Real-time DNA barcoding in a remote rainforest using nanopore sequencing

Advancements in portable scientific instruments provide promising avenues to expedite field work in order to understand the diverse array of organisms that inhabit our planet. Here we tested the feasibility for in situ molecular analyses of endemic fauna using a portable laboratory fitting within a single backpack, in one of the world’s most imperiled biodiversity hotspots: the Ecuadorian Chocó rainforest. We utilized portable equipment, including the MinION DNA sequencer (Oxford Nanopore Technologies) and miniPCR (miniPCR), to perform DNA extraction, PCR amplification and real-time DNA barcode sequencing of reptile specimens in the field. We demonstrate that nanopore sequencing can be implemented in a remote tropical forest to quickly and accurately identify species using DNA barcoding, as we generated consensus sequences for species resolution with an accuracy of >99% in less than 24 hours after collecting specimens. In addition, we generated sequence information at Universidad Tecnológica Indoamérica in Quito for the recently re-discovered Jambato toad Atelopus ignescens, which was thought to be extinct for 28 years, a rare species of blind snake Trilepida guayaquilensis, and two undescribed species of Dipsas snakes. In this study we establish how mobile laboratories and nanopore sequencing can help to accelerate species identification in remote areas (especially for species that are difficult to diagnose based on characters of external morphology), be applied to local research facilities in developing countries, and rapidly generate information for species that are rare, endangered and undescribed, which can potentially aid in conservation efforts.

Here we tested the feasibility for in situ molecular analyses of endemic fauna using a 23 portable laboratory fitting within a single backpack, in one of the world's most imperiled 24 biodiversity hotspots: the Ecuadorian Chocó rainforest. We utilized portable equipment, 25 including the MinION DNA sequencer (Oxford Nanopore Technologies) and miniPCR 26 (miniPCR), to perform DNA extraction, PCR amplification and real-time DNA barcode 27 sequencing of reptile specimens in the field. We demonstrate that nanopore sequencing 28 can be implemented in a remote tropical forest to quickly and accurately identify species 29

Introduction 45
Biodiversity is defined as the variety of life found on Earth, including variation in 46 genes, species, and ecosystems. While about 1.9 million species have been described 47 Cutadapt [40]. The consensi were then aligned to the Sanger sequences of the same 237 amplicons to investigate the quality of the consensus sequences generated from MinION 238 reads using SeaView [41] and AliView [42]. Sanger sequencing reads were edited and 239 assembled using Geneious R10 software [43] and mapping files inspected by eye using 240 Tablet [44]. 241 We further tested the impact of coverage on the consensus accuracy by 242 randomly subsampling three sets of 30, 100, 300 and 1,000 reads, respectively for the 243 eyelash palm pitviper and gecko 1. Subsampling was performed with famas 244 (https://github.com/andreas-wilm/famas). These sets were assembled de novo and 245 processed using the same approach we used for the full data sets (see above). 246 We then created species alignments for all barcodes (using sequences obtained 247 from Genbank;; accession numbers can be found in the phylogenetic tree 248 reconstructions in the Supplementary material). We inferred the best substitution model 249 Upon returning to UTI's lab in Quito, we created one additional DNA barcode 272 library with new samples. With our remaining flow cell, we were interested in quickly 273 generating genetic information for (a) additional specimens that were collected during our 274 field expedition (gecko 2), (b) undescribed species collected the week before our 275 expedition (Genera: Dipsas and Sibon), (c) an endangered species that would have been 276 difficult to export out of the country (Jambato toad), (d) a rare species lacking molecular 277 data (Guayaquil blind snake), and (e) combinations of barcoded samples through 278 multiplexing (for the eyelash palm pitviper and gecko 1). 279 Initially, this second sequencing run appeared to perform well. However, after 280 using Albacore to demultiplex the reads, we determined the adapter ligation enzyme likely 281 degraded because the output primarily consisted of adapter sequences (Supplementary Figure  3). Nevertheless, we were able to generate consensus sequences for 16S of the 283 Jambato toad, the two Dipsas species, the dwarf gecko, and the Guayaquil blind snake 284 ( Fig.  4  and  Fig.  5). 285 The pore count of the flow cells appeared to be unaffected by travel conditions, as 286 indicated by the multiplexer (MUX) scan, an ONT program that performs a quality check 287 by assessing flow cell active pore count. The first run in the field had an initial MUX scan 288 of 478, 357, 177, and 31, for a total of 1,043 active pores and after approximately two 289 hours of sequencing the flow cell generated 16,484 reads. The second flow cell ran at 290 UTI had a MUX scan of 508, 448, 277, and 84, for total of 1,317 active pores and the run 291 produced 21,636 reads within two hours. This is notable since this run was performed 8 292 days after arriving in Ecuador and the flow cell was stored at suboptimal conditions on 293 site and during travel. 294 We were unable to confidently determine if PCR bands were present in the field 295 by running a gel with SYBR Safe DNA Gel Stain (Thermo Fisher Scientific) and a 296 handheld ultraviolet flashlight. Therefore, the presence or absence of PCR product and 297 size was later determined by gel electrophoresis and quantified by a Quantus Fluorometer 298 (Promega) at UTI. Amplification for 16S and ND4 was successful for all samples, but 299 amplification of CytB was unsuccessful, perhaps due to suboptimal PCR settings, as 300 samples were run concurrently due to the limitation and time-constraint of having only 301 one miniPCR machine available (Supplementary Figure 2 adapters. The best contig created by Canu was based on 55 reads, to which 3,695 reads 316 mapped for the polishing step. The consensus sequence was 501bp and showed a 100% 317 nucleotide match to the respective Sanger sequence. For this species, we did not find 318 any differences between the de novo and the reference-based mapping consensus 319 sequences (generated by mapping against a reference from the same species). The 320 individual clusters with all other B. schlegelii and B. supraciliaris (formerly recognized as 321 a subspecies of B. schlegelii) sequences in the phylogenetic tree (Fig. 4A). While the 322 CytB de novo assembly did not succeed (no two reads assembled together), the best 323 supported contig for ND4 (864bp) was based on 50 sequences and achieved an accuracy 324 of 99.4% after polishing (using 95 reads that mapped to the de novo consensus). and South America. Dwarf geckos can be difficult to identify in the field and it is suspected 329 that there are several cryptic species within this genus in Ecuador. We captured two 330 individuals on the evening of the 11th of July 2017, and because the two geckos differed 331 in the shape and size of the dorsal scales (Fig. 3B) and were difficult to confidently identify 332 by morphological characters, we decided to investigate them further with DNA barcoding.

334
Gecko 1 (Lepidoblepharis aff. grandis) 335 Gecko 1 was included in the first sequencing run in the field. We obtained 4,834 reads 336 for the 16S fragment, 63 reads for CytB, and 76 for ND4. The consensus sequence 337 (522bp) for this individual showed a 100% nucleotide match to the respective Sanger 338 sequence. We then performed reference-based mapping using L. xanthostigma 339 (Genbank accession: KP845170) as a reference and the resulting consensus had 99.4% 340 accuracy. We found three insertions compared to the Sanger and the de novo consensus 341 sequences (position 302: G and 350-351: AA). Next we attempted assemblies for CytB 342 and ND4. While the assembly for the CytB reads failed, we were able to assemble the 343 ND4 reads. However, the polished consensus sequence showed a relatively high error 344 rate compared to the Sanger sequence (92.1% accuracy). We then blasted all ND4 reads 345 against NCBI. For ND4 we found 8 sequences to blast to ND4 from squamates, 4 to 16S 346 (3 to a viper and 1 to a gecko), 3 to the positive control, 10 very short hits (negligible hits), 347 and 46 to find no blast hit. Interestingly, while only 8 reads were hits for ND4 from 348 squamates, 72 reads mapped to the consensus of the de novo assembly. The higher 349 error rate can thus be explained by the fact that contaminant reads were used to assemble 350 and correct consensus. The de novo assembled consensus showed an accuracy of 351 91.7% compared to 92.1% for the polished sequence.

353
Gecko 2 (Lepidoblepharis aff. buchwaldi) 354 Gecko 2 was included in the second sequencing run at UTI. We generated 325 reads (for 355 more information see discussion on the possible issue with the adapter ligation enzyme). 356 After filtering for read quality and assembly, we found the best contig to be supported by 357 30 reads. Out of the 325 barcoded reads, we found 308 to map to the consensus. After 358 running Nanopolish, we found it to match 98.4% to the Sanger sequence. All of the 359 indel. We next applied reference based mapping (same protocol and reference as for 368 gecko 1). The resulting consensus sequence showed an accuracy of 97.9%. Phylogenetic 369 tree reconstruction shows that gecko 1 and gecko 2 are clearly two distinct species (see 370 Laboratory processing and sequencing for Atelopus ignescens was carried out in the lab 374 at UTI using a preserved tadpole sample. We obtained 503 reads for this species. The 375 best supported de novo assembled contig was based on 56 reads. We then mapped the 376 reads back to this contig for the polishing step, which resulted in 491 mapped reads. 377 However, while the total coverage was 434x for the segment, the average coverage was 378 only 212x. The discrepancy can be explained by a high percentage of reads that sequences. However, many of those reads were adapter sequences. The Canu de novo 394 assembled sequence was generated from 16 reads. We then mapped 740 reads back to 395 this consensus. After polishing the consensus sequence matched 100% of the Sanger generated sequence ( Fig. 5A;; 516bp consensus length). We further investigated the 397 accuracy of reference based mapping for this species. We used Trilepida macrolepis 398 (Genbank accession: GQ469225) as a reference, which is suspected to be a close 399 relative of T. guayaquilensis. However, the resulting consensus sequence had a lower 400 accuracy (97.7%) compared to the de novo assembled consensus (100%). Our sequence 401 is sister to the clade comprising Trilepida macrolepis and all Rena species in the 402 phylogenetic tree. 403 404

Dipsas snakes (Genus: Dipsas) 405
Dipsas are non-venomous New World colubrid snakes that are found in Central and 406 South America (Cadle 2005). Here we included two specimens collected one week prior 407 to our expedition, which are suspected to be undescribed species. 408 409

Dipsas sp. (JMG378) 410
We generated 779 reads for Dipsas (JMG378). The best supported contig of the Canu de 411 novo assembly (498bp consensus length) was based on 59 reads and matched the 412 corresponding Sanger sequence to 99% after polishing (Fig. 5B). Three out of 5 413 mismatches were indels in poly-A stretches (position: 185, 287, 411). The remaining two 414 mistmachtes are a C to G at position 469 and a T to A at position 489 for the nanopore 415 compared to the Sanger sequence. Interestingly, the reference-based consensus 416 sequence (using Dipsas sp., GenBank accession: KX283341 as a reference) matched 417 the Sanger sequence to 99.4% after polishing. We generated 816 reads for the CytB 418 barcode. However, de novo assembly was not successful as none of the reads actually 419 belonged to CytB. 420 421

Dipsas sp. (JMG396) 422
We generated 487 reads for Dipsas (JMG396). Sequences with a quality score of >13 423 were retained resulting in 193 sequences. The best supported contig of the Canu de novo 424 assembly was based on 59 reads (498bp consensus length). After polishing the 425 consensus sequence matched the corresponding Sanger sequence to 98.9% (Fig.  5B). 426 The first two mismatches are typical nanopore errors, namely indels in poly-A stretches 427 Sibon snakes are found in northern South America, Central America and Mexico [48]. We 439 generated 339 reads for the 16S barcode of this species. However, we were not able to 440 create a consensus sequence for this barcode, as almost all the reads were adapter 441 sequences (all but 11 reads). Furthermore, we generated 1,425 reads for the CytB 442 barcode but were not able to create a consensus sequence.

444
Subsampling 445 We further investigated the read depth needed to call accurate consensus sequences 446 using our approach. We used the eyelash palm pitviper and gecko 1 to test subsampling 447 schemes, since we obtained thousands of reads for these samples. We randomly 448 subsampled to 30, 100, 300 and 1,000 reads (in three replicates;; see Supplementary 449 We further sequenced multiplexed barcodes (16S and ND4) for the eyelash palm pitviper 458 and gecko 1. However, we did not obtain reads for this sample from sequencing run 2, 459 most likely due to the adapter ligation issues. We thus generated artificial multiplexes for 460 the eyelash palm pitviper pooling random sets of 1,000 16S reads with all 96 ND4 reads 461 to investigate the performance of the de novo assembly using multiplexed samples. We 462 assembled the reads de novo and processed them using the same approach as 463 discussed above. In all three cases we found the first two contigs of the canu run to be 464 16S and ND4 contigs. After polishing the 16S consensus sequences achieved a 99.8% 465 accuracy (all three assemblies showed a deletion in a stretch of four T's compared to the 466 Sanger sequence) and the ND4 sequences a 99.4% accuracy. All errors, but one (which 467 shows a T compared to the C in the Sanger sequence), in ND4 are deletions in 468 homopolymer stretches. 469 470

Discussion 471
In this project, we investigated the feasibility of molecular species identification in a 472 remote tropical field location. Below we discuss different aspects of the project such as 473 performance, conservation implications, significance for local resource building, as well 474 as outlooks and improvements for future development.

Performance in the field 477
Our objective was to employ a portable laboratory in a rainforest to identify endemic 478 species with DNA barcoding in real-time. Our protocols resulted in successful DNA 479 extraction, PCR amplification, nanopore sequencing, and barcode assembly. We 480 observed that the MinION sequencing platform performed well in the field after extended 481 travel, indicating the potential for nanopore-based sequencing on future field expeditions. 482 Although we demonstrate that the successful molecular identification of organisms in a 483 remote tropical environment is possible, challenges with molecular work in the field 484 remain. Our field site was provided with inconsistent electrical power, but still allowed us often required for such analytical tasks. In our study, utilizing short DNA fragments with a 502 relatively small number of samples for barcoding allowed us to perform all bioinformatic 503 analyses in the field, but larger data outputs may require additional storage and more 504 computational resources. 505 506

Implications for conservation 507
Tropical rainforests, such as the Ecuadorian Chocó, are often rich in biodiversity, as well 508 as species of conservation concern. The Chocó biogeographical region is one of the fauna. Our rapidly obtained DNA barcodes allowed us to accurately identify organisms 514 while in the field, and had an accuracy comparable to Sanger generated sequences. 515 When samples are not required to be exported out of the country to carry out molecular 516 experiments, real-time sequencing information can contribute to more efficient production 517 of biodiversity reports that advise conservation policy, especially in areas of high 518 conservation risk. 519 Of particular note in this study was the critically endangered harlequin Jambato 520 toad, Atelopus ignescens. Although not a denizen of the Chocó rainforests, this Andean 521 toad is a good example to demonstrate how nanopore sequencing can aid in the 522 conservation of critically endangered species. Atelopus ignescens was previously 523 international laws and treaties, sample transport requires permits that can often be difficult 531 to obtain, even when research is expressly aimed at conservation, resulting in project 532 delays that can further compromise sample quality. By working within the country, under permits issued by Ministerio del Ambiente de Ecuador to local institutions, we were able 534 to generate sequence data for the endangered harlequin Jambato toad Atelopus 535 ignescens within 24 hours of receiving the tissue, whereas obtaining permits to ship 536 samples internationally in the same time frame would have not been possible. The last 537 confirmed record of Atelopus ignescens dates back to 1988, and this species was 538 presumed to be extinct before one population was rediscovered in 2016, 28 years later.

Species identifications 546
It is important to note that we do not intend for rapidly-obtained portable sequence 547 information to substitute for standard species description processes. Instead, we aim to 548 demonstrate that obtaining real-time genetic information can have beneficial applications 549 for biologists in the field, such as raising the interesting possibility of promptly identifying 550 new candidate species, information which can be used to adjust fieldwork strategies or 551 sampling efforts. As we have shown, the latter could be especially important with 552 organisms and habitats facing pressing threat. Rapidly obtaining genetic sequence 553 information in the field can also be useful for a range of other applications, including 554 identifying cryptic species, hybrid zones, immature stages, and species-complexes. Furthermore, we acknowledge that in most cases multiple loci are needed to 556 reliably infer species position in a phylogenetic tree. DNA barcoding has been shown to 557 hold promise for identification purposes in taxonomically well-sampled clades, but may 558 have limitations or pitfalls in delineating closely related species or in taxonomically 559 understudied groups [58], [59]. However, our aim in this study was to demonstrate that 560 portable sequencing can be used in the field and that the final sequences have an 561 accuracy needed to achieve reliable identification of a specimen. While a recent study 562 has demonstrated a field-based shotgun genome approach with the MinION to identify invaluable specialist knowledge about specific groups of organisms. Even with the advent 570 of molecular diagnostic techniques to describe and discover species, placing organisms 571 within a phylogenetic context based on a solid taxonomic foundation is necessary. An 572 integrative approach utilizing molecular data and morphological taxonomy can lead to 573 greater insight of biological and ecological questions [60]. As noted by Bik, 2017, "There 574 is much to gain and little to lose by deeply integrating morphological taxonomy with high-575 throughput sequencing and computational workflows." 576 577

Bioinformatic challenges 578
Although nanopore sequence reads show high error rates (12-35%;; [61], [18], [62]), the 579 consensus sequences generated in this study matched the respective Sanger sequences 580 with high accuracy, ranging from 98.4% to 100% (Fig. 4 and Fig. 5), with four out of seven 581 sequences showing an accuracy of 100%. In order to investigate the minimum coverage 582 needed to accurately call bases, we used different subsampling schemes. Overall, a 583 coverage of 30 reads achieved an accuracy of 99.4 -99.8%. With 100 reads most 584 assemblies were 100% accurate, indicating that an excessive amount of reads is not 585 needed to produce high quality consensus sequences. Furthermore, we applied 586 Nanopolish to all consensus sequences. This tool has been shown to be very effective at 587 correcting typical nanopore errors, such as homopolymer errors [35], [63]. As can be 588 seen in section "Post-Nanopolish assembly identity" in [63], accuracy of the resulting 589 consensus increases significantly after polishing. While, we did not measure the 590 improvement in accuracy in our study, we did notice a high accuracy after polishing. 591 However, as can be seen in Fig.  5B, nanopolish is not always able to accurately correct 592 homopolymer stretches. 593 We further tested reference-based mapping versus de novo assembly, because a 594 reference-based mapping approach may introduce bias, making it possible to miss indels. 595 Overall, we see that consensus sequences generated using reference-based mapping 596 have slightly lower accuracy. However, in two cases (the eyelash palm pitviper and the 597 Jambato toad) an accuracy of 100% was achieved with reference-based mapping. 598 Interestingly, in the case of Dipsas sp. (JMG378), reference-based mapping resulted in a 599 slightly better accuracy than de novo (99.4% compared to 99%). In general, we 600 recommend the use of a de novo assembly approach as this method can be applied even 601 if no reference sequence is available and generally produced more accurate sequences. 602 An alternative approach would be to generate consensus sequences by aligning the 603 individual reads for each barcode to one another, which would not be affected by a 604 reference bias. This method is implemented in the freely available software tool Allele 605 Wrangler (https://github.com/transplantation-immunology/allele-wrangler/). However, at 606 the time of submission this tool picks the first read as the pseudo reference, which can 607 lead to errors in the consensus if this read is of low quality or an incorrect sequence. 608 Future developments might establish this method as an alternative to de novo assembly 609 algorithms, which are typically written for larger genomes (e.g. the minimum genome size 610 in Canu is 1000bp) and can have issues with assemblies where the consensus sequence 611 is roughly the size of the input reads (personal communications Adam Phillippy). 612 Each of our two runs showed a very high number of reads not assigned to any 613 barcode sequence after de-multiplexing with Albacore 1.2.5 (7,780 and 14,272 for the 614 first and second sequencing run, respectively). In order to investigate whether these 615 reads belong to the target DNA barcodes but did not get assigned to sequencing 616 barcodes, or if they constitute other sequences, we generated two references (one for 617 each sequencing run) comprising all consensi found within each individual sequencing 618 run. We then mapped all reads not assigned to barcodes back to the reference. We were 619 able to map 2,874 and 4,997 reads to the reference for the first and the second 620 sequencing run, respectively, which shows that a high number of reads might be usable 621 if more efficient de-multiplexing algorithms become available. Here we used Albacore 622 1.2.5, an ONT software tool, to de-multiplex the sequencing barcodes. This tool in under 623 constant development and thus might offer more efficient de-multiplexing in later versions.

Cost-effectiveness and local resource development 628
Next-generation sequencing technologies are constantly evolving, along with their 629 associated costs. Most major next-generation sequencing platforms require considerable 630 initial investment in the sequencers themselves, costing hundreds of thousands of dollars, 631 which is why they are often consolidated to sequencing centers at the institutional level 632 [65]. In this study, we used the ONT starter pack, which currently costs $1000, and 633 includes two flow cells and a library preparation kit (6 library preparations) as well as the 634 ONT 12 barcoding kit which is currently $250 for 6 library preparations (for a full list of 635 equipment and additional reagents see Supplementary  Table  1) processing costs approximately $10 per sample, independent of the through-put. Thus 646 the Oxford Nanopore MinION has the potential to be a cost-effective sequencing option 647 for resource-limited labs, especially in developing countries without access to standard 648 sequencing devices. 649 The small size and low power requirements of the MinION will likely continue to 650 enable its evolution as a field-deployable DNA sequencing device, opening up new 651 avenues for biological research in areas where the typical laboratory infrastructure for 652 genetic sequencing is unavailable. With some training, in the field molecular analyses 653 could also potentially be performed by local students or assistants, providing an 654 opportunity for capacity building and community involvement. We also demonstrate that portable sequencing can allow nimble use of rapidly generating 686 data for endangered, rare, and undescribed species at nearby facilities within the country. 687 In the context of conservation and biodiversity science, portable nanopore sequencing 688 can be beneficial for applications including: 689 i. When it is exceedingly challenging or not possible to export biological material 690 internationally or to a facility with a conventional sequencing device. The proper 691 permits to collect samples and carry out experiments in the location of the study 692 are still necessary, and collaborating with local researchers is encouraged.
ii. When the material to be sequenced may be compromised during transportation 694 conditions, or during the time in between collection and sequencing. This can be 695 applicable to experiments involving RNA in particular, which is subject to 696 degradation if not adequately preserved or immediately frozen. 697 iii. For biodiversity reports aimed at quickly generating species data to inform 698 conservation policy decisions, especially in areas of high conservation risk. 699 iv. To rapidly screen and sequence pathogens, such as chytrid fungus in 700 amphibians. Studies using the MinION in the field have been applied during 701 epidemics, including recent outbreaks of Ebola and Zika, and can be applied to 702 non-human pathogens as well. vi. To identify organisms in the field that are difficult to locate or capture by 707 sampling environmental DNA (eDNA). 708 While we live in a period of amazing technological change, biodiversity and ecosystem 709 health are decreasing worldwide. Portable sequencing will not be a silver bullet for 710 conservation biology, but it can be a powerful tool to more efficiently obtain information 711 about the diversity of life on our planet. This is particularly important for many tropical 712 rainforests under high risk of habitat loss, such as the Ecuadorian Chocó. We present, to 713 our knowledge, the first expedition to successfully demonstrate real-time animal species 714 identification using DNA barcode sequencing in a remote tropical rainforest. We anticipate 715 that as portable technologies develop further, this method will broaden the utility of 716 biological field analyses including real-time species identification, cryptic species 717 discovery, biodiversity conservation reports, pathogen detection, and environmental 718 studies.