Pest Alert Tool—a web-based application for flagging species of concern in metabarcoding datasets

Abstract Advances in high-throughput sequencing (HTS) technologies and their increasing affordability have fueled environmental DNA (eDNA) metabarcoding data generation from freshwater, marine and terrestrial ecosystems. Research institutions worldwide progressively employ HTS for biodiversity assessments, new species discovery and ecological trend monitoring. Moreover, even non-scientists can now collect an eDNA sample, send it to a specialized laboratory for analysis and receive in-depth biodiversity record from a sampling site. This offers unprecedented opportunities for biodiversity assessments across wide temporal and spatial scales. The large volume of data produced by metabarcoding also enables incidental detection of species of concern, including non-indigenous and pathogenic organisms. We introduce an online app—Pest Alert Tool—for screening nuclear small subunit 18S ribosomal RNA and mitochondrial cytochrome oxidase subunit I datasets for marine non-indigenous species as well as unwanted and notifiable marine organisms in New Zealand. The output can be filtered by minimum length of the query sequence and identity match. For putative matches, a phylogenetic tree can be generated through the National Center for Biotechnology Information’s BLAST Tree View tool, allowing for additional verification of the species of concern detection. The Pest Alert Tool is publicly available at https://pest-alert-tool-prod.azurewebsites.net/.


INTRODUCTION
Rapid de v elopment of high-throughput sequencing (HTS) technologies has revolutionized our ability to characterize biodi v ersity across ecosystems. They fueled ubiquitous applications of environmental DNA (eDNA) metabarcoding --a genetic method that amplifies homologous gene(s) across species to deri v e taxonomic constituents of a community ( 1 ) --for new species discovery, ecological trend monitoring and environmental impact assessments (2)(3)(4). One of the increasingly touted eDNA metabarcoding applications is for biosecurity surveillance ( 5-7 ), a field specifically devoted to monitoring en vironments f or pest species with potentially devastating ecological and economic consequences. The lar ge v olume of data produced by metabarcoding also raises the possibility of upscaling biosecurity surveillance capa bilities, ena bling indeliberate detection of species of concern through the broad assessment of entire biolo gical comm unities ( 8 , 9 ). The allure of this 'passi v e surv eillance' approach is obvious: we do not always know which species of concern to expect, and casting a broad surveillance net across the taxonomic spectrum may allow early detection of new, unanticipated pest Nucleic Acids Research, 2023, Vol. 51, Web Server issue W439 incursions. More broadly, passi v e surv eillance could enab le efficient le v eraging of r esour ces, as HTS technologies ar e increasingl y a pplied for biodi v ersity monitoring purposes across a variety of biomes and contexts (10)(11)(12)(13) and can be secondarily adopted for biosecurity surveillance ( 8 ).
Reporting unverified biosecurity risks from HTS data, howe v er, may lead to a rapid increase in workload for biosecurity managers for minimal benefit, and can e v en lead to legal actions against r esear chers in case of false-positi v e r eporting ( 8 ). Furthermor e, r esear chers conducting HTSbased biodi v ersity surv eys might be unaware of the biosecurity threats occurring in their region and often overlook ramifications with key end users concerned with the spread of unwanted organisms. Ther efor e, ther e is a growing interest in simple tools that could alert both researchers and stakeholders to the potential biosecurity risks contained in HTS datasets ( 14 ). For instance, straightforward w e b-based informatics tools that scr een pr e-publication HTS datasets for the presence of species of concern could be implemented as part of best practice standards in eDNA-based r esear ch and biomonitoring ( 15 ) and allow r esear chers to conduct important secondary quality assurance steps.
We introduce the Pest Alert Tool, an online server (application), de v eloped for screening HTS datasets for species of concern, and showcase its applicability for secondary quality assurance steps or reporting of putati v e threats to environmental agencies. This is an open tool, which combines an algorithm for the automa tic genera tion and upda te of a r efer ence da tabase based on a cura ted list of species of interest and BLAST processing of submitted FASTA files in a user-friendly interface. The current version of the tool was created as a proof of concept to address the immediate demand of New Zealand biosecurity practitioners and is set up to screen nuclear small subunit 18S ribosomal RNA (18S rRNA) and mitochondrial cytochrome oxidase subunit I (COI) genes for marine non-indigenous species (NIS) as well as unwanted and notifiable marine organisms in New Zealand. Howe v er, it can easily be adjusted for other regions or applications following the guidelines provided in this account.

Implementation
The Pest Alert Tool is based on comparing HTS files in FASTA format against automatically generated customized nuclear 18S rRNA and COI gene databases ( Figure 1 ). These databases consist of all the sequences belonging to genera that contain species that have been identified as marine pests in New Zealand, i.e. NIS ( https://pest-alert-tool-prod.azurew e bsites.net/examples/ NIS list 22-11-17.txt ) and species on Biosecurity New Zealand's list of unwanted and notifiable marine organisms ( https://pest-alert-tool-prod.azurew e bsites.net/examples/ Unwanted notifiable 22-11-17.txt ). The databases were constructed using the Creating Reference databases for Amplicon-Based Sequencing (CRABS) algorithm ( 16 ) with sequences obtained from National Center for Biotechnology Information (NCBI) ( 17 ) for 18S rRNA and COI and Barcode of Life Data System ( 18 ) for COI only. The integrated CRABS algorithm allowed automated construction of the databases and updates e v ery 6 months. Sequences acquir ed by CRABS ar e quality checked (minimum length 250 bp; maximum number of N's = 1) and result in two clean FASTA files (one for 18S rRNA and one for COI). The FASTA files are then converted into databases, using the makeblastdb function, which can be used to align the submitted sequences against. The user can sta y inf ormed on the latest updates by checking the disclaimer box at the bottom of the page, which also provides information on r efer ence database coverage for New Zealand marine NIS and unwanted and notifiable organisms. The instructions on how to use the tool can be accessed via the 'HELP' button in the upper panel (Supplementary File S1).
The minimum input r equir ed for the Pest Alert Tool is a text file containing DNA sequences originating from either the 18S rRNA or COI gene in a FASTA format (*.fa or *.fasta extension). FASTA files are text files, where each sequence begins with a single-line description, followed by a line of sequence data. The single-line description contains a greater than ( > ) symbol followed by the sequence name. These files can be opened and edited in any text-editing or word-pr ocessing pr ogram. Various length example FASTA files randomly compiled from published New Zealand marine datasets are downloadable on the Pest Alert Tool w e b page (e.g. https://pest-alert-tool-prod. azurew e bsites.net/examples/100.fa ). The input FASTA files are HTS output files pre-processed (i.e. filtered, trimmed, merged and denoised) with a bioinformatics pipeline [e.g. D AD A2 ( 19 )] prior to being uploaded into the Pest Alert Tool. Basic raw data processing is needed to reduce the screening time and potential error rate in the output results. Inexperienced HTS users commonly outsource the raw data processing steps to a sequencing or bioinformatics service provider and FASTA is the usual output file format they recei v e. For users with some experience in data processing, we provide a fully annotated guide on how to construct a FASTA file from the raw sequencing files following the D AD A2 pipeline in R (Supplementary File S2).
Using the input FASTA file, the Pest Alert Tool queries the appropria te BLAST da tabase using blastn ( 20 ) with the megablast algorithm. The BLAST is run with a maximum number of reported sequence matches set at 100.
To ensure consistency and ease of deployment, all backend services were containerized and managed through a single Docker Compose file. The API was constructed with FastAPI, a modern Python frame wor k, and deployed using Gunicorn ( https://gunicorn.org/ ) and NGINX ( https: //www.nginx.com/ ) w e b servers. The NGINX service also served the static content and ensured the availability and scalability of the w e bsite. To offload work from the application, we employed the Celery protocol ( 21 ). To improve performance, we utilized Redis ( https://redis.com/ ) as a caching engine, providing an in-memory data structur e stor e. On the front end, we used Vue.js ( https://vuejs.org/ ), a progressi v e JavaScript frame wor k for building user interfaces. This frame wor k offers a component-based programming model based on standard HTML, CSS and JavaScript. The whole code base de v eloped to build the Pest Alert Tool is available in a public Git repository ( https://gitlab.com/cawthronpublic/marine-biosecurity-toolbox/pest-alert-tool ).

Output
Once the FASTA file has been uploaded, the screening for putati v e pests commences instantly. When the screening is complete, the Pest Alert Tool will navigate to the 'EX-PLORE RESULTS' scr een (Figur e 1 ). Her e, users can see whether any of their submitted sequences match with New Zealand marine NIS or with unwanted and notifiable organisms. At the top, below the main toolbar, the number of r efer ence sequence matches to marine NIS and notifiable organisms is reported. Values for the minimum percentage sequence identity match and minimum sequence length can be adjusted by the user to allow for variations in the stringency of the results. Allowed adjustment ranges are 98-100% for the minimum percentage sequence identity match and 100-600 for the minimum sequence length. By selecting lower percentage sequence identity values (i.e. < 99%) and too short minimum sequence length threshold (i.e. < 250 bp), the user decreases the specificity of the match, potentially increasing the risk of incorrect species identification. Ther efor e, additional verification steps for such matches are recommended.
Further verification of query sequences that have resulted in hits to NIS or notifiable species can be undertaken by investigating the r efer ence sequences by pr essing the i-button under each species name in a list view. A phylogenetic tree can also be created via pairwise alignments in NCBI (see below). As a rule of thumb, matches supported by multiple r efer ence sequences ar e expected to be more robust. Howe v er, due to existing gaps in sequence r efer ence databases, some species might have only one or few r efer ence sequences available. It is advised that matches to putati v e pests are further inspected to verify the robustness of the r efer ence, for example, by verifying whether additional information is provided on species ID, provenance and / or vouchered specimen.
For additional diagnostics of the sequence match specificity, it is highly recommended to verify the match by checking the phylogenetic tree of the wider range of related taxa from NCBI. This can be done directly from the Pest Alert Tool results in a list view mode, by pressing the 'Tree BLAST' button next to the species of interest. By pressing the button, the sequence from the user's dataset is submitted to NCBI's phylogenetic tree generation engine that runs BLAST pairwise alignments. When the tree is ready, a green button 'VIEW TREE' appears. This link on the button will take the user to the Tree View on NBCI platform. For more details about interacting with the Tree View on NCBI look at the NCBI tutorial ( https://www.ncbi.nlm.nih.gov/tools/ gbench/tutorial19/ ). To verify the match, users should check the location of the yellow highlighted query (the submitted sequence). A robust match is expected to be on a branch with the same named species and to form a separated phylogenetic clade from other species.

Examples of use
The anticipated primary use of the Pest Alert Tool is screening of metabarcoding da tasets genera ted from New Zealand marine samples (from routing monitoring, environmental assessments or any r esear ch-focused projects) to identify potential presence of the risky taxa. The recent outburst of HTS-based biodi v ersity data (22)(23)(24) often leads to the publication of taxonomic inventories that include species of concern without cautioning prospecti v e end users on the potential for misassignment or other sources of falsepositi v e errors ( 8 ). For e xample, there hav e been cases where identification of species of concern in already published HTS data raised management responses that led to additional quality control analyses demonstrating the erroneous attribution of these species ( 25 , 26 ). Such responses may incur significant financial burden to governmental agencies and / or expose r esear chers to legal entanglements. Ther efor e, the Pest Alert Tool offers a simple solution for marine eDNA practitioners and environmental managers alike, enabling the rapid screening of sequence data pre-and Nucleic Acids Research, 2023, Vol. 51, Web Server issue W441 post-publication, and hence r epr esents a powerful verification platf orm f or improved quality assurance standards of HTS data. We see the tool being adopted by metabarcoding data holders (r esear chers , environmental consultants , biosecurity practitioners , community groups , etc.) as a routine quality assurance step. This would allow informed decision on whether data owner needs to inform authorities on potential presence of species of concern, conduct additional verification steps or add a disclaimer on uncertainties associated with the detection and identification of these species ( 8 ). The tool can also be used retrospecti v ely, on datasets already existing in the public domains , for example , to estab lish or v erify the baselines of NIS incursions and distribution in the coastal waters. This would provide useful inf ormation f or biosecurity managers and decision-makers on likely vectors of spread, population dynamics and potential future introductions to keep an eye on.
We also see potential applicability of the tool for research purposes. For example, as part of the initial proof-ofconcept test, the offline algorithm underlying the Pest Alert Tool has been tested in a study aimed at investigating potential affinity of marine NIS to particular biopolymer types in a marine deployment experiment ( 27 ). The algorithm allowed to quickly screen an e xtensi v e dataset obtained from 60 biofilm samples, r epr esenting 4225 unique amplicon sequence variants (ASVs) for a subset of marine NIS ASVs. Now, with the algorithm implemented in the w e b-based server with a user-friendly interface, subsetting of metabarcoding datasets can easily be performed by r esear chers or students interested in marine pests independently of their proficiency in bioinformatics pipelines. The tool can also be offered as an educational r esour ce for community outreach activities or science curriculum at high school to raise public awareness of marine environmental issues and novel biodiv ersity surv eillance approaches offered by HTS tools.

Future development
The Pest Alert Tool is intended initially to screen metabarcoding datasets for marine pests in New Zealand, but it can easily be adapted to detect other species of inter est --fr eshwater or terr estrial species of concern, pathogens, and protected or indicator taxa. It can also be extended to other regions and other metabarcoding markers. The minimum r equir ement for such an extension is a curated list of species of interest. In Supplementary File S3, we provide detailed guidelines for how the tool can be adjusted to create a customized database for other regions or applica tions. The ef ficiency of the tool is reliant on the availability of r efer ence sequences of target taxa; ther efor e, collaborati v e cross-national effort augmenting existing open access databases with quality r efer ences is warranted and of high priority for future de v elopments. Moreov er, with future improved knowledge of regional biodi v ersity and completeness of r efer ence sequence databases, an extension to the screening server can be implemented to indicate indigenous biodi v ersity found in a sample and highlight known unwanted organisms, but also species that are not yet listed as unwanted or non-indigenous but are new to the region. A further extension to allow registering and tracking unwanted or suspicious detections in a searchable GIS format can be a useful li v e r esour ce f or inf orming targeted and precautionary management response to emerging biosecurity risks by governmental agencies.

Citing the Pest Alert Tool
Authors who make use of the Pest Alert Tool w e b server should cite this article as a general r efer ence and should also include https://pest-alert-tool-prod.azurew e bsites.net/ . The w e b server pages will list additional articles for citation that relate to the algorithms employed, the software that implements them and the energy parameters it uses.

DA T A A V AILABILITY
Pest Alert Tool is an open source collaborati v e initiati v e availab le in the GitLab repository ( https://gitlab.com/cawthron-public/marine-biosecuritytoolbox/pest-alert-tool ).