Abstract

We developed phyloBARCODER (https://github.com/jun-inoue/phyloBARCODER), a new web tool that can identify short DNA sequences to the species level using metabarcoding. phyloBARCODER estimates phylogenetic trees based on the uploaded anonymous DNA sequences and reference sequences from databases. Without such phylogenetic contexts, alternative, similarity-based methods independently identify species names and anonymous sequences of the same group by pairwise comparisons between queries and database sequences, with the caveat that they must match exactly or very closely. By putting metabarcoding sequences into a phylogenetic context, phyloBARCODER accurately identifies (i) species or classification of query sequences and (ii) anonymous sequences associated with the same species or even with populations of query sequences, with clear and accurate explanations. Version 1 of phyloBARCODER stores a database comprising all eukaryotic mitochondrial gene sequences. Moreover, by uploading their own databases, phyloBARCODER users can conduct species identification specialized for sequences obtained from a local geographic region or those of nonmitochondrial genes, e.g. ITS or rbcL.

Introduction

Metabarcoding analyses are employed in a wide range of research fields, such as ecology or fisheries, by annotating short DNA sequences derived from environmental or bulk specimens (Deiner et al. 2017; Ruppert et al. 2019). Over the past 40 years, such sequence-based identification of taxa has inspired microbial and protist ecologists (Creer et al. 2016). Since the inception of high-throughput sequencing (HTS) technologies (Goodwin et al. 2016), the use of metabarcoding as a biodiversity detection tool has attracted immense interest. For a sample of biological material, these technologies typically produce thousands to millions or even billions of short genetic sequences called “reads” of barcoding genes, such as mitochondrial 12S rRNA or CO1 genes (Deiner et al. 2017). For this purpose, appropriate primers have been designed to enable metabarcode sequencing of some animal groups such as fish (Miya et al. 2015), corals (Shinzato et al. 2021), and metazoans (Leray et al. 2013). Therefore, accuracy and reliability of recent DNA metabarcoding analyses are heavily affected by species identifications and reference databases.

In metabarcoding, although composition-based methods such as Bayesian classifier (Wang et al. 2007) are widely used with stand-alone programs (Bayer et al. 2024), a common method of species identification is to quantify similarity between reads and reference sequences using web tools such as NCBI BLAST searches (Altschul et al. 1997; Collins et al. 2021). However, there are several problems (Fig. 1): (i) BLAST searches provide a score based on local alignments, rather than global alignments, leading to a loss of information (Munch et al. 2008); (ii) such scores are calculated using paired alignments, not multiple alignments (Smith and Pease 2017); and (iii) identifications are provided without phylogenetic context (Czech et al. 2022). Although a new method, phylogenetic placement (EPA-NG [Barbera et al. 2019] and PPLACER [Matsen et al. 2010]), puts genetic sequences into specific phylogenetic reference trees and compares statistical measures of all phylogenetic positions, users suffer from labor-intensive advance preparation of huge reference databases (Czech et al. 2022). Recently, a user-friendly web tool, MitoFish pipeline (Zhu et al. 2023), was developed to enable workflow from raw sequences to species identification. However, to produce rough estimates of species lists, the MitoFish pipeline employs BLAST searches and focuses only on reads related to the fish 12S rRNA gene (Miya et al. 2015).

Overview of phyloBARCODER and the alternative, similarity-based method. A similarity-based method is represented by BLAST search identification. Numbers beside nodes indicate bootstrap probabilities (>50%). Numbers beside anonymous sequence/species names indicate BLAST identities. Species-A, Species-B, and Species-C indicate BLAST hits obtained from reference databases.
Fig. 1.

Overview of phyloBARCODER and the alternative, similarity-based method. A similarity-based method is represented by BLAST search identification. Numbers beside nodes indicate bootstrap probabilities (>50%). Numbers beside anonymous sequence/species names indicate BLAST identities. Species-A, Species-B, and Species-C indicate BLAST hits obtained from reference databases.

For accurate species identification, there is a demand for high-quality public reference databases (Collins et al. 2021; Miya 2022). Development of primers and marker genes for various animal groups requires reference databases for corresponding taxonomic groups and gene regions (Ruppert et al. 2019). For this purpose, similarity-based identifications are generally conducted by querying commonly used databases such as GenBank (https://www.ncbi.nlm.nih.gov/genbank/) or Barcode of Life Data System (http://www.boldsystems.org/) using online tools such as BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Although such users can search updated databases with maximal taxonomic breadth, there are problems with repeatability because the set of reference sequences changes with each update (Federhen 2011; Collins et al. 2021) especially for analyses conducted for different localities/projects (Fig. 1). In addition, as large-scale metabarcoding projects across spatial or temporal gradients require complete species coverage to allow for reliable species identification (Cribdon et al. 2020), region-specific reference databases with wide taxonomic/populational breadth are needed (Stoeckle 2020).

New Approaches

To address this need, we created a web tool called phyloBARCODER. By estimating gene trees with stochastic models of sequence evolution (Yang 2006) based on global alignments, phyloBARCODER users can distinguish reads even if they share the same BLAST scores (Fig. 1). Users can evaluate the existence of reference sequences of focal species in the pre-installed database and can apply their own reference databases by uploading them. Currently, the pre-installed database comprises curated reference sequences of all mitochondrial DNA genes of eukaryotes. phyloBARCODER automatically produces phylogenetic trees with bootstrap probabilities, multiple alignments, and a species list of query sequences. Thus, phyloBARCODER can evaluate species identifications produced by other metabarcoding–identification programs such as CLAIDENT (Tanabe and Toju 2013) and USEARCH (Edgar 2010).

The scope of this web tool is (i) species/group identifications of focal sequences and (ii) sequence delineations of species/groups from all decoded anonymous sequences derived from high-throughput input data (hundreds or thousands of operational taxonomic units [OTUs] or amplicon sequence variants [ASVs]). Although phyloBARCODER version 1 analyzes all uploaded anonymous sequences at a time, up to 10 query sequences can be selected from them. Thus, to make a species list from all anonymous sequences comprising many species, e.g. >20 distantly related species, those sequences should be roughly identified to species, such as using a similarity-based method before running phyloBARCODER.

Results and Discussion

Interface

Figure 2 shows the top page of phyloBARCODER, version 1. phyloBARCODER involves three types of databases, “Anonymous DB,” “Pre-installed DB,” and “User DB.” Anonymous DB consists of uploaded metabarcoding (anonymous) sequences, including query sequences. Pre-installed DB consists of eukaryotic sequences from all mitochondrial genes and comprises two variants (MIDORI2, GenBank254: Leray et al. 2022): “Species DB” contains sequences representing a single sequence from each species and “Haplotype DB” contains all unique haplotypes belonging to each species. User DB can be created by uploading the user's own database.

The front of phyloBARCODER. phyloBARCODER operates in two modes, a) Species Identification and b) Sequence Extraction.
Fig. 2.

The front of phyloBARCODER. phyloBARCODER operates in two modes, a) Species Identification and b) Sequence Extraction.

Two modes are available in phyloBARCODER (Fig. 2): (a) Species Identification and (b) Sequence Extraction.

Species Identification

Figure 1 shows the flow of information in this analytical mode. In “Species Identification” mode (Fig. 2a), decoded anonymous sequences (<250,000 sequences) should be uploaded in FASTA format. An example anonymous sequence set is shown in supplementary fig. S1, Supplementary Material online. For BLAST searches (Altschul et al. 1997), uploaded anonymous or user reference sequences are built as Anonymous DB or User DB, respectively, using MAKEBLASTDB, which is included in the BLAST+ package (Camacho et al. 2009). Then, a BLAST search is conducted to retrieve sequences similar to query sequences (included in anonymous sequences) from databases. BLAST hits are aligned using MAFFT (Katoh and Standley 2013) and TRIMAL (Capella-Gutierrez et al. 2009). To achieve faster analysis than is possible with the maximum likelihood method, phylogenetic analyses employ the neighbor joining method (Saitou and Nei 1987) implemented in APE (Popescu et al. 2012) in R. The most parameter-rich model in the program, the TN 93 model (Tamura and Nei 1993), is applied with a gamma-distributed rate for site heterogeneity (Yang 1994).

Before starting an analysis, users need to set BLAST parameter options for the similarity search (Fig. 2): The option “Number of queries” assigns the number of query sequences from uploaded anonymous sequences. The option “-num_alignments” assigns the number of BLAST hits for each query sequence. The option “-evalue” is a threshold expect value for saving hits. The option “Bootstrap analysis” determines the number of bootstrap replicates to show branch confidence levels in phylogenetic gene trees as measures of confidence for species identification. If we submit an example anonymous sequence file (supplementary fig. S1, Supplementary Material online) to phyloBARCODER, the result file is created after 4 seconds of computation.

The summary output (supplementary fig. S2, Supplementary Material online) can be seen by clicking the link after “Status Finished,” just above the “SUBMIT” button (Fig. 2). This summary output (supplementary fig. S2, Supplementary Material online) shows a phylogenetic gene tree as a drawing and provides Newick formats of trees, multiple alignment of BLAST hits, species list showing species name/classification, and setting details. In the species list, phyloBARCODER generates species names using BLAST hits derived from Pre-installed DB placed in the same clade with the query sequences. In the resultant alignment, links to the NCBI website are made for BLAST hits from Pre-installed DB. Using these links, users can evaluate reliabilities of reference sequences with vouchers, locations, and publications. Poorly aligned sites are identified using TRIMAL with the option “-gappyout.” Such sites are marked with “0,” whereas unambiguously aligned sites are identified with “1.” One can download output (in zip format) from the link shown after “Download” located at the top of this summary. After alignments, the species list is shown. This list shows identified species/classification names for each query sequence and identical anonymous sequences based on the estimated gene tree. However, users themselves should assess those identifications with reference to the estimated gene tree and alignment.

phyloBARCODER can evaluate whether anonymous sequences originate from primer-targeted regions amplified by polymerase chain reaction (PCR). To produce additional alignments including primer sequences, the option “-range” (Fig. 2a) determines lengths of 5′ upstream and 3′ downstream sequences of BLAST hits (supplementary fig. S3, Supplementary Material online).

Sequence Extraction

Accuracy and reliability of species identification of metabarcoding depend on the breadth and quality of databases (Somervuo et al. 2017; Collins et al. 2021). phyloBARCODER can increase the reliability of species identification by accounting for species not present in the reference library. In “Sequence Extraction” mode (Fig. 2b), phyloBARCODER collects reference sequences when a keyword is provided by the user (supplementary fig. S4, Supplementary Material online). For this purpose, Pre-installed DB contains reference sequences with name lines, including names of genes, species, and classifications derived from the MIDORI2 database (Leray et al. 2022). We will include newly published reference databases for various taxa such as plants or even for protists (Creer et al. 2016) in response to user requests.

In principle, phyloBARCODER strives to provide phylogenetic positions of query sequences that accurately reflect the underlying pattern of speciation. We call this pattern the “species tree.” In practice, based solely on sequences of a single gene, phyloBARCODER produces a “gene tree” that may differ substantially from the species tree. Therefore, in addition to the existence of focal sequences, users should evaluate resultant species identifications by consulting the published scientific literature for trees.

Three Case Studies

We demonstrate the utility of phyloBARCODER using case studies involving three animal groups, including identification not only at the species level (fish) but also above (corals) and below (copepods) the species level.

Case Study 1: Fish

phyloBARCODER can graphically show species identification of fish environmental DNA (eDNA) reads obtained from the open ocean. At least for the CO1 barcode marker, fish are among the best-studied taxonomic groups (Weigand et al. 2019; Collins et al. 2021). However, marine fish are far less well represented than freshwater fish (Miya 2022), partly because taxonomic misassignment increases with geographic scale (Bergsten et al. 2012). The use of phylogenetic assignment algorithms is robust against poor database representations of taxa (Cribdon et al. 2020).

Yu et al. (2022) detected chub (Scomber japonicus) or blue mackerel (fig. 2b in Yu et al. 2022) from MiFish eDNA metabarcoding (Miya et al. 2015) of seawater from the Kuroshio Extension area. At 0 m depth at station B6 (point B6-0), they confirmed the MiFish detection of mackerel by species-specific quantitative PCR and by direct net sampling survey.

phyloBARCODER users can evaluate reference sequences of Scomber in Pre-installed DB with reference to the species tree obtained from the literature (Fig. 3a). By finding the keyword in Pre-installed DB (supplementary fig. S2b, Supplementary Material online), phyloBARCODER lists search results as output. Example output of four hits for the keywords, “Species” (Database), “12S” (Gene), and “Scomber” (Classification), is shown in supplementary fig. S4, Supplementary Material online. This result indicates that the database contains 12S rRNA gene sequences for all four Scomber species (Nelson et al. 2016). Thus, misidentification is unlikely due to the absence of reference sequences of Scomber. One can download a file from the link shown after “Download” located at the top of this output.

The species tree a) and results of a phyloBARCODER analysis b to e) for fish data. a) A species tree of the genus Scomber (Cheng et al. 2012). b) A 12S rRNA gene tree. c) The name line of each BLAST hit derived from Pre-installed DB. d) Alignment of BLAST hits. e) Species list. For detail, see supplementary figs. S2 and S5, Supplementary Material online.
Fig. 3.

The species tree a) and results of a phyloBARCODER analysis b to e) for fish data. a) A species tree of the genus Scomber (Cheng et al. 2012). b) A 12S rRNA gene tree. c) The name line of each BLAST hit derived from Pre-installed DB. d) Alignment of BLAST hits. e) Species list. For detail, see supplementary figs. S2 and S5, Supplementary Material online.

Using an estimated gene tree, phyloBARCODER detects reads belonging to the same species from samples collected at multiple locations. To illustrate this capacity of phyloBARCODER, a species identification analysis (Fig. 2a) is conducted using 20 eDNA sequences (supplementary fig. S1, Supplementary Material online) obtained from points B6-0 and C0-300 (Yu et al. 2022), as anonymous sequences. An anonymous sequence, OTU_006, was used as a query because this sequence was already identified as S. japonicus/Scomber australasicus by Yu et al. (2022). To increase the precision of species identification, a custom reference sequence database (https://github.com/rogotoh/PMiFish, 2023 October 16, 15,254 sequences) was also uploaded as User DB.

The estimated gene tree (Fig. 3b and supplementary fig. S5, Supplementary Material online) placed OTU_006 in a clade consisting of reference sequences of S. japonicus, S. australasicus, and Scomber colias with a bootstrap probability of 100. When available, geographic considerations can be applied. Here, an Atlantic species, S. colias (Martins et al. 2013), cannot be a candidate Pacific specimen. The alignment of MiFish regions (Fig. 3d) indicates that S. japonicus and S. australasicus cannot be distinguished by the MiFish primer (Yu et al. 2022). Thus, the user can conclude with the highest statistical support that OTU_006 is S. japonicus or S. australasicus. In addition, phyloBARCODER identifies reads belonging to the same species. In the estimated gene tree (Fig. 3b), not only OTU_023 and OTU_031 from the same sample (B6-0) but also OTU_005 from a different sample (C0-300) was retained in the S. japonicus/S. australasicus clade with OTU_006. Alternative, i.e. similarity-based, methods collect reads belonging to the same group after individual species identification of each read without direct sequence comparisons among queries. Even with statistical values for all phylogenetic positions produced by placement methods, users should evaluate sequences closely related to queries without their sequence alignments (Czech et al. 2022). Analyses without query sequence comparisons decrease reliability and repeatability of species delineation. Moreover, phyloBARCODER assured the border of the clade with sister-group placement of sequences belonging to a sister species, S. scombrus (Fig. 3b).

These results indicate that phyloBARCODER identifies OTU_006 as S. japonicus/S. australasicus with OTU_023, OTU_031 (B6-0m), and OTU_005 (C0-300m), with 100% statistical support. phyloBARCODER constructs species lists (Fig. 3e) of queries describing taxonomic assignment at or above the species level. Users, however, should visually evaluate species identifications based on the species tree, estimated gene tree, and alignments.

phyloBARCODER also succeeded in detecting S. japonicus/S. australasicus from sequences generated by the ASV method (supplementary fig. S6, Supplementary Material online). Recently, the ASV method has been used instead of the clustering (OTU) method (Miya 2022), which lacks reusability, reproducibility, and comprehensiveness (Callahan et al. 2016). phyloBARCODER can be also applied to show multiple ASV sequences identical to a representative sequence generated by the OTU method.

Case Study 2: Corals

phyloBARCODER can detect eDNA sequences of coral genera. Shinzato et al. (2021) developed primers to amplify mitochondrial 12S rRNA gene sequences of diverse scleractinian corals and succeeded in detecting eDNA reads associated with 23 genera. Using 797 ZOTU (zero-radius OTU) sequences from their sample site 1D along the Okinawa seashore (fig. 5 in Shinzato et al. 2021), we verified that phyloBARCODER can detect reads from members of the genus Pocillopora (Fig. 4 and supplementary fig. S7, Supplementary Material online). When a read associated with Pocillopora was used as a query sequence, phyloBARCODER detected 31 Pocillopora sequences among the uploaded 797 sequences. Although Shinzato et al. (2021) did not show the ZOTU boundary of Pocillopora, phyloBARCODER clearly showed the boundary by placing reference sequences from sister genera, Stylophora and Seriatopora, as the closest relatives of the Pocillopora clade. Also, as shown in short internal branches among Pocillopora species, phyloBARCODER showed the difficulty of identification among Pocillopora species due to highly conserved mitochondrial genome sequences among scleractinian species (Shearer et al. 2002).

A result from phyloBARCODER analysis a) and the species tree b) for coral data. a) A 12S rRNA gene tree. b) A species tree of the family Pocilloporidae and related taxa (Bhattacharya et al. 2016). For detail, see supplementary fig. S7, Supplementary Material online. Anonymous sequences used in the coral analysis are available from the top page of phyloBARCODER version 1.
Fig. 4.

A result from phyloBARCODER analysis a) and the species tree b) for coral data. a) A 12S rRNA gene tree. b) A species tree of the family Pocilloporidae and related taxa (Bhattacharya et al. 2016). For detail, see supplementary fig. S7, Supplementary Material online. Anonymous sequences used in the coral analysis are available from the top page of phyloBARCODER version 1.

A phyloBARCODER analysis a) and the species tree b) for copepod data. a) A CO1 gene tree. “BlueHit” at the end of OTU names indicates a BLAST hit against second or later query sequences. b) A species tree for M. lucens and related taxa (Tessler et al. 2018). For detail, see supplementary fig. S8, Supplementary Material online. Anonymous sequences used in the copepod analysis are available from the top page of phyloBARCODER version 1.
Fig. 5.

A phyloBARCODER analysis a) and the species tree b) for copepod data. a) A CO1 gene tree. “BlueHit” at the end of OTU names indicates a BLAST hit against second or later query sequences. b) A species tree for M. lucens and related taxa (Tessler et al. 2018). For detail, see supplementary fig. S8, Supplementary Material online. Anonymous sequences used in the copepod analysis are available from the top page of phyloBARCODER version 1.

Case Study 3: Copepods

phyloBARCODER can detect local populations from short DNA sequences derived from bulk extracted mixtures. Based on mitochondrial CO1 gene sequences (326 bp), Hirai et al. (2022) identified four local populations of Metridia lucens, North Pacific, North Atlantic, South hemisphere 1, and South hemisphere 2. Here, we verified that phyloBARCODER can detect these four M. lucens populations among 191 anonymous sequences used in the analysis of Hirai et al. (2022) (Fig. 5 and supplementary fig. S8, Supplementary Material online). For databases of species identification, obtaining and curating sequence data directly from a restricted list or regional study species is desirable (Collins et al. 2021). To test detection of local populations, 86 reference sequences of North Pacific M. lucens were retrieved from the MetaZooGene Atlas & Database (https://metazoogene.org). When employing anonymous sequences from four populations (Hirai et al. 2022) as queries, phyloBARCODER detected anonymous sequences from the MetaZooGene database as the North Pacific population and delineated boundaries of the four populations uncovered by Hirai et al. (2022).

When reference sequences corresponding to query species are not present in databases, phyloBARCODER can help species identification by producing BLAST identities. For copepod species identifications, Blanco-Bercial et al. (2014) explored the impact of missing data on accuracy and reliability. BLAST identities against the first queries were 98% to 100% for reads included in the M. lucens clade and 87% for a read of Metridia pacifica (Fig. 5). Even if M. lucens reference sequences were not in databases used in phyloBARCODER analyses, the BLAST identity for the M. pacifica read (87%) would give the user guidance for species identification with reference to those values between species (85.6% to 87.4%) (Blanco-Bercial et al. 2014).

Comparison with Other Web Tools

To illustrate the novelty of phyloBARCODER, the species identification of fish sample B0-0 (Yu et al. 2022) was compared with those analyzed using a pioneer web tool in this field, MitoFish pipeline (Zhu et al. 2023), with special reference to the accuracy of detecting S. japonicus/S. australasicus (Fig. 3). As far as we know, except for phyloBARCODER, MitoFish pipeline is the only other web tool that can identify source species of eDNA sequences of macroorganisms in response to user requests. MitoFish pipeline has a feature to make a species list for all detected species from MiFish eDNA sequences. However, users cannot evaluate species identification visually. When analyzing raw data of sample B6-0 (Yu et al. 2022), MitoFish pipeline detected sequences identified as S. colias with low confidence without showing the possibility of S. japonicus/S. australasicus (supplementary fig. S9, Supplementary Material online). Similarity-based methods without multiple comparisons of anonymous and reference sequences prevent MitoFish pipeline from showing candidate species.

Species identifications were also compared with a broadly used metabarcoding approach, Bayesian classifier (Wang et al. 2007), implemented in MOTHUR (Schloss et al. 2009), for fish, coral, and copepod data. When anonymous sequence data from these three groups were analyzed with Bayesian classifier with the same databases (supplementary figs. S5, S7, and S8, Supplementary material online), different results were obtained for fish and copepod data. For fish data (supplementary fig. S5, Supplementary Material online), the Bayesian classifier identified four anonymous sequences as S. colias with support value 96 to 100, but did not show other candidates, S. japonicus and S. australasicus. Analyses without multiple comparisons prevent Bayesian classifier from showing candidate species. For the copepod data (supplementary fig. S8, Supplementary Material online), Bayesian classifier identified the North Pacific population of M. lucens, but did not show other local populations. Analyses without delineation comparing all query sequences prevent Bayesian classifier from identifying other local populations in copepod analyses. In addition to those accurate, visualized results, faster analyses in phyloBARCODER, e.g. 40 seconds in phyloBARCODER and 17 minutes in MOTHUR for copepod data, enable users to easily reanalyze with different conditions.

Limitations

phyloBARCODER currently has several limitations: (i) Users should evaluate the species list produced automatically by phyloBARCODER with reference to outputs and the species tree. (ii) Before phyloBARCODER analyses, preprocessing steps at a minimum for HTS data should be done by other programs such as USEARCH (Edgar 2010) and QIIME2 (Bolyen et al. 2019). (iii) A maximum of 10 query sequences is allowed, although other anonymous sequences having close matches with queries are included in the multiple sequence alignment. (iv) For a quick and simple approach to species identification, abundance annotation is not handled automatically. (v) Further work is required to evaluate the use of bootstrap probabilities as measures of confidence for species identification.

Materials and Methods

The phyloBARCODER server runs on the Linux operating system. An Apache HTTP Server provides web services. Python scripts process all data and requests from users. All these resources have been extensively used and are well supported.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

This work was conducted in part under the program of open science promotion of AORI and the FSI project, Ocean DNA: constructing “bio-map” of marine organisms using DNA sequence analyses from the University of Tokyo. We thank Steven D. Aird for the English language editing.

Funding

This work was supported by the Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (A) (21H04922).

Data Availability

All data and code used in this paper are available in a GitHub repository to ensure full reproducibility. For commercial purposes, output from phyloBARCODER may be used, but its script may not be modified or uploaded on a web server without permission.

References

Altschul
SF
,
Madden
TL
,
Schäffer
AA
,
Zhang
J
,
Zhang
Z
,
Miller
W
,
Lipman
DJ
.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
.
1997
:
25
(
17
):
3389
3402
. https://doi.org/10.1093/nar/25.17.3389.

Barbera
P
,
Kozlov
AM
,
Czech
L
,
Morel
B
,
Darriba
D
,
Flouri
T
,
Stamatakis
A
.
EPA-ng: massively parallel evolutionary placement of genetic sequences
.
Syst Biol
.
2019
:
68
(
2
):
365
369
. https://doi.org/10.1093/sysbio/syy054.

Bayer
PE
,
Bennett
A
,
Nester
G
,
Corrigan
S
,
Raes
EJ
,
McInnes
AS
,
Cooper
M
,
Ayad
ME
,
McVey
P
,
Kardailsky
A
, et al.
A comprehensive evaluation of taxonomic classifiers in marine vertebrate eDNA studies
. bioRxiv 580601. https://doi.org/10.1101/2024.02.15.580601, 17 February 2024, preprint: not peer reviewed.

Bergsten
J
,
Bilton
DT
,
Fujisawa
T
,
Elliott
M
,
Monaghan
MT
,
Balke
M
,
Hendrich
L
,
Geijer
J
,
Herrmann
J
,
Foster
GN
, et al.
The effect of geographical scale of sampling on DNA barcoding
.
Syst Biol
.
2012
:
61
(
5
):
851
869
. https://doi.org/10.1093/sysbio/sys037.

Bhattacharya
D
,
Agrawal
S
,
Aranda
M
,
Baumgarten
S
,
Belcaid
M
,
Drake
JL
,
Erwin
D
,
Foret
S
,
Gates
RD
,
Gruber
DF
, et al.
Comparative genomics explains the evolutionary success of reef-forming corals
.
eLife
.
2016
:
5
:
e13288
. https://doi.org/10.7554/eLife.13288.

Blanco-Bercial
L
,
Cornils
A
,
Copley
N
,
Bucklin
A
.
DNA barcoding of marine copepods: assessment of analytical approaches to species identification
.
PLoS Curr
.
2014
:
6
:
1
22
. https://doi.org/10.1371/currents.tol.cdf8b74881f87e3b01d56b43791626d2.

Bolyen
E
,
Rideout
JR
,
Dillon
MR
,
Bokulich
NA
,
Abnet
CC
,
Al-Ghalith
GA
,
Alexander
H
,
Alm
EJ
,
Arumugam
M
,
Asnicar
F
, et al.
Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2
.
Nat Biotechnol
.
2019
:
37
(
8
):
852
857
. https://doi.org/10.1038/s41587-019-0209-9.

Callahan
BJ
,
McMurdie
PJ
,
Rosen
MJ
,
Han
AW
,
Johnson
AJ
,
Holmes
SP
.
DADA2: high-resolution sample inference from Illumina amplicon data
.
Nat Methods
.
2016
:
13
(
7
):
581
583
. https://doi.org/10.1038/nmeth.3869.

Camacho
C
,
Coulouris
G
,
Avagyan
V
,
Ma
N
,
Papadopoulos
J
,
Bealer
K
,
Madden
TL
.
BLAST+: architecture and applications
.
BMC Bioinformatics
.
2009
:
10
(
1
):
421
. https://doi.org/10.1186/1471-2105-10-421.

Capella-Gutierrez
S
,
Silla-Martinez
JM
,
Gabaldon
T
.
Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses
.
Bioinformatics
.
2009
:
25
(
15
):
1972
1973
. https://doi.org/10.1093/bioinformatics/btp348.

Cheng
J
,
Gao
T
,
Miao
Z
,
Yanagimoto
T
.
Molecular phylogeny and evolution of Scomber (Teleostei: Scombridae) based on mitochondrial and nuclear DNA sequences
.
Chin J Oceanol Limnol
.
2012
:
29
(
2
):
297
310
. https://doi.org/10.1007/s00343-011-0033-7.

Collins
RA
,
Trauzzi
G
,
Maltby
KM
,
Gibson
TI
,
Ratcliffe
FC
,
Hallam
J
,
Rainbird
S
,
Maclaine
J
,
Henderson
PA
,
Sims
DW
, et al.
Meta-Fish-Lib: a generalised, dynamic DNA reference library pipeline for metabarcoding of fishes
.
J Fish Biol
.
2021
:
99
(
4
):
1446
1454
. https://doi.org/10.1111/jfb.14852.

Creer
S
,
Deiner
K
,
Frey
S
,
Porazinska
D
,
Taberlet
P
,
Thomas
WK
,
Potter
C
,
Bik
HM
.
The ecologist's field guide to sequence-based identification of biodiversity
.
Methods Ecol Evol
.
2016
:
7
(
9
):
1008
1018
. https://doi.org/10.1111/2041-210X.12574.

Cribdon
B
,
Ware
R
,
Smith
O
,
Gaffney
V
,
Allaby
RG
.
PIA: more accurate taxonomic assignment of metagenomic data demonstrated on sedaDNA from the North Sea
.
Front Ecol Evol
.
2020
:
8
:
8
. https://doi.org/10.3389/fevo.2020.00084.

Czech
L
,
Stamatakis
A
,
Dunthorn
M
,
Barbera
P
.
Metagenomic analysis using phylogenetic placement—a review of the first decade
.
Front Bioinform
.
2022
:
2
:
871393
. https://doi.org/10.3389/fbinf.2022.871393.

Deiner
K
,
Bik
HM
,
Machler
E
,
Seymour
M
,
Lacoursiere-Roussel
A
,
Altermatt
F
,
Creer
S
,
Bista
I
,
Lodge
DM
,
de Vere
N
, et al.
Environmental DNA metabarcoding: transforming how we survey animal and plant communities
.
Mol Ecol
.
2017
:
26
(
21
):
5872
5895
. https://doi.org/10.1111/mec.14350.

Edgar
RC
.
Search and clustering orders of magnitude faster than BLAST
.
Bioinformatics
.
2010
:
26
(
19
):
2460
2461
. https://doi.org/10.1093/bioinformatics/btq461.

Federhen
S
.
Comment on ‘Birdstrikes and barcoding: can DNA methods help make the airways safer?’
.
Mol Ecol Res
.
2011
:
11
(
6
):
937
938
; discussion 939–942. https://doi.org/10.1111/j.1755-0998.2011.03054.x.

Goodwin
S
,
McPherson
JD
,
McCombie
WR
.
Coming of age: ten years of next-generation sequencing technologies
.
Nat Rev Genet
.
2016
:
17
(
6
):
333
351
. https://doi.org/10.1038/nrg.2016.49.

Hirai
J
,
Chen
F
,
Itoh
H
,
Tadokoro
K
,
Lemay
MA
,
Hunt
BPV
,
Tsuda
A
.
Molecular and morphological analyses to improve taxonomic classification of Metridia lucens/pacifica in the North Pacific
.
J Plankton Res
.
2022
:
44
(
3
):
454
463
. https://doi.org/10.1093/plankt/fbac020.

Katoh
K
,
Standley
DM
.
MAFFT multiple sequence alignment software version 7: improvements in performance and usability
.
Mol Biol Evol
.
2013
:
30
(
4
):
772
780
. https://doi.org/10.1093/molbev/mst010.

Leray
M
,
Knowlton
N
,
Machida
RJ
.
MIDORI2: a collection of quality controlled, preformatted, and regularly updated reference databases for taxonomic assignment of eukaryotic mitochondrial sequences
.
Environmental DNA
.
2022
:
4
(
4
):
894
907
. https://doi.org/10.1002/edn3.303.

Leray
M
,
Yang
JY
,
Meyer
CP
,
Mills
SC
,
Agudelo
N
,
Ranwez
V
,
Boehm
JT
,
Machida
RJ
.
A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents
.
Front Zool
.
2013
:
10
(
1
):
34
. https://doi.org/10.1186/1742-9994-10-34.

Martins
MM
,
Skagen
D
,
Marques
V
,
Zwolinski
J
,
Silva
A
.
Changes in the abundance and spatial distribution of the Atlantic chub mackerel (Scomber colias) in the pelagic ecosystem and fisheries off Portugal
.
Sci Mar
.
2013
:
77
(
4
):
551
563
. https://doi.org/10.3989/scimar.03861.07B.

Matsen
FA
,
Kodner
RB
,
Armbrust
EV
.
Pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree
.
BMC Bioinformatics
.
2010
:
11
(
1
):
538
. https://doi.org/10.1186/1471-2105-11-538.

Miya
M
.
Environmental DNA metabarcoding: a novel method for biodiversity monitoring of marine fish communities
.
Ann Rev Mar Sci
.
2022
:
14
(
1
):
161
185
. https://doi.org/10.1146/annurev-marine-041421-082251.

Miya
M
,
Sato
Y
,
Fukunaga
T
,
Sado
T
,
Poulsen
JY
,
Sato
K
,
Minamoto
T
,
Yamamoto
S
,
Yamanaka
H
,
Araki
H
, et al.
MiFish, a set of universal PCR primers for metabarcoding environmental DNA from fishes: detection of more than 230 subtropical marine species
.
R Soc Open Sci
.
2015
:
2
(
7
):
150088
. https://doi.org/10.1098/rsos.150088.

Munch
K
,
Boomsma
W
,
Huelsenbeck
JP
,
Willerslev
E
,
Nielsen
R
.
Statistical assignment of DNA sequences using Bayesian phylogenetics
.
Syst Biol
.
2008
:
57
(
5
):
750
757
. https://doi.org/10.1080/10635150802422316.

Nelson
JS
,
Grande
T
,
Wilson
MVH
.
Fishes of the world
.
Hoboken (New Jersey)
:
John Wiley & Sons
;
2016
.

Popescu
AA
,
Huber
KT
,
Paradis
E
.
Ape 3.0: new tools for distance-based phylogenetics and evolutionary analysis in R
.
Bioinformatics
.
2012
:
28
(
11
):
1536
1537
. https://doi.org/10.1093/bioinformatics/bts184.

Ruppert
KM
,
Kline
RJ
,
Rahman
MS
.
Past, present, and future perspectives of environmental DNA (eDNA) metabarcoding: a systematic review in methods, monitoring, and applications of global eDNA
.
Glob Ecol Conserv
.
2019
:
17
:
e00547
. https://doi.org/10.1016/j.gecco.2019.e00547.

Saitou
N
,
Nei
M
.
The neighbor-joining method: a new method for reconstructing phylogenetic trees
.
Mol Biol Evol
.
1987
:
4
(
4
):
406
425
. https://doi.org/10.1093/oxfordjournals.molbev.a040454.

Schloss
PD
,
Westcott
SL
,
Ryabin
T
,
Hall
JR
,
Hartmann
M
,
Hollister
EB
,
Lesniewski
RA
,
Oakley
BB
,
Parks
DH
,
Robinson
CJ
, et al.
Introducing MOTHUR: open-source, platform-independent, community-supported software for describing and comparing microbial communities
.
Appl Environ Microbiol
.
2009
:
75
(
23
):
7537
7541
. https://doi.org/10.1128/AEM.01541-09.

Shearer
TL
,
Van Oppen
MJ
,
Romano
SL
,
Worheide
G
.
Slow mitochondrial DNA sequence evolution in the Anthozoa (Cnidaria)
.
Mol Ecol
.
2002
:
11
(
12
):
2475
2487
. https://doi.org/10.1046/j.1365-294X.2002.01652.x.

Shinzato
C
,
Narisoko
H
,
Nishitsuji
K
,
Nagata
T
,
Satoh
N
,
Inoue
J
.
Novel mitochondrial DNA markers for scleractinian corals and generic-level environmental DNA metabarcoding
.
Front Mar Sci
.
2021
:
8
:
758207
. https://doi.org/10.3389/fmars.2021.758207.

Smith
SA
,
Pease
JB
.
Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny
.
Brief Bioinform
.
2017
:
18
(
3
):
451
457
. https://doi.org/10.1093/bib/bbw034.

Somervuo
P
,
Yu
DW
,
Xu
CCY
,
Ji
Y
,
Hultman
J
,
Wirta
H
,
Ovaskainen
O
.
Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding
.
Methods Ecol Evol
.
2017
:
8
(
4
):
398
407
. https://doi.org/10.1111/2041-210X.12721.

Stoeckle
MY
.
Improved environmental DNA reference library detects overlooked marine fishes in New Jersey, United States
.
Front Mar Sci
.
2020
:
7
:
226
. https://doi.org/10.3389/fmars.2020.00226.

Tamura
K
,
Nei
M
.
Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees
.
Mol Biol Evol
.
1993
:
10
(
3
):
512
526
. https://doi.org/10.1093/oxfordjournals.molbev.a040023.

Tanabe
AS
,
Toju
H
.
Two new computational methods for universal DNA barcoding: a benchmark using barcode sequences of bacteria, archaea, animals, fungi, and land plants
.
PLoS One
.
2013
:
8
(
10
):
e76910
. https://doi.org/10.1371/journal.pone.0076910.

Tessler
M
,
Gaffney
JP
,
Crawford
JM
,
Trautman
E
,
Gujarati
NA
,
Alatalo
P
,
Pieribone
VA
,
Gruber
DF
.
Luciferin production and luciferase transcription in the bioluminescent copepod Metridia lucens
.
PeerJ
.
2018
:
6
:
e5506
. https://doi.org/10.7717/peerj.5506.

Wang
Q
,
Garrity
GM
,
Tiedje
JM
,
Cole
JR
.
Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy
.
Appl Environ Microbiol
.
2007
:
73
(
16
):
5261
5267
. https://doi.org/10.1128/AEM.00062-07.

Weigand
H
,
Beermann
AJ
,
Ciampor
F
,
Costa
FO
,
Csabai
Z
,
Duarte
S
,
Geiger
MF
,
Grabowski
M
,
Rimet
F
,
Rulik
B
, et al.
DNA barcode reference libraries for the monitoring of aquatic biota in Europe: gap-analysis and recommendations for future work
.
Sci Total Environ
.
2019
:
678
:
499
524
. https://doi.org/10.1016/j.scitotenv.2019.04.247.

Yang
Z
.
Computational molecular evolution
.
Oxford
:
Oxford University Press
;
2006
.

Yang
Z
.
Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate method
.
J Mol Evol
.
1994
:
39
(
3
):
306
314
. https://doi.org/10.1007/BF00160154.

Yu
Z
,
Ito
SI
,
Wong
MK
,
Yoshizawa
S
,
Inoue
J
,
Itoh
S
,
Yukami
R
,
Ishikawa
K
,
Guo
C
,
Ijichi
M
, et al.
Comparison of species-specific qPCR and metabarcoding methods to detect small pelagic fish distribution from open ocean environmental DNA
.
PLoS One
.
2022
:
17
(
9
):
e0273670
. https://doi.org/10.1371/journal.pone.0273670.

Zhu
T
,
Sato
Y
,
Sado
T
,
Miya
M
,
Iwasaki
W
.
MitoFish, MitoAnnotator, and MiFish pipeline: updates in 10 years
.
Mol Biol Evol
.
2023
:
40
(
3
):
msad035
. https://doi.org/10.1093/molbev/msad035.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].
Associate Editor: Naruya Saitou
Naruya Saitou
Associate Editor
Search for other works by this author on:

Supplementary data