-
PDF
- Split View
-
Views
-
Cite
Cite
Marko Mutanen, Sami M. Kivelä, Rutger A. Vos, Camiel Doorenweerd, Sujeevan Ratnasingham, Axel Hausmann, Peter Huemer, Vlad Dincă, Erik J. van Nieukerken, Carlos Lopez-Vaamonde, Roger Vila, Leif Aarvik, Thibaud Decaëns, Konstantin A. Efetov, Paul D. N. Hebert, Arild Johnsen, Ole Karsholt, Mikko Pentinsaari, Rodolphe Rougerie, Andreas Segerer, Gerhard Tarmann, Reza Zahiri, H. Charles J. Godfray, Species-Level Para- and Polyphyly in DNA Barcode Gene Trees: Strong Operational Bias in European Lepidoptera, Systematic Biology, Volume 65, Issue 6, November 2016, Pages 1024–1040, https://doi.org/10.1093/sysbio/syw044
- Share Icon Share
The proliferation of DNA data is revolutionizing all fields of systematic research. DNA barcode sequences, now available for millions of specimens and several hundred thousand species, are increasingly used in algorithmic species delimitations. This is complicated by occasional incongruences between species and gene genealogies, as indicated by situations where conspecific individuals do not form a monophyletic cluster in a gene tree. In two previous reviews, non-monophyly has been reported as being common in mitochondrial DNA gene trees. We developed a novel web service “Monophylizer” to detect non-monophyly in phylogenetic trees and used it to ascertain the incidence of species non-monophyly in COI (a.k.a. cox1) barcode sequence data from 4977 species and 41,583 specimens of European Lepidoptera, the largest data set of DNA barcodes analyzed from this regard. Particular attention was paid to accurate species identification to ensure data integrity. We investigated the effects of tree-building method, sampling effort, and other methodological issues, all of which can influence estimates of non-monophyly. We found a 12% incidence of non-monophyly, a value significantly lower than that observed in previous studies. Neighbor joining (NJ) and maximum likelihood (ML) methods yielded almost equal numbers of non-monophyletic species, but 24.1% of these cases of non-monophyly were only found by one of these methods. Non-monophyletic species tend to show either low genetic distances to their nearest neighbors or exceptionally high levels of intraspecific variability. Cases of polyphyly in COI trees arising as a result of deep intraspecific divergence are negligible, as the detected cases reflected misidentifications or methodological errors. Taking into consideration variation in sampling effort, we estimate that the true incidence of non-monophyly is ∼23%, but with operational factors still being included. Within the operational factors, we separately assessed the frequency of taxonomic limitations (presence of overlooked cryptic and oversplit species) and identification uncertainties. We observed that operational factors are potentially present in more than half (58.6%) of the detected cases of non-monophyly. Furthermore, we observed that in about 20% of non-monophyletic species and entangled species, the lineages involved are either allopatric or parapatric—conditions where species delimitation is inherently subjective and particularly dependent on the species concept that has been adopted. These observations suggest that species-level non-monophyly in COI gene trees is less common than previously supposed, with many cases reflecting misidentifications, the subjectivity of species delimitation or other operational factors.
There has been endless debate over the definition of a species and whether the concept has any biological reality (Wheeler and Meier 2000; De Queiroz 2007; Mallet 2007; Hausdorf 2011). While there is now a broad consensus that more inclusive taxonomic categories are defined solely following cladistic principles (Hennig 1966) (i.e., by monophyly criterion and hierarchical order) and are largely arbitrary, species are generally viewed as natural entities with observable distances between them, resulting from the differentiation of lineages through speciation (Wright 1940; Mayr 1942; Coyne and Orr 2004). However, species boundaries are often much harder to discern when individuals are sampled across geographical scales or through time (Baselga et al. 2013), and the complexity in gathering direct evidence on the potential for interbreeding creates challenges for rigorous testing of species boundaries. Nonetheless, the species rank has maintained its status as a central concept in virtually all fields of biology, one with particular societal relevance because of its centrality in conservation, legislation, or food trade (e.g., Avise 1989; Isaac et al. 2004). Although there are species concepts that do not perceive species necessarily as monophyletic entities (for reviews see De Queiroz 2007; Mallet 2007), monophyly is a central criterion in most of them (e.g., the phylogenetic species concept) (Cracraft 1989).
A phylogenetic tree depicting relationships among species is known as a species tree. Because evolution is a unique and nonrecurrent process, there is only a single true topology that reflects evolutionary relationships among species. Systematists have traditionally inferred tree topologies based on morphological, ecological, or other life history characters, but now this is most often done based on DNA sequences. A much-discussed issue concerning the use of DNA sequences is that the evolution of a gene is not necessarily congruent with that of a species (Pamilo and Nei 1988; Maddison 1997). Because nuclear genes of sexually reproducing organisms are subject to recombination, their coalescence histories differ. With rare exceptions, mitochondrial genes are inherited uniparentally (usually maternally), show very limited recombination, have population genetics governed by an effective population size (Ne) that is one quarter of that for the nuclear genome, and are particularly susceptible to selective sweeps (Hurst and Jiggins 2005; Rubinoff et al. 2006).

Schematic overview of the different potential reasons for species to be classified as para- or polyphyletic, or false-positively monophyletic in a gene tree.
Rapidly accumulating DNA barcode libraries, such as the Barcode of Life Data System (BOLD; Ratnasingham and Hebert 2007), are boosting the number of species being described as new to science (Olave et al. 2014). Typically, DNA barcode gene trees are used as an important source of information in species delimitation. Several algorithmic species delimitation tools have been developed, some of them specifically for use with DNA barcodes (Pons et al. 2006; Puillandre et al. 2012; Fujisawa and Barraclough 2013; Ratnasingham and Hebert 2013; Zhang et al. 2013; Jones et al. 2015). However, these methods can misdiagnose species boundaries when gene trees are non-monophyletic and hence it is important to know the frequency of paraphyly and polyphyly. A benchmark review concluded that on average 23% of species express non-monophyly in mitochondrial DNA markers (Funk and Omland 2003). In 146 studies of arthropods, the average percentage of non-monophyletic species was 26.5% and in other invertebrates as high as 38.6% (Funk and Omland 2003). Recently, Ross (2014) estimated the incidence of paraphyly (including polyphyly) with an extensive, but not completely validated, data set of the animal COI barcodes accessed from BOLD, and concluded that 19% of species were non-monophyletic. For Lepidoptera, he reported the level of 17% of non-monophyly.
Here, we studied the incidence of para- and polyphyly in trees built using the standard COI barcode gene. The data set we analyzed included 41,583 specimens of 4977 species of European Lepidoptera, 50.6% of all known species in this order from this continent. We developed a novel tool (named Monophylizer) that permits the automated identification of paraphyletic and polyphyletic lineages based on tree topology. Prior to carrying out the non-monophyly assessment, we paid particular attention to potential sources of operational bias, in particular by cross-checking identifications (based on current morphology-based taxonomy of all studied taxa) as carefully as possible. We then continued by assessing various other operational factors, particularly taxonomic uncertainties and identification difficulties. Our study is the first to attempt to estimate the significance of these effects. We also applied statistical modeling approaches not previously used in this context to test the effect of sampling effort.
MATERIAL AND METHODS
Target Group
The insect order Lepidoptera constitutes one of the most diverse animal groups; it contains approximately 157,500 described extant species (van Nieukerken et al. 2011). Lepidoptera is likely the best-studied insect order, although with strong geographic and taxonomic biases. Families with large species are generally better known than those dominated by small species, and species-rich tropical faunas are generally poorly investigated compared to those in temperate regions (Lees et al. 2013). At a global scale, Lepidoptera is the best represented order in the International Barcode of Life Project (iBOL) with approximately 1 million sequences on BOLD associated with 100,000 species. By way of context, 9846 lepidopteran species have been reported from Europe up to 2011 (Karsholt and van Nieukerken 2013).
Material Collection and Delimitation of Sampling Area
The data used in this study were largely collected within the framework of the iBOL as part of national or regional initiatives such as the Fauna Bavarica project (http://www.faunabavarica.de, last accessed June 2, 2016), the Finnish Barcode of Life project (http://www.finbol.org/, last accessed June 2, 2016), the “Lepidoptera of the Alps” campaign, the Norwegian Barcode of Life (http://www.norbol.org/, last accessed June 2, 2016) and the Nature of The Netherlands project, or as individual research projects.
Our analysis focused on specimens from Europe as defined by current political boundaries but excluding European Turkey, Cyprus, and most of the Macaronesian islands. Most tissue samples analyzed were from identified pinned specimens collected in the past 15 years, because older material was less likely to generate sequence data. Altogether 41,583 specimens, representing 4977 species, yielded a DNA sequence of over 500 base pairs (bp) in length. Specimens with shorter sequences were excluded from the analyses. From one to 146 specimens were sampled per species with an average of 8.4. Among this total, 697 species (14%) were represented by a single DNA barcode: the so-called singletons. Although singletons cannot show non-monophyly themselves, they were included in the analyses because they may render other species para- or polyphyletic by becoming “entangled” with them. Almost all sequenced species were included, several of them from species or species groups already known to be subject to species delimitation and identification difficulties. We excluded some specimens that could not be associated with any described species, but included many species likely or possibly encompassing cryptic species. In one case (the Stigmella salicis group, Nepticulidae), we included undescribed species and applied interim names, because in this case the presence of several species has been convincingly demonstrated (van Nieukerken et al. 2012). The only species group not wholly included is the Dahlica/Siederia complex of species (Psychidae), because the current taxonomy of this group is known to be largely inaccurate, parthenogenesis is frequent, and morphological characters for many species are misleading (Elzinga et al. 2014), rendering identification of species currently impossible by anything other than molecular data. Sampling was geographically somewhat biased, with 60% of the specimens collected in only 10 of the 51 European countries: Finland (9619), Germany (7922), Italy (4829), United Kingdom (4005), Austria (3184), France (2774), Spain (1440), Romania (1384), the Netherlands (1167), and Norway (868). Full taxonomic and collection information of specimens is available in BOLD through individual specimen pages within the public data set DS-MARKALL (dx.doi.org/10.5883/DS-MARKALL) in the BOLD (www.boldsystems.org) barcode data repository. Collection localities are also available in .klm format (viewable with Google Earth) on Dryad at http://dx.doi.org/10.5061/dryad.k3mr1.
Sequencing
A 500–658 bp long amplicon of the 5’ terminus of the mitochondrial COI gene (the standard DNA barcode region for animals) was sequenced for all specimens. A single codon deletion occurs in three species of Scardiinae (Tineidae), but otherwise the target gene region does not show length variation in European Lepidoptera. Sequencing was predominantly conducted at the Canadian Centre for DNA Barcoding (CCDB), but also at Naturalis Biodiversity Center (the Netherlands) and laboratories of the authors’ research organizations. The CCDB’s sequencing protocol is described in detail in deWaard et al. (2008). The primer pair LepF1 and LepF2 (Brower and Egan 1997) was primarily used to amplify the barcode region, but, in cases of failure, other primer sets were also attempted. Full primer details, laboratory reports, trace files, sequences, and GenBank accession numbers can be retrieved from the sequence page of each record in BOLD and can be downloaded at dx.doi.org/10.5883/DS-MARKALL.
Verifying Identifications and Taxonomic Names
Although specimens were generally identified to species level by taxonomic experts based on morphology prior to sequencing, the resulting DNA barcodes provided an efficient way to cross-check the identifications. Since misidentifications easily produce false cases of non-monophyly, we carefully examined all anomalous cases. This necessitated at least the superficial re-examination of voucher specimens, but in many occasions also the dissection of their genitalia, whose morphology often carries important diagnostic features. This process revealed many misidentifications, which were corrected in BOLD prior to final analyses of species-level monophyly. Clerical errors and the application of different nomenclatures can similarly lead to false observations of non-monophyly, especially when performed using automated detection of non-monophyly as done here. We, therefore, harmonized names throughout the complete data set following the nomenclature of Fauna Europaea (http://www.fauna-eu.org/, last accessed June 2, 2016). This revealed hundreds of cases where two or more names had been applied to a single species, but also cases of homonymy, (the application of a single name to several species). Despite careful cross-checking of identifications, it is likely that some misidentifications remain in the data because of identification problems or taxonomic uncertainties in several species groups. We attempt to estimate the effect of this below.
Detection of Contamination, NUMTs and Chimeric Sequences
Prior to analysis, several validation steps were performed to increase the reliability of the results. First, Sanger sequencing trace electropherograms were reviewed for quality, excising sequences associated with a mean trace quality “phred” score below 30 and where more than 10% of the bases showed a quality score below 20 after trimming of the primer sequences. Sequences that met these quality criteria were reviewed to excise those that are likely pseudogenes (NUMTs) or chimeric in origin. Pseudogenes were detected by comparing each sequence to a Hidden Markov Model (Eddy 1998) of the COI protein (Finn et al. 2010). Low-scoring sequences contained either unusual amino acid substitutions, stop codons or reading frame shifts, all indicators of pseudogenization. Tests for chimeras involved dividing sequences into 100 bp fragments with each fragment independently searched against the barcode reference library. Resulting hits were compared to ensure that all fragments match similar reference records in the library. Sequences failing this test were manually evaluated and discarded if a chimeric origin was confirmed. Finally, sequences were compared against a reference library of common laboratory contaminants, discarding those that matched.
Phylogenetic Analyses
Distance-based NJ and optimality criterion-based ML phylogenetic methods were used to reconstruct DNA barcode gene trees. These methods are capable of analyzing large (>5000 sequences) data sets and were used to estimate the effect of the inference method on the incidence of paraphyly and polyphyly. In general, ML is expected to yield a more correct tree topology because of limitations inherent in the NJ method. These include especially the sensitivity of the method to the input order of specimens and the correctness of the distance matrix (Huelsenbeck and Hillis 1993; Farris et al. 1996). Despite these problems, NJ has repeatedly been shown to perform well for species delimitation and to approximate phylogenetic relationships (Huelsenbeck and Hillis 1993; Kumar and Gadagkar 2000; Mihaescu et al. 2009).
Since NJ is computationally less demanding than ML and permits rapid construction of trees with thousands of specimens, NJ trees were constructed without the exclusion of redundant (identical) haplotypes, which is expected to have minimal effect on the tree topology estimated by this method. In contrast, haplotype collapsing was done for the more demanding ML analyses to increase computational efficiency. However, redundant haplotypes were not removed when they occurred between different species (barcode-sharing). Haplotype collapsing was conducted using ALTER (Glez-Peña et al. 2010).
Distance matrices for NJ trees were calculated under both the Kimura 2-parameter (K2P) (Kimura 1980) and P-distance model, using the BOLD alignment of sequences (amino acid based HMM). To estimate non-monophyly in NJ trees, we applied K2P because it is usually used for DNA barcode data, although it is not necessarily the best-fit model of nucleotide evolution of the COI gene (Srivathsan and Meier 2012). Trees generated with P-distance showed very similar topologies. Trials were conducted to estimate the effect of different nucleotide substitution models available in BOLD, but no effect on incidence of non-monophyly was detected (results not shown). Trees were rooted on a specimen representing a sister group (where known) or a closely related group as based on recent comprehensive Lepidoptera phylogenies (Mutanen et al. 2010; R Core Team 2013). Analyses were performed mostly per family or by grouping several related families, but due to large numbers of specimens (exceeding the maximum number permitted by BOLD), by subfamilies in Geometridae and by separating Noctuinae from the rest of Noctuidae. This partitioning is unlikely to lead to any case of non-monophyly remaining undetected, because non-monophyletic species are only in exceptional cases tangled outside a single genus (we have never observed this in our data). A few families or subfamilies (Bedelliidae, Urodidae, Schreckensteiniidae, Heterogyniidae, Riodinidae, Thyrididae, Orthostixinae in Geometridae) include only a single species in this study and our analysis thus makes them by definition monophyletic. But all seven species have highly divergent barcodes and would remain monophyletic, however, treated.
ML trees were constructed using RAxML v. 8 (Stamatakis 2014) via the Black Box web server (http://embnet.vital-it.ch/raxml-bb/index.php, last accessed June 2, 2016). The analyses were conducted under the GTR+G model of nucleotide evolution (Tavaré 1986). Node support values were estimated with 100 bootstrap replicates. Analyses were mostly performed using the division applied in NJ analyses, except that families with very few specimens were combined in three groups, the first including two non-ditrysian families; the second the ditrysian families excluding the non-macroheteroceran families except for Riodinidae (a single monophyletic species); and the third Macroheterocera plus Riodinidae. This division is phylogeny based except for the placement of Riodinidae, which is not currently included in Macroheterocera. All trees were saved in Newick format for the detection of monophyly and are deposited in Dryad.
Detection of Non-Monophyly
Non-monophyly can be detected by eye in a graphical representation but is prone to human error. In trees with hundreds or thousands of terminals, internal branches are often short and detection by eye can be very difficult. Also, polyphyletic species dispersed among many other species might remain undetected. For these reasons, we developed a web service called “Monophylizer” that detects cases of non-monophyly. The service accepts Newick, Nexus, NeXML, and PhyloXML trees. The Monophylizer was designed to be rather permissive in the Newick syntax it allows because BOLD can omit syntactically invalid Newick tree descriptions. However, some of the database fields that BOLD includes allow text fields that may contain parentheses or commas, which file readers cannot distinguish from the commas and parentheses used by the Newick syntax. These must be avoided. Trees are parsed by the service using the Bio::Phylo toolkit (Vos et al. 2011), which can accept many tree format “dialects,” including most of the idiosyncrasies produced by BOLD.
Before proceeding, the web service applies an auto-incrementing integer index to each node both in a pre-and a postorder traversal. In a preorder tree traversal parent nodes are processed before their children, whereas in postorder children are processed before their parents. Thus, in this indexing scheme, each node is assigned the value of the incrementing index both before and after visiting its children, such that the tree ((A,B),C); is indexed as ((A{3.4},B{5.6}){2.7},C{8.9}){1.10}; if we signify the pre- and postorder node index as, respectively, the first and second integer of each statement between braces. The web service assesses monophyly using the following algorithm, which is applied to all distinct species in the tree:
1. Based on the species name, all leaf nodes that belong to the focal species are collected.
2. The most recent common ancestor (MRCA) of the collected leaf nodes is identified.
3. All leaf nodes subtended by the identified MRCA are collected.
4. If this set is the same as the set of leaf nodes in step 1, the species is monophyletic. If not, continue to step 5.
5. All internal nodes in the tree that subtend leaf nodes from the focal species as well as at least one other species are collected and sorted by their postorder index.
6. The collected, sorted internal nodes from step 5 are grouped into distinct root-to-tip paths. Internal nodes that are nested in each other are identified (and collected in the same group) by checking that the preorder index of the focal node is larger, and the postorder index of the focal node is smaller than that of the next node.
7. If there is more than one distinct root-to-tip path (i.e., group), the taxon is considered polyphyletic, otherwise paraphyletic.
8. For each first (i.e., most recent) node in each group, all subtended species are collected. The union of these sets across groups forms the set of entangled species.
The web service can be accessed at http://monophylizer.naturalis.nl/, last accessed June 2, 2016 and the source code is freely available at https://github.com/naturalis/monophylizer, last accessed June 2, 2016. The output of the web service can be configured to be either a table in a web page, or a tab-separated spreadsheet for high-throughput applications, for example, when combined with automated web clients. We used this web service to analyze the topologies of our gene trees.
Estimating the Effects of Sampling Effort and Intra- and Interspecific Divergence
The frequency of observed species monophyly is strongly influenced by intraspecific genetic variation represented in the data and the genetic distance to related species. Both measures are affected by sampling intensity. Higher sampling effort will reveal more intraspecific variation and will tend to identify more closely related species. We explored the effect of both these factors and their interaction on the frequency of monophyly as determined by Monophylizer. The measures we used of maximum intraspecific divergence and minimum genetic distance to the nearest neighbor were based on the K2P model of nucleotide substitution and were calculated using the “barcode gap analysis” tool of BOLD using pairwise deletion setting for missing nucleotides. Sequences were aligned using the BOLD sequence aligner (amino acid based HMM). The analysis was carried out with a slightly reduced data set of 4921 species (56 species excluded) since Barcode Gap Analysis currently treats records with infraspecific names as different species.
We analyzed the occurrence of non-monophyly by fitting a generalized linear model using the function “glm” in R 3.0.0 (R Core Team 2013). Species monophyly versus non-monophyly was treated as a binary response variable while the explanatory variables were distance to the nearest neighbor, maximum intraspecific genetic variation and the number of specimens analyzed. All interactions among the explanatory variables were included in the model. We assumed a binomial error distribution and used a logistic link function. For visualization and inferences, the fitted values of the model were transformed to probabilities by using the inverse of the link function.
Some fitted values of the above model were either zero or one, which results in problems in applying the Wald approximation used in deriving P-values for parameters of the model (Venables and Ripley 2002). Therefore, we performed a permutation test to derive empirical P-values for the model parameters. We randomly reordered the observations 10,000 times and fitted the statistical model described above to each permutated data set. An empirical P-value can be calculated by comparing the estimate derived from the true data to the distribution of estimates produced by permutation.
To assess the predictive power of the statistical model, we performed a cross-validation analysis at the family level: each of the 71 families was, in turn, used to test the model fitted to the data on the remaining 70 families. We chose this approach because the incidence of non-monophyly may vary among families, potentially biasing the predictive power of the model toward a subset of families. The overall performance of the model was then assessed by comparing the predictions to observations with the area under a Receiver Operating Characteristic (ROC) curve method [function “AUC” (LeDell et al. 2014)].
Statistical Relationship between Sampling Effort and Non-Monophyly
Barcode-Sharing
If a species shares its barcode with another species, and both show no intraspecific variation, they are treated as monophyletic by Monophylizer. However, these species could equally be considered para- or polyphyletic. Furthermore, such species pairs would no longer be reciprocally monophyletic if even a single nucleotide substitution were to occur. We investigated the frequency of this phenomenon by searching for species that showed sequence identity to their nearest neighbor but had been ranked monophyletic. As the same issues can arise when neighboring species are very similar but not identical, we also searched for monophyletic species differing by less than 1% from their nearest neighbor. Searches were performed using the “barcode gap analysis” tool in BOLD.
Estimation of Taxonomic Uncertainty and Misidentifications
Taxonomic inaccuracy and misidentifications are likely to yield many “false positives” cases of non-monophyly due to the incorrect assignment of a specimen to a species. Although we carefully crosschecked identifications of doubtful records prior to the analyses, there are many species groups where unclear morphological limits among species can produce misidentifications. A more significant effect is the likely inaccuracy of the taxonomy itself in many groups. To avoid circular reasoning we did not remove such groups from our data (with the exception of the psychid Dahlica/Siederia group where the inadequacy of taxonomy is widely acknowledged), but instead attempted to estimate the magnitude of this effect. As the authors include many of Europe’s leading experts on Lepidoptera, this was done by asking the relevant specialist to categorize each non-monophyletic species as “species identification straightforward” or “species identification problematic,” and separately as “species limits well-defined” or “species limits poorly defined.” We also used expert judgment to assess the occurrence of potential cryptic species in the data. While we acknowledge that assessing these effects involves some subjectivity, we find their impact potentially significant. In making their assessments the taxon specialists were asked to be conservative and include only the most obvious cases of synonymy and misidentifications, while the presence of potential cryptic species was only accepted when additional independent evidence, such as morphological or ecological differences, supported the genetic differences (thus we excluded cases of deep intraspecific barcode splits lacking further evidence that cryptic species may be involved).
Estimating Effects of Allopatry and Parapatry
Estimating the effect of geography is especially challenging because geographic information is often used to delimit different species. This is particularly problematic when a species is composed of spatially isolated populations, which will show additional structure in their degree of genetic differentiation; their allocation to species is both subjective and depends on the species concept employed (Mutanen et al. 2012). As pointed out by McKay and Zink (2010), where such species clusters involve paraphyly, it could simply be eliminated by elevating allopatric populations to valid species. We estimated the effect of allopatry on the incidence of non-monophyly, an exercise that was greatly facilitated by the distributional data for European Lepidoptera, which is superb in comparison with any other diverse invertebrate group or faunal region.
RESULTS
Incidence of Non-Monophyly in NJ and ML Trees

Overlap in species classified as mono-, para-, and polyphyletic using either NJ or ML methods. The number of species is indicated in each partition (the counts for monophyly exclude species represented by singletons).
Non-monophyletic species unique to ML show both a higher average intraspecific K2P variability (ML: mean 1.93, 95% adjusted bootstrap percentile [BCa] confidence = interval [CI] = 1.50–2.52; NJ: mean 1.29, 95% BCa CI = 0.882–1.905) and greater average minimum genetic distance to their nearest neighbor (ML: mean 1.79, 95% BCa CI 1.45–2.22; NJ: mean = 0.95, 95% BCa CI = 0.64–1.40). Only three non-monophyletic species unique to ML fully shared their barcode sequence (0.0% K2P distance) with their nearest neighbor, whereas with NJ this occurred in 29 species. A closer investigation of these cases showed that the difference is largely due to the tendency of the NJ method to place sequences that are identical except for length at slightly different nodes, a known pitfall of this method. As NJ, however, yielded fewer cases of non-monophyly (465 in NJ vs. 469 in ML) and several of these cases were due to the presence of haplotypes identical except for sequence length, ML seems to identify more species as non-monophyletic. Non-monophyletic species discovered only by ML have higher mean intraspecific variation because many of these species represent taxa with deep intraspecific splits, suggesting that ML recovers such species more frequently as non-monophyletic than NJ.
In 97.9% of species classed as non-monophyletic by ML, the polyphyly or paraphyly was due to one or more species from the same genus, but seven (1.5%) involved moths in closely related genera. Three of the latter cases involved a pair of genera that are currently being proposed for synonymy (Sciadia and Elophos, Geometridae, Ennominae—see also Huemer and Hausmann 2009), whereas a fourth involves two genera (Crombrugghia and Oxyptilus, Pterophoridae) separated by such minor characters that they are not accepted by all authorities (Kullberg et al. 2001). Three species (0.6%) showed phylogenetic “tangles” involving 5 (Glacies coracina, Geometridae), 36 (Dryobotodes monochroma, Noctuidae), and 71 (Deltote incognita, Noctuidae) genera, but these were very likely misidentification artifacts (see Supplementary Table 1 available on Dryad). Non-monophyly was not observed among species in different subfamilies or higher-level ranks.
Parameter estimates from a binomial generalized linear model (with a logistic link function) explaining the probability of non-monophyly
Parameter | Estimate | Std. Error | z | P-value | Empirical P-value |
Intercept | −0.354 | 0.190 | −1.86 | 0.063 | < 0.0001 |
Dist. to NN | −2.22 | 0.189 | −11.8 | < 0.0001 | < 0.0001 |
Intraspec. var. | 2.10 | 0.177 | 11.9 | < 0.0001 | < 0.0001 |
Specimens | 0.0417 | 0.0148 | 2.82 | 0.0048 | 0.00020 |
Dist. to NN × intraspec. var. | −0.0413 | 0.0258 | −1.60 | 0.11 | 0.0010 |
Dist. to NN × specimens | −0.0117 | 0.0109 | −1.079 | 0.28 | 0.00010 |
Intraspec. var. × specimens | −0.0257 | 0.00672 | −3.83 | 0.00013 | 0.00030 |
Dist. to NN × intraspec. var. × specimens | 0.00464 | 0.00150 | 3.10 | 0.0019 | < 0.0001 |
Parameter | Estimate | Std. Error | z | P-value | Empirical P-value |
Intercept | −0.354 | 0.190 | −1.86 | 0.063 | < 0.0001 |
Dist. to NN | −2.22 | 0.189 | −11.8 | < 0.0001 | < 0.0001 |
Intraspec. var. | 2.10 | 0.177 | 11.9 | < 0.0001 | < 0.0001 |
Specimens | 0.0417 | 0.0148 | 2.82 | 0.0048 | 0.00020 |
Dist. to NN × intraspec. var. | −0.0413 | 0.0258 | −1.60 | 0.11 | 0.0010 |
Dist. to NN × specimens | −0.0117 | 0.0109 | −1.079 | 0.28 | 0.00010 |
Intraspec. var. × specimens | −0.0257 | 0.00672 | −3.83 | 0.00013 | 0.00030 |
Dist. to NN × intraspec. var. × specimens | 0.00464 | 0.00150 | 3.10 | 0.0019 | < 0.0001 |
Notes: The explanatory variables were the genetic distance to the nearest neighbor species (dist. to NN), maximum intraspecific K2P variation (intraspec. var.), and the number of specimens analyzed (specimens). Empirical P-values were derived from a permutation test (see text for details) because some fitted probabilities were numerically either zero or one, which may result in overestimated P-values when using the Wald approximation.
Parameter estimates from a binomial generalized linear model (with a logistic link function) explaining the probability of non-monophyly
Parameter | Estimate | Std. Error | z | P-value | Empirical P-value |
Intercept | −0.354 | 0.190 | −1.86 | 0.063 | < 0.0001 |
Dist. to NN | −2.22 | 0.189 | −11.8 | < 0.0001 | < 0.0001 |
Intraspec. var. | 2.10 | 0.177 | 11.9 | < 0.0001 | < 0.0001 |
Specimens | 0.0417 | 0.0148 | 2.82 | 0.0048 | 0.00020 |
Dist. to NN × intraspec. var. | −0.0413 | 0.0258 | −1.60 | 0.11 | 0.0010 |
Dist. to NN × specimens | −0.0117 | 0.0109 | −1.079 | 0.28 | 0.00010 |
Intraspec. var. × specimens | −0.0257 | 0.00672 | −3.83 | 0.00013 | 0.00030 |
Dist. to NN × intraspec. var. × specimens | 0.00464 | 0.00150 | 3.10 | 0.0019 | < 0.0001 |
Parameter | Estimate | Std. Error | z | P-value | Empirical P-value |
Intercept | −0.354 | 0.190 | −1.86 | 0.063 | < 0.0001 |
Dist. to NN | −2.22 | 0.189 | −11.8 | < 0.0001 | < 0.0001 |
Intraspec. var. | 2.10 | 0.177 | 11.9 | < 0.0001 | < 0.0001 |
Specimens | 0.0417 | 0.0148 | 2.82 | 0.0048 | 0.00020 |
Dist. to NN × intraspec. var. | −0.0413 | 0.0258 | −1.60 | 0.11 | 0.0010 |
Dist. to NN × specimens | −0.0117 | 0.0109 | −1.079 | 0.28 | 0.00010 |
Intraspec. var. × specimens | −0.0257 | 0.00672 | −3.83 | 0.00013 | 0.00030 |
Dist. to NN × intraspec. var. × specimens | 0.00464 | 0.00150 | 3.10 | 0.0019 | < 0.0001 |
Notes: The explanatory variables were the genetic distance to the nearest neighbor species (dist. to NN), maximum intraspecific K2P variation (intraspec. var.), and the number of specimens analyzed (specimens). Empirical P-values were derived from a permutation test (see text for details) because some fitted probabilities were numerically either zero or one, which may result in overestimated P-values when using the Wald approximation.

Proportions of species with different minimum K2P distances to their nearest neighbor in mono-, para-, and polyphyletic species. For monophyletic species, singletons were excluded.

Monophyletic species (215 in total) showing less than 0.01 minimum K2P distance (or less than 7 nucleotide substitutions difference) to their closest species. The number of nucleotide substitutions to the nearest neighbors are indicated with arrows. The curve is not cleanly stepped because of slight variation in sequence lengths and because the substitution model employed does not assume equal likelihoods of all substitutions. Forty-eight species having K2P divergence of zero to the closest heterospecific would be rendered non-monophyletic by a single nucleotide substitution.
High intraspecific variability is associated with the presence of non-monophyly. Of 3807 monophyletic species represented by more than one individual, the mean K2P intraspecific maximum variability is 0.99% (95% BCa CI 0.95–1.04; mean n=8.99). In paraphyletic species, the mean maximum intraspecific variation is 2.37% (95% BCa CI 2.14–2.62; mean n=14.58) and in polyphyletic species it rises to 3.26% (95% BCa CI 2.77–3.87; mean n=12.78). These differences remain after controlling for sampling effort across categories by dividing the mean maximum intraspecific variation for each species by the number of specimens (monophyletic: mean = 0.14, 95% BCa CI 0.13–0.15; paraphyletic: mean = 0.26, 95% BCa CI 0.23–0.30; polyphyletic: mean=0.53, 95% BCa CI 0.40–0.74).
The remaining analyses are based only on results obtained through ML. Identical haplotypes were not considered except in assessing the effects of sampling bias.
Effects of Intra- and Interspecific Divergence and Sampling Effort

The estimated probability of species non-monophyly from a generalized linear model including as explanatory factors: genetic distance (in percent with 1% distance equaling to 0.01 K2P divergence) to nearest neighbor (vertical axes), maximum within-species K2P genetic distance (horizontal axis) and the number of specimens included in the analysis (the figures show four values). Probability values are indicated by a grayscale gradient with black = 1 and white = 0.
![Probability of finding non-monophyly as a function of the number of specimens per species included in the analysis. Points indicate proportions of non-monophyletic species in groups of species with equal number of analyzed specimens. The darkness of points indicates weights (inverses of bootstrap standard errors) used in fitting the regression curve, darker colors indicating higher weights. The curve [y=0.23×(1−e−(x−1)e−2.5)] is fitted with nonlinear asymptotic regression (see text for details).](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/sysbio/65/6/10.1093_sysbio_syw044/9/m_syw044f6.jpeg?Expires=1747896190&Signature=f0TtdOmfbZGUKhfW45gO-3pMRpvzB27e7Y30EbwGUFKkoaYJgaNQQoX21jiK2HBSqWZxSZt87FFtgkkqQ62AXX31EB1lniXmQlJfSOZveDCitgY1eC5oSDFsUG0yc7BkuSsJ5FRfjov4mCGAkT1RP2zAvnHNw~HzwBKOgOgAdBsBL6b2qFsGw8Bd5jtDCO2SKod1ooEnxxzLMRPy2gWihUWsesuOoYbbtC2zk8CPfUviIAT9Ba9ytuSIlKKjPAjqbSX0gTOxqiGcV2Ihip7HR7SoHkbJ9mBFFqL57JgV9PheBspEEWGk5ItohCNbvNlaCI1IQzjVyCdvfIRYWW2QOw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Probability of finding non-monophyly as a function of the number of specimens per species included in the analysis. Points indicate proportions of non-monophyletic species in groups of species with equal number of analyzed specimens. The darkness of points indicates weights (inverses of bootstrap standard errors) used in fitting the regression curve, darker colors indicating higher weights. The curve [y=0.23×(1−e−(x−1)e−2.5)] is fitted with nonlinear asymptotic regression (see text for details).
Effect of Taxonomic and Identification Uncertainty
The taxonomic specialists among the authors estimate that 31.8% of non-monophyly may not be valid, as they are likely to reflect “over-splitting” of species (Supplementary Table 1 available on Dryad). These include (i) cases where highly similar but allopatric populations have been considered distinct species, (ii) parapatric species pairs with little information on the extent of gene flow between populations, (iii) ecological (e.g., altitudinal, latitudinal, habitat, or food-plant associated) forms or potentially polymorphic species, and (iv) sympatric pairs or groups of variable species separated by uncertain boundaries.
In 15.1% of non-monophyletic species, there was further independent evidence of cryptic diversity (undersplitting), in most cases from morphological differences associated with barcode divergence. Of these, 78.9% involve sympatric splits, indicating that many cases are likely to represent reproductively isolated, but morphologically similar species pairs or groups. Several of these cases are undergoing taxonomic revision using an integrative approach, but with a single exception (Stigmella salicis group, see above) we followed currently recognized taxonomic boundaries.
In 31.6% of non-monophyletic species, we identified problems with species identification, despite our initial careful validation. In many but not all cases, these difficulties were associated with possible cases of oversplitting. In some cases, such as in Yponomeuta (Yponomeutidae), the problems arose because reliable identifications require larval characters and are much harder for adults. Often, but not always, the difficulty in identifying species is linked to the likely presence of cryptic species or oversplit species.
Altogether 58.6% of non-monophyletic species were estimated as potentially being affected at least by one taxonomic uncertainty (undersplitting, oversplitting, or identification difficulties). In fact, 19.9% of species were estimated as being affected by more than one taxonomic issue.
Effect of Geography-Related Patterns among Species
Altogether, 78.7% of the non-monophyletic species are sympatric with at least one of the species responsible for polyphyly or paraphyly. An entirely allopatric relationship was detected in 14.5% of species and parapatry in 5.8% of species. Of the species suspected of being non-monophyletic due to oversplitting, at least one of the species responsible was sympatric in 61.5% of the cases, allopatric in 21.5% of the cases and parapatric in 14.8% of the cases. In four cases, the geographic relationships between species are uncertain due to poor distributional data.
DISCUSSION
Our survey examined 41,583 specimens belonging to 4977 lepidopteran species, the largest data set of DNA barcodes so far published for any taxonomic group. Moreover, our data are of higher quality than previously published arthropod data sets due to the efforts made to resolve issues such as uncertainties in identification. For example, the benchmark review of species non-monophyly (Funk and Omland 2003) included 2319 species, whereas Ross (2014) examined 21,337 species. Both of these studies examined a broad range of taxa, but relied largely upon unpublished and poorly validated data, without examining the reliability of the identifications.
Funk and Omland (2003) reported the occurrence of non-monophyly to be 26.5% in arthropods (figure for Lepidoptera not estimated separately), whereas Ross (2014) reported an 18% non-monophyly in Arthropoda and 17% in Lepidoptera (16% with interim operational names included). We observed a considerably lower level of 12.3% of non-monophyly in our “raw” data. This is not explained by differences in sampling effort, since Ross sampled an average of 8.1 specimens per species among Arthropoda, whereas our average was 8.4. Ross’ Lepidoptera data are based partially on the same cleaned data used in our study, indicating that the level of non-monophyly in data not used in our study was likely higher than 17%. Since Funk and Omland’s data were retrieved from GenBank while Ross’s data came from BOLD, the higher incidence may be due to the higher proportion of misidentifications in GenBank (Harris 2003; Bidartondo 2008; Groenenberg et al. 2011). A second possible cause is that before the advent of systematic species sampling through the iBOL, data sets were biased because of more concentrated efforts on difficult species groups, which in turn are likely to express unusually high levels of non-monophyly. We also believe that the lower level of non-monophyly in our study compared with that by Ross reflects our considerable effort in data validation. Our data, however, are certainly not free of identification errors. The true incidence of non-monophyly is further exaggerated by several other operational factors, but attenuated by limited sampling in many species. We attempted to assess the significance of all these factors in our study.
Our results are remarkably congruent with a detailed study of paraphyly on birds (McKay and Zink 2010). Funk and Omland (2003) found that 16.7% of birds exhibited non-monophyly, a value close to the 14.3% reported by McKay and Zink (2010). However, a detailed examination of 856 species by the latter authors revealed that 55.7% of these cases were due to incorrect taxonomy, a value close to our estimate of 58.6% of cases generated by taxonomic issues in our Lepidoptera data set.
Two genetic factors play a crucial role in influencing the likelihood of species monophyly: the degree of intraspecific variation and the extent of divergence between species. Our data clearly demonstrate that relatively high intraspecific variation and relatively low genetic distance from the closest species characterize virtually all non-monophyletic species, and that where both occur the species concerned is nearly always non-monophyletic. Monophyletic species have, on average, nearly six times higher genetic distances to their nearest neighbors than paraphyletic species and almost seven times higher than polyphyletic species. Similarly, their intraspecific variation is less than half that of paraphyletic species and less than one-third of that of polyphyletic species. The statistical analysis indicates that the probability of non-monophyly increases practically as a step function from zero to one with increasing intraspecific genetic variation and decreasing distance to the nearest neighbor, and that non-monophyly becomes more likely with increasing sampling effort.
Since sampling of insect populations is never complete, only a fraction of the total genetic diversity is represented in any data set. Both the study by Ross (2014) and our work shows a clear positive correlation between sampling effort and the incidence of non-monophyly. Previous studies, with the exception of that of McKay and Zink (2010) in birds, did not attempt to estimate the actual level of non-monophyly as we did. However, many species in our study have distributions that extend beyond Europe and sampling across their entire range will likely reveal additional genetic variation that might affect estimates of non-monophyly. Based on our data, we estimated that the actual level of non-monophyly in European Lepidoptera would be about 23% (95% confidence interval 16–48%) without considering the impact of operational factors (see below). The point estimate is slightly less than the 26.5% reported by Funk and Omland (2003), although their value falls within the confidence limits of our estimate.
We detected identification issues in 31.6% of non-monophyletic species. Cases of oversplitting are strikingly frequent in our data, since we estimated that up to 31.8% of all non-monophyletic species may represent “false species.” The taxonomic issues affecting the distribution of non-monophyly in our study occur for two main interconnected reasons. First, lepidopteran taxonomy has a long tradition in Europe, where the fauna is well investigated in many areas, leading to the situation in which “taxonomic resolution” eventually gets very fine in those groups that have been studied by many workers. While this effort undoubtedly helps to reveal many cryptic species, the side effect is that species that fail to match standard criteria are more likely to be considered as valid. This is exemplified by the many cases of allopatric populations of European Lepidoptera that over time and with increased taxonomic scrutiny have been accorded species status (Mutanen et al. 2012). Second, taxonomic tradition favors species splitting at the expense of species lumping. This is exemplified by Euxoa tritici (Noctuidae), which was split into three species in a non-peer-reviewed revision of the group (Fibiger 1997). Despite subsequent morphometric studies indicating broad overlap and the poor performance of the diagnostic characters separating the proposed species (Mutanen 2005), many checklists still list them as distinct (DNA barcodes also do not support the presence of more than one species). The International Code of Zoological Nomenclature (ICZN 1999) does not require peer review for a new name to be valid, while synonymization of species typically requires thorough studies, which then have to become generally accepted by the taxonomic community. This situation leads to the gradual accumulation of poorly justified species over time.
We estimated that 14.9% of all non-monophyletic species in our study are actually a species pair or group. This estimate is likely conservative since we included only cases with independent evidence aside from DNA barcodes supporting the presence of cryptic species. In several cases, the description of one or several new species is underway (e.g., Huemer and Mutanen 2015). It is likely that many more cases with deep intraspecific splits will eventually be revealed as species complexes since DNA barcodes are very effective in revealing potential cryptic species (Hausmann et al. 2009a, 2009b, 2013; Dincă et al. 2011, 2015; Huemer and Hebert 2011; Huemer et al. 2013, 2014; Mutanen et al. 2013; Huemer and Timossi 2014; Kirichenko et al. 2015). On the other hand, some other studies have not found evidence of cryptic species among taxa showing unusually high intraspecific barcode variation (Webb et al. 2011; Hogner et al. 2012; Kvie et al. 2013;.
The assignment of allopatric populations to species in a standardized way is one of the most difficult challenges of alpha taxonomy (Mutanen et al. 2012). Geographic distance often leads to a breakdown of gene flow between populations, resulting in the gradual differentiation of populations over time and eventually speciation. Under these circumstances, the taxonomic delimitation of populations is inherently subjective and greatly affected by the underlying species concept. Of the non-monophyletic species in this study, 14.5% involved two or more allopatric species and 5.8% one or more parapatric species. Many other non-monophyletic species are allopatric or parapatric in relation to some, but not all, associated species. This suggests that issues related to the geographic relationships of the species and resulting taxonomic difficulties play a significant role in species poly- and paraphyly.
Altogether 58.6% of non-monophyletic species detected in this study are likely to be due to methodological rather than biological causes. Thus, the observed level of 12.3% non-monophyly in our data would drop to 5.1% if these methodological issues were taken into account. Similarly, our extrapolation to estimate the actual level of non-monophyly (23%) would drop to 9.5, albeit with relatively broad confidence intervals. It is not possible to precisely classify all cases of non-monophyly as due to methodological or biological causes because there are different interpretations of species concepts, but our data indicate the two are of roughly equal importance. Furthermore, we are not able to provide any estimate of the frequency of hybrid specimens in our data. Hybrid specimens could easily be misidentified or even described as valid species. Recent in-depth studies have revealed several such cases (Anderson et al. 2007; Rougerie et al. 2012). Because of maternal inheritance, hybrid specimens cannot be identified by mitochondrial DNA markers such as the barcode locus (COI).
NJ and ML methods yielded similar estimates of the incidence of paraphyly or polyphyly, suggesting that any differences with prior studies were not caused by the tree-construction method. Most studies included in Funk and Omland (2003) were based on NJ analysis. We deliberately used the K2P nucleotide substitution model in NJ analyses because it is employed by most DNA barcoding studies, although it is not always the best-fit model (Srivathsan and Meier 2012). Based on our trials (results not shown), the substitution model very seldom had any effect on the tree topology using NJ. NJ is known to be prone to a variety of artifacts, such as the order of specimens in the input file (Farris et al. 1996) and long-branch attraction (Felsenstein 1978). For this reason, its use in molecular phylogenetic studies is often disfavored. We also observed that NJ tends to place sequences that are identical except for length at different nodes. We, therefore, adopted ML as the basis for most of our analyses. Although the numbers of non-monophyletic species recovered by the two methods were similar there was only a 76% overlap in species composition. This highlights the importance of method selection in DNA barcode-related studies, especially when topology-related questions are addressed.
Based on our observations, species-level polyphyly in COI gene trees involving deep genetic divergence is very rare in Lepidoptera, but the situation may be different in groups with different biology (e.g., with haplodiploid genetic system, see Patten et al. 2015). Non-monophyly above the genus level is exceptional, and is likely to involve either misidentifications or oversplitting of genera while non-monophyly involving higher taxonomic groups was never observed. Cases of COI barcode-sharing between closely related species have been reported in many taxa, although their frequency is usually low and is often due to oversplitting (e.g., Hausmann et al. 2013; Pentinsaari et al. 2014). Under these circumstances, paraphyly, polyphyly, and monophyly are not distinct phenomena as a single nucleotide substitution can change the type of relationship. For the same reason, some of the species observed as monophyletic in this study may be revealed as non-monophyletic with increased sampling (cf. Fig. 4). Actually, under perfect barcode-sharing between two or more species with no intraspecific variation, species could equally be considered mono-, para-, or polyphyletic. We considered such cases (usually species pairs) reciprocally monophyletic, but even a single mutation would negate this. A significant fraction of non-monophyletic species represents cases of high genetic similarity between two or more species. Many species having deep intraspecific variation (usually with a deep split) appear paraphyletic because other species are nested within them.
Is the prevalence of non-monophyly in Lepidoptera typical of other groups? Both Funk and Omland (2003) and Ross (2014) compared the incidence of non-monophyly among major animal groups. They detected significant differences amongst taxa but with Arthropoda typically being near the mean. We found that the probability of non-monophyly declined as the average genetic distance to the nearest neighbor increased. Several recent DNA barcode data release papers enable us to explore whether this pattern extends to comparisons across higher taxa. Pentinsaari et al. (2014) studied 1972 Coleoptera species and found that the mean K2P difference to the nearest neighbor was over twice that of Lepidoptera (11.99% vs. 5.80%; 5.22% in our data for Lepidoptera). Their estimate of the frequency of non-monophyly was only 2.2% without adjustment for methodological issues such as cryptic species. The data are geographically more limited and the average sampling effort lower, but there is little doubt that the level of non-monophyly in beetles is lower than in Lepidoptera. Ward et al. (2005) reported a mean interspecific divergence from the nearest neighbor of 22.03% in fishes, Hebert et al. (2004) and Kerr et al. (2009) 11.82% and 12.64% in birds, respectively, Chang et al. (2009) 18.66% in earthworms, Zhou et al. (2009) 15.54% in Trichoptera, 23.89% in Ephemeroptera, and 19.24% in Plecoptera, Ball et al. (2005) 25.02% in Ephemeroptera, Shaffield et al. (2009) 13.81% in Hymenoptera (Apoidea), Blagoev et al. (2009) 6.77% in Araneae, and Hogg and Hebert (2004) 21.03% in Collembola.
DNA barcoding has great potential to accelerate taxonomic workflows by enabling rapid sorting of specimens into tentative species or operational taxonomic units (Zhou et al. 2007; Kekkonen and Hebert 2014). Molecular data have the advantage of potentially permitting species delimitation in a quantitative and standardized way (Tautz et al. 2003; Leaché et al. 2014). While a final taxonomic framework has to be based on more comprehensive genomic (Leaché et al. 2014) or broadly integrative data (Padial et al. 2010; Schlick-Steiner et al. 2010), DNA barcodes are very valuable because they are easy to obtain at low cost (including from older museum specimens) and existing reference libraries with broad taxonomic coverage already exist. Therefore, an increasing number of taxonomic revisions is based in whole or in part on DNA barcodes. Several quantitative species delimitation algorithms for molecular data have been developed over the past decade, including approaches dedicated to DNA barcodes such as ABGD and BIN (Puillandre et al. 2012; Ratnasingham and Hebert 2013). Other approaches such as GMYC (Pons et al. 2006), bGMYC (Fujisawa and Barraclough 2013), DISSECT (Jones et al. 2015), and PTP (Zhang et al. 2013) permit the analysis of varied genetic markers, whereas other methods enable species delimitation based on multi-marker or even genome-wide SNP data (Yang and Rannala 2010; Leaché et al. 2014; Pante et al. 2014). Regardless of the method, species-level non-monophyly forms a major challenge as no method can correctly delimit species showing polyphyly in a gene tree and only exceptionally can they deal with paraphyly. While our study suggests that species-level non-monophyly is less frequent than previously thought, problems in algorithmic DNA-based species delimitation remain. More attention should be paid to separating the true cases of non-monophyly from those resulting from technical and methodological causes. We hope a schematic stepwise chart (Fig. 1) will serve as a general blueprint for taxonomic studies. Our open-access tool “Monophylizer” should help taxonomists to rapidly and confidently separate monophyletic, paraphyletic, and polyphyletic species from each other based on phylogenetic trees in a variety of common file formats. Cases of non-monophyly should be flagged for careful reappraisal and deep-level species polyphyly studied for the presence of overlooked species.
CONCLUSIONS
Species delimitation is increasingly based on molecular data. However, non-monophyly represents a major challenge for algorithmic species delimitations. Processes such as incomplete lineage sorting and introgression give rise to biological non-monophyly that cannot be resolved by increased geographic or genetic sampling. However, our results suggest that current estimates overestimate the extent of non-monophyly in trees based on mitochondrial DNA. We found that a very high fraction of cases of non-monophyly reflects methodological issues, such as misidentifications, oversplitting of species, overlooked species, and the inherent subjectivity of species delimitations, especially when allopatric populations are concerned. Species polyphyly in mtDNA is rare and mostly attributable to cases of very shallow divergence between species, but in rare cases it may also result from mitochondrial introgression. Whether or not a species appears monophyletic in a tree is also affected by the method used to build the tree. Overall, our study supports the argument that, when used with care and in conjunction with other techniques, DNA barcodes are a valuable addition to the tools available for taxonomic work on animals.
SUPPLEMENTARY MATERIAL
Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.k3mr1.
FUNDING
Most of the sequences used in this study were generated at the Biodiversity Institute of Ontario under the International Barcode of Life Project, funded by the Government of Canada through Genome Canada and the Ontario Genomics Institute. The generation of German data was funded by grants from the Bavarian State Ministry of Education, Culture, Science and the Arts (Barcoding Fauna Bavarica, BFB) and the German Federal Ministry of Education and Research (German Barcode of Life GBOL2: BMBF #01LI1101B). Molecular laboratory infrastructure and sequencing within the Nature of The Netherlands project was funded by a FES grant from the Dutch Ministry of Finance. The Finnish Barcode of Life project was funded by the Kone Foundation, the Finnish Cultural Foundation, and the University of Oulu. Support for this research was provided by the Spanish Ministerio de Economía y Competitividad [projects CGL2010-21226/BOS and CGL2013-48277-P to R.V.], by a Région Haute-Normandie post-doctoral fellowship [to R.R.], and by a Marie Curie International Outgoing Fellowship within the 7th European Community Framework Programme [project no. 625997 to V.D.]ă Sequencing of Norwegian material was supported by the Natural History Museum, University of Oslo, and the Norwegian Barcode of Life Network (NorBOL). Sequencing within the framework of the Lepidoptera of the Alps Campaign was supported by the Promotion of Educational Policies, University and Research Department of the Autonomous Province of Bolzano—South Tyrol with funds to the project “Genetic biodiversity archive -DNA barcoding of Lepidoptera of the central Alpine region (South, East and North Tyrol),” the Austrian Federal Ministry of Science, Research and Economics with funds to ABOL (Austrian Barcode of Life), and by the regional institutions Tiroler Landesmuseen, inatura and Landesmuseum Kärnten. S.M.K. was funded by the international fellowship program at Stockholm University and Finnish Cultural Foundation. Sampling of Lepidoptera from Upper-Normandy (France) was supported by a grant by Conseil Régional de Haute-Normandie to Thibaud Decaëns, then member of the ECODIV laboratory at the University of Rouen.
ACKNOWLEDGMENTS
We are indebted to a large number of taxonomic experts and collectors who have contributed to the Lepidoptera Barcode of Life campaign in multiple ways, especially by providing material for DNA barcoding and by identifying specimens. We are grateful to the ICT staff at Naturalis, and especially David Heijkamp, for hosting the Monophylizer web service through their infrastructure. We are very grateful to staff at the Biodiversity Institute of Ontario for their key role in generating sequences, photographing specimens, entering data elements into BOLD, and aiding the curation of this information. We thank Jess Johansson and Megan Milton for help with GenBank submissions and BOLD data set. Finally, we thank Frank Andersson, Karl Kjer, and an anonymous referee for their comments on an earlier version of the study.