PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding

Abstract DNA metabarcoding combines DNA barcoding with high-throughput sequencing to identify different taxa within environmental communities. The ITS has already been proposed and widely used as universal barcode marker for plants, but a comprehensive, updated and accurate reference dataset of plant ITS sequences has not been available so far. Here, we constructed reference datasets of Viridiplantae ITS1, ITS2 and entire ITS sequences including both Chlorophyta and Streptophyta. The sequences were retrieved from NCBI, and the ITS region was extracted. The sequences underwent identity check to remove misidentified records and were clustered at 99% identity to reduce redundancy and computational effort. For this step, we developed a script called ‘better clustering for QIIME’ (bc4q) to ensure that the representative sequences are chosen according to the composition of the cluster at a different taxonomic level. The three datasets obtained with the bc4q script are PLANiTS1 (100 224 sequences), PLANiTS2 (96 771 sequences) and PLANiTS (97 550 sequences), and all are pre-formatted for QIIME, being this the most used bioinformatic pipeline for metabarcoding analysis. Being curated and updated reference databases, PLANiTS1, PLANiTS2 and PLANiTS are proposed as a reliable, pivotal first step for a general standardization of plant DNA metabarcoding studies. The bc4q script is presented as a new tool useful in each research dealing with sequences clustering. Database URL: https://github.com/apallavicini/bc4q; https://github.com/apallavicini/PLANiTS.


Introduction
DNA metabarcoding allows the automated identification of species present in an environmental sample (1) combining DNA barcoding and high-throughput sequencing techniques. Nowadays, DNA metabarcoding has found a broad range of applications, representing a key tool for understanding evolutionary history, functions and ecological diversity of organismal communities (2). It has also been applied for plant identification in different fields, such as palynology (3), melissopalynology (4), plant-pollinator interactions (5)(6)(7), identification of allergens (8,9), environmental monitoring (10), dietary analyses (11,12) and composition of herbal products (13).
Though many biases can be introduced by sampling strategy and sample processing, primer design, sequencing errors and bioinformatics pipelines (14), the accuracy and reproducibility of DNA metabarcoding heavily depend on the availability of a comprehensive, updated, accurate and standardized reference dataset for the targeted DNA barcode (5).
Sequence datasets of the nuclear ribosomal internal transcribed spacer (ITS rRNA; UNITE for fungi, 15, and ITSoneDB for eukaryotes, 16), the small (16S rRNA) and the large subunit ribosomal RNA (23S rRNA) for prokaryotes (RDP, 17; Greengenes, 18; SILVA, 19), the small (18S rRNA) and the large subunit ribosomal RNA (28S rRNA) for eukaryotes (SILVA,19) and cytochrome oxidase subunit I (COI) for metazoans (CO1 Classifier, 20) have already been established. On the contrary, an acknowledged reference dataset specifically dedicated to the kingdom Planta is still missing, likely due to the deficiency of a robust, effective and largely accepted DNA barcode for plants. In this context, the determination of an official plant DNA barcode is still debated; indeed, different markers (either used individually or combined) have been proposed among several plastidial (e.g. rbcL and matK) and nuclear (e.g. ITS) loci (21).
However, multiple pieces of evidence support the suitability of the ITS region, ITS2 in particular, as the preferential marker for DNA metabarcoding in plants. This is mostly due to high levels of interspecific divergence and thus discriminatory power, which are higher than in short plastidial barcodes; further, it presents fewer amplification and sequencing problems (22). While both ITS1 and ITS2 subregions have been tested successfully over a broad range of plant taxa (23)(24)(25), the recent development of long read sequencing technologies such as MinION (Oxford Nanopore Technologies) or SMRT sequencing (Pacific Biosciences) enables the coverage of the whole ITS region. We expect that in the next years the discriminatory power of this barcode will be exploited for plant studies, as it has already been demonstrated for fungi (26).
National plant datasets have been established, such as the Great Britain database of national native plants and grasses (27), which has been recently used for DNA metabarcoding analyses (9). Other datasets regard specific geographic or ecologically characterized areas of interest, such as the one prepared for tropical herbal plants (28). Even though these datasets are of high quality, these resources are limited to only a small fraction of the whole plant biodiversity. This is also the case of other, global resources, such as the Barcode of Life Data system (BOLD; 29), which considers so far only about the 20% of the formally described land plants (30).
Sickel et al. (31) integrated the global 'ITS2 database' in the bioinformatic pipeline (32) when performing a study of honeybee-collected pollen using DNA metabarcoding. However, this database was updated in 2015 for the last time and held mainly information on the secondary structure of the ITS2 region but it does not include any identity check of the sequences. Recently, a rbcL reference dataset was developed by Bell et al. (5), as this marker, combined with the ITS2, aids in the identification at species level of some angiosperms.
At present, the large part of plant DNA metabarcoding studies rely on sequence data stored at and downloaded from NCBI (4,8,11,12,33) and sometimes complemented with own produced sequences (34). However, using the NCBI resources holds a grave disadvantage as the sequences are not checked for mistakes in taxonomy during deposition. Namely, a rather abundant number of sequences assigned to plant taxa but belonging to fungi can be frequently found in NCBI. These records usually belong to fungi that are present on or inside the plant tissue and that are sequenced instead of the target organism. To avoid taxonomic validation of each reference matching a sequence in a DNA metabarcoding survey, new bioinformatic pipelines that detect such taxonomic misidentifications automatically and new dedicated reference databases with a minimized number of misidentified sequences are needed.
In the frame of a DNA metabarcoding project aimed at analyzing plant and fungal diversity in aerobiological samples in Italy, we came across the problem of the lack of a reliable and comprehensive reference dataset for the ITS region of Viridiplanta. To overcome such issue, we provide a new script that checks whether the generated clusters in the reference database are composed mainly by one species and selects the representative sequence according to this. Furthermore, we generated a new, curated ITS reference dataset called 'PLANiTS' that we present here along with its innovative features.

Materials and methods
Viridiplants ITS sequences were downloaded from NCBI on 11 April 2019 using the Entrez query 'ITS' OR 'internal transcribed spacer' OR 'rRNA' OR 'ribosomal RNA' AND 'Viridiplantae' [ORGN] AND '200: 4000' [SLEN], to exclude genome contigs. In this way, we retrieved sequences of the ITS1 and the ITS2 fragments and complete the ITS region.
As one of the key points during the construction of a plant ITS database is the removal of sequences deposited in NCBI as plant sequences, but truly belonging to fungi, this study tackled this problem using two similarity-based methods, as follows: (i) MEGAN (35) assigns sequences to taxa using the last common ancestor (LCA) algorithm and displays the inferred taxonomy. We used it to highlight sequences that, though having putatively the same taxonomy (based on the header of the fasta file), could not be aligned due to sequence divergence. In this case, the LCA algorithm cannot assign the reads at species level but places them into the 'Not assigned' or other higher taxonomic level, such as Eukarya. An ITS database with the downloaded sequences was firstly formatted using the 'build' tool from MALT 0.4.1 package (36) using the default parameters and 100% identity. Secondly the 'run' tool was used to run the formatted database against DNA metabarcoding samples that contained plant sequences. In this case, we used as test some aerobiological samples coming from our DNA metabarcoding study (Banchi et al. under review; see Results and discussion section) that presented a wide range of plant taxa and were suitable to this purpose. The produced rma files (compressed, indexed binary files that include sequence reads, alignments and taxonomic assignments) were imported in MEGAN 6 (35) for the benchmarking. The 'Inspector' tool provided in MEGAN was used to view the individual sequence comparisons upon which the assignment of a sequence to a taxon was based (35). Sequences suspected to be GenBank entries with a wrong taxonomic assignment were checked manually and further blasted in NCBI.
(ii) The BLAST command-line algorithm was used to blast the curated fungal UNITE database (15) against the downloaded ITS sequences to detect any positive match, potentially belonging to fungi. First, a database with sequences downloaded from NCBI was constructed using the 'makeblastdb' command, and then the UNITE database was blasted against it with the 'blastn' command at default parameters. The sequences that had a similarity hit were checked in NCBI.
As a result of the control procedures, several taxonomically misidentified fungal sequences were detected and discarded.
ITSx (37) was then used to extract ITS1, 5.8 and ITS2 fragments from the GenBank sequences.
To reconstruct the complete ITS region, the sequences that presented ITS1, 5.8 and ITS2 regions after ITSx were concatenated to obtain the complete region.
For each sequence, taxonomy was added at Phylum, Class, Order, Family, Genus and Species level with entrez_qiime (NA: not assigned; 38) and manually edited when needed.
To reduce redundancy and computational effort, the sequences were clustered with CD-HIT (39) at 99% identity. CD-HIT uses a greedy incremental clustering algorithm method and selects the longest sequence as representative of a certain cluster. Even if this approach is extremely fast and commonly accepted to make non redundant databases, it is prone to errors (i) if the longest sequence of a cluster has been erroneously identified while the others were corrected, or (ii) if the DNA marker has not enough discriminating power to resolve a certain taxon (e.g. at species level) and therefore a species is selected among others correctly represented in the cluster. To overcome these biases, we have developed a script called 'better clusters for QIIME' (bc4q) that checks whether the generated clusters are composed mainly by one species (more than 90% of the members of each cluster) and ensures that the representative sequence of each cluster is the most frequent one, by eventually replacing it with the one showing the highest number of counts. If no defined main species can be found in a cluster, the check is repeated at the genus level, and in case of no clear genus, at family level, re-ranking the members of the clusters to ensure that the most frequent genus (or family) is picked. If the genus (or family) check succeeds for a cluster, the cluster is kept, and the species (or the genus) is saved as 'NA' (not assigned) in the output file for taxonomy association. If the check fails also at the family level, the cluster is discarded, and the event is reported to the log together with the taxonomic information of the members of the failed cluster for manual review (Figure 1; code available at https://github.com/apallavicini/bc4q).
After a first run of the script, we manually checked those 'warning' clusters to identify misidentified plant accessions in ITS1, ITS2 and total ITS databases. These misidentified accessions were removed with CLC Genomics Workbench v. 12 (Qiagen) from the un-clustered databases that were clustered again with CD-HIT (39) at 99% identity. The script was re-run on the cleaned clusters. At the end, we produced three curated datasets which were named PLANiTS, PLANiTS1 and PLANiTS2, respectively, and are freely available (https://github.com/apallavicini/PLANiTS) in QIIME format, being QIIME the most common tool to analyze metabarcoding data (39). Moreover, the same format is required in the Microbial Genomics module of CLC Genomics Workbench v.12 (Qiagen). We are committed to updating the datasets with the newest GenBank sequences every 6 months to 1 year.
To further validate the performance of PLANiTS in plant taxonomic assignment, we compared its performance with the ITS2 Database (32) on the same dataset. From the ITS2 Database (32; http://its2.bioapps.biozentrum. uni-wuerzburg.de/), we downloaded the Viridiplantae sequences (consisting of 114 733 sequences) and we formatted them in QIIME format.
The PLANiTS database was not tested because the entire ITS fragment (up to 1500 bp) is still not sequenced by the most used DNA metabarcoding sequencing technologies used so far (e.g. Illumina, IonTorrent), which can read only up to ∼400 bp. However, this database will be useful for long-read sequencing technologies that will become more common in the next years. The PLANiTS1 database was created because of the research that we were contemporaneously conducted on aerobiological samples, which induced us to generate the PLANiTS databases, as described in the Results and discussion section. In this study now under review, indeed, only the ITS2 was amplified, being it the most common marker in plant and fungal DNA barcoding and metabarcoding studies (21,25) also in aerobiological field (14). Therefore, we principally chose to test PLAN-iTS2.
As NCBI does not double check the uploaded sequences, manual curation is needed to prevent the inclusion of sequences that may lead to incorrect taxonomic assignment and propagation of misidentifications (30). This issue is of particular importance for plants, as endophytic fungi are commonly amplified instead of-or co-amplified withtheir targeted plant host. Fungi are known components of the plant microbiota present superficially or inside plant tissues as parasites, endophytes, pathogens and symptomless infections (41). It is therefore not surprising that the majority of the misidentified sequences corresponded to filamentous fungi, mould and yeast, which are widely distributed either as plant pathogens or endophytes (Alternaria and Aureobasidium) or are common in environments, such as soil and air (Cladosporium, Aspergillus, Fusarium, Phoma). Many of these sequences were erroneously included as Viridiplantae in other databases, such as the ITS2 Database of Ankebrand et al. (32).
In DNA metabarcoding studies, even if plant-specific primers are used, the potential co-amplification with other organisms is difficult to prevent (21); alternatively, in some region was always complete, as sequenced between ITS1 and ITS2. The accessions that presented the three markers were then concatenated to produce the ITS reference, consisting of 267 752 ITS sequences.
After the analysis with the bc4q script, 16 154 ITS1, 17 301 ITS2 and 15 729 whole ITS clusters were re-assigned at genus level as they contained more than one species. Moreover, 1529 ITS1, 1894 ITS2 and 1380 ITS clusters were highlighted as they contain more than two species. After the manual check, a total of 1558 accessions were considered as misidentified and were removed from the datasets.
After all the cleaning and clustering steps we obtained three curated final datasets (Table 1) Among the total 168 viridiplant orders, 130-134 (∼80%) are present in the reference datasets (Table 2). During the compilation of this reference dataset, we noticed that some classes are under-represented for the ITS barcode (i.e. the hornworts Anthocerotopsida, the liverwort Marchantiopsida and the fern Polypodiopsida). An increased sequencing effort in these groups would lead to a substantial improvement of the representation of the plant kingdom, which would lead to more complete and reliable taxonomic results in DNA barcoding and metabarcoding studies.   Orders in bold are the one present in PLANiTS reference dataset. * Orders present only in PLANiTS1, * * orders present only in PLANiTS2. For each order the total number of sequences identified in the three databases (PLANiTS, PLANiTS1 and PLANiTS2) is reported, and the corresponding number of sequences identified at the species level is additionally reported in parentheses In the three datasets, 15 691 ITS1, 16 949 ITS2 and 15 643 whole ITS clusters were re-assigned at genus level, while 1204 ITS1, 1338 ITS2 and 1049 ITS clusters were reassigned at family level. Cluster re-assigned at family level mainly contain few recurrent species from the following plant families: Scenedesmaceae and Siphonocladaceae (Chlorophyceae, Chlorophyta), Brachytheciaceae (Bryopsida, Streptophyta), Lejeuneaceae (Jungermanniopsida, Streptophyta) and Araceae (Liliopsida, Streptophyta). Whether a genus or species identification is needed for these genera, we suggest improving the resolution of molecular identification by adding another barcode, such as matK.
The bc4q script can also serve as a new resource for the construction of other reference databases, as the clustering step is a crucial part of data management.
We have successfully tested PLANiTS2 reference database with an ad hoc mock community and 110 aerobiological samples collected with volumetric samplers in different sites of North and Central Italy from a threeseason survey in which DNA metabarcoding was applied to study plant diversity (Banchi et al. under review). The same samples were also analyzed using ITS2 Database (32) in order to compare the performance of the two resources.
The mock community was composed by seven plant taxa: one species of Chlorophyta, i.e. Trebouxia gelatinosa, and six species of Streptophyta, i.e. the Gymnosperm Taxus baccata and the Angiosperms Acer campestre, Campanula sp., Corylus avellana, Tulipa gesneriana and Wisteria sp.
The plant samples were chosen among the plants available at the Botanic Garden of the University of Trieste with the aim to include the widest range of taxa. The pollen was collected timely in the seasons. Chlorophyta was chosen among the algal cultures stored at the University of Trieste, representing a cosmopolitan terrestrial algal genus.
DNA extraction, library preparation and sequencing were performed as described in Banchi et al. (42) using as forward primer the reverse complement of ITS-u2 and ITS-p4 as reverse primer (21). Taxonomic assignment was performed with QIIME2 (40) with the alignment-based taxonomy consensus method based on vsearch 2.0.3 (43) applying the 97% identity limit and PLANiTS2 or ITS2 Database (32) as reference.
The taxonomic composition of the mock community analyzed with PLANiTS2 allowed for correct assignment at the genus level. In particular, C. avellana, A. campestre and T. gesneriana were assigned at species level, while T. gelatinosa, T. baccata and Wisteria sp. were identified up to their genera.
The analysis of aerobiological samples using PLANiTS2 recovered 158 plant genera. Plant taxonomic composition was mostly influenced by season. Corylus was the most abundant genus recovered in spring, whereas in summer and autumn the highest abundance was detected for Brassica followed by Linum, Cucurmis and Daucus (Banchi et al. under review).
With ITS2 Database (32), the total number of genera (168) and the results for the most represented taxa across samples (Corylus, Brassica, Linum, Campanula, Cucumis, Daucus) are comparable between the two databases. However, also here fungal taxa misidentified as plant taxa, are present even in high rank position (i.e. 14th) and represent ∼5% of the sequences. These are listed as Pueraria, Dioscorea, Pteris and Fallopia, belonging instead to Debaryomyces sp., Cladosporium sp., Epicoccum sp. and Filobasidium sp.
These results show how the cleaning and the clustering of plant sequences are important for a reliable taxonomic assignment, especially in the analysis of mixed environmental samples, such as air samples, where plants and fungi are present.
PLANiTS1, PLANiTS2 and PLANiTS are curated, reliable and updated reference databases and we propose them as a pivotal first step for a general standardization of plant DNA metabarcoding studies, in the prospect of facilitating the comparison of data among different researches dealing with plant identification at deep taxonomic level.

Funding
Finanziamenti di Ateneo per progetti di Ricerca scientifica (FRA2016) assigned to L.M. by the University of Trieste.