PEMA: from the raw .fastq files of 16S rRNA and COI marker genes to the (M)OTU-table, a thorough metabarcoding analysis

Background Environmental DNA (eDNA) and metabarcoding, allow the identification of a mixture of individuals and launch a new era in bio- and eco-assessment. A number of steps are required to obtain taxonomically assigned (Molecular) Operational Taxonomic Unit ((M)OTU) tables from raw data. For most of these, a plethora of tools is available; each tool’s execution parameters need to be tailored to reflect each experiment’s idiosyncrasy. Adding to this complexity, for such analyses, the computation capacity of High Performance Computing (HPC) systems is frequently required. Software containerization technologies ease the sharing and running of software packages across operating systems; thus, they strongly facilitate pipeline development and usage. Likewise are programming languages specialized for big data pipelines, incorporating features like roll-back checkpoints and on-demand partial pipeline execution. Findings PEMA is a containerized assembly of key metabarcoding analysis tools with a low effort in setting up, running and customizing to researchers’ needs. Based on third party tools, PEMA performs reads’ pre-processing, clustering to (M)OTUs and taxonomy assignment for 16S rRNA and COI marker gene data. Due to its simplified parameterisation and checkpoint support, PEMA allows users to explore alternative algorithms for specific steps of the pipeline without the need of a complete re-execution. PEMA was evaluated against previously published datasets and achieved comparable quality results. Conclusions Given its time-efficient performance and its quality results, it is suggested that PEMA can be used for accurate eDNA metabarcoding analysis, thus enhancing the applicability of next-generation biodiversity assessment studies.


Background
Metabarcoding inaugurates a new era in bio-and eco-monitoring. However, from the output of a sequencer to an amplicon study analysis results, it takes a long way.
Well-established pipelines are available to process metabarcoding data (mothur [1], QIIME 2 [2], LotuS [3]) for the case of 16S rRNA marker gene and bacterial communities. However, there is none that can be used in a straightforward way for metabarcoding analysis of eukaryotic organisms. For this to be functional, adaptation to other marker genes (e.g COI) is required. Furthermore, the pipelines mentioned above, although entrenched, they still suffer from a series of hurdles: technical difficulties in installation and use, strict limitations in setting parameters for the algorithms invoked, incompetence in partial re-4 execution of an analysis, are among the most prominent.
Moreover, given the computational demands of such analyses, access to High Performance Computing (HPC) systems, might be mandatory, for example, to process studies with large number of samples. This is rather timely given the ongoing investment of national and international efforts (for example [4]) to serve the broad biological community via commonly accessible infrastructures.
PEMA is an open-source pipeline that bundles state-of-the-art bioinformatic tools for all necessary steps of amplicon analysis and aims to address the issues mentioned above. It is designed for paired-end sequencing studies and is implemented in the BigDataScript (BDS) [5] programming language. BDS's ad hoc task parallelism and task synchronization, supports heavyweight computation which PEMA inherits. In addition, BDS supports checkpoint files that can be used for partial re-execution and crash recovery of the pipeline. PEMA builds on this feature to serve tool and parameter exploratory customization for optimal metabarcoding analysis fine tuning. Switching effortlessly between clustering algorithms is a pertinent example. Finally, via the Docker [6] and Singularity [7], the latter HPCcentered, software containerization technologies, PEMA is distributed in an easy to download and install fashion on a range of systems from regular computers, to cloud or HPC environments.
Beyond the technical aspects and from the biology perspective, PEMA supports the metabarcoding analysis of both prokaryotic communities (based on the 16S rRNA marker gene) and eukaryotic ones (based on the COI marker gene).
Two clustering algorithms, Swarm v2 [8] and CROP [9], are employed for the clustering of reads in Molecular Operational Taxonomic Units (MOTUs) in the COI marker gene case. VSEARCH [10] is used for the 16S rRNA gene case. Taxonomy assignment is performed in an alignment-based approach, making use of the CREST LCAClassifier [11] and the Silva database [12]. For the COI marker gene, the RDPClassifier [13] and the MIDORI database [14] are used. In the 16S marker gene case, phylogenybased assignment is also supported, based on RAxML-ng [15], EPA-ng [16] and Silva, as well as ecological and phylogenetic analysis via the "phyloseq" R package [17].
All the pipeline-controlling and third-party-module parameters are defined in a plain parameter-value 5 pair text file. Its straightforward format eases the analysis fine tuning complementary to the aforementioned checkpoint mechanism. A tutorial about PEMA and installation guidance can be found on PEMA's GitHub repository (https://github.com/hariszaf/pema).

Implementation
PEMA's architecture comprises four main parts taking place in tandem ( Figure 1). Detailed description of the tools invoked by PEMA and their licences is included in Additional file 1: Supplementary Methods.

Part 1: Quality control and pre-processing of raw data
Before all else, FastQC [18] is used to obtain an overall read-quality summary. Beyond this visual inspection and to correct the errors are produced by a sequencer, PEMA incorporates a number of tools.
Trimmomatic [19] implements a series of trimming steps, namely: either to remove parts of the sequences corresponding to the adapters or the primers, or to trim and crop parts of the reads, or even remove a read completely, when it fails to the quality filtering sequences. BayesHammer [20], an algorithm of the SPAdes assembly toolkit [21], determines specific-position errors, where a particular base has been called incorrectly and revises them. PANDAseq [22] assembles the overlapping pairedend reads and then the 'obiuniq' program of OBITools [23] groups all the identical sequences in every sample, keeping a track of their abundances. The VSEARCH package [10] is invoked for the chimera removal.  Second, CROP [9], an unsupervised probabilistic Bayesian clustering algorithm that models the clustering process using Birth-death Markov chain Monte Carlo (MCMC). The CROP clustering algorithm is adjusted by a series of parameters need to be tuned by the user (namely b, e and z). These parameters depend on specific dataset properties like the length and the number of reads). PEMA, automatically adjusts b, e and z by collecting such information and applying the CROP recommended parameter-setting rules [9].
Any singletons occurring among the (M)OTUs after this step are removed.

Part 3: Taxonomy assignment
Alignment-based taxonomy assignment is supported for both the 16S rRNA and the COI marker gene analyses. In the 16S rRNA marker gene alignment-based case, the LCAClassifier algorithm of the CREST set of resources and tools [11], is used together with the Silva database [12]  For the COI marker gene, PEMA uses the RDPClassifier [13] and the MIDORI reference to assign taxonomy of the MOTUs. The MIDORI dataset [14] contains quality controlled metazoan mitochondrial gene sequences from GenBank [24].
Intended primarily for studies from less explored environments, phylogeny-based assignment is available for 16S rRNA marker gene data. PEMA maps OTUs to a custom reference tree of 1000 Silvaderived consensus sequences (created using RAxML-ng [15] and gappa (phat algorithm) [25], Figure   2A). PaPaRa [26] and EPA-ng [16] combine the OTU clustering output and the reference tree to produce a phylogeny-aware alignment and map the 16S rRNA OTUs to the custom reference tree. Beyond the context of PEMA, users may visualize the output with tree viewers like iTOL [27] ( Figure 2B).

Part 4: (Μ)OTU and sample abundance tables, biodiversity downstream analysis
In both the 16S rRNA and COI marker gene analysis, an (M)OTU-table is returned by PEMA. For each sample of the analysis, a subfolder containing statistics about the quality of its reads, as well as the 7 taxonomies and their abundances, is also generated.
For the 16S rRNA marker gene case, a downstream analysis of the OTUs including alpha-and betadiversity analysis, taxonomic composition, statistical comparisons and calculation of correlations between samples is also supported, thanks to the "phyloseq" R package [17]. When this option is selected, then besides phyloseq's output, a Multiple Sequence Alignment (MSA) and a phylogenetic tree of the OTUs are returned; for the MSA, the MAFFT [28] aligner is invoked while the latter is being built by RAxML-ng [15].

PEMA container-based installation
An easy way of installing PEMA is via its containers. A dockerized PEMA version is available at https://hub.docker.com/r/hariszaf/pema. Singularity users can pull the PEMA image from https://singularity-hub.org/collections/2295. Between the two containers, the Singularity-based one is recommended for HPC environments due to Singularity's improved security and file accessing properties [29]. For detailed documentation, visit https://github.com/hariszaf/pema.

PEMA output
All PEMA-related files (i.e. intermediate files, final output, checkpoint files and per-analysis-parameters) are grouped in distinct (self-explanatory) subfolders per major PEMA pipeline step. In the last subfolder, the results are further split in folders per sample. This eases further analysis both within the PEMA framework (like partial re-execution for parameter exploration) or beyond.

Results and discussion Evaluation
To evaluate PEMA, two publicly available datasets from published studies were used. For the 16S rRNA marker gene, the dataset reported by study [30] was used while for the COI case, the one of [31] (accession numbers: PRJEB20211 and PRJEB13009 respectively). In both cases, the respective .fastq files were downloaded from ENA-EBI using 'ENA File Downloader version 1.2' [32]. For both cases, PEMA was run on the in-house HPC cluster.

Comparison to existing software
By the means of evaluation, PEMA's features were compared with those QIIME 2 [2], mothur [1], a LotuS [3]. Table 1 presents a detailed comparison among the four tool features in terms of marker ge support, diversity and phylogeny analysis capability, parameter setting and mode of execution, operati system availability and HPC suitability. As shown, PEMA is equally feature-rich, if not richer in certa feature categories to the other software packages. In particular, PEMA's support for COI marker ge studies is distinctive; two methods for the taxonomy assignment are supported and PEMA's eas parameter setting, step-by-step execution and container distribution render it user and analysis friendly.

COI marker gene analysis evaluation
The selected study [31] created two COI libraries of different sizes: COIS (235 bp amplicon size) a COIF (658 bp amplicon size). The sequencing reads of COIS were selected for PEMA's evaluation; t COIF sequencing read pairs had no overlap so as to be merged.
Regarding the creation of the MOTU table, [31] used VSEARCH [10] with a clustering at 97 similarity threshold. Afterwards, the BLAST+ (megablast) algorithm [33] was used against a manual created database including all NCBI GenBank COI sequences of length >100 bp (June 2015) wh excluding environmental sequences and higher taxonomic level information [31]. As discussed in t publication, this approach resulted in 138 unique MOTUs out of which 73 were assigned to speci level. For PEMA's evaluation, the chosen clustering algorithm was Swarm v2, using different optio for the cluster radius (d) parameter (Table 2); according to [8], this is the most important parameter as affects the number of MOTUs that are being created. The resulting MOTUs were classified against t MIDORI reference database [14] using RDPClassifier [13]. The results of the processing of t sequences are shown in Additional file 2: Table S1.
As shown in Table 2  The taxonomic assignment of the retrieved MOTUs is shown in Figures 3-4. Certain .fastq files contained very few reads, such as those for sample ERR1308241, and therefore resulted in zero MOTUs upon the completion of PEMA; thus, these samples are not included in Figure 3. It is worth mentioning that four of the 138 MOTUs were found in both the positive control samples of the published study as well as the present study (Table 3). Also, in three cases, PEMA resulted in the same genera as the positive control samples of the published study (Table 3). The computational time required by PEMA for the completion of the analysis is also shown in Table 2.
Regardless of the value of the d parameter, all analyses were completed in about 2 hours, ie. adequately fast to allow parameter testing and customization.
It is known that the choice of parameters affects the output of each analysis; therefore, it is expected that The execution time and the reported OTU number of each tool are presented in Table 4. LotuS and PEMA resulted in a final number of OTUs comparable to that of [30]. Clearly, due to PEMA's parallelexecution support, the analysis time can be significantly reduced (~1.5 hours in this case). The executional time is depending on the parameters chosen for each software (see Additional file 1: Supplementary Methods).
Due to the non-full overlap of the sequence reads, mothur resulted in an inflated number of OTUs; thus, is was excluded from further analyses. The results of all the pipelines were analyzed with the phyloseq script that is provided with PEMA. The taxonomic assignment of the PEMA retrieved OTUs is shown in Figure 5. The phyla that were found in the samples are similar to the ones that were found in [30].
Although the lowest number of OTUs was found in the marine station (Kal) ( Table 5), which is not in accordance with [30], the general trend of the decreasing number of OTUs with the increasing salinity was observed as it was in [30]. Notably, this result was not observed with the other tested pipelines (Table 5). Furthermore, each of the pipelines resulted in a different taxonomic profile (Additional files 4-6: Figure S1-3) with an extreme case of missing the Order of Betaproteobacteriales (Additional files 7-9: Figure S4-6).  L_LOout_A  2451  67640  156  5878  177  47954  791  17888   L_LOout_B  3432  95835  180  5978  248  62416  1182  19571   L_LOout_C  2987  97592  180  9216  221  59346  1048  24947   L_LOinA  3656  85882  200  6284  264  62253  1176  18346   L_LOinB  2935  76545  173  6357  194  46191  954  18750   L_LOinC  3149  71222  183  5849  219  49890  1012  Overall, PEMA's output is in accordance with [30] and if all the different outputs are taken into consideration, it can be concluded that it performed better than the other tested pipelines in capturing the microbial community diversity and composition of the samples.

Beyond environmental ecology, on-going and future work
PEMA is mainly intended to support eDNA metabarcoding analysis and be directly applicable to nextgeneration biodiversity/ecological assessment studies. Given that community composition analysis may also serve additional research fields, eg. microbial pathology, the potential impact of such pipelines is expected to be much higher. On-going PEMA work focuses on serving a wide scientific audience and on making it applicable to more types of studies. The easy set up and execution of PEMA, allows users to work closely with national and European HPC/e-infrastructures (e.g. ELIXIR Greece [34], LifeWatch ERIC [35], EMBRC ERIC [36]). The aim of this effort is to outreach their communities and address both ongoing as well as future analysis needs.
In a mid-to meso-term perspective, pipeline development work will support the analysis of the ITS and 18S rRNA marker gene studies too. When operational, a holistic biodiversity assessment approach would be possible through PEMA and eDNA; the analysis of the most commonly used marker genes (16S rRNA, ITS, COI/18S rRNA) for each of the greatest taxonomic groups (Bacteria, Fungi, Eukaryotes) would be implemented by the same pipeline. Finally, it is our intention to allow ad hoc and in-house databases to be used as reference for the (M)OTU taxonomy assignment.

Conclusions
PEMA is an accurate, execution friendly and fast pipeline for the metabarcoding analysis of the 16S rRNA and COI marker genes. It provides a per-sample analysis output, different taxonomy assignment methods and graphics-based biodiversity/ecological analysis. This way, in addition to (M)OTU calling, it provides users with both an informative study overview and detailed result snapshots.      Description of the commands, along with their parameters, used to run PEMA, mothur, LotuS and QIIME 2.
Additional file 2: Table S1: Number of sequences after each pre-processing step for the case of COI, dataset from [31].
Additional file 3: Table S2: Number of sequences after each pre-processing step for the case of 16S rRNA gene.
Additional file 4: Figure S1: Bar plot depicting the taxonomy of the retrieved OTUs from LotuS at the Phylum level.
Additional file 5: Figure S2: Bar plot depicting the taxonomy of the retrieved OTUs from QIIME 2 using Deblur at the Phylum level.
Additional file 6: Figure S3: Bar plot depicting the taxonomy of the retrieved OTUs from QIIME 2 using DADA2 at the Phylum level.
Additional file 7: Figure S4: Bar plot depicting the taxonomy of the retrieved OTUs from LotuS at the class of Betaproteobacteriales.
Additional file 8: Figure S5: Bar plot depicting the taxonomy of the retrieved OTUs from QIIME 2 using Deblur at the class of Betaproteobacteriales.
Additional file 9: Figure S6: Bar plot depicting the taxonomy of the retrieved OTUs from PEMA at the class of Betaproteobacteriales.