YAMP: a framework enabling reproducibility in metagenomics research

YAMP is a user-friendly workflow that enables the analysis of whole shotgun metagenomics data while using containerisation to ensure computational reproducibility and facilitate collaborative research. YAMP can be executed on any UNIX-like system, and offers seamless support for multiple job schedulers as well as for Amazon AWS cloud. Although YAMP has been developed to be ready-to-use by non-experts, bioinformaticians will appreciate its flexibility, modularisation, and simple customisation. The YAMP script, parameters, and documentation are available at https://github.com/alesssia/YAMP.


Background
Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies collecting and analysing large amounts of data has surged, opening new challenges for data analysis and research reproducibility. A ubiquitous lack of repeatability and reproducibility has in fact been observed, and a recent Nature's survey of 1,576 researchers showed that more than 50% and 70% of them failed to reproduce their own and other scientists' experiments, respectively [1]. Unavailability of primary data and computational experimentation have been named as the major culprits for this reproducibility crisis, with many studies relying on ad hoc scripts and not publishing the necessary code and/nor sufficient details to reproduce the reported results [2,3,4], and with variations across workstations and operating systems representing another obstacle [5,6]. To overcome this issue, tools allowing the development of workflows [7] and software containers [8] have been proposed [9]. In fact, containerised well-structured workflows allow storing every detail of the workflow execution, including software's versions and parameters (provenance, [10]), and nullify systems' variations [6], guaranteeing studies' repeatability and reproducibility. Containerised workflows also facilitate collaborative projects, by ensuring identical analysis processes, thus comparable results, and allow the automatisation of data-intensive repetitive tasks [11]. Moreover, they save users with little bioinformatics or computational expertise from the hassles of installing the required pieces of software, and of designing and implementing often complex analysis orchestrations, while expert bioinformaticians can use them as a starting point for customised analyses, thus avoiding redundant solutions.
In metagenomics research, several analysis pipelines have been developed so far. However, they either do not support containerisation (e.g., MetAMOS [12], MOCAT2 [13]), thus potentially compromising reproducibility, or require users to upload their unpublished and/or confidential data on third-party servers (e.g., MG-RAST [14]), where, according to the available resources, they can spend several days waiting to be processed [15], and with data privacy concerning some of the researchers [16].
Here we present "Yet Another Metagenomic Pipeline" (YAMP), a ready-touse containerised workflow that processes raw shotgun metagenomics sequencing data up to the taxonomic and functional annotation. YAMP is implemented in Nextflow, a framework that allows defining workflows that are highly parallel, easily portable (including on distributed systems), and very flexible and customisable [6]. We integrated our Nextflow pipeline with a Docker (https:// www.docker.com) and a Singularity (http://singularity.lbl.gov) container. While the former defines a platform-independent virtualised light-weight operating system that includes all the pieces of software required by YAMP and traces their versioning, the latter allows these features to be transferred to High Performance Computing (HPC) systems, with which Docker is inherently incompatible.

The YAMP workflow
The YAMP workflow is composed of three analysis blocks: the quality control, (QC; Figure 1, green rectangle), complemented by several steps of assessment and visualisation of data quality ( Figure 1, orange rectangle), and the community characterisation ( Figure 1, pink rectangle). The QC starts with an optional step of de-duplication, where identical reads, potentially generated by PCR amplification, are removed. Next, reads are first filtered to remove adapters, known artefacts, phiX, and then quality-trimmed. Reads that become too short after trimming are discarded, while, when pairedend reads are at hand, singleton reads (i.e., paired-end reads whose mates have been removed) are preserved in order to retain as much information as possible. Finally, reads are screened for contaminants, e.g., reads that do not belong to the studied ecosystem. The QC is performed by means of a number of tools belonging to the BBmap suite [17], namely clumpify, BBduk, and BBwrap, which are computationally efficient and allow processing both single-and paired-end reads from all the major sequencing platforms (i.e., Illumina, Roche 454 pyrosequencing, Sanger, Ion Torrent, PacBio, and Oxford Nanopore). FastQC [18] is used to perform QC assessment and visualisation of reads quality, and to evaluate the effectiveness of the trimming and decontamination step. The QC is followed by multiple steps aimed at the taxonomic and functional characterisation of the microbial community. Taxonomic binning and profiling, i.e., the identification and quantification of the micro-organisms present in the metagenomics sample, is performed with MetaPhlAn2 [19], which uses clade-specific markers to both detect the micro-organisms and to estimate their relative abundance. The functional capabilities of the microbial community, i.e., the functions carried out by the identified micro-organisms, are assessed by the HUMAnN2 pipeline [20], which first stratifies the community in known and unclassified organisms using the MetaPhlAn2 results and the ChocoPhlAn pan-genome database, and then combines these results with those obtained through an organism-agnostic search on the UniRef proteomic database. QIIME [21] is used to evaluate multiple αdiversity measures based on the taxonomic profile.
YAMP accepts in input both single-and paired-end FASTQ files, and users can customise the workflow execution either by using command line options or by modifying a simple plain-text configuration file, where parameters are set as key-value pairs. While the parameters could be tuned according to the dataset at hand, to facilitate non-expert users in their analyses we provide a set of default parameters derived from our own analysis experience. The output generated by YAMP includes a FASTQ file of QC'ed reads, the taxonomy composition along with the microbe, gene and pathway relative abundances, the pathway coverage, and multiple α-diversity measures. An option allows users to retain temporary files, such as those generated by the QC steps or during the HUMAnN2 execution. Additionally, YAMP outputs several QC reports, a detailed log file recording information about each analysis step (Supplementary Figure S1), and statistics of memory usage and time of execution (Supplementary Figure S2).

Results
To facilitate the discussion on YAMP computational requirements, and to assess its ability to correctly identify microbial communities, we analysed 18 randomly selected samples from six different body sites sequenced during the Phase III of the Human Microbiome Project [22] (Table 1). On average, the selected samples included 12.6M paired-end reads (25.2M reads in total), which yielded to 13.3M QC'ed reads (including both paired-end and singleton reads), and were processed in an average time of two hours using four threads on a machine sporting a 2.60GHz Intel R Xeon R processor with 32 GB of RAM (Table 1). At the phylum level, each body site showed a characteristic signature (Figure 2), with a predominance of Actinobacteria in the airways, Firmicutes in the vagina, Bacteroidetes in the stool, and a mixture of Actinobacteria, Firmicutes and Proteobacteria in the oral cavity, as already observed in previous studies [23]. A site-specific microbial signature was also present at the species level, where both the Principal coordinate analysis (PCoA) evaluated using the Bray-Curtis dissimilarity (Supplementary Figure S3 and S4), and the hierarchical clustering computed on the Manhattan distances among species relative abundances ( Figure 3) showed that the taxonomy composition was sufficient to discriminate among body sites, even though it had limited ability in distinguishing among different loci in the oral cavity.

Discussion
In conclusion, with YAMP, we provide a user-friendly workflow that enables the analysis of whole shotgun metagenomics data. By supporting containerisation, YAMP allows for computational reproducibility, also enabling collaborative studies. In fact, while software versions are described in the Docker/Singularity container, the Nextflow script and configuration file capture all the details needed to fully track each step of data processing, thus satisfying the provenance requirements. Indeed, to ensure reproducibility, researchers should only provide the YAMP configuration file and a link to the container image. Being based on Nextflow, YAMP runs on any UNIX-like system, provides out-of-the-box support for several job schedulers (e.g., PBS, SGE, SLURM) and for the Amazon AWS cloud, and its integration with Docker/Singularity is completely usertransparent. Moreover, YAMP does not require users to upload unpublished and/or confidential data on third-party servers, as for instance required by the MG-RAST [14] or EBI Metagenome [24] pipeline. Finally, while YAMP has been developed to be ready to use by non-experts, and potentially does not require any software installation or parameter tuning, bioinformaticians will value its flexibility and simple customisation. In fact, the well-defined YAMP modularisation and the usage of standard data formats allow both an easy integration of new analysis steps and a customisation of the existing ones.
YAMP is made available as a Nextflow script which allows a user-friendly execution via the command line. The source code is available in the YAMP GitHub repository (https://github.com/alesssia/YAMP), which includes a wiki with a full documentation and several tutorials. The Docker/Singularity image can be downloaded and installed from DockerHub (https://hub.docker.com/r/ alesssia/yampdocker).

Potential implications
YAMP has been designed with the specific goals of enabling reproducible metagenomics analyses, facilitating collaborative projects, and helping researchers with limited computational experience who are approaching this field of research. However, we are confident that other areas of research would be aided by a more widespread use of containerised well-structured workflows. Indeed, as outlined in the Background Section, a lack of reproducibility is nowadays ubiquitous, and, besides undermining the credibility of scientific research, it has an economical cost, quantified, for instance, in US$28B/year for preclinical research [25]. On the other hand, ensuring reproducibility does not come for free: anecdotic evidence suggests that the time spent on a project may increase by 30-50% [1], and that to reproduce the analysis of single computational biology paper can require up to 280 hours [26]. YAMP represents a proof-of-concept showing a simple way to enable reproducible and collaborative research. We also advocate the sharing of such containerised workflows, which will benefit a wide group of researchers, regardless of their computational experience [11].

Data Availability
The 18 randomly selected samples used to assess YAMP belong to the Phase III of the Human Microbiome Project [22], and were downloaded from the European Nucleotide Archive website (Study accession number: PRJNA275349, https: //www.ebi.ac.uk/ena/data/view/PRJNA275349). Samples were collected from healthy adults residing in the USA at the time of sample collection. After genomic DNA extraction, metagenomics library preparation was performed using the NexteraXT library construction protocol. Paired-end metagenomics sequencing was performed on the Illumina HiSeq2000 platform with a read length of 100 bp. Samples' accession numbers are reported in Table 1.

Data Analysis
Samples were processed with YAMP using the default parameters, as defined in the published YAMP configuration file (https://raw.githubusercontent. com/alesssia/YAMP/master/nextflow.config). The Bray-Curtis dissimilarity values were evaluated using the species relative abundances as estimated by YAMP using MetaPhlAn2 [19] and the vegdist function in the vegan R package (version 2.4.3) [27]. Principal coordinate analysis (PCoA) was evaluated on the Bray-Curtis dissimilarity values using the pcoa function in the ape R package (version 4.1) [28]. Hierarchical clustering was computed using the Manhattan distance among species relative abundances and the pvclust function in the pvclust R package (version 2.0) [29]. 10,000 bootstrap interactions were used to evaluate the P values supporting each cluster.