nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning

Abstract The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install-all dependencies are provided within containers-portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All codes are hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.


Comparison of existing pipelines for metagenome assembly and binning
shows a comparison of nf-core/mag to existing pipelines that are implemented using workflow management systems, such as Snakemake or Nextflow, and allow a scalable and easy to use application on HPC clusters. Note that this comparison is focused on pipelines for the assembly and binning of metagenomes and does not include pipeline features for other analysis types. In the following we briefly discuss the tool choices for nf-core/mag v2.1.0 with respect to its core functionalities.
Assembly. To compute assemblies based on short reads, the state-of-the-art assembly tools metaSPAdes and MEGAHIT can be used within nf-core/mag. Both tools show a consistent high performance across a variety of datasets in the CAMI II challenge. MEGAHIT has very efficient memory (RAM) management, while metaSPAdes has a higher memory requirement, which can cause problems particularly for the computation of large metagenomes or co-assemblies. To compute hybrid assemblies based on short and long reads, hybridSPAdes is used. In case within the community there will be the need for assemblies based solely on long reads (for example when Nanopore sequencing becomes cheaper, allowing higher sequencing depths), the pipeline can be extended with tools such as metaFlye (4).
Binning. To reconstruct MAGs, the pipeline was implemented to use the binning tool MetaBAT2 based on user requests. In the CAMI II challenge, it was shown that MetaBAT2 has a higher purity compared to other binning tools such as MaxBin2 (5) or CONCOCT (6), while reaching a lower completeness than CONCOCT and a lower ARI (adjusted Rand index) than MaxBin2. In the future, the pipeline can be extended to include different binning tools that can be used depending on the experimental requirements, or to additionally use bin refinement tools such as DAS Tool (7) (which combines the results from multiple binning methods to further improve the quality).
The quality of the retrieved bins is assessed with BUSCO v5. Another widely used tool to assess the bin quality is CheckM (8), which makes use of lineage-specific marker genes. Both tools seem to perform comparably (9). While CheckM can handle bacterial and archaeal genomes, BUSCO v5 can additionally assess the quality of eukaryotic genomes as well as of genomes for a subset of viruses.
Furthermore, since version 5, BUSCO can automatically select for each bin the most specific lineage dataset (containing the single-copy orthologs for benchmarking), aiming to increase the resolution and allowing the analysis for genomes of unknown origin (9).
Taxonomic classification. MAGs can be taxonomically classified in this pipeline with CAT/BAT or GTDB-TK. For this, CAT/BAT aligns predicted protein sequences against the NCBI non-redundant reference database (10), while GTDB-TK uses a set of marker genes. Thus, GTDB-TK is computationally more efficient than CAT/BAT, but can only classify MAGs that pass a certain quality threshold. Another important aspect is that GTDB-TK allows easy maintenance and reproducibility, since it is bi-annually updated together with the GTDB (11) and older database versions remain accessible. Besides CAT/BAT and GTDB-TK, the tool PhyloPhlAn 3.0 (12) was developed for taxonomic classification and published around a similar time, and it remains to be evaluated how these tools perform in comparison.
Additional pipeline extensions for functional annotation as well as for assembly and bin refinement steps, as partly already implemented in Muffin and ATLAS, are envisioned for future releases.

nf-core and DSL2
All nf-core pipelines must be based on the nf-core template. This template was recently ported to the new Nextflow DSL2 syntax, which enables a modularised structure and reuse of components, with each process using its own BioContainer (13). nf-core/mag is ported to DSL2 since version 2.0.0. For a detailed description of the nf-core framework see the main nf-core publication (14) or the nf-core website (https://nf-co.re).

Reproducibility
Generating results that can be reproduced is a major challenge and many findings published in scientific literature can still not be replicated by other scientists (15). The nf-core framework enables reproducibility as described in the 'Material and Methods' section. To additionally ensure that the individual tools generate reproducible results, several reproducibility settings were implemented for nfcore/mag. MEGAHIT and SPAdes, for example, depend on multi-threading parameters and the number of CPUs used for computation can affect the final results. In nf-core/mag, the number of used CPUs can be fixed and reported accordingly to generate reproducible assemblies. This ensures that the specified number of CPUs is not increased in case these processes will be re-submitted (as is usually the case for nf-core pipelines, if the specified resource requirements for a process do not suffice). For MetaBAT2, a deterministic behaviour is enabled by default within this pipeline via a fixed seed parameter. Moreover, specific settings allow the generation and/or saving of databases for BUSCO or CAT, for which the required public databases do not always remain accessible.

Results on simulated data
We ran nf-core/mag with the different assembly settings on the simulated metagenomic data. The following command was used (Nextflow v21.04.1) to generate short read and hybrid sample-wise assemblies: > nextflow run nf-core/mag -r 2. Besides comparing the resulting assemblies (see Figure 2), we compared the reconstructed genomes with respect to commonly used MAG metrics. The results shown in Figure S1 demonstrate that the average number of contigs per MAG decreased with hybrid vs. short read assembly.
However, the average quality across all MAGs does not increase when using the hybrid or coassembly assembly setting. For example, performing group-wise co-assemblies results in a higher average contamination compared to sample-wise assemblies (see Figure S2 d)), and hybrid assemblies result in lower completenesses compared to short read assemblies (see Figure S2 c)). This is likely caused by the highly increased number of reconstructed MAGs when using these settings (see Figure 2E).

a) b)
c) d) Figure S1: MAG-wise metrics obtained using different nf-core/mag assembly settings on the simulated data: sample-wise assembly, group-wise co-assembly, short read only assembly or hybrid assembly. Each point corresponds to one assembly, originating either from one sample or one group.
Metrics displayed are a) average number of contigs per MAG, b) average total MAG length in base pairs, c) average MAG completeness and d) average MAG contamination. The number of contigs and the total length of each MAG were summarised by QUAST, the MAG completeness and contamination were estimated by BUSCO.