Mapache: a flexible pipeline to map ancient DNA

Abstract Summary We introduce mapache, a flexible, robust and scalable pipeline to map, quantify and impute ancient and present-day DNA in a reproducible way. Mapache is implemented in the workflow manager Snakemake and is optimized for low-space consumption, allowing to efficiently (re)map large datasets—such as reference panels and multiple extracts and libraries per sample — to one or several genomes. Mapache can easily be customized or combined with other Snakemake tools. Availability and implementation Mapache is freely available on GitHub (https://github.com/sneuensc/mapache). An extensive manual is provided at https://github.com/sneuensc/mapache/wiki. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Mapping sequencing reads to a reference genome is a fundamental step in genomic analyses.Compared to modern data, ancient DNA (aDNA) presents a number of challenges as data is sparse, contaminated, and contains post-mortem damage, such as fragmentation and deamination.Furthermore, samples are often re-sequenced many times to maximize their yield, making the mapping step iterative.As a result, robust, efficient, yet easy to use bioinformatic pipelines are needed to map and remap the sequencing reads in a reproducible way, while allowing a wide range of users to modify predefined options and navigate through the results to, e.g., select promising samples, extracts, or libraries.
Two mapping pipelines optimized for aDNA have been published: PALEOMIX (Schubert, et al., 2014) which was designed for stand-alone servers (e.g., without a queuing system) and nfcore/eager (Yates, et al., 2021) which is based on the workflow manager Nextflow (Di Tommaso, et al., 2017) and is designed to make use of distributed computer infrastructures used nowadays (e.g., cluster or cloud services).Both tools have common and specific functionalities.For instance, only PALEOMIX allows for mapping to multiple reference genomes and both tools require considerable amounts of storage for temporary files.
Here, we introduce mapache, a pipeline designed to map in a reproducible manner both ancient and present-day DNA sequences to one or multiple reference genomes and to analyse the resulting alignments by outputting summary statistics.The focus of the pipeline is to have a lightweight, efficient, flexible, and easy to use mapping framework.Mapache is implemented in the workflow manager Snakemake (Mölder, et al., 2021) and it thus inherits its flexibility, scalability, and reproducibility.Snakemake is based on Python allowing for experienced users to easily adapt the code or to combine the workflow with other Snakmake workflows.Compared to PALEOMIX, mapache can run on a single workstation, a HPC system or in the cloud, and compared to nf-core/eager, mapache can map reads to multiple reference genomes.Moreover, unlike the other tools, mapache manages temporal files efficiently, keeping the storage space small.Finally, mapache allows imputing (low-coverage) genomes with GLIMPSE (Rubinacci, et al., 2021).

Reproducibility
Scalability.Mapache scales well for different (but common) research cases: screening for aDNA quality (mapping of many small FASTQ files), mapping of high-coverage genomes (many large FASTQ files) to a single reference, as well as reconstructing microbial genomes from a metagenomic sample (mapping many FASTQ files to many reference genomes).Another important feature in mapache is its good handling of space for intermediate files.By default, these files are automatically erased when no longer needed, which is especially important when storage space is limited.Finally, depending on the input size, mapache can be launched locally (for small datasets) or on a distributed system (preferred setup for medium and large projects).These features make mapache an ideal tool for running large projects on a cluster: the mapping of six high-coverage ancient human genomes (246 GB, 743 million reads; Clemente et al. ( 2021)) took 2.5 hours and the mapping of 150 modern human samples (15 TB of input data; European Genome-Phenome Archive: EGAD00001007082) took less than a week on a cluster.Automation.Mapache is launched with a single command line and can be re-launched with the same command if for any reason a mapping run fails (e.g., due to a lack of memory or time).Portability.Mapache can easily be installed on any Linux computer.Dependencies are automatically installed using Conda.

Workflow
Configuration file.Mapache can be customized using a configuration file.Most parameters contain default values optimized for mapping ancient DNA reads to the human reference genome.The configuration is highly flexible, allowing the user to adapt mapache to the input data: 1) some steps of the workflow can be skipped, 2) different software are available for some steps, and 3) any tool-specific parameter may be included in the pipeline.For example, the mapping pipeline can be customized with parameters suited to map present-day DNA; the user can select among BWA-aln, BWA-mem and bowtie2 for mapping (see Table 1 for a complete list of tools and appropriate references); any valid parameter can be passed to AdapterRemoval with the option "adapterremoval_params".Finally, the workflow can be adapted to the available computer infrastructure by specifying the number of threads, amount of memory and running time for each step in the configuration file.

Default mapping workflow.
A detailed description of all the steps can be found on the wiki and the list of used software is listed in Table 1.The default workflow (Figure 1) is optimized for mapping ancient DNA reads to the human reference genome.In brief, adapters, short reads and low-quality bases are first removed.Reads are mapped with BWA-aln, disabling the seed (-l 1024, Schubert et al. (2012)).The mapped reads are then merged by library and duplicates are removed.The reads are merged by sample and reads are realigned locally around indels.Finally, the MD flag is recomputed.As an option, unmapped/low-quality reads can be stored in a separate BAM file per sample.

Imputation (optional).
Mapache allows for imputation and phasing of (low-coverage) genomes using GLIMPSE, a tool that is shown to accurately impute both low-coverage present-day (Rubinacci, et al., 2021) and ancient human genomes (Sousa Da Mota, et al., 2022).
Output.Mapache's main outputs are the alignments per sample and genome as BAM files and, if requested, a) low-quality and unmapped reads and b) imputed genotypes in VCF format.Mapache outputs summary statistics on the final BAM and intermediate files, including depth of coverage for all or a subset of the chromosomes, genomic sex, damage patterns, and read lengths.Statistics are returned as tables and graphs, allowing to quickly investigate the run.For in-depth analyses, mapache can generate several individual reports (fastqc, qualimap, multiqc).At the end of a run a Snakemake report can be generated which combines all the summary statistics and graphics, runtime statistics (runtime, time of execution, and executed code chunks) in a single self-contained html, which can be easily shared and stored.

Conclusion
Mapache is a flexible mapping workflow optimized for runtime and minimal storage usage to map efficiently both ancient and present-day DNA sequences in a reproducible manner.Mapache comes with an extensive wiki and many pre-defined parameters making it easy to get started.Furthermore, as the underlying language is Python, if users want to go beyond the implemented capabilities, it is easy to modify the code or to combine the workflow with any other Snakemake workflow.Finally, mapache includes downstream processes that are also standalone and can be run on existing BAM files.Table 1.List of software used for the different steps of the workflow. Step

Figure 1 .
Figure 1.Main workflow of mapache.Each box shows a step and related software.Dashed boxes are optional.The grey box shows standalone steps which can be computed on any BAM file.