MeCorS: Metagenome-enabled error correction of single cell sequencing reads

Summary: We present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies. Availability and implementation: https://github.com/metagenomics/MeCorS Contact: abremges@cebitec.uni-bielefeld.de Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
The vast majority of microbial species found in nature has yet to be grown in pure culture, turning metagenomics and-more recentlysingle cell genomics into indispensable methods to access the genetic makeup of microbial dark matter (Brown et al., 2015;Rinke et al., 2013). Frequently, single amplified genomes (SAGs) and shotgun metagenomes are generated from the same environmental sample, and are methodologically combined e.g. to validate metagenome bins with single cells or to improve the SAG's assembly contiguity (Campbell et al., 2013;Hess et al., 2011). However, a single cell's DNA needs to be amplified prior to sequencing, as usually accomplished by multiple displacement amplification (MDA; Lasken, 2007). This amplification is heavily biased, leading to uneven sequencing depth including ultra-low coverage regions with basically no informed error correction possible (Chitsaz et al., 2011;Supplementary Fig. S1). Moreover, chimera formation occurs roughly once per 10 kbp during MDA, further complicating SAG assembly (Nurk et al., 2013;Rodrigue et al., 2009).
While an array of error correction tools exist for a variety of use cases (Laehnemann et al., 2016), only one tool was specifically designed to correct SAG data: hammer (Medvedev et al., 2011), recently refined to BayesHammer (Nikolenko et al., 2013). We propose a metagenome-enabled error correction strategy for single cell sequencing reads. Our method takes advantage of largely unbiased metagenomic coverage, enabling it to correct positions with too low a coverage for SAG-only error correction, and to correct chimeric SAG reads through non-chimeric metagenome reads.

Methods
We correct potential errors using an algorithm similar to solving the spectral alignment problem (Pevzner et al., 2001). Given a set of trusted k-mers, we use a heuristic method to find a sequence with minimal corrections such that each k-mer on the corrected sequence is trusted. Using a k-mer size of 31, we consider a k-mer trusted if it occurs at least twice in the accompanying metagenome. This coverage threshold was determined empirically to work with most datasets ( Supplementary Fig. S2).
Our correction algorithm was inspired by fermi (Li, 2012) and BFC (Li, 2015), but we do not act on the assumption of uniform

2199
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.  (Bowers et al., 2015). Based on metagenome read mapping, we estimate the relative abundance of E. coli to amount to 0.15%, corresponding to a mean per-base coverage of only 20.7Â (Supplementary Table S2). We evaluated MeCorS along with BayesHammer (Nikolenko et al., 2013), a widely used error correction tool for SAG data. Our method corrects more errors than BayesHammer, producing a significantly higher fraction of better and perfect reads after correction (Table 1; Supplementary Table S3). In contrast to BayesHammer, MeCorS reduces the amount of chimeric SAG reads by one order of magnitude, likely due to the non-chimeric nature of the metagenome reads. MeCorS works well with modern single cell assemblers, most notably reducing the misassembly rate of both IDBA-UD (Peng et al., 2012) and SPAdes (Bankevich et al., 2012) by half, while providing high sequence contiguity (Fig. 1). In particular poorly amplified SAGs benefit from metagenome-enabled error correction, yielding improved assembly accuracy and contiguity (Supplementary Tables S4 and S5).
We note that such a hybrid error correction of SAG data may result in miscorrection(s) of rare variants. If the captured cell contains a variant that is rare or absent in the corresponding metagenome, correction will be biased towards the most abundant variant in the metagenome sequence. If strain resolution is desired, we suggest polishing the SAG assembly using the uncorrected raw data. In all other cases, SAG assemblies benefit directly from metagenome-enabled error correction via MeCorS.
Uneven genome coverage and chimera formation present the biggest challenges in the downstream processing and analysis of SAG datasets to date. We propose MeCorS for the correction of SAG reads when complementary metagenome datasets are available. Error and chimera correction is essential for improved SAG assembly and demonstrates a powerful application of combined shotgun metagenome and single cell sequencing. Conflict of Interest: none declared. Fig. 1. Effect on SAG assembly. We corrected the raw reads (R) with Bayes-Hammer (B; Nikolenko et al., 2013) or MeCorS (M). We then used IDBA-UD (Peng et al., 2012) and SPAdes (Bankevich et al., 2012) to assemble the SAGs. Brackets indicate all statistically significant changes (P < 0.05; two-tailed Wilcoxon signed-rank test). Quality assessment with QUAST (Gurevich et al., 2013); Supplementary Tables S4 and S5 contain in-depth assembly statistics Mean percentage and standard deviation of perfect reads, chimeric reads (i.e. reads with parts mapped to different places), corrected reads becoming better and worse than the raw reads. Evaluation as described in Li (2015); please refer to Supplementary Table S3 for per-SAG metrics, including runtime and memory usage.