MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis

Abstract Motivation Ribosomal RNA profiling has become crucial to studying microbial communities, but meaningful taxonomic analysis and inter-comparison of such data are still hampered by technical limitations, between-study design variability and inconsistencies between taxonomies used. Results Here we present MAPseq, a framework for reference-based rRNA sequence analysis that is up to 30% more accurate (F½ score) and up to one hundred times faster than existing solutions, providing in a single run multiple taxonomy classifications and hierarchical operational taxonomic unit mappings, for rRNA sequences in both amplicon and shotgun sequencing strategies, and for datasets of virtually any size. Availability and implementation Source code and binaries are freely available at https://github.com/jfmrod/mapseq Supplementary information Supplementary data are available at Bioinformatics online.

-MAPseq algorithm and reference. Together with the MAPseq software, a large full-length ribosomal RNA reference dataset is provided along with pre-computed hierarchical OTUs and taxonomic assignments from NCBI and SILVA. The algorithm requires a k-mer pre-clustering of the reference dataset which is provided for the MAPseq reference and can be built for custom, user-provided references. The classification of a query sequence begins with a 2-stage k-mer search; first, the representatives of k-mer clusters are searched, and then the members of the top clusters are searched. Finally, the sequences are fully aligned and confidences are estimated for each taxonomic level.

Reference taxonomy
NCBI taxonomies for the reference dataset were assembled either by referring to the annotated taxonomy for sequences extracted from the RefSeq database, or by scanning for GenBank entries annotated to be culture collection strains. This taxonomy reference set comprised 89,315 sequences, covering 15,810 species, 3,637 genera, 1,493 families, 800 orders, 404 classes, and 185 phyla. Another, independent taxonomy annotation was obtained from SILVA Living Tree Project (LTP) (Yilmaz, et al., 2014) by mapping the sequences annotated in the SILVA LTP database to the MAPseq reference set. The set of sequences with a trusted taxonomy was defined as the gold set and was used to predict the taxonomy of the remaining, un-annotated sequences. The updated reference (MAPref 2.0) covers in the NCBI taxonomy: 16'323 species, 6'981 genera, 2'654 families, 1'091 orders, 412 classes, and 175 phyla.

Hierarchical OTUs
The set of aligned sequences in the Archaea, Bacteria and Eukarya reference datasets were separately clustered using an average-linkage hierarchical clustering algorithm implemented in HPC-CLUST (Matias Rodrigues and von Mering, 2014) down to 90% sequence identity. Each sequence was then assigned to an OTU at five different identity cutoffs: 99%, 98%, 96% and 90%, yielding 144'596, 92'862, 51'904, and 19'957 OTUs, respectively. The updated reference (MAPref 2.0) includes one additionally level at the 97% identity cutoff, for a total of 6 levels.

MAPseq algorithm
MAPseq achieves a big advance in efficiency and accuracy at searching databases of highly similar sequences, by improving upon the k-mer counting approach used in many other sequence search tools such as BLAST or USEARCH. To achieve this efficiency, the reference dataset is pre-clustered into clusters of sequences. This pre-clustering is made available to MAPseq together with the actual sequence data and the taxonomic information. At runtime, MAPseq uses a 2-stage k-mer search -first to the k-mer cluster representatives, and second to the cluster members -to effectively reduce the number of sequences to be searched exponentially and to significantly improve the memory requirements of the program.

K-mer pre-clustering
Sequences in the datasets to be used with MAPseq are pre-clustered on the basis of the number of shared k-mers, in two phases. The first sequence in the dataset is used as the seed for the first cluster, afterwards, every sequence in the dataset is iteratively compared on the basis of shared k-mers to the existing cluster representatives and are either assigned to one of the existing clusters or used as the representative for a new cluster. Sequences are added to existing clusters when the sequence being considered shares at least 80% of its k-mers with a cluster representative. The first phase of pre-clustering is complete once all sequences are assigned to a cluster or used as representatives for new clusters. In the second phase, all cluster representatives are kept from the first phase, but the membership assignment for the other sequences is recalculated, this ensures that sequences assigned to a cluster at some stage can be assigned to another cluster representative to which they are more similar. This situation can occur because cluster representatives can be created at any point after sequences have already been earlier assigned to other clusters in the first phase. Pre-clustering of our dataset consisting of 918'803 sequences results in 56'169 k-mer clusters which implies a 16 times reduction in number of sequences that need to be searched in the first stage of the k-mer search step. In the updated reference (MAPref 2.0), the clustering of 1'585'280 sequences yielded 91'181 clusters.

Mapping of sequence reads to the reference dataset
When a query is first searched against the database, the number of shared k-mers between the query and one representative for each cluster is computed. In a second step, the number of shared k-mers is computed between the query and all of the members belonging to the clusters that ranked highest in the previous step. Finally, the top hits are aligned to the query sequence. Dynamic programming is used to identify the set of identical segments yielding the largest alignment score, followed by alignment of the regions between these segments using the banded Needleman-Wunsch algorithm. The ends of the alignment are determined using the banded Smith-Waterman algorithm with drop-off such that the alignment is not extended if the alignment score drops to more than 20 below best score found to that point. The best match is taken as the best guess for taxonomy or OTU assignment, and the remaining hits are used to estimate the confidence in the assignment at each level.

Estimating assignment confidence
MAPseq achieves higher accuracy by assigning a mapping confidence to better control false classifications. There are two different types of false classifications: i) misclassifications due to missing sequences in the reference dataset, and ii) incorrect assignments among the existing sequences in the database. These two types of errors are controlled for independently. The first case is controlled by computing a confidence on the basis of identity cutoffs previously optimized for each level for a given taxonomy and reference dataset. The identity confidence is computed using the following formula: Where id i is the identity of the query compared to the reference sequence i, id cutoff is the identity cutoff, and Δ cutoff is a parameter that controls the strength of the effect of identity differences on the confidence; these two parameters were optimized previously for each taxonomic level. The additional 0.02 was added so that the id cutoff term matches the OTU cutoffs in OTU mapping. The second case is controlled using the formula described below that weighs the score of the top hit against subsequent hits to different taxa: Where S i is the alignment score of reference sequence i in the top hit list, T is the score of the best aligned sequence, w i is the weight of sequence i, W is the sum over all weights, α is a parameter controlling the influence of lower scoring hits. In essence, the higher the difference in scores between the top hit and the closest second-best hits to any conflicting taxonomy/OTU, the better the confidence placed in the assignment. This automatically solves the problem of sequence reads originating in highly conserved parts of the 16S rRNA sequence, because top hits will tend to have similar scores even when belonging to conflicting taxonomies. The final confidence is calculated as the minimum between the two confidences: ( ) = ( 6789: ( ), #$ ( )).
Another advantage of this approach is that higher confidences are automatically computed for lower taxonomic levels, since second-best hits will necessarily have higher score differences at lower taxonomic levels.

Validation of sequence read mapping
Benchmarking of the taxonomic classification and OTU mapping performance was done using as a starting point a set of nearly full-length 16S/18S ribosomal subunit sequences compiled from the Genbank database. The taxonomy from the All-Species Living Tree Project (LTP) (Yarza, et al., 2010) which comprises a set of manually curated sequence and taxonomies was used as the gold standard in our benchmarks. For the OTU mapping, sequences were clustered using HPC-CLUST (Matias Rodrigues and von Mering, 2014) after alignment with the INFERNAL aligner (Nawrocki, et al., 2009), or with UCLUST.
To benchmark the accuracy of MAPseq and other classification tools commonly used in metagenomic data analysis (RDP Classifier 2.6, NINJA-OPS 1.5, VSEARCH 2.4.0 and USEARCH 9.2.64) we generated a benchmark query and reference dataset from our full-length rRNA gene dataset in which the true OTU assignments or taxonomies were known. We avoided benchmarking trivial cases when the query and database sequence were nearly identical, by excluding from the benchmark reference any sequences originating from the same species or OTU (at 99% identity cutoff) as any of the query sequences. In addition, we evaluated the ability of different methods to correctly ignore sequences that had no representative in the reference database by generating the reference set such that in the best case only 50% of the query sequences actually had a representative of their genus or OTU (at 98% identity cutoff). This test set was generated by randomly selecting one sequence for each genus from the full set while respecting the previous conditions.
Confidence thresholds per method were chosen as the thresholds yielding maximum F 1/2 -scores averaged over the scores obtained for different read lengths and different rRNA hypervariable regions. For USEARCH and VSEARCH, which output no confidences, we used the difference above the identity cutoff for each level as a measure of confidence, for example when a query had 98.1% identity to a reference sequence then it had 0.1 confidence when mapping to an OTU at 98% identity cutoff.
The same test reference dataset and queries were used with all methods tested.

Computation speed and memory benchmarks
The computation speed and memory benchmarks were performed by running each tool with 8 threads (except RDP classifier which does not support multithreading) on the same dedicated Dell Blade M605 computer with 2 quad-core Opteron 2.33 GHz processors and 24 GB of random access memory. For the benchmark (Figure 1a), the input data used was the Human Microbiome Project sequencing run (700016012) consisting of amplicon raw data targeting the V3V5 region of the 16S rRNA gene. This sample was mapped to different subsamples of the MAPseq reference database consisting of 918 803 sequences, and in (Figure 1g,h) the input data used was a 10% subsampled set of all raw reads found in 2,194 HMP samples for which both V1-V3 and V3-V5 sequencing runs existed. USEARCH could not be run on reference datasets larger than 650'000 sequences, due to memory usage limitations of the 32bit version (the only freely available version). The following commands were used for running the benchmarks:

Concordance analysis of independent V1V3 and V3V5 sequence data of the Human Microbiome Project
Raw 16S rRNA V1V3 and V3V5 amplicon sequencing data and sample metadata of the Human Microbiome Project (HMP) were downloaded from the NCBI Sequence Read Archive and the HMP data depository (hmpdacc.org). Chimeric reads were detected using UCHIME (Edgar, et al., 2011) in both de novo mode and with a custom in-house reference database of non-chimeric sequences; reads labeled as chimeric by both approaches were removed from further analyses. Filtered reads were then aligned to a 16S rRNA model using INFERNAL (Nawrocki, et al., 2009) and pruned to the respective alignment flanking positions for the V1V3 and V3V5 primer sets, as described above. Reads that did not align to these regions, that had too many gaps within flanking positions or that were not observed in at least two samples independently at a 5bp error tolerance were removed from further analyses. Moreover, all biological samples were removed that did not contain at least 1,000 sequences of both V1V3 and V3V5. After these filtering steps, there remained 2,194 samples containing a total of 17,890,946 (V1V3) and 26,627,383 (V3V5) sequences.
Sequences were assigned to OTUs by reference mapping. Consistency between V1V3 and V3V5 data from the same biological samples was assessed as direct per-sample Jaccard similarity (i.e., the overlap of reference OTUs at 97% called in both V1V3 and V3V5 sequencing of the same sample) and as the Pearson correlation of abundances of OTUs shared between both sequencing sets per sample. Reference mapping was performed on three different reference databases: One including all full-length sequences for each OTU (full reference, Figure 1g,h), one with 30% randomly picked representatives per OTU (reduced reference) and one including a single random representative for each OTU (representatives only). Speed estimates are based on reduced input files (random 10% of the sequences for each sample) and tool parameters as in the previous section, while consistency (Jaccard similarity, Pearson correlation) was computed on all input sequences with 40 to 80 threads on a larger workstation.

Accuracy reduction when using representative reference datasets
A common practice used in ribosomal RNA marker gene analysis is to make the reference database non-redundant or use only representatives of each OTU. In Fig. S2 we show that this practice leads up to 0.1 less F-score than using the full reference set (Fig. 1f). This reduction in F 1/2 -score was not limited to MAPseq but was also observed for all other tools tested. Figure S2 -F1/2 scores for sequence reads of different lengths, mapped to OTUs at 98% cutoff using: a) nonredundant/representative reference dataset and b) the full reference, identical to Fig. 1f and included here for direct comparison.

HPC-CLUST based OTUs outperform UCLUST OTUs in mapping consistency
We chose a subset of HMP metagenomic data in which the same samples had been sequenced twice, targeting two different regions of the rRNA gene (V1 to V3 and V3 to V5); these were in total 2,194 samples. Abundance correlations for mapped OTUs at 97% identity between sequenced subregions per sample were very high (median=0.83, mean=0.73; Figure S2a) when using MAPseq and our reference OTU clustering. When using MAPseq to map to the same reference (but pre-clustered by UCLUST, the tool used in the GreenGenes and SILVA databases) the correlation was lower (median=0.62, mean=0.59), and when using UCLUST (v1.2.22q) to map the sequences the correlations were yet significantly lower both when mapping to our reference (median=0.39, mean=0.41) as well as to the UCLUST-clustered reference (median=0.22, mean=0.29; this latter approach corresponds to the default for "closedreference OTU picking" in the widely used QIIME pipeline). We observed a similar trend in the fraction of identified OTUs common to both pairs of sequencing runs for the same sample ( Figure S3b): MAPseq mapping to the MAPseq reference resulted in the highest overlap (median=0.46, mean=0.43), followed by MAPseq mapping to the UCLUST-clustered reference (median=0.33, mean=0.31). Using UCLUST resulted in a much-reduced overlap between OTUs common to pairs of sequencing runs when mapping to OTUs (median=0.30,mean=0.29) and when mapping to UCLUST representatives (median=0.23, mean=0.22).

Figure S3
-Abundance correlations and Jaccard overlaps for OTUs identified in pairs of sequencing runs of identical HMP samples targeting two different ribosomal RNA regions (V1-V3 and V3-V5). The same reference dataset was used with two different clustering approaches: hierarchical clustering using average-linkage computed with HPC-CLUST, and a UCLUST clustering.
Finally, we compared these two OTU clustering methods to de novo clustering of the metagenomic samples. OTU set compositional consistency between reference-mapped partitions and their respective de novo counterparts was checked by calculating the Adjusted Mutual Information (AMI, (Vinh, et al., 2009)) between the respective OTU sets (e.g., comparing MAPseq against an average linkage pre-clustered reference with an average linkage de novo clustered partition). Ideally, the results of de novo clustering and reference-mapping strategies should correspond to each other -if consistent sequence processing and clustering algorithms are applied. However, this correspondence has recently been called into question for UCLUST (Westcott and Schloss, 2015). To test this, we quantified partition similarity in terms of OTU composition (per-sequence cluster membership) as Adjusted Mutual Information (AMI); AMI values of 1 indicate perfect partition identity, and AMI values of zero indicate random compositional agreement as expected by chance. For the HMP dataset, we observed that MAPseq against an average linkage (AL) OTU reference provided a good correspondence with AL de novo clustering (AMI=0.83). This indicates that MAPseq reference mapping, although computationally much more efficient, may indeed approximate hierarchical AL de novo clustering, which is arguably still a gold standard in marker gene processing. In contrast, UCLUST mapping against a UCLUST-clustered non-redundant reference (the default method in QIIME, see above) showed much lower agreement with de novo UCLUST clustering (AMI=0.66). UCLUST performed better against our AL OTU reference (AMI=0.75), but was outperformed by MAPseq against a UCLUST reference (AMI=0.79). Thus, both reference preclustering and choice of mapping tool have a strong effect on consistency, and UCLUST was clearly outperformed on both, by MAPseq mapping and by AL clustering. Figure S4 -Analysis of ten mock communities. Top panels, box plots of sum of squared errors between predicted and expected taxa abundance. Bottom panels, number of times each tool ranked first over the ten mock communities. *) RDP classifier does not report species classifications and therefore no result could be computed at the species level.

MAPseq outperforms NINJA-OPS, VSEARCH, and RDP Classifier in the analysis of mock communities.
We downloaded ten mock communities available in the Mockrobiota dataset (Bokulich, et al., 2016), specifically: (Gohl, et al., 2016;Kozich, et al., 2013;Tourlousse, et al., 2017). After downloading, we mapped the forward reads to our full reference using MAPseq, NINJA-OPS, VSEARCH, and RDP Classifier. As a measure of performance, we computed the sum of squared errors (SSE) between predicted and expected abundances. Fig. S4 shows the box plots of the SSE obtained for each tool over all the sequence runs and how often each tool ranked first over all samples for the species, genus, and family taxonomic levels. MAPseq outperformed the other tools, obtaining an overall smaller median of the SSE and consistently ranking first more frequently than other tools.

MAPseq exhibits linear complexity with input size.
We benchmarked the time it took to analyse several downsamplings of the HMP dataset used in the concordance analysis of the V13 and V35 sequencing runs. The downsampled queries ranged between a thousand reads and a million reads. Fig. S5 shows that the MAPseq runtime is practically linear when using 8 cores over a large range of input sizes. The full MAPseq reference was used as the mapping target.