Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags

Abstract Background Illumina paired-end sequencing has been increasingly popular for 16S rRNA gene-based microbiota profiling. It provides higher phylogenetic resolution than single-end reads due to a longer read length. However, the reverse read (R2) often has significant low base quality, and a large proportion of R2s will be discarded after quality control, resulting in a mixture of paired-end and single-end reads. A typical 16S analysis pipeline usually processes either paired-end or single-end reads but not a mixture. Thus, the quantification accuracy and statistical power will be reduced due to the loss of a large amount of reads. As a result, rare taxa may not be detectable with the paired-end approach, or low taxonomic resolution will result in a single-end approach. Results To have both the higher phylogenetic resolution provided by paired-end reads and the higher sequence coverage by single-end reads, we propose a novel OTU-picking pipeline, hybrid-denovo, that can process a hybrid of single-end and paired-end reads. Using high-quality paired-end reads as a gold standard, we show that hybrid-denovo achieved the highest correlation with the gold standard and performed better than the approaches based on paired-end or single-end reads in terms of quantifying the microbial diversity and taxonomic abundances. By applying our method to a rheumatoid arthritis (RA) data set, we demonstrated that hybrid-denovo captured more microbial diversity and identified more RA-associated taxa than a paired-end or single-end approach. Conclusions Hybrid-denovo utilizes both paired-end and single-end 16S sequencing reads and is recommended for 16S rRNA gene targeted paired-end sequencing data.


Introduction
The microbiome plays an important role in global ecology, nutrient cycling, and disease [1]. Targeted sequencing of the hypervariable region of the 16S rRNA gene is now routinely used to profile microbiota. Identifying related groups of organisms known as operational taxonomic units (OTUs) remains a central part of the analysis of microbiome data. Both de novo and reference-based approaches have been proposed for processing 16S rDNA reads-each with complementary strengths and weaknesses. De novo OTU-picking naively clusters reads based on sequence similarity. It has the advantages of not requiring any prior knowledge or reference about the target molecule, and produces OTU groupings that are more naturally aligned to the data. However, de novo approaches require comparison of the same gene region. Reference-based approaches can get around this limitation, but rely on a preexisting set of OTU representatives that may or may not be appropriate for a particular dataset [2].
To perform a de novo approach, one of the challenges presented by Illumina paired-end reads is that the reverse read (R2) often has a much lower base quality than the forward read (R1). For the 16S datasets generated at the Mayo Clinic Core Facility, only 24% of R2s passed quality control (QC) between 2013 and 2015, as opposed to 83% for R1s ( Supplementary Fig. 1). We are then left with a smaller set of high-fidelity paired-end reads (R1-R2) and a deeper set of single-end reads (R1). Thus, we would have to choose between the more accurate taxonomic identification using R1-R2 or improved detection of rare taxa using R1 [3]. To integrate information from both paired-end and singleend reads, we propose hybrid-denovo, a pipeline that combines paired-end and single-end reads in order to retain the advantages of de novo OTU-picking while maximizing the ability to detect rare taxa.

Methods
Hybrid-denovo first constructs an OTU backbone using only paired-end reads. The remaining single-end reads (R1) are mapped to the OTU backbone, creating new OTUs if unmapped ( Fig. 1 A). The same quality control and OTU-picking process as implemented in IM-TORNADO is used to build the OTU backbone [3]. Specifically, quality filtering was performed using Trimmomatic [4] with a hard cutoff of PHRED score Q3 for 5 and 3 ends of the reads, trimming the 3 end with a moving average score of Q15, with a window size of 4 bases, and removing any remaining reads shorter than 75% of the original read length. Reads with any ambiguous base calls were discarded. Surviving read pairs were further trimmed down to specified cutoffs to uniform the length of both reads, then concatenated and sorted by cluster size. Afterwards, a de novo OTU-picking was conducted via UPARSE algorithm [5,6]. Though the UPARSE algorithm has performed de novo chimera removal, we additionally used UCHIME [7] to perform a reference-based chimera removal against the GOLD database [8] resulting in a set of high-quality OTU representatives. We then mapped the single-end R1s to the R1 end of the OTU representatives using USEARCH (if there are multiple hits with the same score, the most abundant one will be chosen by default). The remaining unmapped R1s were clustered into new OTUs via the UPARSE algorithm and added to the list of OTUs generated by the paired-end reads. Thus, the OTU representatives consist of a mixture of single-end and pairedend reads. We then aligned all the OTU representatives using the structure alignment algorithm Infernal trained on the Ribosomal Database Project's (RDP's) database [9,10]. OTU representatives that were not aligned were removed as they hypothetically represented nonbacteria. A phylogenetic tree was built from the aligned OTU representatives using FastTree [11]. Fast-Tree has little penalty on end-gaps, which is favorable when processing a mixture of single-end and paired-end reads. Finally, R1 and R2 reads were stitched together with ambiguous nucleotides (a string of Ns) in between and then assigned a taxonomy by    Figure 2: Comparison of mothur, QIIME, and hybrid-denovo on genus-level profiles. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). Each column represents the microbiota profile of an individual averaged over all replicates. The overlaps of detected genera between the 3 pipelines are shown in the Venn diagram. the RDP classifier [12] trained on the Greengenes database [13] OTUs not classified as bacteria and singleton OTUs were removed as they were presumed contaminants. Note that this step may have lost diversity that is not represented in the database and is a tradeoff between accuracy and completeness. The complete workflow of our pipeline is given in Supplementary Fig. 2.
To validate our approach, we created a gold standard data set with high-quality paired-end reads based on the 837 highcoverage human fecal samples sequenced at the Mayo Core Facility (V3-V5 16S amplicon, 694 nt, 357F/926R primers) [14]. These fecal samples were collected from 20 subjects using 6 different methods (no additive, RNAlater, 70% ethanol, EDTA, dry swab, and fecal occult blood test [FOBT]). The samples were immediately frozen or stored at room temperature for 4 days to study the stability of the microbiota. Each condition had 2-3 technical replicates to assess the reproducibility. We ran Trimmomatic [4] for quality control and trimmed R1s down to 250 bp and R2s down to 200 bp to ensure high base quality, resulting in nonoverlapping paired-end reads. For each sample, we retrieved 8000 high-quality paired-end reads. We then performed OTU-picking and taxonomy assignment based on these pairedend reads using IM-TORNADO. These resulting OTUs and their associated taxonomy constitute the gold standard data set. We then subset the gold standard with 25%, 50%, and 75% of the R2 reads remaining. The 3 sub-data sets represented different levels of R2 quality encountered in practice. We compared hybrid-denovo with de novo approaches based on single-end R1s or paired-end reads using the sub-data sets. Performance was evaluated by calculating the Spearman's correlation with the gold standard in terms of microbial β-diversity (unweighted and weighted UniFrac, and Bray-Curtis distance) and genus-level relative abundances.
We also compared our pipeline with QIIME and mothur (versions 1.8.0 and 1.39.3, respectively) [15,16] on the gold standard data set. As QIIME and mothur currently do not support de novo OTU-picking based on nonoverlapping reads, we ran QIIME and mothur on the R1 reads. Parameter settings were chosen to be comparable to that of hybrid-denovo. As we created good-quality reads by using Trimmomatic, we reduced potential variation in performance between pipelines by not applying additional read QC filters. An RDP classifier trained on Greengenes v13.5 was used to classify reads for all pipelines. Singletons and nonbacteria OTUs (based on taxonomy) were filtered out. The major differences between the 3 pipelines in addition to the commands used to reproduce the results are documented in Supplementary Note 1. We assessed performance by investigating (1) the number of detected genera and percentage of unclassified reads at the genus level, (2) Mantel correlation using Bray-Curtis (BC) matrices, and (3) the intraclass correlation coefficient (ICC) for these core OTUs and genera observed in more than 90% of the samples. ICC is a measure of the correlation between the technical replicates. A high value indicates less measurement error. ICC was calculated using the R ICC package [17].
Finally, we demonstrated the performance of the proposed method on a data set from the study of the stool microbiome of rheumatoid arthritis (RA) patients, which consists of 40 RA patients and 49 controls (V3-V5 16S amplicon, 694 nt) [18]. We applied DESeq2 to the taxa count data for differential abundance analysis [19] and compared the RA-associated OTUs and genera recovered by different approaches.

Results
The correlation of microbial β-diversity with the gold standard was generally high for all the 3 approaches (Fig. 1B). However, the approach based on single-end R1 tended to have a lower correlation when BC distance was used (the single-end R1 approach was invariant to the number of R2s). The paired-end approach, on the other hand, had a much lower correlation for unweighted UniFrac when only 25% of R2s remained. This was due to the fact that unweighted UniFrac captures community membership, which is contributed mainly by rare taxa, and many rare taxa are no longer detectable by the paired-end A B Figure 3: Comparison of mothur, QIIME, and hybrid-denovo on intra-class correlation coefficients (ICCs) of the core genera (A) and OTUs (B). ICCs are calculated based on the technical replicates for 6 different fecal collection methods. Hybrid-denovo is run on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%).
approach due to loss of reads. In contrast, Hybrid-denovo was very robust and had the best or close to the best correlation with the gold standard in both diversity measures. For weighted UniFrac distance, the correlation was similarly high for all the 3 methods as the weighted UniFrac is most influenced by dominant taxa and all the methods quantify these dominant taxa very well (Fig. 1B). We next studied the performance of taxonomic profiling of the proposed approach. Based on the 56 genera with a prevalence greater than 10%, hybrid-denovo had a much higher correlation with the gold standard across all scenarios considered, and its performance was not very sensitive to the percentage of R2s remaining (Fig. 1C). In contrast, the performance of the paired-end approach depends strongly on the R2 quality and had a much lower correlation when R2 quality was low. The singleend R1 approach was invariant to the number of R2s expected and performed better than the paired-end approach only when R2 quality was low. Supplementary Fig. 3 showed the individual genus correlations. For the single-end approach, 2 genera showed 0 correlation with the gold standard because all of their R1 reads were reclassified at the family level due to their short length (Lachnobacterium mapped to Ruminococcaceae and Erwinia mapped to Enterobacteriaceae), indicating the increased phylogenetic resolution using paired-end reads. For the paired-end approach, genera with low abundance exhibited a lower correlation, indicating the decreased quantification accuracy due to loss of paired-end reads.
We also compared hybrid-denovo with mothur and QIIME, the 2 predominant pipelines for 16S data, based on the gold standard data set. Mothur and QIIME took around 24 and 6 hours, respectively, to complete the analysis of the gold standard dataset (n = 837), compared with around 1 hour for our pipeline. Mothur and QIIME produced a total of 4599 and 2898 nonsingleton OTUs, respectively, while hybrid-denovo produced 1094, 1086, 1079, and 1049 nonsingleton OTUs on data sets with different percentages of good-quality R2 reads (100%, 75%, 50%, and 25%). Though our pipeline resulted in a smaller number of OTUs, we detected a larger number of genera than mothur and QIIME. For exam-ple, application of hybrid-denovo to the data set with 50% goodquality R2 reads yielded a total of 110 genera, compared with 70 and 84 for QIIME and mothur, respectively (Fig. 2, upper right, Venn diagram). Using BLAST on the paired-end counterparts of the QIIME and mothur-specific genera (classified based on R1 reads) against the Greengenes database re-assigns many of the reads to other genera. This indicates that those genera were probably misclassified due to shorter reads. Though the genuslevel microbiota profiles for the 20 subjects were similar for all the pipelines (Fig. 2), hybrid-denovo had a much lower proportion of reads with unknown genus identity (5%) than mothur and QI-IME (14% and 18%, respectively). Taken together, these observations demonstrated that hybrid-denovo had increased taxonomic resolution due to the use of longer reads. Interestingly, all the pipelines could yield similar intersample relationships as measured by Mantel correlation coefficients based on Bray-Curtis distance matrices (Table 1). The availability of technical replicates of the data set allows us to compare different pipelines using intraclass correlation coefficients. A high ICC indicates less variability introduced by the bioinformatics pipeline. We calculated the ICCs for different fecal collection methods for the core OTUs and genera, which occurred in more than 90% of the samples. Our pipeline generally had higher ICCs (less variation between technical replicates) than mothur and QIIME (Fig. 3). In contrast, mothur and QIIME did not perform as well on the core OTUs and genera, respectively.
We also applied our method to a data set from an RA study [18], where about 40% of R2s were discarded after quality control (Supplementary Table 1). Hybrid-denovo resulted in the largest number of OTUs and genera, as expected (Fig. 4A), and covered all genera from the paired-end approach and the majority genera from the single-end R1 approach (Fig. 4C). Among the 5 R1-specific genera, Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae 02d0 and Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Sarcina were reclassified to Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium when their pairedend counterparts were used, indicating that the R1-specific genera were misclassified due to their short read length.  Figure 4: Comparison of the R1, Paired, and Hybrid approaches on the RA data set. A, Number of detected OTUs (red) and genera (blue). B, Number of significant OTUs (red) and genera (blue) from differential abundance analysis (FDR ≤ 0.01). C, Venn diagram of the genera detected. D, Venn diagram of significant genera from differential abundance analysis.
Apart from the comparison of the detected genera, we also demonstrated the advantage of hybrid-denovo in the context of differential abundance analysis using DESeq2 [19]. We excluded OTUs that occurred in less than 10% samples from testing. A total of 758, 578, and 393 OTUs were tested using hybrid-denovo, paired, and R1 approaches, respectively. Due to higher read counts and increased phylogenetic resolution, hybrid-denovo recovered more differential OTUs (Fig. 4B). We identified a total of 126 significant OTUs at an FDR-adjusted P value of 0.01 compared with 93 and 80 OTUs for paired-end and single-end R1 approaches, respectively. Since different methods had their own definition of OTUs and direct comparison of the differential OTUs is difficult, we instead compared the genus identity of the identified OTUs. The differential OTUs identified by hybrid-denovo were classified into 33 genera, in comparison with 32 and 34 for the paired-end and single-end R1 approaches (Fig. 4B). There were 20 significant genera shared by all 3 methods (Fig. 4D), many of which were reported by previous studies [18,20,21]. For example, Bacteroides is enriched in control samples, while Collinsella, Eggerthella, Prevotella, and Clostridium are enriched in RA samples. Even though the total number of differential genera were similar for all the methods, hybrid-denovo identified the most genera (n = 11) that were shared by either 1 of the other 2 approaches, compared with 6 and 9 for the pairedend and single-end R1 approaches, indicating that the hybriddenovo approach was able to identify differential genera that were otherwise missed by either the paired-end or single-end R1 approach. Furthermore, hybrid-denovo had the lowest number of method-specific genera (n = 2) in contract to paired-end (n = 6) and R1 single-end (n = 5). The method-specific genera might be less reliable due to lack of support from other methods. For example, the R1 approach found Veillonella to be enriched in control samples, which conflicts with a previous study [20]. Interestingly, among the 2 hybrid-denovo-specific genera, Klebsiella, which was enriched in healthy people, was reported by Zhang et al. [21].

Discussion
We proposed hybrid-denovo for de novo OTU-picking based on paired-end 16S sequence tags. Through simulations and real data examples, we showed that our approach had better performance than the single-end or paired-end approach in quantifying the microbial diversity and taxonomic abundance, due to the full use of the information in the paired-end reads.
Based on the size of 16S amplicons and the length of the paired-end reads, we could have overlapping or nonoverlapping paired-end reads. For example, sequencing of the V4 region (252 nt, 515F/806R primers) produces overlapping paired-end reads, while sequencing of the V3-V5 region (694 nt, F357/R926 primers) results in nonoverlapping paired-end reads using Illumina MiSeq (250 bp × 2). As QIIME and mothur currently do not support de novo OTU-picking based on nonoverlapping pairedend reads, the main advantage of our pipeline lies in the ability to process nonoverlapping paired-end reads. However, our pipeline could also be applied to overlapping paired-end reads by using PANDAseq [22] to stitch the paired-end reads together. It is noted that some existing pipelines could also process a mixture of paired-end and single-end reads with different capacities. For example, the recently proposed LotuS pipeline uses good-quality R1 reads to build OTUs, followed by a postclustering merging of R1 and R2 to increase the accuracy of the taxonomy [23]. However, the OTU-level resolution is still determined by R1 reads.
There are new pipelines that have been developed for 16S data. It is interesting to benchmark hybrid-denovo against these state-of-the-art pipelines. We selected DADA2 and LotuS [23,24] for comparison as they have been demonstrated to have an overall better performance than QIIME and mothur and have been increasingly used by the community. We repeated the same analysis on the gold standard data set with complete read pairs. The specific command lines used for DADA2 and LotuS are documented in Supplementary Note 1. DADA2 produced 18 389 sequence variants (SVs), while LotuS produced 472 OTUs. The Mantel correlation on the OTU/SV-level Bray-Curtis distance is high between hybrid-denovo and LotuS (ρ = 0.93) but moderate between hybrid-denovo and DADA2 (ρ = 0.71). Interestingly, the Mantel correlation on the genus-level Bray-Curtis distance is high between all methods (ρ > 0.97), indicating that all methods could produce similar genus-level profiles ( Supplementary  Fig. 4). Similar ICC analysis demonstrated that all the methods had relatively high ICCs, but hybrid-denovo had the overall the best performance ( Supplementary Fig. 5).
One problem for de novo OTU-picking is the potential inflated OTU number, which could be due to sources such as sequencing errors, chimera, and environmental contaminants [6]. In hybrid-denovo, we used various quality filtering criteria to reduce the number of spurious OTUs. For example, we applied Trimmomatic [4] to trim and remove reads with low base quality, removed reads with any ambiguous bases, removed singleton OTUs, used the Infernal package [9] to remove non-structurally aligned OTUs, and used reference-based UCHIME as an additional chimera removal process [6]. However, even these filters might fall short of reducing the inflated diversity estimate due to unknown sequencing errors. Improving the diversity estimate from hybrid-denovo will be the focus of our future work.