SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

Abstract 16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.

1 Proofs of the lemmas 1.1 Lemma-1 Lemma 1: Given N sequences and a sample obtained by uniformly sampling n sequences, the bound on the probability of missing a cluster c k of size ρ k is given by, Proof. The proof the equality in the lemma follows from the fact that, the cluster of size ρ k is missed only if either of the following two cases happen: 1. there is exactly one sequence from cluster k present in the sample 2. no sequences from cluster k is present in the sample. Thus, adding the two we get the probability of missing a cluster as, The second part of the inequality comes from approximating Equation 1. It is a well known fact that, n k ≤ n k k! =⇒ N −ρ k n−1 N n ≈ (N −ρ k ) n−1 (n−1)! .ρ k N n n! = (N − ρ k ) n−1 .ρ k .n N n Applying the Approximation, (1 − x) r ≤ e −xr , ∀0 ≤ x ≤ 1

Lemma-2
Lemma 2: Given N sequences in the database with n sequences in the sample the probabilityp m k of missing m clusters of size k = [ρ 1 , ρ 2 , .., ρ m ] such that ρ 1 ≤ ρ 2 ≤ ... ≤ ρ m is given by, Proof. We begin by noting that sampling m clusters of sizes ρ 1 is not independent of one another as they are dependent on the number of sequences sampled notated by n. Hence the probabilities of missing one cluster of size ρ k as given by Lemma 1 cannot be multiplied. Expanding the summation on we get the following, Remember that a cluster is missed from being identified in a certain round if it has either 0 or 1 representatives in the sample. The first term refers to the case when no sequences are sampled from each of the m clusters and the second term refers to the case where exactly one sequence is sampled from 1 of the m clusters and the other respective m − 1 clusters have 0 representatives and so on. We prove that this statement holds good by an induction on the number of clusters m. Base Case: Lets us show that the statement holds good for the smallest m, m = 1.
We know that this true which is a consequence of lemma 1,p 1 is the probability of missing one cluster of size ρ 1 . Inductive Step: Since it holds good for m = 1 let us assume that this statement is true for m clusters. Lets us calculate the odds of missing m + 1 clusters. The odds of missing m + 1 clusters is simply the odds of missing m clusters and missing one other cluster of size ρ m+1 . The cluster m + 1 is also missed if either 0 or 1 sequence from the cluster is a part of the sample. Therefore we get, Given that ρ 1 ≤ ρ 2 ≤ ... ≤ ρ m+1 we can say that the probability of missing m + 1 clusters of size atleast ρ 1 can be upper bounded by the probability of missing m + 1 clusters of size ρ 1 . Hence we get that, Conclusion: Since both the base case and the inductive step have been proved as true, by mathematical induction the statementp m holds for every whole number m.

Theorem-1
Given N sequences in the database, computing the probability p t k of capturing t clusters of size at least k, is NPcomplete.
Proof. Define the event and the probability of capturing cluster i as c i and P (c i ) respectively. Define the event and the probability of missing cluster i asc i and P (c i ) respectively. The proof of the hardness of p t k is given by expanding p t k into probability form and apply the Inclusion-Exclusion Principle: In order to compute the second term of the expanded formula, we need to compute the probabilities of a list of events and a list of event intersections, which is an enumeration of P({c 1 ,c 2 , ...c t }), the power set of event set {c 1 ,c 2 , ...c t }.
Note that by applying Lemma-2, each individual probability term P (c i ∩ c j ∩ ...) can be computed exactly, however, to compute p t k , the total number of probabilities terms we compute will be |P({c 1 ,c 2 , ...c t })| = 2 t , and therefore the computation of p t k is NP-complete.

Mode-shifting Algorithm
Algorithm 1: Mode-shifting Input: clusters {c 1 , c 2 ...c n }, dictionary of counts for sequences-counts, M Output: a set of representatives

Simulation Analysis
We studied the clustering algorithm using simulations since the distributions of cluster sizes were not known before hand. To that end, as mentioned in the main document, we assumed different distributions on the cluster sizes and studied the relationship between the entropy of the cluster sizes and the number of sequences pending to be clustered in the database. We observed that both the entropy dropped with decreasing number of sequences and their relationship remained independent of the nature of the distribution of clusters. In the main document, we showed plots where we assumed a geometric distribution on the cluster sizes. We show results pertaining to truncated normal distribution(figure S1a and S1b) and uniform distribution (figure S1c and S1d), as an extreme case.
3.2 Evaluating the performance of SCRAPT on lupus microbiome data set As described in the main document, we have included a few results on the lupus microbiome data set for the want of space. We chose to not include them in the main document as they conveyed little new information.

Performance of SCRAPT for different similarity thresholds
We studied the performance of SCRAPT with different similarity thresholds. SCRAPT consistently outperforms all the other tools considered in this study, independent of similarity thresholds. However, for the want of space we  (d) Figure S2: The cumulative number of sequences clustered by (a) SCRAPT (b) Naive sampling (fixed α) with modeshifting) (c)Naive sampling (fixed α) without mode-shifting (d) Adaptive sampling without mode-shifting, for different values of α on the lupus microbiome data set for a similarity of 0.98. describe our results and interpretations only for a similarity threshold of 98%(0.98). For the all the similarity thresholds the results and the conclusions we made hold true. The results for all other similarities for each of the data sets we have taken up in this study can be found here: https://drive.google.com/drive/folders/ 1n976A2n18WUg1jJrSXNZEFUz559jwLB3?usp=sharing

Comparing mode against the longest sequence for centroid
We studied the sequence with the largest length against mode of the cluster. For a similarity threshold of 0.98 and an initial sampling rate of 0.1, we ran SCRAPT with the sequence with the highest multiplicity(mode) within a cluster as a choice of centroid and the sequence with the greatest length as a choice of centroid and compared their clustering qualities. We show the cumulative number of sequences clustered and the measure of fragmentation in figure figure  S4. We see from the figures that SCRAPT with mode clearly as the centroid outperforms SCRAPT with the longest sequence as the centroid. SCRAPT with mode as centroid produces larger and less fragmented clusters for the same choice of initial clustering parameters reinforcing our intuition that the mode is a better choice of centroid than length.

Comparing SCRAPT Against MeShClust 3.0
We compare SCRAPT against MeShClust 3.0 which uses a sampling and mean-shifting based approach to perform sequencing clustering. We computed the fragmentation measure between SCRAPT and MeShClust 3.0 and further benchmarked the run-times.  cluster, C, cluster centroids, and E, extended members of the cluster. Extended members can be thought of as weak members that violates the similarity criterion by a small amount. From the figure S5a we see that SCRAPT produces less fragmented clusters compared to MeShClust 3.0 even when its extended members are considered. In figure S5b we compared the run-times between SCRAPT and MeShClust 3.0 and we see that SCRAPT is orders of magnitude faster than MeShClust 3.0.

Comparing the extent of fragmentation between SCRAPT and DADA2
We computed the extent of fragmentation between DADA2 and SCRAPT. We see from figure S6 that on all three datasets SCRAPT produces less fragmented clusters. However, we see a strong correlation between the sizes of the clusters produced by both these methods. In order to justify, why DADA2 more fragmented clusters, we would need the exact members of the ASVs produced by DADA2.

Commands
• We merge the pair-end demultiplexed data using the script multiple join paired ends.py from qiime2 by the following command