A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies

Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate reconstructions of the original biological sequences. Herein we present the Poisson binomial filtering algorithm (PBF), which minimizes both problems by calculating the error-probability distribution of a sequence from its quality scores. In order to validate our method, we quality-filtered 37 publicly available datasets obtained by sequencing mock and environmental microbial communities with the Roche 454, Illumina MiSeq and IonTorrent PGM platforms, and compared our results to those obtained with previous approaches such as the ones included in mothur, QIIME and USEARCH. Our algorithm retained substantially more reads than its predecessors, while resulting in fewer and more accurate OTUs. This improved sensitiveness produced more faithful representations, both quantitatively and qualitatively, of the true microbial diversity present in the studied samples. Furthermore, the method introduced in this work is computationally inexpensive and can be readily applied in conjunction with any existent analysis pipeline.


SN1.1 Error probability distribution associated to a sequence
Let us suppose we have 1 sequence of length N nucleotides, each nucleotide with a potentially non-equal (and known) probability p i of being erroneous and a probability (1 − p i ) of being correct. We are interested in obtaining the probability of a sequence of having j = 0, 1, 2, ... , N erroneous nucleotides. Statistically, our problem can be analysed as the probability distribution of the number of successes in a sequence of N independent yes/no experiments with success probabilities p 1 , p 2 , ... , p N . This is equivalent to the sum S N of N independent Bernoulli distributed random variables X 1 , X 2 , ... , X N such that and P(X i = j) stands for the probability of obtaining j errors in nucleotide i. The stochastic variable S N is known to follow a Poisson binomial distribution, from where the method proposed in this paper takes its name (see (Poisson, 1837) and (Cramér, 1999) for more information on this statistical distribution). The mean and variance of this distribution are Following Eq. (SN1.1) and taking into account that all X i are independent, the probability of the sequence of having zero errors is P(S N = X 1 + X 2 + ... + X N = 0) = where σ is a map from {1, 2, ..., j} to E = {1, 2, ..., N} in a way that represents the combination of N elements taken j at a time without repetition, and satisfies σ r ≤ σ s for r < s. Note that σ c is the complement of σ, i.e. σ c = {1, 2, ..., N} \ σ.

SN1.2 Algorithm to obtain the error probability distribution of a sequence
While Eq. (SN1.2) is exact for all j, it becomes useless in practice for moderate values of j, as the number of elements in the summation of the equation is which grows extremely fast with j. We explain here an algorithm inspired in (Butler & Stephens, 1993) that allows us to calculate P(S N = j) for all j in a simple and efficient way.
If we have two random variables Y and Z , each of them taking discrete values 0, 1, 2, ... , the probability of the sum Y + Z of taking value j is As we know the estimated probability p i of each nucleotide of being erroneous, the basic idea of the algorithm is to calculate the error probability of the first two nucleotides of having j errors, add the third nucleotide to calculate the error probability of the first three nucleotides of having j errors, and continue recursively until we reach the N-th nucleotide and finally obtain the probability that the total sequence has j errors. When this is done for j = 0, 1, 2, ..., N errors, we obtain the error probability distribution of the total sequence. 4 The algorithm results: 1. Obtain P(X 1 = j) following Eq. (SN1.1). Let U = X 1 .
(b) Calculate P(Y + Z = j) applying Eq. (SN1.3), being Y = U and Z = X i . Note that all the terms P(Y = i) and P(Z = j − i) for i = 0, ..., j needed in Eq. (SN1.3) were already calculated in previous steps of the algorithm.
3. The probability for the sequence under study of having j errors, P(S N = j), is given by U when i = N.

SN1.3 Filtering of sequences based on their error distribution
1. In practice, steps (1)(2)(3) of the algorithm presented in the former section must only be repeated up to j = j max , where j max is the lowest value of j that satisfies r =j r =0 P(S N = r ) ≥ ξ, and 0 < ξ < 1 is a confidence coefficient (in our case ξ = 0.995). Let the predicted maximum errors j ξ be the number such that the sequence has a probability ξ of having less than j ξ errors. It is obtained interpolating the accumulated error probability of the sequence between the values r = j max − 1 and r = j max to obtain its exact value in r = j ξ . A linear interpolation yields j ξ = j max − 1 + ξ − jmax −1 r =0 P(S N = r ) P(S N = j max ) .
(SN1. 4) 2. Let j tol be the maximum tolerable number of errors per sequence, that is, the maximum number of errors allowed for a correct clustering. In our case, j tol was equal to 1% of the sequence length (i.e. j tol = 2 for N = 200 nt or j tol = 2.5 for N = 250 nt in the samples analysed in this work). The sequence under study is discarded if j ξ > j tol , and accepted as correct otherwise. At this moment, the calculation for this particular sequence is finished, and it is time to repeat the whole algorithm (subsections SN1. 2 and SN1.3) for the rest of the sequences of the population.
Let us note that the default value of j tol was chosen as 1% of the sequence length instead of a fixed integer number. We believe that the predicted number of errors alone is not a reliable estimator of sequence goodness as its interpretation depends on sequence size (e.g. the effect of having 2 errors is more critical in a 10 nt sequence than in a 200 nt sequence). We therefore propose to use the number of errors per base as a filtering parameter, for the sake of generality. Similarly, clustering is usually done based on relative sequence similarity (e.g. 97% similarity) instead of absolute Hamming distance. This fact results in the same choice of parameters having similar effects regardless of sequence length. Note, however, that all this could lead to noninteger values in the number of errors. We believe this is not a problem: the concept of predicted maximum errors j ξ should be understood as a maximum bound and in a statistical context. In fact, the possibility of having non-integer error thresholds allows the user to fine tune the parameters in order to reach an optimal solution. Anyway, we provide the user with the possibility of using only integer error values (via the −−round flag in the moira.py script, or by selecting an integer error threshold, for example via the −−maxerrors parameter in the moira.py script). Supplementary Figures SN1.1a,b show the error probability distribution and the accumulated error probability distribution respectively for a high quality sequence, an ambiguous sequence and a low quality sequence. The results obtained by the algorithm presented above (i.e. the Poisson binomial filtering method) are compared to those obtained with a Poisson approximation (see next subsection). Supplementary Figure SN1.1c shows a zoom of (b) to clarify the algorithm just presented. Error probability distribution (a) and accumulated error probability distribution (b) for three particular sequences of N = 200 nucleotides. The Poisson binomial exact calculation (circles connected with solid lines) and the Poisson approximation (triangles connected with dashed lines) are plotted for a high quality sequence (black), an ambiguous sequence (red) and a low quality sequence (blue). The predicted maximum errors j ξ stands for the 100ξ-th percentile of the error probability (i.e. a sequence has a probability ξ of having less than j ξ errors). In (c) we plot a zoom of (b) where the calculation of j ξ for the Poisson approximation of the ambiguous sequence is sketched (and where the difference between the Poisson binomial and its Poisson approximation is more evident). j max is the lowest integer for which the accumulated probability of having j max errors is greater than ξ. In our calculations, ξ = 0.995 and the maximum tolerable number of errors per sequence is j tol = 2. Sequences with j ξ > j tol are discarded in the filtering step.

SN1.4 A simple numerical example to clarify the Poisson binomial filtering algorithm
Let us numerically solve an especially simple case to clarify the applicability of the Poisson binomial filtering algorithm.
We have a short sequence of N = 3 nucleotides, whose error probabilities are p 1 = p 2 = p 3 = 1/3. They are all the same in this particular case to simplify the calculations, but in general they can be non-equal.
In summary, we have that the sequence of 3 nucleotides of error probabilities p 1 = p 2 = p 3 = 1/3 has a probability 8/27 of having 0 errors, probability 4/9 of having 1 error, probability 2/9 of having 2 errors and probability 1/27 of having 3 errors.
Let us see now if it would be accepted or discarded according to the Poisson binomial filtering algorithm. Following Eq. (SN1.4) and taking into account that j max = 3 and ξ = 0.995, we obtain j ξ = 2.865. In case j tol = 2, the algorithm would discard this sequence.
Finally, let us note that the binomial distribution is a special case of the Poisson binomial distribution, when all probabilities are the same, that is, when p 1 = p 2 = · · · = p N . Therefore, the calculations shown in this example could be checked with the explicit probability mass function of the binomial distribution (Cramér, 1999).

SN1.5 Approximate calculation of the error probability distribution of a sequence
In the former subsections we presented the Poisson binomial filtering (PBF) method to calculate the exact distribution of the error probability of a sequence, provided the error probabilities p i of each of its nucleotides. The Poisson binomial filtering algorithm is fast and computationally simple. Here, we compare it with several approximations that can be used. As our problem can be faced as the sum of N binomial distributions of probabilities p i and number of trials n = 1, it can be approximated to a Poisson distribution as far as N is high and p i 1. The probability mass function of X is given by where P(S N = j) is the probability of a sequence of having j errors, and the mean value and variance of the Poisson distribution, λ, is obtained as The Le Cam's theorem affirms that the sum has approximately a Poisson distribution and the inequality bounds the approximation error in terms of the total variation distance between the exact solution and the approximation. The Poisson approximations following Eqs. We can see that the accuracy of the Poisson approximation decreases for high values of p i , that is, for sequences with low-quality bases. Therefore, the exact method and the Poisson approximation could lead to differences in the acceptance or rejection of sequences for j tol sufficiently large and j ξ ∼ j tol , i.e. for ambiguous sequences.
Note that the difference between the results obtained by the Poisson binomial filtering method and its Poisson approximation could be relevant in real cases. Supplementary Figure SN1.2 shows the relationship between both methods for the sequences present in the Even1P dataset (see Supplementary Note 3). The Poisson binomial filtering discarded 11% less sequences than its Poisson approximation, which is remarkable considering that the target of the method presented here is to improve the precision of classifications as much as possible. But, what makes these 11% of sequences to be correctly accepted by the PBF method and wrongly rejected by the Poisson approximation? In typical cases, we have found that this different precision is due to the existence of a unique nucleotide with a low quality score, preventing the satisfaction of the hypothesis p i 1 cited above. Let us show a simple example to clarify this point. A sequence of N=200 nucleotides, one of them with Q = 10 and the rest with Q = 30, would lead to a bound for the difference between the Poisson binomial distribution and its Poisson approximation obtained from the Le Cam's theorem of 2 N i=1 p 2 i = 0.0204 (applying the equivalence p i = 10 Q −10 ). This value is 50 times larger than the bound obtained for the same sequence but where all nucleotides show the same quality Q = 30 (and where 2 N i=1 p 2 i = 4 × 10 −4 , a very low value that makes the Poisson approximation a valid assumption). In fact, a large amount of real cases obtained from Even1P dataset that were accepted by the PBF method and wrongly rejected by its Poisson approximation gave rise to differences of one or even two orders of magnitude in the Le Cam's bound.
In summary, the precision of the Poisson approximation is extremely dependent on the existence of a few (even one) bad-quality nucleotides. This fact leads in those cases to values of j ξ obtained by the Poisson binomial filtering and its Poisson approxi- Zoom showing high quality sequences and our error cutoff of 2 predicted maximum errors per sequence (1% of sequence length). Numbers in green: sequences accepted by the Poisson binomial filtering (y axis) and its Poisson approximation (x axis). PBF discarded 11% less sequences than its Poisson approximation.
mation sufficiently different to influence substantially the number of sequences that are accepted or not. Other approximations of higher precision than the Poisson approximation, but of a more complex implementation, such as the Kolmogorov-type approximations or the ones obtained using the Pearson family of distributions, are also available (Butler & Stephens, 1993). However, the accuracy, simplicity and speed of the exact method make it in general a better option than the approximative methods, as the latter will be either less precise or even more complex to implement.

SN2.1 Calculation of the statistical properties of a filtered population of sequences
Let us suppose that we have M sequences belonging to k different taxa. In a process of filtering we eliminate m < M sequences chosen randomly from the total population in a way that all sequences are equally probable to be erased. We name n i (m) the number of sequences of taxon i = 1, ..., k that remain in the population after removing m sequences, and M − m the fraction of such remaining population that belongs to taxon i. In this calculation, we are interested in the statistical properties (i.e., mean and variance) of the fraction p i (m) of sequences of each taxon i over the total number of remaining sequences when the number of removed sequences m grows.
The hypergeometric distribution applies to sampling without replacement from a finite population whose elements can be classified into two mutually exclusive categories (in our case, each taxon i and the rest of the population). Therefore, the stochastic variable that represents the number of sequences removed from taxon i when m sequences are randomly chosen from the total population follows a hypergeometric distribution of population size M, number of initial success states in the population n i (m = 0) and number of draws m. In consequence, the probability mass function of X i is given by where P(x) is the probability of erasing exactly x sequences from taxon i after eliminating m sequences in the total population. Its mean and variance are, respectively and for every taxon i,

13
As mentioned above, we are interested in the fraction of the remaining total population that belongs to taxon i after removing m sequences, that is, and its statistical properties are Equation (SN2.1) yields that the mean value E[p i (m)] = p i (0) and therefore is independent of the number of removed sequences m. On the contrary, its variance grows monotonically with m: In summary, we observe that the filtering process of a total population leads to fractions of each taxon that are maintained in average but grow in variance when the number of removed sequences m increases. This leads to the following general statement: the more sequences are erased, the larger error we will be committing when measuring the composition of the total population.
Finally, note that the term M − 1 in the denominator of Eq. (SN2.2) moderates the influence of m in the generation of taxonomic bias (see Supplementary Note 6 for empirical evidence). A greater source of taxonomic bias during sequence filtering is described in the next subsection.

SN2.2 Taxonomic bias is inherent to raw reads in 454 and Illumina sequencing platforms
Most of the discussion on taxonomic bias in massive sequencing projects has focused on the coverage bias (i.e. some fragments getting a larger number of amplicons than others) caused by PCR amplification (Polz & Cavanaugh, 1998;Huber et al., 2009;Kumar et al., 2011;Ross et al., 2013;Klindworth et al., 2013). However, it is generally assumed that, after sequencing a pool of DNA molecules, the length and quality distributions of the obtained reads are independent of their taxonomic origin. If that were not to happen, the length trimming and quality filtering processes would result in the under-representation of the taxa with smaller average length and quality, generating a bias in the retrieved community composition.
We have found evidences of this problem occurring in the 454 datasets analysed for this work. After taxonomically classifying all the raw reads from the 454 Even3T library with mothur, we found out that the reads classified into the genus Staphylococcus (the most abundant taxon in the sample) were significantly shorter than the reads classified as Streptococcus (the second most abundant taxon in the sample) (Supplementary Figure SN2.1a,b). We also found that the Staphylococcus reads had more expected errors per base than the Streptococcus reads (Supplementary Figure SN2.1c). This is not surprising: since in the 454 platform the read quality drops at the end of the reads (Edgar, 2013), the longer reads are expected to have a higher average quality. Our pipeline included a pre-filtering step in which reads were trimmed to a fixed length of 250 nt, and reads shorter than 250 nt were discarded. Because of the differences in the read length distributions, 11.7 % of the Staphylococcus reads were discarded during this step, in contrast to only 7.8% of the Streptococcus reads. This produced a 4.4% (since 92.2/88.3=1.044) artificial enrichment of Streptococcus versus Staphylococcus in the pre-filtered dataset. Pre-filtering also amplified the differences in quality between the Staphylococcus and the Streptococcus reads. Due to the relationship between position and quality in 454 reads, trimming to a fixed length will leave only high quality nucleotides in the longer reads, but may leave low quality nucleotides at the end of the shorter ones. Differences in length distribution will thus result in differences in quality distribution after trimming (Supplementary Figure SN2.1c,d). Even if the pre-filtering step considerably reduced the average expected errors per base, it also caused a significant increase in the taxonomic bias. Performing quality filtering with our default cutoff of 0.01 errors per base would have discarded 59.6% of the Staphylococcus reads but only 37.0% of the Streptococcus reads, resulting in a 56.2% artificial enrichment of Streptococcus versus Staphylococcus in the filtered dataset. Note that this comes in addition to the previous enrichment caused by the length trimming step.
Length and quality biases were also detected for the IonTorrent platform, a fact that has been described elsewhere (Salipante et al., 2014). For example, filtering with our 200nt length and 0.01 errors per base cutoffs removed 89.46% of the Bacteroides (the most abundant genus in the sample) reads from the IonTorrent Even2P, but 94.56% of the Streptococcus (second most abundant genus in the sample) reads, resulting in a 192% artificial enrichment of Bacteroides versus Streptococcus. A similar problem, albeit to a lesser extent, was found in reads generated using the Illumina MiSeq platform. In this case, reads from different taxa showed different quality distributions. When quality-filtering raw reads from MiSeq Even1M library, we found out that 63.03% of the reads classified as Staphylococcus (the most abundant genus in the sample) had 0.01 errors per base or more, while 68.37% of the reads classified as Acinetobacter had 0.01 errors per base or more. Filtering with those parameters caused a 16.9% artificial enrichment of Staphylococcus versus Acinetobacter.
These biases are likely originated during base/quality calling: for instance, 454 reads show a systematic decrease in quality after homopolymer regions (Brockman et al., 2008), which will penalise the taxa with longer homopolymer stretches on its 16S gene. We propose the incorporation of a simple step into the filtering pipeline that substantially minimises this problem. It is based on the assumption of the fact that, in spite of differences in quality, identical sequences should have the same origin, as it is unlikely that two biologically unrelated sequences became identical due to sequencing errors. Thus, identical sequences were collapsed, and the representative with the highest overall quality was used to decide whether the whole group was discarded or allowed into the filtered dataset. In practice, this mitigates the effect of biases in quality distribution as even low abundance sequences are expected to have a high quality representative. Our solution rendered similar quality distributions for the different taxa, even after length trimming (Supplementary Figure SN2.1e,f), and significantly lower taxonomic biases than other filtering approaches, especially for 454 data (Figure 2c). Every method that relies on quality scores for sequence filtering will be affected by this source of bias. We therefore propose the approach described above as a general solution to this problem, since its simplicity makes it very easy to integrate into any filtering pipeline. The arrows in (b) indicate the fraction of reads from each taxon removed after discarding sequences shorter than l min = 250 nt. (c, d, e) Errors per base distributions of Streptococcus and Staphylococcus reads in the (c) raw dataset, (d) after trimming the reads to 250 nt and discarding the ones shorter than the cut-off, and (e) after collapsing the trimmed reads. The dashed lines indicate average errors per base. Note that length trimming substantially increases the difference between the Streptococcus and Staphylococcus error distributions (d) when compared to that of the raw reads (c). Filtering at this point would cause a 56.2% overrepresentation of Streptococcus versus Staphylococcus. Collapsing identical reads prior to filtering solves this problem (e), reducing the overrepresentation to 1%. (f) Compositional bias generated during the pre-processing and filtering of the six 454 mock community samples, measured as the Bray-Curtis dissimilarity between the raw and the processed read communities. This shows that results in (c, d, e) can be generalised to all the taxa present in all the samples.

. Dataset accession numbers and abbreviations
This table shows details on the 16S datasets used in this study. NCBI Short Read Archive accession numbers are provided, unless otherwise specified. The 454 even mock community 1 was taken from the example in mothur's 454 SOP webpage (http://www.mothur.org/wiki/454_SOP). The Illumina mock communities were taken from Bokulich et al., 2013 andEdgar, 2013. For Illumina datasets, the number of pairs is provided. The IonTorrent environmental communities were taken from mothur's SOP webpage (http://www.mothur.org/wiki/Ion_Torrent_sequence_analysis_using_Mothur).
* When a dataset contained more than 40,000 reads, all analyses were performed over a random subsample of 40,000 raw reads. For each of those datasets, an additional file containing the names of the randomly selected sequences is provided as a Supplementary resource.

SN4.1. Poisson binomial filtering accurately discriminates between good and erroneous sequences
When applying our default cut-off of 1% errors allowed per sequence, our algorithm accurately classified 96% of the mock community sequences from the Even1M dataset (Figure 1d). 3% of the sequences were incorrectly discarded while, remarkably, only 1% of the sequences were incorrectly retained. Moreover, most of those incorrectly retained sequences had only 3 true errors (1.2% errors per sequence), meaning that they would likely cluster correctly when applying the standard 3% OTU distance cut-off. The rest of the Illumina datasets rendered similar results. The accuracy of our method was slightly lower for the 454 and IonTorrent datasets, but it nevertheless resulted in a minimum of 88% (for 454) and 79% (for IonTorrent) correctly classified sequences. (Supplementary Figure  SN4.1).

SN4.2. Performance of the different filtering methods on the mock community datasets
Publicly available datasets from even and staggered mock communities from the Human Microbiome Project  were filtered with PBF, mothur, USEARCH and QIIME (Figure 2, Supplementary Tables SN4.2, SN4.4). These artificial communities contain known amounts of 16S rRNA gene copies from 20 different bacterial organisms. The fact that both the qualitative and quantitative composition of the samples are known beforehand allowed us to thoroughly compare the effects of the different filtering methods in terms of OTU accuracy, alpha diversity and community composition. OTU accuracy was defined as the maximum similarity of its representative sequence to the 16S sequences of the microorganisms used to build the mock community, as previously described in Edgar (2013). We were also interested in determining how the different filtering processes affected the observed community composition. The taxonomic bias in community composition caused by any given filtering method was calculated as the Bray-Curtis dissimilarity between the raw and the filtered datasets, after taxonomically classifying their reads down to the genus level.
In the even datasets, which contain the same number of 16S rRNA gene copies for each organism, all methods resulted in more than 20 OTUs after clustering. This was not surprising, since contaminations, PCR errors and sequencing errors were expected to inflate the observed diversity. In the staggered communities, in which the number of 16S rRNA gene copies varied by several orders of magnitude between the different organisms, the observed diversity was generally lower, due to some species being present at very low abundances. The total number of reported OTUs greatly varied between filtering methods, with Poisson binomial filtering consistently resulting in values that were the closest to the true diversity of the samples.
PBF also produced the highest proportion of accurate OTUs in all the 16S mock datasets for both sequencing platforms, while minimizing the number of singletons and spurious OTUs retrieved (Figure 2a,b). In the 454 and IonTorrent datasets it also discarded the smallest number of reads and resulted in the smallest taxonomic bias (Figure 2c,d). In the Illumina datasets QIIME retrieved a larger number of reads, while both QIIME and mothur caused smaller taxonomic biases than our method. (Figure 2c,d -Illumina). However, we believe that this was the result of a too shallow filtering by mothur and QIIME, since both methods produced a remarkably lower proportion of accurate OTUs and a larger number of OTUs and singletons (Figure 2a,b -Illumina). QIIME produced an especially high number of spurious OTUs, a fact that has also been discussed elsewhere (Edgar, 2013). Their pipeline (Bokulich et al., 2013) deals with this problem by applying a post-hoc OTU size cut-off at the cost of sensitivity. Nonetheless, our results show that, even after the removal of singletons from the QIIME-filtered dataset, their number of OTUs would exceed that of the dataset filtered with our method, including singletons (Supplementary Table  SN4.4).
The two filtering algorithms included in the USEARCH suite showed an intermediate performance in terms of the number and accuracy of the OTUs retrieved for both the 454 and Illumina platforms. Quality trimming yielded the smallest number of reads and resulted in the highest taxonomic bias, which supports the idea that over-stringent filtering may lead to undesirable effects. In the IonTorrent datasets, USEARCH filtering (as recommended in http://www.brmicrobiome.org/#!16sprofilingpipeline/cuhd) performed below Poisson binomial filtering for all the studied bechmarks (Figure 2 -IonTorrent). Finally, the mothur implementation of the PyroNoise algorithm (Quince et al., 2009) showed lower OTU accuracy than the other methods tested for filtering 454 reads. It has been previously described that the denoising process can introduce minor alterations in the original reads (Gaspar & Thomas, 2013), a phenomenon that might explain these results. It must be noted that, albeit a pipeline for filtering IonTorrent reads with PyroNoise has been described, the publicly available IonTorrent mock community datasets were only available in Fastq format (Stephen Salipante, personal communication), which precluded the use of flowgram denoising algorithms. However, this limitation was not present for the environmental datasets, and a comparison of quality filtering algorithms for IonTorrent datasets that includes PyroNoise can therefore be found in Supplementary Figure

Figure SN4.1. Poisson binomial filtering accurately discriminates between good and erroneous sequences.
Comparison between the maximum predicted errors j ξ calculated by the Poisson binomial algorithm and the true number of errors for all sequences from the 454 and Illumina MiSeq mock community dataset. Dots represent unique sequences. True mock community sequences are plotted in blue, contaminant sequences are plotted in gray, and chimeric sequences are plotted in red. The blue background represents sequence abundance. Red lines indicate our error cutoff of 2.5 errors per sequence (j tol , 2 errors per sequence for the IonTorrent datasets). The percentage of true mock community sequences present on each quadrant is also indicated. The graph is plotted in logarithmic scale (the 0 in the x-axis is added for clarity).  This table gives the taxonomic composition of the raw datasets from the 16S mock communities sequenced with the Illumina MiSeq platform, classified at the genus level with mothur's classify.seqs command. The taxonomic composition from both the assembled paired-end reads (labeled with the suffix "P") and the forward single reads (labeled with the suffix "S") is provided.

SN5.1. Performance of the different filtering methods on the environmental communities datasets
The performance of the different filtering methods was also evaluated by quality-filtering publicly available datasets obtained by sequencing environmental communities (Supplementary Figures SN5.1, SN5.2, SN5.3) The results were similar to the ones obtained with the mock communities, with Poisson binomial filtering filtering being the most consistent method in producing the smallest number of OTUs and singletons. Additionally, the OTUs obtained with PBF were overall the most similar to the 16S sequences present in the SILVA 16S reference alignment , which suggests that they contained the smallest number of errors. In the environmental 454 datasets, PyroNoise showed better results than in the 454 mock communities, but did it in an irregular fashion, especially in terms of OTU accuracy (Supplementary Figure SN5.1d). This inconsistency may be again due to the alteration of the original reads, and suggests that PyroNoise requires a finer parameter optimization than other approaches in order to be fully effective.
In the environmental IonTorrent datasets PyroNoise discarded the smallest number of reads, but resulted in the highest number of singletons and OTUs, which also beared the least similarity to the reference alignment. USEARCH showed an intermediate performance between     Figure SN5.3. Results on the IonTorrent environmental communities (a, b): Number of singletons (a, bars), total species (a, symbols) and reads (c) retrieved after filtering the raw reads with the different methods and performing chimera removal and clustering with a common pipeline. OTU and singleton numbers were obtained by averaging the results from 100 independent library size standardizations. (c): Taxonomic bias caused by the different filtering methods, measured as the Bray-Curtis dissimilarity between the raw and the filtered read communities. (d): Average percent OTU similarity to their best hit from the SILVA bacterial 16S reference alignment. This number was obtained by averaging the results from 100 independent library size standardizations. As demonstrated in Supplementary Note SN2.1, random removal of reads from a community will have a minor effect on the observed community composition. In order to exemplify this with experimental data, we have randomly removed an increasing number of reads from six mock and environmental communities sequenced with the three platforms analyzed in this study, and we have calculated the generated taxonomic bias (measured as the Bray-Curtis dissimilarity between the unfiltered and the subsampled communities, as described in the methods section). For each data point, ten independent subsamples were taken, and the average and standard deviation were reported. We have compared the resulting biases with those obtained when applying the different filtering algorithms. Taxonomic

Supplementary Note 7. Effect of the choice of quality-filtering method on the inference of ecological patterns
While the previous sections (and previous works, see references in main test) show that the choice of filtering method can greatly affect features such as the number of discarded sequences, or the observed number of OTUs and singletons, molecular ecology studies are more likely to be focused on inter-sample comparisons. Therefore, analysing the same set of samples with two filtering methods could theoretically lead to the same ecological interpretation, even if the individual results for both methods greatly differ on a sample-bysample basis. In this section we explore how the choice of filtering method can affect the inference of ecological patterns based on relative community richness, diversity and composition. The datasets used in this section were obtained from the mothur SOP for the analysis of 454, Illumina and IonTorrent datasets, and can be downloaded from the following url adresses: http://www.mothur.org/wiki/454_SOP http://www.mothur.org/wiki/MiSeq_SOP http://www.mothur.org/wiki/Ion_Torrent_sequence_analysis_using_Mothur

. Effect of the choice of filtering method on relative community richness
The datasets for the three sequencing platforms, obtained from their corresponding mothur SOP webpages, were processed using different filtering methods as described in Supplementary Note 9, and the Chao1 richness index was calculated for each sample and filtering method. This table shows the Pearson correlation of the Chao1 index profiles through all samples for each possible pair of filtering methods. In all cases, the filtered datasets were subsampled to a similar number of sequences per sample, in order to avoid size effects. Abbreviations: PBF, Poisson binomial filtering; UEE, USEARCH expected errors; QT, quality trimming.     Simulated datasets were generated using the XS program (Pratas et al., 2014). The script moira.py was run on an Intel Core i7-2670QM at 2.2 GHz, using a single core. Abbreviations: PB/CPython: Poisson binomial algorithm, naive python implementation run on the Cpython interpreter; PB/Cpython+C: Poisson binomial algorithm, python + C extension run on the Cpython interpreter; PB/Pypy: Poisson binomial algorithm, naive python implementation run on the PyPy interpreter; Poisson/Cpython: Poisson approximation, naive python implementation run on the Cpython interpreter; Poisson/PyPy: Poisson approximation, naive python implementation run on the PyPy interpreter.

Supplementary Note 9. Theoretical and practical differences between the Poisson binomial filtering algorithm and USEARCH expected errors filtering
During the review process of this manuscript, new information regarding the USEARCH expected errors (E_max) filtering algorithm was published by Edgar & Flyvbjerg (2015). They calculate the expected errors of a sequence by using a Poisson distribution. This has the advantage of being extremely simple, as the expected errors of a sequence can be calculated from a Poisson distribution as the sum of the error probabilities of each base. However, we have shown that approximating a Poisson binomial distribution to a Poisson distribution may lead to significant differences in practice, as even one bad-quality base in an otherwise good-quality sequence will result in approximation errors according to Le Cam's theorem (Supplementary Note SN5.5).
The USEARCH expected errors filtering is therefore different to the Poisson binomial filtering algorithm presented in this work, but equivalent to its Poisson approximation, as shown by Edgar & Flyvbjerg in Eq. (6) of their paper. Even in that case, we believe that the use of the predicted maximum errors j ξ presented in our work could be in a sense more clear than Edgar & Flyvbjerg's E/E_max, since the term E_max (and our own term maximum predicted errors) could suggest that an upper bound is being placed on the number of errors, that is, that no sequence passing the filter should have more than E_max (or j tol , as per our notation) errors.
The relationship between the expected errors E and the probability of having less than E errors is non-trivial. In Supplementary Figure SN9.1 we show the fraction of sequences that, being E < 1 (and therefore being accepted as correct by the E_max method), would indeed have 0 true errors, assuming a Poisson distribution. For very low values of E, almost all sequences are classified correctly, but for E close to 1 less than 40% of the sequences that are accepted as having zero errors would really have zero errors, leading to a large underestimation of the number of errors.
In order to test the impact of this issue, we have validated both methods by classifying the sequences of the Even1P dataset, as described in the methods section of the main text ("Validation of Poisson binomial filtering on Mock Community data"). The flag --eeout was provided to USEARCH in order to obtain the expected errors of each sequence. Parameters were adjusted so that both methods discarded approximately the same number of sequences. Supplementary Figure SN9.2 shows that, for a similar number of discarded sequences, USEARCH produced far more false negatives (that is, sequences with E < E_max that nonetheless had more than E_max true errors) than PBF.
Even a small proportion of erroneous sequences could significantly affect the results of 16S-rRNA gene-based molecular ecology studies. In this context, the users will want to establish a clear upper bound for the number of errors present in their quality-filtered sequences. The error numbers reported by PBF can be easily interpreted by the user: the sequence has a fixed (but user-settable) probability of having more errors than reported. On the other hand, an E_max filter will let pass many sequences with more than E_max errors, in a proportion that is not fixed but depends on E_max. This makes the use of mock/PhiX data to correctly set the E_max parameter a critical step, as described in Edgar & Flyvbjerg (2015). PBF, or its Poisson approximation (as used in this work), is free of this particular issue.
In any case, we want to note that Edgar & Flyvbjerg recommend the use of mock/PhiX data to measure the correlation of Q-score derived benchmarks to the true error rate of the sequences. While Q-scores were generally a good predictor of the true error rates for all the mock datasets analysed in this study (which covered 454, Illumina and MiSeq platforms), we fully agree with Edgar & Flyvbjerg (and others) in that the use of mock/PhiX data to fine tune the filtering parameters will result in a more accurate quality-filtering.  Comparison between the number of errors j ξ and E predicted by the Poisson binomial and USEARCH filter (with and without the previous application of the collapsing step presented in this work) algorithms, and the true number of errors for all sequences from the Even1P mock community dataset. Dots represent unique sequences. True mock community sequences are plotted in blue, contaminant sequences are plotted in gray, and chimeric sequences are plotted in red. The blue background represents sequence abundance (note that for PBF and USEARCH-collapsed few unique sequences may have a high number of representatives, and vice versa). Red lines indicate the error cut-off (j tol ), which was adjusted so that all the methods retained approximately the same number of sequences (42-44%). The plot is thus divided in four quadrants corresponding to correctly retained sequences (lower left), correctly discarded sequences (upper right), incorrectly discarded sequences (upper left) and incorrectly retained sequences (lower right). The percentage of true mock community sequences present on each quadrant is also indicated.

SN10.3 Commands used for filtering the Illumina datasets SN 10.3.1 Length truncation of contigs generated from Illumina paired-end reads
While the sequences used in this study were normally truncated to 250 nucleotides, some environmental communities were sequenced with pairs of primers that generated smaller contigs. In those cases, contigs were truncated to a smaller length. This modified the values of the following parameters: -mothur: minlength, keepfirst -Quality-trimming and USEARCH expected errors: fastq_trunclen -Poisson binomial filtering: truncate A file containing the truncation lengths used for those datasets is provided as a Supplementary resource. SN 10.3.2 Commands for mothur NOTE: mothur developers recommended a value of 275 for the maxlength parameter when analyzing the V4 region of the 16S rRNA gene. Some of the environmental communities were sequenced using different primers. For them, a custom maxlength value was selected, based on the contig length distribution, in order to discard the clearly missasembled contigs. A file containing the maxlength values used in these datasets is provided as a Supplementary resource. mothur make.contigs(ffasta=reads_1.fasta, rfasta=reads_2.fasta, fqfile=reads_1.qual, rqfile=reads_2.qual) trim.seqs(fasta=reads.trim.contigs.fasta, maxambig=0, maxlength=275, minlength=250, keepfirst=250) quit()