Shepherd: accurate clustering for correcting DNA barcode errors

Abstract Motivation DNA barcodes are short, random nucleotide sequences introduced into cell populations to track the relative counts of hundreds of thousands of individual lineages over time. Lineage tracking is widely applied, e.g. to understand evolutionary dynamics in microbial populations and the progression of breast cancer in humans. Barcode sequences are unknown upon insertion and must be identified using next-generation sequencing technology, which is error prone. In this study, we frame the barcode error correction task as a clustering problem with the aim to identify true barcode sequences from noisy sequencing data. We present Shepherd, a novel clustering method that is based on an indexing system of barcode sequences using k-mers, and a Bayesian statistical test incorporating a substitution error rate to distinguish true from error sequences. Results When benchmarking with synthetic data, Shepherd provides barcode count estimates that are significantly more accurate than state-of-the-art methods, producing 10–150 times fewer spurious lineages. For empirical data, Shepherd produces results that are consistent with the improvements seen on synthetic data. These improvements enable higher resolution lineage tracking and more accurate estimates of biologically relevant quantities, e.g. the detection of small effect mutations. Availability and implementation A Python implementation of Shepherd is freely available at: https://www.github.com/Nik-Tavakolian/Shepherd. Supplementary information Supplementary data are available at Bioinformatics online.


1.B. Marginal Likelihood of Model M 2
In a similar way we can also find an expression for the marginal likelihood of model M 2 . As before the probability chain rule is used to obtain, P( f c , S c | S p , f p , ρ, M 2 ) = P( f c | ρ, M 2 )P(S c | S p , M 2 ).
In Eq. (S8), we use the property that the read count of the sequence under consideration, S c , is independent of the nearby putative barcode S p and its read count f p . However, since S c and S p are distinct sequences, the probability of observing S c will not be independent of S p . Furthermore, the probability of observing S c does not depend on the error rate ρ, since it is a randomized sequence under model M 2 . On the other hand, the probability of observing f c will depend on ρ. This becomes clear if we think about f c as the number of times S c was read without errors. Since S c is a random DNA sequence under M 2 with 4 possible nucleotides at each of the l positions it follows that, The probability P( f c | ρ, M 2 ) is more difficult to determine. It is the probability of observing the read count f c for a true barcode S c in the population, given the error rate ρ. Since the observed count distribution of the sequences includes error sequences, the distribution of the observed true barcode counts is unknown. What we do know is the maximum observed count f max . Given this maximum, we have a range for the possible count values between 1 and f max . With no additional information we want to assume as little as possible about the count distribution. This is achieved by choosing the maximum entropy distribution, given by the discrete uniform distribution in our case. It follows that for f c ∈ [1, f max ], The marginal likelihood of model M 2 is now given by, P( f c , S c | S p , f p , ρ, M 2 ) = P( f c | ρ, M 2 )P(S c | M 2 ) = 1 4 l f max . (S11)

1.C. Sequence Classification using Bayes Factor
The marginal likelihood of each model is now computable. To compare the models and to determine which model describes a given sequence better we will use the log Bayes factor, ln K, the logarithm of the ratio between the marginal likelihoods of model M 1 and M 2 given by, ln K = ln P( f c , S c | S p , f p , ρ, M 1 ) − ln P( f c , S c | S p , f p , ρ, M 2 ) = ln p( f c ;n,p pc ) + lnp pc + l ln 4 + ln f max . (S12) To make a decision about whether the current sequence is an error sequence or a true barcode we will consider M 1 as our null model. We only want to reject this model if its marginal likelihood is significantly lower than the marginal likelihood of the alternative model M 2 . By default, we will reject the null model if ln K ≤ −4, i.e if the marginal likelihood of model M 2 is approximately 55 times greater than the marginal likelihood of model M 1 . The threshold can be adjusted by the user to control the trade-off between false positives and false negatives. Increasing the threshold will increase the number of false positives while reducing the number of false negatives.

MULTIPLE TIME POINT MODE
Instead of performing clustering at each time point independently, Shepherd constructs the k-mer Index at the first time point and only performs clustering for the first time point using the procedure described in section 2.2 in the main text. For subsequent time points, Shepherd treats the error correction task as a classification problem: Let t n denote the nth time point. Once the clustering is performed for t 1 , each sequence at t 2 is assigned to one of the clusters identified at t 1 . Let us consider a sequence from t 2 that we want to assign to a cluster from t 1 . To find the most suitable cluster for the sequence, the k-mer Index from t 1 is used to find its closest putative barcode from t 1 . The sequence is then assigned to the cluster of this putative barcode if it is within distance . If two or more putative barcodes at t 1 have the same Hamming distance to the sequence under consideration at t 2 and appear in its -neighborhood, the sequence is assigned with the one in the higher count cluster at t 1 . This is because the sequence is more likely to belong to the higher count cluster given that its distance to the putative barcodes is equal. At later time points the same classification procedure is used to assign sequences from time point t m (m > 1) to the putative barcodes from the previous time point t m−1 .
The computational advantage of the above multiple time point procedure is that the construction of the k-mer Index is only performed at the first time point, reducing the computational time required for error correction in subsequent time points. Moreover, by assigning sequences from later time points to clusters from previous time points, one connects barcodes from different time points and performs error correction simultaneously.

2.A. Identification of Emerging Barcodes
Some barcodes exist in the population but may not be detectable in the first time point due to low sequencing depth and a low barcode count. However, these barcodes may rise in frequency at later time points and could have a significant impact on the evolutionary dynamics of the population.
Shepherd is capable of identifying barcodes that emerge at later time points. This is done for each time point after the sequences have been classified to putative barcodes from the previous time point using the procedure described in the previous section.
Barcodes that emerge within the -neighborhoods of putative barcodes are separated from these barcodes. This is done by using the statistical test described in section 1 to determine if a sequence is an error sequence or a new putative barcode. If a new putative barcode is identified all sequences in the same cluster are reassigned to either the original putative barcode or the new putative barcode, using the statistical test to determine which barcode each sequence is more likely to originate from. Furthermore, unassigned sequences are assigned to one of these new putative barcodes if they are within distance .
Unassigned sequences not found in the -neighborhood of any putative barcode are clustered using the procedure described in section 2.2 in the main text. This step yields new putative barcodes that are discarded unless they are also found in the next time point. This is done to prevent the formation of false positives. When a new putative barcode is identified its k-mers are added to the k-mer index so that it can be found at later time points.

2.B. Algorithm
The procedure for multiple time point error correction can be summarised in steps. Given a time point t m (m > 1) we perform the following steps to achieve accurate error correction.
1. Classify sequences at time t m to putative barcodes from time t m−1 .
2. Separate emerging barcodes that were assigned to putative barcodes from t m−1 using the statistical test.
3. Reclassify all sequences in separated clusters to either the original putative barcode or the new putative barcode using the statistical test.
4. Assign sequences not assigned in step 1 to new putative barcodes introduced in step 3, if they are within distance .
5. Cluster remaining unassigned sequences using the single time point procedure. New putative barcodes from this step are discarded unless an exact match also appears at time t m+1 .

PARAMETER SELECTION AND OPTIMIZATION
In this section we detail how the parameters of Shepherd are automatically determined based on the input data. We will also discuss how the clustering procedure is optimized to improve performance. The per nucleotide substitution error rate ρ is determined from the input data using the procedure described in section 2.A. The maximum distance for merging sequences and the substring length k used for k-mer indexing can also be determined from the input data using the procedures described in sections 2.B and 2.C, respectively. Furthermore, the parameters τ and f t introduced for performance optimization are also estimated from the input data as described in section 2.D.
It is important to note that the only parameters that impact the clustering results are and ρ. Specifically, the parameter dictates which sequences are considered for merging and ρ is used in the statistical test to determine if a sequence is an error sequence or a true barcode. The other parameters of Shepherd only affect computational time and memory usage.

3.A. Determining ρ
Given the input data the parameter ρ is estimated by considering all possible single-nucleotide polymorphisms (SNPs) of the highest count sequences. Specifically, we consider the N h highest count sequences in the input data. For each of these sequences we find the set of all possible SNPs.
Since the highest count sequences in the input data are almost certainly true barcodes the count of these sequences corresponds to the number of times these barcodes were sequenced without errors. Let n 0 denote the combined counts of the N h highest count sequences and let n 1 denote the combined counts of their SNPs. Under the binomial model for sequencing errors proposed in the previous section the probability of reading a sequence without errors is given by, Under the same model the probability of a SNP during sequencing is given by, To obtain an estimate of ρ we consider the following ratio of Eq. (S13) and Eq. (S14), Empirical probability estimates of the probabilities P(X = 0) and P(X = 1) are given bŷ P(X = 0) = n 0 /n andP(X = 1) = n 1 /n, respectively. Here n is the unknown total number of reads when sequencing the N h highest count sequences. By substituting these estimates in Eq. (S15) we obtain,P (X = 1) Note that the unknown quantity n cancels out in Eq. (S16) and we obtain a relationship between ρ and the known quantities l, n 0 and n 1 . By rearranging the terms in Eq. (S16) we now obtain the following estimate of the error rate ρ,ρ = n 1 n 1 + ln 0 . (S17) By default the number of sequences used for the estimation is set to N h = min(N, 500), where N is the total sequence count.

3.B. Determining
We will start by fixing appropriately. We want to choose so that the vast majority of error sequences are within the -neighborhoods of their source barcodes. However, we do not want to choose a larger than necessary. Firstly, the memory and time complexity of the algorithm increase for larger values of . This is because a larger necessitates that the sequence is divided into more partitions with smaller k, since we require that the number of partitions p > . In general, this will increase the number of k-mer combinations and consequently the number of entries in the k-mer index. As a result, the index will take up more space in memory and will take longer to construct. Another reason that we want to avoid choosing a larger than necessary is that the k-mer neighborhoods become larger as increases. Since we search the k-mer neighborhoods for putative barcodes this leads to an increase in search time. If the distance between a sequence under consideration S c and a putative barcode S p is large enough our statistical test will classify S c as a putative barcode regardless of the counts of the sequences. Therefore a reasonable choice for is the largest distance such that S c could still be classified as an error sequence for some count combinations of f p and f c . To find this distance we start by considering count combinations that will clearly favour model M 1 for a given distance d between S c and S p . We fix the count of the putative barcode to f p = f max . It remains to find the value of f c that maximizes the marginal likelihood of model M 1 . From Eq. (S7) we see that maximizing the marginal likelihood of model M 1 , with respect to f c , is equivalent to finding the mode of the binomial distribution with parametersn andp pc . However, sincen is a function of f c we will replace it withn mle to determine the mode. We can do this since we know that the value of f c that maximizes the marginal likelihood of model M 1 will be much smaller than f p = f max , since it represents the most likely count of an error sequence originating from sequence S p . Consequently, it is safe to assume thatn mle > f p + f c , which impliesn =n mle . Since f c > 0 it follows that given f p = f max and a distance d between S c and S p , the value of f c that maximizes the marginal likelihood of model M 1 is given by, f c = max (n mle + 1)p pc , 1 . (S18) To find we apply our statistical test using the count combination f c (as defined in Eq. S18) and f p = f max , for increasing values of d. The largest value of d for which S c is classified as an error sequence will be chosen as the value for .

3.C. Determining k
When choosing k we need to make sure that p > . However, in most cases there are several choices of k that satisfy this constraint. On the one hand, we want to choose a small k so that k − is small, this corresponds to the distance between the dashed circle and the solid circle being small in Figure 2. Since we only consider -neighbors for merging, this ensures that the number of irrelevant sequences in each neighborhood with distance greater than are minimized. This will decrease the size of each neighborhood resulting in shorter search times. However, as we mentioned previously a smaller k will also increase the number of k-mer combinations, increasing the memory use and running time of the algorithm.
To find a reasonable value for k we will only consider the true barcodes in the absence of errors. The reason for this is that error sequences will be close to their source barcodes in sequence space. Consequently, if we focus on excluding true barcodes, that are not associated with a given true barcode, we are simultaneously excluding many of the error sequences of those distinct barcodes as well. We will also assume that has already been fixed. It should be noted that the optimal value for k depends on the hardware used for running the algorithm. However, our approach here does not consider the hardware and only attempts to find a reasonable choice based on the theoretical distribution of Hamming distances for random barcodes.
For a given barcode we want to ensure that the number of distinct barcodes in the region between the dashed circle and the solid circle in Figure 2 is small. As discussed in section 2 the Hamming distance from a given barcode to a random barcode follows a binomial distribution with l trials and success probability 3/4. Given this distribution we will require that, This constraint ensures that for a given barcode the majority of distinct barcodes are expected to be either within distance or beyond distance k . This ensures that we do not pick a k that is too large. Given Eq. (S19) and the constraint p > we will now pick the largest integer that satisfies both as our value for k. The constraint given in Eq. (S19) is somewhat arbitrary since there is no inherent reason to choose 1/2 as our threshold. Nevertheless, we have chosen it here since it resulted in appropriate choices for k in practice. For example, our parameter selection scheme found k = 4 to be the optimal value for synthetic datasets A and B (see table S2). For both datasets the barcode length was l = 20 and = 3 was chosen. By considering the other two possible values for k, 2 and 5, we can see why this choice is optimal. If k = 2, the k-mer index will take up too much space in memory. If k = 5, the index takes up slightly less space in memory compared to k = 4. However, this choice increases the k-mer neighborhood size significantly, since k = 5 × 3 = 15 for k = 5 compared to k = 4 × 3 = 12 for k = 4. If we consider the hamming distance distribution for true barcodes we see that many of the distinct barcodes will be within hamming distance 12 to 15. Therefore, k = 5 is not a suitable choice since it would cause a large number of unrelated barcodes to be included in the k-mer neighborhoods.

3.D. Performance Optimization
The statistical test described in section 1 is important for classifying a sequence in cases when it is unclear whether it is a true barcode or an error sequence. However, a simple threshold for the distance d or the count f c will suffice to classify the sequence accurately in many of the cases encountered. By introducing appropriate thresholds to identify these cases we can avoid performing the statistical test repeatedly, which leads to a reduction in computational time.
There are primarily two common classes of sequences that we want to focus on for performance optimization.
The first common case is that the sequence has a high enough count that regardless of how close it is to a neighboring barcode, it will still be very likely to be a true barcode. For these cases we want to find a high count threshold f t , such that any sequence with count f c ≥ f t is more likely to be a true barcode than an error sequence. To do this we consider the case when we have a sequence S c , with the smallest nonzero Hamming distance d = 1 to the true barcode S p . Furthermore, we consider the case when f p = f max . The idea is to maximize the marginal likelihood of model M 1 . To find f t we perform our statistical test for increasing values of f c , starting from the value of f c given in equation Eq. (S18). We are looking for the smallest value of f c for which the marginal likelihood of model M 2 is greater than the marginal likelihood of M 1 . Consequently, the first value of f c such that the statistical test classifies the sequence S c as a true barcode will be chosen as the count threshold f t . Once we have obtained f t we can classify all sequences with count f c ≥ f t as true barcodes, without having to perform the test again for each of these cases.
There is also another case that we want to deal with separately to decrease the running time of our algorithm. Many of the error sequences will have count 1. This is because most errors that originate from the same barcode are unique under reasonable assumptions on the error rate, the barcode length and the barcode count distribution. To save time we want to find a distance threshold, τ, such that any sequence with read count f c = 1 that is within Hamming distance τ to a barcode S p is more likely to be an error sequence than a true barcode, regardless of the count of S p . As f p increases, the likelihood that the current sequence is a true barcode decreases. Because of this we will now consider the case when f p = 1, which is when S c has the highest likelihood of being a true barcode for a given distance d from S p . We will now start with distance d = 1 and perform our statistical test for increasing values of d. The largest value of d for which S c is classified as an error sequence will be chosen as the value for τ. Any sequence with count 1 within distance τ of its barcode neighbor can now be classified as an error sequence.
Finally, the computational time can be further reduced when computing the Hamming distance between a sequence and its nearby putative barcode. Since sequences beyond distance are not considered, a truncated Hamming distance can be employed. For sequences S a and S b of length l separated by Hamming distance d, their truncated Hamming distance is defined as, The truncated Hamming distance allows us to save time by stopping the computation of the Hamming distance once it has exceeded . We note that the optimizations described in this section only affect the computational time of the procedure, and do not affect the clustering results.

4.A. Single Time Point Data
The single time point evaluation of Shepherd, Bartender and Starcode is based on 3 synthetic datasets.
The procedure for generating these datasets is as follows. First 500 000 barcodes are created, each one with 20 random nucleotides and 6 constant nucleotides. Then a read count is assigned to each barcode by drawing a sample from the exponential distribution with mean 100 and applying the ceiling function to obtain an integer value.
The read count of each barcode corresponds to the number of times it is sequenced. Each time a barcode is sequenced, we perform a Bernoulli trial at each of its nucleotide positions with the chosen error rate as the probability of success. When one of these trials is successful, an error has occurred and the nucleotide at that position is replaced by one of the other 3 nucleotides with equal probability. Once the errors have been introduced, the synthetic datasets consist of a set of unique sequences and their read counts. If a barcode was destroyed in the error generating process, i.e., if every time it was sequenced an error was introduced, all sequences associated with that barcode were removed from the dataset. This was done to simplify the evaluation, since destroyed barcodes that have been clustered correctly are difficult to distinguish from false positives.  Table S1. Parameters used for Shepherd on each dataset. is the maximum Hamming distance considered for merging a given sequence with a putative barcode. The parameter k controls the length of the k-mers. τ is the maximum Hamming distance for which a count 1 sequence is merged with a putative barcode within distance without performing the statistical test. Finally, f t is a count threshold and all sequences with count greater than f t are classified as true barcodes without performing the statistical test.

4.B. Simulation Procedure for Multiple Time Point Data
To generate synthetic multiple time point data we start with a single time point synthetic dataset with 500 000 barcodes of length 26 (20 random nucleotides and 6 constant nucleotides). The count of each barcode is obtained by applying the ceiling function to a sample from the exponential distribution with mean 100. To simulate selective advantage, 5000 of these barcodes are given an increased growth rate of between 5% and 15%. These barcodes are randomly chosen without replacement, with higher count barcodes having a higher probability of gaining a growth rate increase. Specifically, the probability of a barcode gaining a growth rate increase is given by its proportion in the population.The rationale is that once we start tracking the barcodes at the first time point, some lineages may have already acquired a selective advantage previously. As a result, these lineages are more likely to have a high count in the first time point. Let g i denote the growth advantage of lineage i. Note that g i = 0 if lineage i is not one of the 5000 lineages with a selective advantage. The growth advantage for a lineage i chosen to receive a selective advantage is given by, where x is a sample from the exponential distribution with mean 0.05.
Let n i,t denote the number of individuals/cells of lineage i in generation t and let p i,t denote the probability that a cell in lineage i undergoes mitosis (cell division) once until the next time point, t + 1. Given that lineage i has n i,t cells at time t we want to simulate the number of cells in the lineage at the next time point, n i,t+1 . We consider two possible outcomes for each cell in lineage i, either the cell undergoes mitosis once until time t + 1 with probability p i,t , or it dies with probability 1 − p i,t , independently of all other cells in the population. It follows that n i,t+1 has a binomial distribution with 2n i,t trials and trial probability p i,t . While n i,t is known at time t the cell division probability p i,t must be determined to sample from this binomial distribution and obtain an updated cell count for lineage i.
To determine p i,t while incorporating the growth advantage g i of lineage i we impose the following constraint, where K is the number of lineages and N denotes the total number of cells in the population in the first time point. Given that g i has been fixed for each lineage i the only unknown quantity in equation (S21) is p i,t at time t. By solving for p i,t in equation (S21) we obtain, At each time point t the cell division probability for each lineage i is found using equation (S22) and the new cell count for the lineage is obtained by sampling from a binomial distribution with 2n i,t trials and success probability p i,t .

5.A. Preprocessing of Experimental Illumina HiSeq Data
Before applying the error correction methods to the experimental data the dataset is filtered to remove extremely low quality sequences. These error sequences have a high enough average per nucleotide error rate that no error correction method is able to reliably correct for them. In fact, in some cases these sequences can accumulate errors at almost all nucleotide positions due to phasing effects (Pfeiffer et al., 2018).
We apply the same sequence quality filter as the one used by Levy et al. (2015). In particular, any sequence with an average Phred quality score less than 30 is excluded. Furthermore, we filter out any sequences that do not match the following regular expression:

5.B. Comparison to Starcode on Experimental Illumina HiSeq Data
We applied Starcode to the experimental Illumina HiSeq data using the default settings of the method, with the distance threshold set to 2. Starcode identified 1 131 999 barcodes, 1 012 458 of these were also identified by Shepherd (see Figure S5). All three methods, Shepherd, Bartender and Starcode identified 963 112 barcodes in common. Figure S4 shows a comparison of the effective cluster radius (r e ) for each method. We see that Starcode has a large number of clusters for which r e is extremely high. This is a clear sign that Starcode is merging unrelated sequences.

5.C. Runtime Performance
We benchmarked the time performance of Shepherd and Bartender on the experimental sequencing data. Shepherd clustered the data in 9 minutes and 4 seconds and Bartender performed the clustering task in 27 minutes and 23 seconds. The time measured is the wall-clock time. The benchmark was performed on a desktop PC with an Intel Core I7 6700k processor and 32GB of system memory. The number of threads used by Bartender was set to 4 to match the core count of the processor. We observed that the time performance of the methods varies considerably between the datasets in the study. Specifically, Bartender is faster than Shepherd on all synthetic datasets and Shepherd is faster on the experimental sequencing data.   Algorithm S1. Read Clustering Procedure Input: sequences (including their read counts), k-mer index, , error rate estimate ρ Output: clustered sequences true_barcode_set = ∅ sorted_sequences = sort_by_count(sequences, order=descending) for seq in sorted_sequences do neighborhood = getNeighbors(seq, k-mer index) true_barcode_neighbors = true_barcode_set ∩ neighborhood if true_barcode_neighbors = ∅ then closest_true_barcode = getClosest(seq, true_barcode_neighbors) if h(seq, closest_true_barcode) ≤ then class = hypothesis_test(seq, closest_true_barcode, ρ) if class == error_sequence then: cluster seq with closest_true_barcode continue to next iteration add seq to true_barcode_set