Abstract

Motivation

Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.

Results

We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib’s default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling.

Availability and implementation

Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Massively parallel Next-Generation Sequencing (NGS) technologies have paved the way for cost-effective high-throughput sequencing of genomic data. However, an important issue is tackling the level of noise in sequenced data that can have a significant impact on downstream analysis. For example, Illumina HiSeq 2000, one of the leading NGS platforms, has a reported error rate of 0.2–0.4% dominated mostly by substitution errors (Schirmer et al., 2016). This error rate, although very low, is an issue for detecting single nucleotide variations (SNV) in some challenging datasets, such as cancer genomes. Indeed, cancer is a complex disease which involves rapidly evolving cells, often with multiple distinct clones. In order to effectively understand progression of a patient-specific tumor, comprehensive sampling of the tumor DNA at several time points is required. The traditional approach for this would require tissue biopsies at different time points. However, performing tissue biopsy is an invasive, expensive and time-consuming procedure and conducting it at multiple time points is not possible in many cases. Alternatively, it is now possible to sample and sequence circulating cell-free DNA (cfDNA) from the patient’s bodily fluids (e.g. blood), circumventing the aforementioned drawbacks. cfDNA arises in the blood or urine from dying cells as a result of apoptosis or necrosis (Schwarzenbach et al., 2011). A proportion of these cells, and consequently their DNA, may derive from a tumor and is known as circulating tumor DNA (ctDNA). Because the proportion of tumor could be very low [∼0.01% (Lipson et al., 2014)], deep (ultra-deep) sequencing of ctDNA/cfDNA needs to be performed in order to be able to monitor the progression of the tumor in patients. Most of the available approaches using ctDNA/cfDNA are performed through targeting the specific regions of the genome (a panel of cancer genes) that may potentially be involved in tumor progression. Polymerase chain reaction (PCR) techniques are then used to enrich the genomic fragments before capturing the regions of interest. A second round of PCR may be applied to the captured material to amplify the amount of DNA before sequencing.

One of the main challenges involved in working with ctDNA/cfDNA is the low proportion of the ctDNA involved (it can be as low as 0.01%), which could be lower than the sequencing error rate of the commonly used sequencing technologies. Furthermore, all DNA polymerase enzymes are known to have some point mutation error rate. For example, Taq, a popular polymerase used in PCR, has a point mutation error rate of 105–106 mutations per duplicated base pair (Clarke et al., 2001). Thus, errors in the sequenced reads can be introduced during PCR or during sequencing, with a PCR error being more severe than a sequencing error, since errors introduced in a given PCR cycle are propagated to all the descendant products of that cycle. PCR errors have the potential to cause false positive SNV predictions in downstream variation discovery analysis. Available variation discovery tools (DePristo et al., 2011; Garrison and Marth, 2012; Kockan et al., 2017; Rimmer et al., 2014) may have to deal with two major issues with this type of data: (i) significant number of PCR duplicates; and (ii) low alternative allele frequency. These two issues increase the false positive rate of variation prediction significantly (Newman et al., 2016).

Therefore, handling such errors is an important problem in NGS protocols for which several library preparation techniques have been developed (Davidsson et al., 2016; Lou et al., 2013). One such technique is tagging DNA molecules with short DNA barcodes, also known as unique molecular identifiers (UMI) (Kou et al., 2016). Given a set of molecules to be sequenced, the barcoding process works as follows. Barcode tags from a pool of random DNA sequences of a fixed length are ligated to DNA molecules, acting as unique identifiers for those molecules. The barcode tagged molecules undergo PCR amplification, and the resulting PCR products are sequenced using NGS technologies. This process is illustrated in Figure 1. The barcode tags provide a way to distinguish reads originating from different DNA molecules with identical or similar DNA sequences. This motivates a preprocessing step with the goal of grouping reads from the same molecule by clustering them using their barcodes. After this clustering, one can attempt to correct errors introduced in the sequenced reads, using, e.g. an overlap layout consensus approach applied to the reads present within each cluster. If the clustering is accurate, this process can significantly increase the precision in downstream mutations and variations discovery analysis (Alcaide et al., 2017; Kockan et al., 2017; Kukita, 2015; Newman et al., 2016).

An abstract illustration of PCR performed on three UMI tagged molecules for three PCR cycles. (A) Consider three DNA molecules, and a number of three possible UMI tags. The UMI tags are ligated to both ends of the different DNA molecules in a random fashion. (B) In this example, the three molecules undergo three PCR cycles before sequencing. Due to PCR inefficiency, not all molecules are duplicated at each cycle. New copies of duplicated molecules are marked with solid arrow lines. PCR may introduce point mutations, represented by a , in any given cycle to the new copies of the molecules. Note how an error occurring in an early cycle propagates to the final PCR product in higher proportions than a later error. (C) Once the molecules are sequenced by paired-end sequencing, sequencing errors, represented also by a , may also be introduced. Both the original molecule’s sequence as well as the UMI tags are subject to PCR and sequencing errors. The details of this sequencing method are described in Alcaide et al. (2017)
Fig. 1.

An abstract illustration of PCR performed on three UMI tagged molecules for three PCR cycles. (A) Consider three DNA molecules, and a number of three possible UMI tags. The UMI tags are ligated to both ends of the different DNA molecules in a random fashion. (B) In this example, the three molecules undergo three PCR cycles before sequencing. Due to PCR inefficiency, not all molecules are duplicated at each cycle. New copies of duplicated molecules are marked with solid arrow lines. PCR may introduce point mutations, represented by a graphic, in any given cycle to the new copies of the molecules. Note how an error occurring in an early cycle propagates to the final PCR product in higher proportions than a later error. (C) Once the molecules are sequenced by paired-end sequencing, sequencing errors, represented also by a graphic, may also be introduced. Both the original molecule’s sequence as well as the UMI tags are subject to PCR and sequencing errors. The details of this sequencing method are described in Alcaide et al. (2017)

Applying a naive clustering scheme, i.e. clustering reads with identical barcodes, overlooks the fact that barcodes themselves are also subject to sequencing and PCR errors, which will result in clustering inaccuracies of several types. First, if PCR errors are introduced in the barcodes in an early cycle, then the reads from a single molecule will be grouped into two or more clusters, thus reducing the potential to correct sequencing errors. If a sequencing error or a late cycle PCR error is introduced in a barcode, the resulting read is likely to appear as a singleton cluster. The singleton clusters (reads) will be discarded from downstream analysis, thus leading to a potential loss of a significant part of the generated data. Last, if two distinct DNA molecules happen to share a barcode (more prevalent in cases when the pool of unique barcodes is small), they will be incorrectly clustered together if their sequence similarity is not considered during the clustering.

A number of clustering methods have been developed to address this UMI clustering problem. Tools such as ProDuSe (Alcaide et al., 2017) and UMI-tools (Smith et al., 2017) rely on mapping the reads back to a reference genome in order to perform read clustering and correction. Another tool that performs barcoded reads clustering is Du Novo (Stoler et al., 2016). Du Novo clusters reads based on their barcodes alone without mapping to a reference genome. Du Novo proceeds by aligning the reads’ barcodes to themselves and uses that alignment as basis for the clustering. Other tools perform barcoded read clustering based on the similarity between the whole sequences of the reads, including the barcodes; tools such as CD-HIT (Fu et al., 2012), starcode (Zorita et al., 2015) and Rainbow (Chong et al., 2012) cluster reads based on a user-defined distance parameter. Starcode has a wrapper script, starcode-umi (https://github.com/gui11aume/starcode/blob/master/starcode-umi) that is aware of the barcode structure. It uses starcode to cluster the reads on their barcodes and their sequences then combine the results of these two clusterings. Methods based on mapping of the reads to a reference genome are generally computationally intensive. On the other hand, methods using only the barcodes may end up clustering together reads of different molecules that have similar barcodes.

To address the challenges mentioned above, we present Calib, which stands for Clustering without alignment using locality sensitive hashing (LSH) and MinHashing of barcoded reads. Calib is an efficient reference-free UMI clustering tool for barcoded paired-end reads that clusters reads based on both their barcode and sequence similarity. Calib is designed to handle substitution-dominant sequencing errors such as those generated by Illumina sequencing platforms. Calib uses LSH on the barcodes to generate initial clusters that are then split using MinHashing signatures of the reads sequences. Furthermore, Calib package includes a UMI sequencing simulation module that is used for benchmarking its performance against other tools as well as to optimize its default clustering parameters, making its use accessible to the end-user. We evaluated Calib on both simulated and real datasets. Calib outperforms all other tools in accuracy of clustering on the simulated datasets, while maintaining reasonable speed and memory footprint. We also tested the effect of Calib and other clustering tools on a downstream SNV calling pipeline. Our results show that Calib reduces downstream tentative false positive SNVs compared to alignment based clustering.

2 Materials and methods

Calib takes in as input a file of UMI-barcoded paired-end reads. Here, we provide all the definitions that are used throughout this manuscript. A paired-end read is composed of two mates, each being a short DNA sequence. Each mate of a read is prefixed by a tag, where the length of the tag, l, is a constant. Note that the tags of the two mates of a read are not assumed to be identical. The barcode of a read is defined as the concatenation of the tags of its first mate and second mate, respectively. Finally, we call sequence of a mate the sequence obtained from the mate by removing its tag. The sequence of a read is then the pair of sequences of its two mates. Different read sequences are not assumed to be of equal length.

2.1 Algorithm overview

Calib clusters the reads by finding connected components of a graph G that is built from the given input reads. The graph is built based on the similarity between barcodes and sequences of the reads.

To measure the similarity between the barcodes of two reads, we rely on their Hamming distance (i.e. number of mismatches): two barcodes are similar if they are within e Hamming distance from one another, where e is a user-defined threshold. To measure the similarity between the sequences of two reads, while avoiding a costly exact edit distance calculation, we rely on the MinHashing approach. MinHashing was first introduced for web indexing purposes (Broder, 1997) and is used in a variety of bioinformatics applications such as Mash (Ondov et al., 2016) and minimap (Li, 2016). Calib performs MinHashing by splitting each mate sequence into m equally sized, non-overlapping segments. From each segment, Calib extracts the lexicographically smallest k-mer. This k-mer is defined as the minimizer of this segment. Thus, each mate will be associated with an ordered list of m minimizers, each of size k.Figure 2B shows examples of read mates and their extracted minimizers. Two mates are considered similar if at least t of their respective minimizers are identical. Note that k, m and t are user-defined values. Two reads are similar if their barcodes and respective mates are similar according to the definitions above.

Calib clustering of paired-end reads of UMI tagged DNA molecules. (A) In this example, six paired-end reads are given to Calib as input. Each read has two mates. Since each mate comes from the start of the tagged molecule (with tag length of 2 in this example), the first two nucleotides are the tag, and the rest is the sequence. (B) From the two mates of each read, the barcode is extracted. The barcode is a concatenation of both mates’ tags. From each mate, Calib extracts an ordered list of minimizers. In this example, k-mer size, k, is 2; and number of minimizers, m, is 2. The first two reads have identical barcodes and minimizers, and thus will be represented by a single node (node 1). (C) With error tolerance, e, of 1; and tag length, l, of 2 (barcode length of 4), there are four possible templates. When processing each template, all pairs of nodes that have identical barcodes, after removing the bases ignored by the template, will be checked to see if they have enough matching minimizers. In this example, the minimizer threshold, t, is 1. In the first template, only nodes 2 and 3 have identical masked barcodes but they do not have enough matching minimizers so no edge will be added between them. (D) Nodes in a connected component are considered as a cluster. Reads represented by nodes in a cluster are considered to be originating from the same UMI tagged molecule
Fig. 2.

Calib clustering of paired-end reads of UMI tagged DNA molecules. (A) In this example, six paired-end reads are given to Calib as input. Each read has two mates. Since each mate comes from the start of the tagged molecule (with tag length of 2 in this example), the first two nucleotides are the tag, and the rest is the sequence. (B) From the two mates of each read, the barcode is extracted. The barcode is a concatenation of both mates’ tags. From each mate, Calib extracts an ordered list of minimizers. In this example, k-mer size, k, is 2; and number of minimizers, m, is 2. The first two reads have identical barcodes and minimizers, and thus will be represented by a single node (node 1). (C) With error tolerance, e, of 1; and tag length, l, of 2 (barcode length of 4), there are four possible templates. When processing each template, all pairs of nodes that have identical barcodes, after removing the bases ignored by the template, will be checked to see if they have enough matching minimizers. In this example, the minimizer threshold, t, is 1. In the first template, only nodes 2 and 3 have identical masked barcodes but they do not have enough matching minimizers so no edge will be added between them. (D) Nodes in a connected component are considered as a cluster. Reads represented by nodes in a cluster are considered to be originating from the same UMI tagged molecule

Formally, Calib builds the graph G=(V,E), where a node vV corresponds to a set of reads with identical barcodes and minimizers. Calib adds an edge in E between two nodes when they have similar barcodes and minimizers. In the next subsections, we provide in depth details of building the graph G and finding the connected components. Calib’s clustering algorithm is summarized in Figure 2.

2.2 Algorithm details

2.2.1 Constructing the node set

Calib takes its input in FASTQ format. For any given read, Calib extracts and stores its barcode and minimizers. The extraction of minimizers is done in linear fashion by scanning the reads once. Minimizers are stored using 2-bit representation.

Reads with identical barcodes and minimizers are grouped together to be represented by the same node in the graph, as shown in Figure 2B. In order to group the reads, Calib uses a hash table data structure where the key is a concatenation of the barcode and the minimizers. Inserting a new element into hash table is done in constant time thus providing linear time grouping of the reads. Then each group is represented as a node in G.

2.2.2 Constructing the edge set

Two nodes can be adjacent only if their barcodes are similar. To connect all the nodes with similar barcodes, the naive way is to do pairwise comparison between all nodes and calculate the Hamming distance between their barcodes. However, this is computationally expensive as it takes quadratic time in terms of nodes, O(|V|2·l). To speed up the process, Calib uses LSH (Gionis et al., 1999). To perform LSH on the barcodes, we define all possible templates that would result from ignoring e of the 2×l nucleotide positions in a barcode and considering the remaining 2×le nucleotide positions.

There are (2×le) possible templates. For each template, we insert barcodes of all the nodes masked by the template into a hash table, where the key is a masked barcode, and the value is a set of nodes. All the nodes that have barcodes with at most e Hamming distance from one another, with respect to this template, are now grouped together. For each pair of nodes in a group, the pair will be checked if they have enough matching minimizers: at least t out of m left mates minimizers and at least t out of m right mates minimizers must be matching. If the pair has enough matching minimizers, an edge is added to connect them in E.Figure 2C shows an example of LSH templates and the construction of the edge set. Processing all the nodes for a template takes O(|V|·l) time as masking of the barcode requires O(l) and there are |V| insertions into the hash table. Since the l is a small constant, this process takes O(|V|). Once the template is processed, Calib will repeat the process for all the remaining templates. The total time will be O(|V|·(2×le)).

Note that, in the worst case, all nodes can map to the same key thus checking all pairs of nodes for matching minimizers will take O(|V|·|V|). However, this is an unlikely scenario to happen if the length of barcode l is not too small (e.g. <4).

2.2.3 Finding clusters

Finally, to obtain a list of clusters, we would find all the connected components of the graph. To do so, we perform a classical breadth-first search of the graph to identify all connected components. Finally, we report the reads of each connected component as a single cluster. Note that each node is associated with one or more reads. This association was already stored in the first step of the algorithm in a list of size |V|. Having kept this association in memory allows us to report the clusters in O(|R|) time, where R is the set of input reads. Example for this step is shown in Figure 2D.

2.3 Error correction

Calib includes a consensus calling module. The module takes as input the clusters generated by Calib’s clustering algorithm. For each cluster, Calib builds a multiple sequence alignment (MSA) of the reads that belong to that cluster. To efficiently build MSA, Calib uses the SPOA library (Vaser et al., 2017) which provides an implementation of the partial-order alignment (POA) algorithm (Lee et al., 2002). After building the MSA, column-wise majority voting is used to determine the consensus sequence of the cluster.

2.4 Data simulation

Calib includes a module that simulates UMI sequencing experiments. The purpose of this module is to simulate datasets for which the true clustering is known. This module is used for comparing Calib performance against other tools in the Results section. In addition to that, the simulation module is used to optimize Calib’s clustering algorithm parameters. The simulation pipeline aims at controlling a wide range of parameters and random aspects of UMI sequencing. First, in terms of barcodes, our simulation allows for specifying the barcode tag length l and the number of unique barcode tags to generate (bounded above by 4l). Next, we specify the number of molecules to be simulated, together with a length distribution characterized by its mean and standard deviation. To generate these molecules, we require a reference genome with a set of targeted regions (in BED format) with which each generated molecule will overlap. Next we focus on the PCR and sequencing steps: we can set the number of PCR cycles, the PCR efficiency rate and the PCR error rate per duplicated base and the read length and sequencing platform. Calib uses ART Illumina (Huang et al., 2012) to simulate paired-end read sequencing.

The simulation pipeline starts by creating a list of barcode tags of a given length. No two tags in this list are identical. Then, each molecule is generated by picking its location in the targeted regions (chosen uniformly at random) and its length (randomly chosen following a normal distribution characterized by the mean length and standard deviation). The next step attaches barcode tags to molecules. This is done by sampling with replacement a tag to be prepended to each molecule, and another to be appended. The now barcoded molecules are passed to PCR duplication simulator. In each cycle, a random subset of the molecules is selected for duplication. The size of this subset is equal to the number of molecules present multiplied by the efficiency rate. Then, a random subset of the nucleotides of those molecules is chosen to be mutated. The number of mutated nucleotides is equal to the sum of lengths of duplicated molecules multiplied by the PCR error rate. The duplicated molecules are added to the rest of the molecules for subsequent PCR cycles. Finally, every molecule in the PCR product is passed to ART Illumina to produce a paired-end read using the specified platform and read length. Note that every part of the pipeline is fed a random seed to ensure benchmarking reproducibility and to enable generation of different datasets with the same parameters. The simulation pipeline is illustrated in Figure 3.

Illustration of the simulation pipeline. Vertical arrows indicate parameters of the pipeline that can be set-up by the user
Fig. 3.

Illustration of the simulation pipeline. Vertical arrows indicate parameters of the pipeline that can be set-up by the user

To assess accuracy of different clustering tools on simulated data, we are using the standard adjusted Rand index (ARI) statistic (Hubert and Arabie, 1985). ARI compares the concordance between two given clusterings of a set of elements. Let X={Xi|1 ≤ i ≤ c} be a predicted clustering of a set of n data points, where Xi is the set of data points in cluster i of X, and Y={Yj|1 ≤ j ≤ d} be the true clustering of the same data. Note that |Xi|=|Yj|. The intersection of two clusters is ni,j=|XiYj|. Then ARI is defined as:
where E(I)=[i(|Xi|2)j(|Yj|2)]/(n2). ARI has a maximum score of 1 when the predicted clustering is perfectly aligned with the true clustering, while a random clustering is expected to have an ARI close to 0. We use the scikit-learn (Pedregosa et al., 2011) Python package to calculate the ARI scores in our tests.

3 Results

We compared the performance of Calib against both alignment based and alignment-free UMI clustering tools on simulated datasets. The alignment-free tools we tested are CD-HIT (Fu et al., 2012), Rainbow (Chong et al., 2012), starcode-umi (Zorita et al., 2015) and Du Novo (Stoler et al., 2016). We also compared Calib’s clustering against UMI-tools (Smith et al., 2017), an alignment-based tool. We used BWA-MEM (Li, 2013) to align the reads for UMI-tools. All our experiments with Calib and the other alignment-free tools can be easily reproduced using our benchmarking scripts available from Calib’s GitHub repository.

3.1 Testing infrastructure

Simulated datasets were generated using Calib data simulation module described in Section 2. All the tools were run on a machine with 754 GB of available RAM, on an Intel® Xeon® CPU Gold 6130 with clock rate of 2.10 GHz, and CentOS 7.4 as an operating system. All tools were compiled with GCC 5.5.0, and were run on a single thread with the exception of UMI-tools pipeline which was run on 32 threads. Any tool that ran for more than 24 h was terminated.

3.2 Parameter selection

Calib has four different parameters that need to be set: barcode error tolerance, e; k-mer size, k; number of minimizers per mate, m; and threshold of matching minimizers per mate, t. To select the parameters for Calib, we conducted a wide range of experiments that emulate different experimental setups when it comes to barcode length and read length. Using the results of these experiments, we pre-configured Calib with different parameter sets that Calib choose from based on the input data barcode length and inferred read length. Details of those experiments are available in the Supplementary Material and on the GitHub repository. Note that the datasets that were used to select Calib’s parameters were not used as testing datasets for the results presented in this section.

To select the similarity threshold of CD-HIT, we ran it with thresholds 0.80, 0.85, 0.90, 0.95, 0.96, 0.97 and 0.98. With Rainbow we had to choose the mismatch value and whether to run Rainbow cluster splitting module. We tested Rainbow with mismatch values 1 to 9 and with or without using its cluster splitting module. For starcode-umi we tried seq-trim of 50, 75 and 100; umi-d of 0, 1 and 2; seq-d of 3, 5 and 8; umi-cluster-ratio of 1, 3 and 5 and seq-cluster-ratio of 1, 3 and 5. For Du Novo we tried barcode edit distance 1 to 4. Du Novo’s runtime, memory footprint and accuracy were more or less the same regardless of the parameter choice. Finally, for UMI-tools, we selected barcode edit distance of 2 since its user manual recommends increasing the default edit distance of 1 if the barcode length >14. For each of the tools, we present the results for both the best performing parameters and the default parameters. The complete results are available in the Supplementary Material.

3.3 Simulated datasets

We simulated three datasets using Calib’s simulation module. In simulating the datasets we tried to emulate ultra-high-throughput sequencing of targeted regions of a human genome using cfDNA, an emerging protocol in cancer genomics (Newman et al., 2016).

We used the human genome build GRCh38/hg38 as reference genome for our simulation. The panel we used for the targeted sequencing was generated to include the exons of 35 randomly selected genes from the Cancer Gene Consensus gene list (Futreal et al., 2004). The simulated datasets had barcode length of 16 bases (so each tag was of length l = 8). The molecule mean size was 300 nucleotides with standard deviation of 25. We set the number of PCR cycles to 7, PCR duplication efficiency rate to 0.6 and PCR duplication error rate to 5×105 substitutions per duplicated nucleotide. We chose the HiSeq 2500 sequencing machine model for ART, with paired-end read length of 150 bp per mate.

The first dataset is the smallest of the three datasets. It has 100 000 molecules given as input to the PCR duplication step, and 100 unique tags (10 K possible unique barcodes) were used to tag both ends of each molecule; thus there is a high rate of re-use of barcodes to tag different molecules. The final paired-end read count for the first dataset is 2 684 356. The second dataset has 1 000 000 molecules and 5000 unique tags (25 M possible unique barcodes), with read count of 26 843 546. The third dataset has the same number of molecules of 1 000 000, but it has 25 000 unique tags (625 M possible unique barcodes) instead. By simulating these datasets we aim to demonstrate how different tools performances scale under different molecule input sizes and barcode re-use rates. The performance results of all tools are summarized in Table 1.

Table 1.

Results of Calib, CD-HIT, Rainbow, starcode-umi, Du Novo and UMI-tools on the three simulated datasets

Simulated datasetToolTime (min)Memory (GB)ARIParametersNote
100 K molecules 100 tags (2.7 M reads)Calib4.921.110.9995−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/A−c0.90Default parameter; exceeded 24 h
70.302.710.6423−c0.95Best accuracy parameter
Rainbow1.020.320.9935−m4|divDefault parameters
1.030.320.9973−m2|divBest accuracy parameters
starcode-umib1.832.420.90720: 3: 3: 3: 50Default parameters
22.392.690.99782: 3: 5: 1: 50Best accuracy parameters
Du Novo16.130.930.1902−d1All parameters scored equally
UMI-toolsa43.94 (41.58)8.260.9923−e2Parameter recommended by manual

1 M molecules 5 K tags (27.8 M reads)Calib37.8611.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9747−m4|divDefault parameters
13.552.680.9965−m2|divBest accuracy parameters
starcode-umib24.0321.260.90800: 3: 3: 3: 50Default parameters
547.8225.920.99762: 1: 5: 1: 50Best accuracy parameters
Du Novo448.022.580.8954−d1All parameters scored equally
UMI-toolsa443.78 (419.45)18.680.9921−e2Parameter recommended by manual

1 M molecules 25 K tags (27.8 M reads)Calib35.5211.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9749−m4|divDefault parameters
13.472.680.9966−m2|divBest accuracy parameters
starcode-umib21.0521.260.90800: 3: 3: 3: 50Default parameters
606.4025.910.99932: 1: 5: 1: 50Best accuracy parameters
Du Novo453.442.500.9084−d1All parameters scored equally
UMI-toolsa492.04 (420.02)18.680.9921−e2Parameter recommended by manual
Simulated datasetToolTime (min)Memory (GB)ARIParametersNote
100 K molecules 100 tags (2.7 M reads)Calib4.921.110.9995−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/A−c0.90Default parameter; exceeded 24 h
70.302.710.6423−c0.95Best accuracy parameter
Rainbow1.020.320.9935−m4|divDefault parameters
1.030.320.9973−m2|divBest accuracy parameters
starcode-umib1.832.420.90720: 3: 3: 3: 50Default parameters
22.392.690.99782: 3: 5: 1: 50Best accuracy parameters
Du Novo16.130.930.1902−d1All parameters scored equally
UMI-toolsa43.94 (41.58)8.260.9923−e2Parameter recommended by manual

1 M molecules 5 K tags (27.8 M reads)Calib37.8611.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9747−m4|divDefault parameters
13.552.680.9965−m2|divBest accuracy parameters
starcode-umib24.0321.260.90800: 3: 3: 3: 50Default parameters
547.8225.920.99762: 1: 5: 1: 50Best accuracy parameters
Du Novo448.022.580.8954−d1All parameters scored equally
UMI-toolsa443.78 (419.45)18.680.9921−e2Parameter recommended by manual

1 M molecules 25 K tags (27.8 M reads)Calib35.5211.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9749−m4|divDefault parameters
13.472.680.9966−m2|divBest accuracy parameters
starcode-umib21.0521.260.90800: 3: 3: 3: 50Default parameters
606.4025.910.99932: 1: 5: 1: 50Best accuracy parameters
Du Novo453.442.500.9084−d1All parameters scored equally
UMI-toolsa492.04 (420.02)18.680.9921−e2Parameter recommended by manual

Note: Best time, memory use and ARI performances on each dataset are marked in bold. Time and memory use are reported using GNU Time 1.9 (user time and maximum resident set size).

aValue in parentheses is the component of user time spent on read alignment. Note that read alignment was performed on 32 threads.

bStarcode-umi has five parameters. In this table, their values are ordered as follows: <umi-d>: <umi-cluster-ratio>: <seq-d>: <seq-cluster-ratio>:<seq-trim>.

Table 1.

Results of Calib, CD-HIT, Rainbow, starcode-umi, Du Novo and UMI-tools on the three simulated datasets

Simulated datasetToolTime (min)Memory (GB)ARIParametersNote
100 K molecules 100 tags (2.7 M reads)Calib4.921.110.9995−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/A−c0.90Default parameter; exceeded 24 h
70.302.710.6423−c0.95Best accuracy parameter
Rainbow1.020.320.9935−m4|divDefault parameters
1.030.320.9973−m2|divBest accuracy parameters
starcode-umib1.832.420.90720: 3: 3: 3: 50Default parameters
22.392.690.99782: 3: 5: 1: 50Best accuracy parameters
Du Novo16.130.930.1902−d1All parameters scored equally
UMI-toolsa43.94 (41.58)8.260.9923−e2Parameter recommended by manual

1 M molecules 5 K tags (27.8 M reads)Calib37.8611.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9747−m4|divDefault parameters
13.552.680.9965−m2|divBest accuracy parameters
starcode-umib24.0321.260.90800: 3: 3: 3: 50Default parameters
547.8225.920.99762: 1: 5: 1: 50Best accuracy parameters
Du Novo448.022.580.8954−d1All parameters scored equally
UMI-toolsa443.78 (419.45)18.680.9921−e2Parameter recommended by manual

1 M molecules 25 K tags (27.8 M reads)Calib35.5211.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9749−m4|divDefault parameters
13.472.680.9966−m2|divBest accuracy parameters
starcode-umib21.0521.260.90800: 3: 3: 3: 50Default parameters
606.4025.910.99932: 1: 5: 1: 50Best accuracy parameters
Du Novo453.442.500.9084−d1All parameters scored equally
UMI-toolsa492.04 (420.02)18.680.9921−e2Parameter recommended by manual
Simulated datasetToolTime (min)Memory (GB)ARIParametersNote
100 K molecules 100 tags (2.7 M reads)Calib4.921.110.9995−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/A−c0.90Default parameter; exceeded 24 h
70.302.710.6423−c0.95Best accuracy parameter
Rainbow1.020.320.9935−m4|divDefault parameters
1.030.320.9973−m2|divBest accuracy parameters
starcode-umib1.832.420.90720: 3: 3: 3: 50Default parameters
22.392.690.99782: 3: 5: 1: 50Best accuracy parameters
Du Novo16.130.930.1902−d1All parameters scored equally
UMI-toolsa43.94 (41.58)8.260.9923−e2Parameter recommended by manual

1 M molecules 5 K tags (27.8 M reads)Calib37.8611.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9747−m4|divDefault parameters
13.552.680.9965−m2|divBest accuracy parameters
starcode-umib24.0321.260.90800: 3: 3: 3: 50Default parameters
547.8225.920.99762: 1: 5: 1: 50Best accuracy parameters
Du Novo448.022.580.8954−d1All parameters scored equally
UMI-toolsa443.78 (419.45)18.680.9921−e2Parameter recommended by manual

1 M molecules 25 K tags (27.8 M reads)Calib35.5211.000.9996−e2 −k8 −m7 −t2Only default parameters were tested
CD-HITN/AN/AN/AN/AAll parameters exceeded 24 h
Rainbow13.532.680.9749−m4|divDefault parameters
13.472.680.9966−m2|divBest accuracy parameters
starcode-umib21.0521.260.90800: 3: 3: 3: 50Default parameters
606.4025.910.99932: 1: 5: 1: 50Best accuracy parameters
Du Novo453.442.500.9084−d1All parameters scored equally
UMI-toolsa492.04 (420.02)18.680.9921−e2Parameter recommended by manual

Note: Best time, memory use and ARI performances on each dataset are marked in bold. Time and memory use are reported using GNU Time 1.9 (user time and maximum resident set size).

aValue in parentheses is the component of user time spent on read alignment. Note that read alignment was performed on 32 threads.

bStarcode-umi has five parameters. In this table, their values are ordered as follows: <umi-d>: <umi-cluster-ratio>: <seq-d>: <seq-cluster-ratio>:<seq-trim>.

We observe that on all three datasets, Calib has the best performance in terms of accuracy having an almost perfect ARI statistic. This is also valid when we evaluated Calib on a simulated dataset with shorter molecule length (mean of 167 nucleotides) which is similar to reported cfDNA fragment length (Wan et al., 2017). Details of this evaluation are available online on Calib's repository.

Other tools performance varied depending on the parameter choice which emphasizes the importance of the parameter selection done in Calib. Du Novo seems to have difficulty when the barcode re-use rate is high (the first dataset). This is due to the fact that Du Novo does not consider the sequence content of the reads and only clusters on the barcodes. CD-HIT performance was not very good on the simulated datasets regardless of the parameter choice. This can be explained by the fact that multiple molecules will originate from the same genomic region, causing CD-HIT to cluster them together.

3.4 Application to a real dataset

In order to evaluate the performance of Calib on a real dataset, we used Horizon’s genomic DNA standard 701 (HD701) (https://www.horizondiscovery.com/quantitative-multiplex-reference-standard-hd701). This standard is appropriate for NGS library preparation including whole-genome, whole-exome, custom capture and targeted amplicon panels. HD701 contains verified and reported variations with allele frequencies ranging from 1 to 24%. To define the sequencing targets, we used a 36-gene panel. The panel covered in total 238 644 unique genomic positions. The panel contained genes with 254 verified and present SNVs from this genomic standard: 10 of these SNVs have been verified by Horizon Discovery using ddPCR, and the remaining 244 SNVs have been reported by Horizon Discovery to be present in the parental cell lines used by Horizon Discovery to generate the HD701 reference. The list of 254 SNVs is available on Calib's GitHub repository. We used eight base pairs UMI tags to tag the DNA fragments from both sides (16 bases barcodes) in the same fashion as the technique described by Alcaide et al. (2017). After library preparation, we generated 22 212 522 paired-end reads of length 2 × 150 bp using an Illumina NextSeq 500 machine. The number of paired-end reads that passed the quality control is 22 082 242. The dataset is deposited to SRA under accession number SRR7903524.

Since we do not know the ground truth of reads clustering for this real dataset, we cannot assess the clustering tools directly using ARI. However, since the real dataset has a list of verified and reported SNVs on the targeted region, we can assess the effects different clustering tools have on downstream SNV calling. For simplicity, we call any SNV call that is not on the list of verified and reported SNVs a tentative false positive. We ran UMI-tools, Calib, and starcode-umi on the real dataset. The runtime and memory use of these tools are reported in Table 2.

Table 2.

Time and memory footprint of UMI-tools, Calib, and starcode-umi on the real dataset

ToolTime (min)Memory (GB)Parameters
Calib9.651.90−e2 −k8 −m7 −t2
starcode-umib450.449.992: 1: 5: 1: 50
UMI-toolsa149.76 (137.44)14.73−e2
ToolTime (min)Memory (GB)Parameters
Calib9.651.90−e2 −k8 −m7 −t2
starcode-umib450.449.992: 1: 5: 1: 50
UMI-toolsa149.76 (137.44)14.73−e2

Note: Time and memory use are reported using GNU Time 1.9 (user time and maximum resident set size).

aValue in parentheses is the component of user time spent on read alignment. Note that read alignment was performed on 32 threads.

bStarcode-umi parameters order is same as described in Table 1.

Table 2.

Time and memory footprint of UMI-tools, Calib, and starcode-umi on the real dataset

ToolTime (min)Memory (GB)Parameters
Calib9.651.90−e2 −k8 −m7 −t2
starcode-umib450.449.992: 1: 5: 1: 50
UMI-toolsa149.76 (137.44)14.73−e2
ToolTime (min)Memory (GB)Parameters
Calib9.651.90−e2 −k8 −m7 −t2
starcode-umib450.449.992: 1: 5: 1: 50
UMI-toolsa149.76 (137.44)14.73−e2

Note: Time and memory use are reported using GNU Time 1.9 (user time and maximum resident set size).

aValue in parentheses is the component of user time spent on read alignment. Note that read alignment was performed on 32 threads.

bStarcode-umi parameters order is same as described in Table 1.

After the clustering step of each tool was completed, we ran Calib’s error correction module on each of the clusters. Any cluster that had <2 reads was discarded. POA was performed on the mates of reads of each cluster, and column-wise majority voting was used to generate the consensus. Any column in the MSA that did not have a 50%+1 majority was replaced with an N. After that, each cluster’s consensus was treated as a read (one consensus per mate), and aligned to the human reference genome using BWA-MEM (Li, 2013).

We used SiNVICT (Kockan et al., 2017), which is specifically designed to work with ctDNA/cfDNA datasets, to call variations on the aligned consensus reads of each method. In addition to that, we ran SiNVICT on the original reads without performing any clustering or error correction. SiNVICT uses a Poisson distribution to call variations based on the given expected platform sequencing error rate. We ran SiNVICT using three different platform error rate parameters: 102, 103 and 104. Figure 4 depicts the total number of SNV calls made by SiNVICT for each considered clustering method and the number of verified or reported calls for HD701 on the 36-gene capture panel.

SiNVICT calls using different clustering methods on the real dataset. The X axis is in logarithmic scale representing the number of calls. The Y axis shows the choice for SiNVICT expected sequencing error rate. Note that this dataset has 254 verified or reported SNVs
Fig. 4.

SiNVICT calls using different clustering methods on the real dataset. The X axis is in logarithmic scale representing the number of calls. The Y axis shows the choice for SiNVICT expected sequencing error rate. Note that this dataset has 254 verified or reported SNVs

Using the default expected sequencing error rate of 102, we observe that the SNV calling based on using no clustering captures more verified or reported SNVs than the SNV calling after performing clustering. This comes at the cost of calling over three times (1695 versus 332–414) more tentative false positives. This disparity is far more pronounced when we decreased the expected sequencing error rate to 104. Performing no clustering or error correction causes about 35% of 239 K genomic positions of the targeted panel to be called as variants. On the other hand, the clustering tools allow for total number of variations called to be reasonably low (1257–2354) while being able to detect about 90% of the expected variations to be found in HD701. We notice that alignment-free tools (Calib and starcode-umi) reduce the rate of tentative false positive SNV calls by half compared to the alignment-based method, UMI-tools. This might be due to mapping errors in the step before clustering in UMI-tools that can be caused by uncorrected sequencing errors in the reads.

4 Conclusion

Our work demonstrates that it is possible to achieve, using reasonable computational resources, a clustering of large datasets of barcoded DNA reads, while accounting both for PCR errors and sequencing errors in the barcode and in the sequence. Our method Calib is extremely simple, as the clustering step itself consists only in identifying connected components of a graph; the key novelty of our approach is the carefully engineered combination of advanced sequence indexing and comparison techniques (LSH, minimizers, MinHashing) to compute the graph. Our experiments on simulated data show that in various contexts, including a wide range of barcodes re-use rates and dataset sizes, Calib outperforms the other clustering tools in terms of clustering accuracy, with an ARI close to 1, while maintaining a reasonable memory footprint and runtime. On the real dataset Calib results in the lowest number of tentative false positive calls compared to the other tools even at low expected sequencing error rates. Calib’s alignment-free clustering means that it is able to utilize reads that do not map, or reads that map to multiple locations which can be difficult to cluster for alignment-based tools.

Among future directions, the most natural is extending Calib for single-end reads. We hope that Calib becomes an even more powerful, easy-to-use, faster and accurate UMI read clustering package, with important applications in the development of accurate and efficient cfDNA analysis protocols for liquid biopsies in cancer personalized medicine.

Acknowledgements

We thank Dr. L. Chindelevitch for the helpful discussion and B. Cardoen for his invaluable comments on our implementation.

Funding

This work was supported in part by NSERC Discovery Grant [to F.H.]; TFRI NF PPG Project #1062 [to F.H. and C.C.C.]; NSERC Discovery Grant [to C.C.]; and NSERC CREATE Training Program in High-Dimensional Bioinformatics Scholarship [to B.O.].

Conflict of Interest: none declared.

References

Alcaide
 
M.
, et al. (
2017
)
Targeted error-suppressed quantification of circulating tumor DNA using semi-degenerate barcoded adapters and biotinylated baits
.
Sci. Rep.
,
7
,
10574
.

Broder
 
A.Z.
(
1997
).
On the resemblance and containment of documents
. In:
Proceedings of the Compression and Complexity of Sequences 1997
. pp.
21
29
.
IEEE
,
Washington, DC, USA
.

Chong
 
Z.
, et al. (
2012
)
Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads
.
Bioinformatics
,
28
,
2732
2737
.

Clarke
 
L.
, et al. (
2001
)
PCR amplification introduces errors into mononucleotide and dinucleotide repeat sequences
.
Mol. Pathol.
,
54
,
351
.

Davidsson
 
M.
, et al. (
2016
)
A novel process of viral vector barcoding and library preparation enables high-diversity library generation and recombination-free paired-end sequencing
.
Sci. Rep.
,
6
,

DePristo
 
M.A.
, et al. (
2011
)
A framework for variation discovery and genotyping using next-generation DNA sequencing data
.
Nat. Genet.
,
43
,
491
.

Fu
 
L.
, et al. (
2012
)
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
,
28
,
3150
3152
.

Futreal
 
P.A.
, et al. (
2004
)
A census of human cancer genes
.
Nat. Rev. Cancer
,
4
,
177
.

Garrison
 
E.
,
Marth
G.
(
2012
)
Haplotype-based variant detection from short-read sequencing
.
arXiv
,
1207
,
3907
.

Gionis
 
A.
, et al. (
1999
).
Similarity search in high dimensions via hashing
. In:
VLDB ‘99 Proceedings of the 25th International Conference on Very Large Data Bases
. pp.
518
529
.
Morgan Kaufmann Publishers Inc
,
San Francisco, CA, USA
.

Huang
 
W.
, et al. (
2012
)
ART: a next-generation sequencing read simulator
.
Bioinformatics
,
28
,
593
594
.

Hubert
 
L.
,
Arabie
P.
(
1985
)
Comparing partitions
.
J. Classif.
,
2
,
193
218
.

Kockan
 
C.
, et al. (
2017
)
SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA
.
Bioinformatics
,
33
,
26
34
.

Kou
 
R.
, et al. (
2016
)
Benefits and challenges with applying unique molecular identifiers in next generation sequencing to detect low frequency mutations
.
PLoS One
,
11
,
e0146638
.

Kukita
 
Y.
, et al. (
2015
)
High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients
.
DNA Res.
,
22
,
269
277
.

Lee
 
C.
, et al. (
2002
)
Multiple sequence alignment using partial order graphs
.
Bioinformatics
,
18
,
452
464
.

Li
 
H.
(
2013
)
Aligning sequence reads, clone sequences and assembly contigs with bwa-mem
.
arXiv
,
1303
,
3997
.

Li
 
H.
(
2016
)
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
.
Bioinformatics
,
32
,
2103
2110
.

Lipson
 
E.J.
, et al. (
2014
)
Circulating tumor DNA analysis as a real-time method for monitoring tumor burden in melanoma patients undergoing treatment with immune checkpoint blockade
.
J. Immunother. Cancer
,
2
,
42
.

Lou
 
D.I.
, et al. (
2013
)
High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing
.
Proc. Natl. Acad. Sci. USA
,
110
,
19872
19877
.

Newman
 
A.M.
, et al. (
2016
)
Integrated digital error suppression for improved detection of circulating tumor DNA
.
Nat. Biotechnol.
,
34
,
547
555
.

Ondov
 
B.D.
, et al. (
2016
)
Mash: fast genome and metagenome distance estimation using MinHash
.
Genome Biol.
,
17
,
132
.

Pedregosa
 
F.
, et al. (
2011
)
Scikit-learn: machine learning in Python
.
J. Mach. Learn. Res.
,
12
,
2825
2830
.

Rimmer
 
A.
, et al. (
2014
)
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications
.
Nat. Genet.
,
46
,
912
.

Schirmer
 
M.
, et al. (
2016
)
Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data
.
BMC Bioinformatics
,
17
,
125
.

Schwarzenbach
 
H.
, et al. (
2011
)
Cell-free nucleic acids as biomarkers in cancer patients
.
Nat. Rev. Cancer
,
11
,
426
.

Smith
 
T.
, et al. (
2017
)
UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy
.
Genome Res.
,
27
,
491
499
.

Stoler
 
N.
, et al. (
2016
)
Streamlined analysis of duplex sequencing data with Du Novo
.
Genome Biol.
,
17
,
180
.

Vaser
 
R.
, et al. (
2017
)
Fast and accurate de novo genome assembly from long uncorrected reads
.
Genome Res.
,
27
,
737
746
.

Wan
 
J.C.
, et al. (
2017
)
Liquid biopsies come of age: towards implementation of circulating tumour DNA
.
Nat. Rev. Cancer
,
17
,
223
.

Zorita
 
E.
, et al. (
2015
)
Starcode: sequence clustering based on all-pairs search
.
Bioinformatics
,
31
,
1913
1919
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Bonnie Berger
Bonnie Berger
Associate Editor
Search for other works by this author on:

Supplementary data