TRUmiCount: correctly counting absolute numbers of molecules using unique molecular identifiers

Abstract Motivation Counting molecules using next-generation sequencing (NGS) suffers from PCR amplification bias, which reduces the accuracy of many quantitative NGS-based experimental methods such as RNA-Seq. This is true even if molecules are made distinguishable using unique molecular identifiers (UMIs) before PCR amplification, and distinct UMIs are counted instead of reads: Molecules that are lost entirely during the sequencing process will still cause underestimation of the molecule count, and amplification artifacts like PCR chimeras create phantom UMIs and thus cause over-estimation. Results We introduce the TRUmiCount algorithm to correct for both types of errors. The TRUmiCount algorithm is based on a mechanistic model of PCR amplification and sequencing, whose two parameters have an immediate physical interpretation as PCR efficiency and sequencing depth and can be estimated from experimental data without requiring calibration experiments or spike-ins. We show that our model captures the main stochastic properties of amplification and sequencing, and that it allows us to filter out phantom UMIs and to estimate the number of molecules lost during the sequencing process. Finally, we demonstrate that the phantom-filtered and loss-corrected molecule counts computed by TRUmiCount measure the true number of molecules with considerably higher accuracy than the raw number of distinct UMIs, even if most UMIs are sequenced only once as is typical for single-cell RNA-Seq. Availability and implementation TRUmiCount is available at http://www.cibiv.at/software/trumicount and through Bioconda (http://bioconda.github.io). Supplementary information Supplementary information is available at Bioinformatics online.


Supplementary Information
Florian G. Pflug and Arndt von Haeseler S1 Supplementary Methods S1.1 Computing the distribution of F To find the actual distribution (in terms its density f F (· ; E) with the reaction efficiency E as a parameter) of the normalized family size F for a particular efficiency E we resorted to simulation. We simulated the PCR process for efficiencies from 0.01 to 0.99 (steps of 0.01 up to 0.90, steps of 0.005 up to 0.94, steps of 0.002 up to 0.99). Each time, we simulated 10 9 independent trajectories, and ran each simulation until the expected family size was 10 7 molecules (i.e. for n = 7/ log 10 (1 + E) cycles). At that point the stochasticity further cycles would introduce is negligible and we may thus assumeMn ≈M n+1 ≈ F .
For each efficiency E, we normalized the simulated raw family sizes using Equation (4) to obtain 10 9 independent samples of F . Using kernel density estimation, we then estimated values of the density function f F (λ ; E) of the normalized family size distribution on a grid of 318 values of λ between 0 and 50. The grid points are spaced non-uniformly, being finest (distance 0.0025) around 0 and 1 and getting coarser elsewhere.
This procedure resulted in a 123 × 318 matrix of densities, i.e. f F (λ ; E) evaluated for each combination of one of the 123 simulated efficiencies E, and one of the 318 normalized family sizes λ. Using this (pre-computed and stored) matrix, the density function f F (λ ; E) can be evaluated quickly for arbitrary values of E and λ by two-dimensional polynomial interpolation (Akima, 1996).

S1.2 Numerical method of moments estimates
To obtain method of moments estimates for model parameters D (reads per molecules) and E (reaction efficiency) in the general case T ≥ 0 from the observed meanm and observed variancev of the number of reads per UMI, we must find D and E such that We solve this system of equations with an iterative method that starts with initialization step I and then repeats update step U until the estimateŝ D,Ê and P(C ≥ T ) converge (absolute or relative change less than 10 −4 ).

I:
We start by pretending that T = 0, and set D :=m, U: Using the current model parameter estimatesD andÊ, we compute We then exploit that the uncensored mean (and similarly the variance) can be partitioned into a sum of the (scaled) censored mean and the mean (or variance) terms "missing" from the censored mean, i.e. we compute updated estimatesD of the uncensored mean andv u of the uncensored variance, Given the updated estimates of the uncensored moments, the updated reaction efficiency estimateÊ is computed as in the case T = 0 aŝ

S1.3 Multiple initial copies
If each distinct molecules the sample initially contains R > 1 identical copies (e.g. R = 2 if the initial molecules are double-stranded), each of these copies can be imagined to be amplified by a separate and independent PCR processes. But since the molecules are indistinguishable, these processes cannot be observed individually -we can observe only the (renormalized) sum of the resulting family sizes. These observed normalized family size distribution is thus the average of R independent versions of F , and its variance is thus one R-th of the variance in Equation (6), i.e.
The density of distribution of F for R > 1 is the R-fold self-convolution of the density of F with itself (re-scaled to again have expected value one), and can thus be computed from the pre-computed matrix for the single-molecule case without performing additional simulations. Parameter estimation proceeds just as for R = 1, except that when computing estimate v of VF , we must now account for the reduction of the observed variance of F by a factor of 1

S1.4 Data Analysis
The reads from each of the downloaded sequenced libraries, were mapped (ignoring the barcode part) with NGM v0.5.2 (Sedlazeck et al., 2013) to the reference transcriptome of D. melanogaster (R6.08) respectively E. coli (strain K-12 MG1655). To avoid ambiguities during mapping for genes with multiple isoforms, we filtered the D. melanogaster transcriptome to contain only a single transcript per gene before mapping. For each gene, we picked either the single transcript with a FlyBase score of at least "moderately supported", or the longest transcript (if multiple ones had score "moderately supported" or higher). After mapping the reads, we used the combination of mapping coordinates (both start and end for the paired-end E. coli data, only start for the single-end D. melanogaster data) and barcode (on both ends in the case of E. coli) as UMI. To account for sequencing errors, we merged similar UMIs (barcodes differing at most in one position, mapping coordinates by at most 30 bases for paired-end, 5 for sing-end libraries) using the graph-based algorithm of Smith et al. (2017). For the E. coli data we additionally combined reciprocal UMIs stemming from the two strands of a single template molecule, but stored the read counts for plus-and minus-strand separately (see Shiroguchi et al. (2012)). This yielded, for each of the libraries, a table comprising the gene id, start-end end position, barcode and read-count(s) of each detected UMI. Based on this table, the error-correction thresholds (T = 5 for E. coli, T = 5 for D. melanogaster R1, T = 2 for D. melanogaster R2), and the initial number of molecules (actually, strands) for each UMI (R = 1 for E. coli due to the Y-shaped adapters, R = 2 for D. melanogaster due to secondary strand synthesis before amplification) our algorithm computed library-wide and raw as well as shrunken gene-specific estimates of the reaction efficiency, of the average number of reads per UMI, and of the loss. For the E. coli data, the error-correction threshold was applied to the plus-and minus-strand read counts separately, filtering out UMIs if either count lay below the chosen threshold. This increased the loss of true UMIs, and we modified the definition of the loss accordingly to = 1 − (1 − P(C < T )) 2 (compare to Equation (12)). (Note that in the histograms in Fig. 2A, for a lack of other options, we show plus-and minus-strand counts separately, but omit UMIs where one of the strands is not detected at all). In addition to the gene-specific parameter and loss estimates, our algorithm output the observed number of UMIs n obs g and the estimated total number of UMIs (i.e. transcript molecules) n tot g .

S1.5 Simulation
We determined the residual error of the corrected transcript counts using a simulation approach. We started from the (loss-corrected) estimated transcript counts n tot g and (shrunken) gene-specific estimates for reaction efficiencyÊg and sequencing depthDg of gene g ∈ {1, . . . , K} that we computed for replicate 1 of the D. melanogaster dataset. First we rounded n tot to the next number in the series 10, 30, 100, 300, . . . and used the resulting number as the true number n true g of transcripts of gene g. For each gene g, we then used the amplification+sequencing model (with parameters Eg, Dg and R = 2 meaning double-stranded molecules) to simulate the sequencing of n true g UMIs, which yielded for each gene n true g read counts, one for each UMI. To this list comprising gene id and (for each gene) n true g read counts, we applied our algorithm, using T = 5 and R = 2 as before (but passing along no other information from the first run of the algorithm). The algorithm thus dropped all UMIs with fewer than T = 5 reads, treated the remaining UMIs for each gene g as the observed number of UMIs n obs g , re-estimated the (shrunken) gene-specific losses, and used them to correct n obs g for these losses to arrive at an estimated total transcript count n tot g . Finally, we computed for each gene the relative quantification error as   S4. Total variance of the raw gene-specific loss estimates. Total observed variance was computed for bins containing 20 genes with a similar number n obs g of observed true UMIs. The regression curve s + u/n obs used to infer the optimal gene-specific shrinkage factors λg comprises two components, the variance s of the loss between genes, and the n obs g -dependent error of the (raw) gene-specific loss estimates u/n obs g . See also Gene-specific estimates & corrections.