deML: robust demultiplexing of Illumina sequences using a likelihood-based approach

Motivation: Pooling multiple samples increases the efficiency and lowers the cost of DNA sequencing. One approach to multiplexing is to use short DNA indices to uniquely identify each sample. After sequencing, reads must be assigned in silico to the sample of origin, a process referred to as demultiplexing. Demultiplexing software typically identifies the sample of origin using a fixed number of mismatches between the read index and a reference index set. This approach may fail or misassign reads when the sequencing quality of the indices is poor. Results: We introduce deML, a maximum likelihood algorithm that demultiplexes Illumina sequences. deML computes the likelihood of an observed index sequence being derived from a specified sample. A quality score which reflects the probability of the assignment being correct is generated for each read. Using these quality scores, even very problematic datasets can be demultiplexed and an error threshold can be set. Availability and implementation: deML is freely available for use under the GPL (http://bioinf.eva.mpg.de/deml/). Contact: gabriel.reno@gmail.com or kelso@eva.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.

However, the computation above is expensive. A potential way to speed it up is to consider certain terms as being negligible. If the best hit for both P7 and P5 stems from the same sampleî. Let the second best hit beĵ for each index. We can assume that remaining pairs are insignificant compared to the probability of pertaining to these two groups, equation 1 becomes: The scaling factor 1 2 is due to the use of two terms in the numerator. The log of this expression is expressed as the Z2 score: If the best hit for both P7 and P5 does not come from the same sample but rather two different ones, namelyî andĵ, equation 1 can be written as : Z2 ≈ 2 1 · p(rp7|I7î) · p(rp5|I5ĵ) p(rp7|I7î) · p(rp5|I5î) + p(rp7|I7ĵ) · p(rp5|I5ĵ) Again, taking the log yields the Z2 score: Z2 ≈ 0.3 + log10(p(rp7|I7î) · p(rp5|I5ĵ)) − log10(p(rp7|I7î) · p(rp5|I5î) + p(rp7|I7ĵ) · p(rp5|I5ĵ)) (5)

Algorithm
For a given observed sequenced index, deML needs to identify all possible index sequences from a user supplied list within a given number of mismatches. To achieve this in a timely fashion, deML builds a prefix tree of the user supplied indices which represent common prefixes as common paths in the tree (see Figure 1). The height of a given node directly indicates the position in the original index string. An advantage of prefix trees is the ability to search with mismatches using recursive calls in the data structure. The call is launched on the root using the string to be searched and the tolerated number of mismatches. The recursive call is performed on the child nodes where the number of tolerated mismatches is decreased by one when one the letter represented by the current node differs or leaving the mismatch count as is otherwise. The query sequence is shortened by one after each function call. The recursion ends when the number of tolerated mismatches falls below zero or a leaf node is reached.
The overall prefix tree algorithm returns all possible indices within a fixed number of tolerated mismatches for downstream computations. Once all the indices have been identified, the likelihood of pertaining to each sample that has been detected is computed. As mentioned in the main text, upon computing sample assignment quality Z1, the number of tolerated mismatches can be set to be lesser than the length of the indices as the contribution of the more divergent indices (edit distance exceeding the number of tolerated mismatches) can be generally considered negligible. As the likelihood of pertaining to samples having indices with low edit distance dwarfs the one to samples with more divergent indices, their contribution can often be safely overlooked. Figure 1: A prefix tree for the following sequences : AGAAT,AGAGG,CATAC,CGCAG,GCATG,TACAC,TACAT. The common prefixes become common paths in the tree.

Testing data
A 245 bases long fragment was amplified from human iPS cells digested in QuickExtract DNA Extraction Solution (epibio) using primers GGCTTAAGTCCTGCTGAGA and AGATAAATATAGAATAAAGCTCATGA. Each 25 l PCR reaction contained Phusion HF master mix (NEB) at 1X, each primer at 0.5 M, 0.024 l of template and the rest was water. The mixture was heated to 98C for 30 seconds, followed by 25 cycles of 98C for 10 seconds, 56C for 10 seconds and 72C for 10 seconds. PhiX DNA was fragmented with Covaris S2 with the 500 bases settings (duty cycle 5%; Intensity 3; cycle per burst 200; time 80s) which gave a fragments that had a mod length of 580 bases as judged by a 2100 Bioanalyzer (Agilent). Indexed Illumina libraries were prepared as described by Kircher et al. (2012) and the indices used are given in Supplementary Table 1. deML relies heavily on the base quality scores to compute the likelihood of pertaining to a given sample for a given read. In theory, the base quality scores reflect the probability of error and can therefore be used to accurately compute the probability of observing the bases from the read given a certain sequence template. It is therefore important for quality scores to accurately reflect the probability of error for a given base. We used freeIbis (Renaud et al. (2013)) using quality score calibration. While plotting the resulting base quality scores against the observed error rate (see Supplementary Figure 2), the ones predicted by freeIbis show significant improvement in terms of correlation upon comparison with the quality scores predicted by the default basecaller provided by the vendor, Bustard. However, we show (see Supplementary Results Section) that the correlation between our assigned confidence and observed false assignment rates as well as our robustness to error also hold true for Bustard basecalled data. Sequence adapters were removed using cutadapt v1.2.1 and reads shorter than 10 basepairs were removed (Martin (2011)). The reads were demultiplexed using deML 1 according to the list of 100 pairs of index sequences that was used. Reads were aligned to a concatenation of the human genome (1000 Genomes) along with decoy sequences (1000 Genomes decoy sequences version 5) and the PhiX genome sequence (sequence provided by Illumina Corp.) using BWA version 0.5.10 (parameters: "-n 0.01" and no seeding). Supplementary Figure

Mapping statistics
As the two potential templates of the sequencing run were human genome PCR product and PhiX controls, we expect most reads to map to either one of those regions. We analyzed each pair of reads and considered that a read can fall within 4 categories: PCR region, PhiX, mapped to a region outside our targets and unmapped. The tally (see Table 2) shows that most pairs are either both PCR product or both PhiX.

Discordant pairs
A small number of read pairs exhibited unexpected mapping patterns. For 179 clusters, the first mate mapped to PhiX and second one mapped to the PCR region. A total of 234 clusters exhibit the converse pattern where the first mate mapped to the PCR region but the second mapped to the PhiX genome. For those 413 (179+234) clusters, we sought to test the possibility that they might have been generated by shifting clusters and therefore have a high probability of error. For a read of length L and where q l is the PHRED quality score for the base at position l, the expected number of mismatches per base to the reference is computed using the following expression: The expression above is the average number of mismatches for any given base and can be multiplied by a sequence length to know how many mismatches are expected over the entire molecule. The expected number of mismatches for this subset was calculated to be 0.0301 or essentially 3.01 mismatches per 100 bases. To assess whether this number is higher in a statistically significant way, 10,000 subsets of 413 cluster were selected at random from the initial BAM file. The distribution for the expected number of mismatches for those randomized subsets were plotted (see Figure 3) against the same number of these discordant pairs. The expected number of mismatches is higher than any of the random subsets (p<0.0001).

Distribution of predictive scores
For reads aligning to the PhiX genome, we consider reads demultiplexed as control to be unequivocally true assignments and human PCR samples to be false assignments. Out of the 8,286,275 unique reads mapping with high mapping quality (MAPQ>=30) to the PhiX genome, 42,089 (0.51%) reads were demultiplexed as one of the human PCR samples. To show that the scores produced by deML can minimize these false assignments, the correlation between these scores and true and false assignments was evaluated. This computation was done on the set of false assignments and a subset of equal size of true assignments.
For the Z0 score (see Figure 4a), the majority of true assignments (green) have a high probability of pertaining to the sample to which they were assigned. False assignments (red) have on average a much lower probability of pertaining to the sample of origin as shown by the higher Z0 score. The density of Z1 score, the probability of pertaining to another sample than the most likely one, was also plotted (see Figure 4b). True assignments (green) have a lower probability of misassignment compared to actual misassignments (red). Table 3. Predictive value of the Z 0 , Z 1 and both scores used in conjunction to classify correct assignments from mis-assignments using the data presented in Figure 1 in the main text.
predictor misclassification out of 84,178 To measure the joint distribution of the Z0 and Z1 for correct and false assignments, the distribution of both scores were projected for the same reads described in the paragraph above. The distribution of Z0 and Z1 scores of false assignments and a subset of equal size of true assignments was plotted (see Figure 5). The color of the individual dots depended on the density of other dots in the vicinity. As expected, reads with a high likelihood of stemming from the PhiX control (Z0) group and with a low likelihood of stemming from another sample (Z1) were enriched for true assignments whereas misassignments were found at the other end of the distribution.

Predictive power of combined scores
To evaluate whether having both Z0 and Z1 scores has better predictive power than solely using a single one, a logistic regression was performed using each score individually and both at once. Using the PhiX data presented in the main text in Figure 1, positive and negative assignments were used as labels and the Z0 and Z1 scores were used as potentially predictive values. The classification was performed using a logistic regression using the glm() function in R version 3.0.1. For all 3 set (Z0, Z1 and Z0, Z1 combined), the number of misclassifications were computed. The lowest number of misclassifications were obtained using both scores in conjunction.

Robustness to sequencing errors
To evaluate the robustness of the demultiplexing to increased error rates, reads with perfect matches to index sequences from a known sample were taken from the original set and mismatches were added using an Illumina error profile. This profile contains sequencer-specific nucleotide substitutions along with quality scores for those. As CASAVA returns only sequences with a maximum of 1 mismatch, the number of sequences with 0 mismatches, 1 mismatch and 2 or more mismatches to the original indices are presented (see Table 4). We ran deML using the default cutoffs (Z0 > 80 and Z1 > 20). For comparison, we ran deindexer (https://github.com/ws6/deindexer 2 , which detects matches to known sample indices using hashes to detect collisions and avoid false assignments. For deindexer, at most 5 mismatches were tolerated as two random sequences of 7 basepairs will exhibit on average 5 mismatches (7 · 3 4 = 5.25). At higher error rates, the number of demultiplexed sequences that could be retrieved by CASAVA substantially as sequences with 1 mismatch or less become a small fraction of the total. However, deML shows greater robustness to increased error rates while keeping a misassignment rate under 0.5% even at very high error rates for sequences meeting the default thresholds. The other software, deindexer, performs well at low error rates but does not scale well to high error rates (see Table 5). At the highest simulated error rate, deML demultiplexes 3.1 times more sequences while maintaining a lower false discovery rate ( 0.44% for deML and 0.56% for deindexer).

Speed versus sensitivity
As the maximum number of allowed mismatches can be specified from the command line, the sensitivity and runtime of the program were evaluated as a function of the number of mismatches (see Table 6). Tolerating more mismatches leads to slower execution speeds but allows greater sensitivity and a more accurate calculation of the Z1 score.
At 3 mismatches, deML was able to detect at least one sample of origin for each sequence. It is worth noting that the quality of this sequencing run used in this study was excellent and allowing additional mismatches was not needed to find at least one sample of origin for each sequence.

Background error rate
As mentioned in the discussion, if we consider only clusters with a high probability of pertaining to their respective read group (Z0 = 0), where both pairs map to the PhiX, the overwhelming majority were demultiplexed as Phix. However, there were 9 clusters (18 sequences in total) which were assigned to the human PCR region samples. In theory, such sequences with indices matching to samples pertaining to PCR regions with perfect matches yet where the forward and reverse read match to the PhiX control should not exist. To investigate whether mixed clusters could have produced such sequences, the expected number of mismatches per base for those 18 sequences were computed.
The same quantity was computed for 10,000 independently sampled subsets of 18 sequences for the entire dataset. A comparison reveals that those sequences do not have an expected number of mismatches per base above those of the background (see Figure 6) thus making the mixed cluster hypothesis unlikely. Supplementary Figure 4: a) Distribution of the Z0 score for reads aligning for the PhiX genome for reads either demultiplexed as control (green) or as human samples (red) b) Distribution for the same reads but for the Z1 score. Supplementary Figure 5: Distribution of true assignments and false assignments and their respective Z0 and Z1 scores for reads aligned to the PhiX depending on whether they were demultiplexed as a human sample (red) or PhiX (green). Reads demultiplexed as being part of the human PCR samples can be considered to be false assignments. A subset using an equal amount of true assignments was used to compare the Z0 and Z1 scores. The intensity of the color indicates the density of the data points for the given category. Data points with a dearth of other points from the same category in the vicinity are dyed as gray.  Figure 6: The distribution of the expected number of mismatches per base for 10,000 sets of 18 randomly chosen sequences mapping to the PhiX genome with Z0 = 0 (red line) versus the ones not demultiplexed as PhiX but as human PCR region (black arrow).

Demultiplexing with default quality scores
We show in the Supplementary Data section that, for the MiSeq run used in this study, predicted quality scores produced by Bustard (the default Illumina basecaller) do not have a perfect correlation to their observed ones. Some groups rectify this discrepancy after basecalling using the Genome Analysis Toolkit (GATK) however, this is not feasible for index sequences (McKenna et al. (2010)). As deML relies on quality scores, we verified whether the algorithm would work equally well for sequence data produced by the default Illumina basecaller. More precisely, the correlation between the false assignment rate and the Z0 and the Z1 scores was evaluated. We demultiplexed the same data but instead of the freeIbis basecalls, the default Illumina basecalls were used. The distribution of the false assignments versus true ones were plotted (see Figure 7). Furthermore, the correlation between the mis-assignment rate and the Z1 score was also measured (see Figure  8). In both cases, the correlation between both scores and the false assignment rates holds. This is a likely consequence of the fact that quality scores produced by Bustard, albeit not having a perfect correlation to their observed error rates, offer a reasonable approximation for the major part of them (quality scores between 30 and 40). Similarly to freeIbis, the quality scores at the lower end of the distribution (less than 20 on the PHRED scale) do not seem to correspond to their observed error rate. As a consequence, the first data point in Figure 8 does not seem to follow well the predicted linear relationship. We also evaluated whether deML would provide the same robustness for the data basecalled with Bustard. In an approach identical to the one used for freeIbis basecalled dataset, mismatches in the indices were added at various rates. The substitutions and quality scores for the mismatching bases were added using a Bustard error profile obtained from control sequences aligned to the PhiX. The number of sequences that could be demultiplexed by deML greatly exceeds the retrievable number of sequences using the default strategy of allowing 1 mismatch, especially at high error rates (see Figure 9). Similarly to results obtained on the data basecalled using freeIbis presented in Table 4 in the Supplementary Results, at the highest error rate, this set also had a low amount of false assignments (8,469 sequences) out of those that passed our default quality thresholds (2,605,363 sequences) for a maximal observed false assignment rate of 0.33%. Observed misassignment rate with respect to binned Z1 score Z1 score per bin Observed misassignment rate (log scale) 1/1 1/100 1/10,000 1/1,000,000 Supplementary Figure 8: Correlation between the Z1 score for reads aligned to the PhiX genome and the observed mis-assignment rate on a log scale for the Bustard basecalled reads. Like the figure in the main document, the line is a linear regression using on all but the first data points. The size of the bins are the same as the ones used for the main figure except that no false assignments were seen for this dataset for a Z1 score above 200 hence no datapoint was reported. Error bars were obtained using Wilson score intervals.

b)
Supplementary Figure 9: a) The edit distance of the simulated indices to the original index sequence as a function of the simulated edit distance to the original indices for Bustard basecalled data. This graph indicates the limits of heuristics using fixed-mismatches like CASAVA. b) For the same dataset, the number of sequences correctly assigned to the original sample for both the ones that passed quality threshold and those that did not. The number of incorrect assignments are also reported for both categories.