An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome

Summary: Many high-throughput sequencing experiments produce paired DNA reads. Paired-end DNA reads provide extra positional information that is useful in reliable mapping of short reads to a reference genome, as well as in downstream analyses of structural variations. Given the importance of paired-end alignments, it is surprising that there have been no previous publications focusing on this topic. In this article, we present a new probabilistic framework to predict the alignment of paired-end reads to a reference genome. Using both simulated and real data, we compare the performance of our method with six other read-mapping tools that provide a paired-end option. We show that our method provides a good combination of accuracy, error rate and computation time, especially in more challenging and practical cases, such as when the reference genome is incomplete or unavailable for the sample, or when there are large variations between the reference genome and the source of the reads. An open-source implementation of our method is available as part of Last, a multi-purpose alignment program freely available at http://last.cbrc.jp. Contact: martin@cbrc.jp Supplementary information: Supplementary data are available at Bioinformatics online.

The following four datasets were constructed using this method, all of which are publicly available at www.cbrc.jp/~anish/paired-end-benchmark.

Evaluating mapping results
In order to decide the correctness of mappings reported by each aligner, we used the Seg-suite package. First we computed the true alignment of each simulated read to the reference. For example, to compute this for the reads in simulated reads1.fa, we executed the following: awk '/>/ {print $3, $4, $5, $2, $6 == "+" ? 0 : -$3}' simulated_reads1.fa | \ seg-sort > simulated_reads1VSsimulated_genome.seg seg-swap simulated_genomeVSreference.seg | seg-sort > refgenomeVSsimulated_genome.seg seg-join refgenomeVSsimulated_genome.seg simulated_reads1VSsimulated_genome.seg |cut -f1,4-7 | \ seg-sort > true_alignments1.seg The first command computes the alignments between the simulated reads and the simulated genome by using the read header information in the fasta file generated by fasta-paired-chunks. The second command prepares the alignments recorded in simulated genomeVSreference.seg for the next command. The third command compares the alignments in the two seg files to obtain the true alignments between the reads and the reference. Next the alignments predicted by the aligners were converted to seg format. The following example is for the sam format, which is the format used by most of the aligners seg-import sam predicted_alignments.sam | seg-sort > predicted_alignments.seg Finally, we counted the reads that were mapped correctly. A mapping was considered correct if the predicted alignment and the true alignment matched at, at least one position. To compute the number of reads in simulated reads1.fastq that were correctly aligned, we executed the following: seg-join -w true_alignments1.seg predicted_alignments.seg |cut -f4 |sort -u| wc -l 3 Aligner settings corresponding to the figures in the main article With Last, we used the following commands for all of the tests. These are the recommended parameters in the Last documentation.
lastdb -m1111110 index reference_genome lastal -Q1 -d108 -e120 -i1 index reads1 > out1.maf lastal -Q1 -d108 -e120 -i1 index reads2 > out2.maf last-pair-probs.py -m 0.1 out1.maf out2.maf > result.maf Execution commands corresponding to Figures 1, 3(a), 4, and 5 of the main article are shown below. For aligners that require the user to input fragment length information, we used 288 ± 135, which is equivalent to 5 standard deviations about the mean fragment length used in the generation of our simulated 100bp dataset.  Execution commands corresponding to Figure 3(b) of main article are shown below. For aligners that require the user to input fragment length information, we used 200 ± 160, which is equivalent to 4 standard deviations about the mean fragment length used in the generation of our simulated 76bp dataset. Execution commands corresponding to Figure 3(c) are shown below. For aligners that require the user to input fragment length information, we used 160 ± 150 which is equivalent to 5 standard deviations about the mean fragment length used in the generation of our 35bp dataset. The same settings were also used in the experiment in Section 3.6/ Figure 6 of the main article.

Parameter testing for select aligners
For some of the aligners, we arrived at the above settings by trying out several combinations of parameters and choosing the ones that provided a good combination of speed and accuracy. We describe below our trials for Bowtie, BWA, and SOAP.

Bowtie
Bowtie can do an end-to-end (align the whole of the read to a region of the reference) or a local alignment.  our datasets, we tried both modes and various presets. We show results for some settings in Figure 1. We chose --local --verySensitive-local for 100bp reads and --verySensitive in the end-to-end mode for 76bp and 35bp reads.

BWA
With BWA, we tried varying the following parameters: (1) n which controls the maximum edit distance (2) l which controls the length of initial portion of the read which to be taken as seed (3) k which controls the maximum differences allowed in the seed, and (4) o which controls the maximum number of gap openings allowed. We show some of the results in Figure 2(a). We observed that increasing n improves the sensitivity, but this comes at an enormous speed cost. Changing the other parameters did not bring about any significant performance boost, therefore we settled for default settings throughout all our tests.

Effect of error profile: a different dataset
To understand the effect of error profile on aligner performance and to make our results more robust, we repeated the experiments in Section 3.2 through Section 3.5 on our second 100bp-read dataset, which was generated using the same random fragments but simulating the sequencing error based on error probabilities from SRR067577. We can see from Figure 3 that the new dataset has significantly more low-quality reads.
The results of mapping this new set of reads to the human reference hg19 (with and without Chromosome 5) and to the rhesus genome rheMac2 are shown in Figure 4. The result of mapping the reads after shuffling the same 40% of the read pairs as in the chromosomal translocations experiment in the main article is shown in Figure 5. Results similar to the ones in the main article were observed.

Effect of trimming
To deal with low quality positions towards the 3' end of a read, which is typical of some of the widely-used sequencing platforms, some aligners provide an option to trim the reads. None of them use trimmming in their default settings. Last uses sequence quality information during alignment and has no mismatch limit, and therefore it does not require explicit trimming of reads. To understand the effect of trimming, we applied BWA-style trimming (using DynamicTrim of the SolexaQA package with a threshold of p = 0.05) to our two 100bp datasets -the error profiles of which can be seen in Figure 3 -and compared the results of mapping trimmed and untrimmed reads ( Figure 6). As can be seen from Figure 3, the dataset generated using quality values from SRR067577 has a higher number of reads with low quality scores, especially towards the 3' position; and it is not surprising that trimming this dataset vastly improves the accuracy of aligners like BWA. Last and Stampy are not affected much by trimming. On the contrary, their performance drops slightly with trimming, likely due to loss of information that comes from trimming.
These datasets are especially conducive to trimming because they have few medium-quality bases. The edit-distance-centric aligners do well when there are very few errors, which includes the case where a read ends with uninformative very-low quality bases that are removed by trimming. Last would presumably be favored when there are many medium-quality bases, which boost the error rate, but convey information that would be lost by trimming. It should also be pointed out that there exist datasets in which low quality bases are not confined to the ends. Trimming is not a solution for such cases, but for Last they pose no particular problems.
7 Last speed-up

Gapless alignment
Last performs gapless alignments significantly faster than gapped alignments. Mapping our 3 datasets of varying read lengths to the human reference in both gapped and gapless mode, we observed that gapless alignment results in no loss in sensitivity while the running time, not including the pairing phase, drops by more than 50% for longer reads (Figure 7).

Using a smaller file format
Since our paired-end module is written in Python, file I/O can be time consuming. By changing the ouptut format of the alignment phase to a leaner file format, we can reduce the running time of the pairing phase by almost 30 % (see Figure 8).

Memory usage
Memory usage is dominated by the index of the reference genome which is loaded into memory. With easy availability of ever-increasing memory, none of the aligners pose any serious problems. Last allows the user to set the value of the prior probability that a read pair is disjoint. It is possible to estimate this probability from the set of reads that map uniquely (However, we have not implemented this feature yet). Depending on the input, estimating this value rather than using the default of 0.01 can improve mapping accuracy. To test this, we reran Last on our shuffled dataset (used in the experiment simulating chromosomal translocations), with the disjoint probability set to exactly the ratio of shuffled reads (0.4). The results are shown in Figure 9. A similar trial with Stampy, in which we changed a similar setting provided by Stampy proved ineffective.
10 The "mapping percentage" metric Many papers use the percentage of reads mapped to evaluate and compare the performance of mapping tools. While it can be a helpful metric, we do not explicitly compare this figure for various reasons. First, if necessary this information can be easily worked out from the curves in Figure 3 of the main text. Secondly, even for a single run of an aligner, the percentage of reads mapped varies depending on the what mapping quality threshold is set. For example, the table below shows for a run of Last the percentage of reads mapped for various mismap probability values. As we relax the threshold stringency, we get greater mapping percentage because we are allowing more lower-quality mappings. For this reason, comparison based on mapped percentage can be misleading. Finally, it is a trivial task to build an aligner that always maps a 100 percent of the reads. Therefore it should be the correctness of the mappings we should be concerned about.