Motivation: RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci.
In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN (Optimal Resolution of Multimapping Ambiguity of RNA-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics.
Results: On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage.
Availability: ORMAN is available at http://orman.sf.net
Supplementary information:Supplementary data are available at Bioinformatics online.
Massively parallel RNA sequencing (RNA-Seq) technologies are replacing microarrays in determining the structure and dynamics of the transcriptome. Analysis of RNA-Seq data helps to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. Unfortunately, there are many RNA-Seq reads that can be mapped to several loci equally well, and it is of key importance to resolve their mapping ambiguity in order to perform a comprehensive and accurate analysis of the whole RNA-Seq data.
There are several mappers that can align RNA-Seq reads to a reference genome or previously known transcript sequences (Au et al., 2010; Trapnell et al., 2009; Wang et al., 2010; Yorukoglu et al., 2012). Owing to the presence of paralogs and homologous regions within a gene, RNA-Seq mappers typically report a fraction of multireads, i.e. reads that map to multiple loci on a reference genome. Based on TopHat mappings on the human reference genome, ∼10% of human RNA-Seq reads are multireads. Similarly, ∼17% of mouse and 50% of some plant RNA-Seq reads are multireads (Li and Dewey, 2011). The presence of multireads complicates the downstream analysis such as determining alternative splicing patterns, gene fusions and other variations.
The common practice for handling multireads is ignoring them in the downstream analysis. This leads to inaccurate estimation of the abundance of expressed transcripts (Li and Dewey, 2011; Nicolae et al., 2011). A simple approach for determining the exact genomic location of a multiread is ‘RESCUE’ (Mortazavi et al., 2008). Here, the initial gene expression values are calculated based on the unique reads that map to them. Each multiread is then assigned to the gene with the fraction equal to the ratio between the gene’s initial expression value and the total expression value of all genes that the multiread maps to. A more complex approach based on expected maximization (EM) (Pasaniuc et al., 2011) is designed to handle mapping ambiguity of a read to two or more homologous genes—for the purpose of determining the expression value of each of these genes and not to determine or quantify isoforms. Finally, RSEM (Li and Dewey, 2011), IsoEM (Nicolae et al., 2011) and iReckon (Mezlini et al., 2013) are EM methods based on statistical generative models for sequencing processes to resolve mapping ambiguity.
Many of the above approaches are designed specially for estimating expression values of known/annotated isoforms. Their performance is highly dependent on the completeness of the isoform database in use. Furthermore, they cannot handle alternative splicing events such as novel exon skipping, alternative 5′ donor and 3′ acceptor sites, intron retention and other structural differences such as insertions or deletions. Some recent computational approaches, in particular IsoLasso (Li et al., 2011), CLIIQ (Lin et al., 2012) and Cufflinks (Trapnell et al., 2010), can identify and quantify unknown isoforms and certain types of transcriptomic variations. Unfortunately, neither IsoLasso nor CLIIQ takes into account multireads, and Cufflinks handles multireads through a simple RESCUE-based approach.
In this article, we show how to resolve the multimapping ambiguity in the presence of novel isoforms involving exon skipping, intron retention and small indels towards accurate downstream analysis. To be mathematically precise, we introduce the notion of a partial transcript—a substring of a potential transcript product of a gene, which satisfies certain conditions (a formal definition is provided in the next section).
The objective of our multiread resolution approach, ORMAN (Optimal Resolution of Multimapping Ambiguity of RNA-Seq Reads), is (i) to compute the minimum number of partial transcripts that cover all the multireads and (ii) to assign each multiread to one of these partial transcripts such that each partial transcript is covered according to the estimated local distribution. We achieve the first objective approximately through a reduction to the standard set cover problem. We achieve the second objective through an integer linear programming formulation, which we handle using available integer linear program (ILP) solvers such as CPLEX, or through greedy heuristics we describe in this article.
We evaluate ORMAN on both simulated and real human RNA-Seq datasets. For the first experiment, we generate paired-end RNA-Seq reads from a random subset of transcripts from the University of California, Santa Cruz (UCSC) database with the expression distribution modelled after a real human dataset. On this simulated data, we show that the performance of state-of-the-art methods for identifying and quantifying transcripts such as CLIIQ, Cufflinks and IsoLasso is typically improved through the use of our multiread resolution approach. Notably, when combined with IsoLasso or CLIIQ, ORMAN gives the most accurate and comprehensive novel isoform detection and quantification pipeline available.
To evaluate ORMAN in a more ‘real world’ setting, we also design an experiment using real RNA-Seq data from a cancer patient (Lapuk et al., 2012). For this experiment, we implant artificial genomic repeats into several genes and compare the performance of ORMAN with that of RESCUE in resolving the multireads mapping to these regions. We show that on this dataset, the multiread assignment by ORMAN approximates the original distributions quite well with a maximum relative error of ≤0.3.
Online databases such as the UCSC browser provide known transcripts from specific gene regions. Let be the set of known transcripts from a gene region GR. Each transcript Ti is a string that can be partitioned into ‘exonic segments’ . We define the ‘gene model’ GM implied by the set of transcripts T as an ordered set of alternating substrings of the gene region called canonical exons and canonical introns. Each exonic segment is a maximal substring of a canonical exon, which is either completely present in a transcript or excluded by that transcript. We refer the readers to Figure 1 for an illustration of a gene model derived from known transcripts.
Given a read R mapping to a gene region GR, the partial transcript PT supported by R is the shortest substring of the gene model that completely covers R, which starts and ends with an exonic segment (or a canonical intron in the case of intron retention). If there is a small insertion or deletion (our method limits the size of each indel to 15 nt) between the read and the reference sequence, we introduce a modified partial transcript with the corresponding indel. Figure 2 illustrates several examples of partial transcripts derived from read mappings to a gene model.
2.1 Combinatorial optimization formulation
The set of partial transcripts present in a sample can be derived from the mappings of RNA-Seq reads to a reference genome with the supply of transcript annotation or to a reference transcriptome. Depending on the given mapping to the reference genome or transcriptome, our objective is to assign each multiread to a single locus on the genome or a transcript. We also need to determine the partial transcript that the multiread should map to. This is done in two phases. In the first phase, we are interested in the minimum number of partial transcripts that could cover all the (multi)reads. In the second phase, we try to distribute the (multi)reads to the set of partial transcripts from the first phase such that the distribution of mappings for each partial transcript follows the most likely distribution.
2.1.1 First phase
Let be the collection of all the partial transcripts derived from mapping results. We also denote by PTi () the set of all reads that support the same partial transcript PTi. In addition, each PTi is assigned a positive weight that is proportional to the number of splicing events, i.e. exon skipping and intron retention events with respect to the known transcript it is associated with. For the partial transcripts without any variations with respect to their associated known transcripts, the weight is 1. Each variation adds a user-defined fixed value to this weight. Our default weight contribution of exon skipping is 100 and of indel is 10 000. Indel events have high weight due to their significantly low relative frequency (Karakoc et al., 2012). Note that, in the case of paired-end reads, each end of a fragment may be assigned to a different partial transcript. In that case, we assign such a pair to the partial transcript formed by taking the union of the exons from the partial transcripts on both ends. The weight of this new partial transcript PTi is assigned as the sum of the weights of the two partial transcripts it is composed of. We aim to determine the minimum-weighted set of partial transcripts that can cover all the reads. This problem can be defined as an instance of the minimum-weighted set cover problem, where sets are represented by the partial transcripts, and reads represent set elements. Because the minimum-weighted set cover problem is NP-hard, we use the standard greedy algorithm, which provides a logarithmic factor approximation guarantee (Chvatal, 1979) to solve this problem and obtain the set of partial transcripts used for the smoothing step.
2.1.2 Second phase
First, we give the formulation of the problem in this phase in terms of an ILP below. We then show the computational complexity of the problem. Finally, we show how to solve the problem in practice.
Let be a set of partial transcripts returned from the first phase. For the partial transcript , we aim to distribute multireads across the partial transcript such that the coverage function of the reads in each partial transcript resembles the most likely distribution. In the case of paired-end reads, we use both ends for the coverage determination. For a read R, let SPT(R) be the set of partial transcripts that R could map to.
Now let lenR be the read length and be the length of the partial transcript PTj. Let Rij ( and ) be indicator variables, where means that we assign Ri to the partial transcript PTj; otherwise . We enforce that Ri can only be assigned to one partial transcript:
Let NRjk ( and ) be the number of reads that cover position k in PTj. Let be the set of the multireads that cover the position k in PTj. In similar manner, we define to be the number of reads that are uniquely mapped and cover the position k in PTj. NRjk could be written as the summation of the number of uniquely mapped reads and multireads that cover the location k:
Let AVjk be the desired number of reads covering the position k in the partial transcript PTj. Because we do not know the original distribution of the reads, we approximate AVjk as follows. First, we find the multimapping region Mk of PTj, which encompasses position k. Next, we calculate the average coverage in the left and right neighbourhoods of Mk (the size of each neighbourhood is set to base pairs, where h is a user-defined parameter). We use the calculated average values as defining points for a line l, which approximates the desired function AV. Then, AVjk is calculated as a value on the line l at the position k. The rationale of this approach lies in the observation that coverage level of a small region is often similar to the level of the immediately neighbouring regions, even when the coverage varies significantly along the entire gene (see Fig. 3).
Let denote the maximum difference between the desired number of reads AVjk per position of partial transcript PTj and the observed number of reads NRjk at any position k. We enforce the following constraints:
The problem of smoothing of the distribution of reads along partial transcripts, named SMOOTH, is provably hard. In addition, it is unlikely to have a constant factor approximation algorithm for the SMOOTH problem. The proofs are described in Supplementary Materials.
2.1.3 Practical implementation
The ILP formulation of ORMAN is solved by IBM ILOG CPLEX. In practice, the running time of the proposed ILP depends on the number of integer and non-integer variables and the number of constraints. The number of integer variables of the provided ILP is proportional to the number of mappings of multireads, which can be in the order of millions. Here we propose a strategy to decompose the original problem into smaller subproblems such that the solution of each smaller one is independent from each other. We create a graph among the partial transcripts returned from the first phase, i.e. . There is an edge between two partial transcripts if there is a multiread r mapping to both of them. It is easy to see that the solution of the ILP corresponding to each connected component in GPT is independent from the solutions of other components. Thus, we can obtain the solution for each component separately using CPLEX. There may still exist some components that could not be solved using CPLEX. In these cases, we propose a heuristic strategy as described in the Supplementary Materials.
3 EXPERIMENTAL RESULTS
We evaluate the performance of ORMAN on both simulated and real datasets. On both types of data, we show that ORMAN resolves mapping ambiguity of multireads accurately and improves the performance of the leading transcript identification and quantification tools.
3.1 Transcript identification and expression quantification in simulated data
First, we focus on quantifying how much ORMAN improves downstream analysis tools. We compare the performance of the leading transcript identification and quantification tools by (i) first running each tool without any pre- or postprocessing (ORIGINAL), and (ii) then running each tool after preprocessing the mappings by ORMAN.
Because there are no real world benchmark datasets that provide comprehensive and accurate information on all transcripts and their abundance levels validated by wetlab techniques, we use simulation data for this evaluation. [Even though the MAQC project (Shi et al., 2006) used RNA-Seq technologies to quantify the expression of a limited number of genes, a significant number of these genes have a single isoform and have unique sequence composition (Li and Dewey, 2011)].
3.1.1 Simulation data
We generated RNA-Seq reads of human transcripts with expression distribution similar to one derived from a real dataset from the GEO database (accession number GSM759513). This dataset comprises paired-end 50-bp RNA-Seq reads of a prostate tissue from Illumina Human BodyMap 2.0 project (Shen et al., 2012). The reference transcriptome has 76 969 transcripts based on the UCSC database. We used TopHat version 2.0.7—with the number of mismatches at most 2—to obtain the mappings of the RNA-Seq reads to the reference sequence (version hg19). We ran IsoEM to quantify the expression profile of the UCSC reference transcriptome and determined that 39 388 of them are highly expressed.
For the simulations, we assigned one random transcript out of all 76 969 transcripts to each one of the expressed transcripts of the prostate dataset. These randomly assigned transcripts represented the expression of 17 956 genes. We then set the expression value of each random transcript to that of the prostate dataset transcript it is associated with. We finally selected 10% of this randomly selected set of the simulated transcripts for the production of novel transcripts; for each such transcript, we randomly skip an exon.
To ensure this transcript is novel, we check whether it is highly similar to other known transcripts. We consider a novel transcript to be highly similar to a known transcript if they have the same number of exons and their percentage sequence similarity is >90%. The novel transcript is then assigned the same abundance level as the original transcript.
We generated 80 million paired-end RNA-Seq reads of 75-bp length from the chosen transcripts. The fragment length is determined based on the normal distribution with a mean of 250 bp and a standard deviation of 25 bp. Each transcript received a number of reads proportional to its predetermined expression level, and each read was picked uniformly at random over all possible starting positions of the transcript. We then randomly introduced sequencing errors in the generated reads according to sequencing error model described in Dohm et al. (2008). This model places the majority of mismatch errors towards the 3′-end of the reads. The error percentage per base was set to be 1%. We used TopHat with the above settings to map the generated reads to the reference genome. Approximately 4% of the generated reads had multiple mapping loci.
3.1.2 Performance evaluation
Our performance evaluation is based on three tools: Cufflinks (version 2.0.2), IsoLasso (version 2.6.0) and CLIIQ (version 0.1.0.2). Cufflinks uses a modified rescue strategy to resolve multireads, whereas the latter two are not capable of resolving multimappings. We run CLIIQ in both its standard mode, where it selects the minimum possible number of isoforms, which minimizes quantification errors, and preference mode, where it prefers known isoforms when there are multiple candidate solutions (abbreviated as CLIIQ_pref below). To measure the relative performance of these tools, we provided the complete UCSC gene annotations and disallowed any novel splice sites while allowing novel exon skipping and intron retention events.
The expression values of transcripts are measured in fragments per kilobase per million mapped reads. For each transcript, we define the ‘relative quantification error’ produced by a given tool as follows. (i) If the known expression value of the transcript is e and the expression value of the transcript reported by the tool is , then the relative quantification error is . (ii) If the tool reports a transcript that is not among the simulated expressed transcripts, the relative quantification error is . (iii) If the tool misses a known expressed transcript, the relative quantification error is 1. Following (Li and Dewey, 2011; Nicolae et al., 2011), we first investigate the proportion of transcripts whose relative quantification error is above a threshold.
For each tool, we also compare how ORMAN affects its performance on detecting novel isoforms. The novel isoforms in our simulation generate reads that are incompatible to any known gene annotations. For existing mapping ambiguity resolving tools that require the full list of known transcripts, these reads might be discarded; hence, novel isoforms with multireads may not be detected. On the other hand, ORMAN allows such reads to be used in the solution. In our experiments, all three tools detect more novel isoforms based on ORMAN mappings as can be seen from Table 1.
In Figure 4, we see that ORMAN improves the performance of IsoLasso and CLIIQ significantly in both modes, which, in comparison with Cufflinks, return fewer incorrectly quantified isoforms for smaller error thresholds. Overall, Figure 4 demonstrates that the combination of ORMAN and CLIIQ_pref provides the best results.
We also report the performance of tools on genes that produce a high proportion of multimapping reads separately. Here, we focus on 3784 genes (expressing 7275 transcripts) to which TopHat mapped reads have the top 20% highest mapping multiplicity (see later).
Figure 5 shows the proportion of transcripts whose relative quantification error is above a threshold on this subset. As before, ORMAN improves the performance of IsoLasso and CLIIQ significantly in both modes, which are better compared with that of Cufflinks.
Next, we consider the performance of each tool in novel isoform detection for those genes that produce multimapping reads. First, we sort all expressed genes according to their mapping multiplicity (i.e. the proportion of the reads that can be mapped to such a gene, which can also be mapped to other genes). Then for genes ranked in the top 10, 20, 30, 40 and 50%, we examine how each tool performs in detecting novel isoforms. Figure 5 demonstrates that, in the case of novel isoforms, all tools benefit from ORMAN mappings. In addition, for those genes whose multiplicity is in top 10% in the sample, ORMAN performs particularly well.
3.2 Multimapping resolution in real RNA-Seq data
It has been known that real RNA-Seq experiments often suffer from various biases resulting in a rather non-uniform coverage across a gene model (Roberts et al., 2011; Wu et al., 2011). Unfortunately, modelling of such complex biases in simulations would be cumbersome. To overcome this problem, we design a controlled experiment with real RNA-Seq reads. For this experiment, we use a previously published RNA-Seq dataset with 51-bp Illumina paired-end reads sampled from a human prostate cancer patient (Lapuk et al., 2012). On this dataset, we introduce artificial repeats in 10 genes based on sequences of other genes. By modifying the sequences of the original reads mapping to the artificial repeats, yet keeping everything else intact, we essentially create a multimapping dataset for which the true coverage distribution is known. We then evaluate ORMAN’s performance in resolving these multireads.
The following section explains the experimental setup in detail. In the next section, we elaborate on the experiment results.
3.2.1 Experimental setup
First, we map the reads using TopHat (version 1.3.2) to the reference sequence (hg19) and Ensembl annotations (GRCh37.62). Next, we randomly select 10 ‘decoy’ genes according to the following rules:
The gene is annotated to have a single transcript based on the Ensembl annotations.
The total gene length (i.e. the sum of all canonical exons) is at least 2000 bp.
The gene is sufficiently expressed in the sample, having an average coverage >100.
The gene is uniquely mappable (i.e. there are no multireads mapping to the gene model).
Similarly, we randomly select 10 ‘replacement’ genes according to the rules 2, 3 and 4 above. Within each replacement gene, we select a 400-bp region to serve as an artificial repeat. This 400-bp sequence is then used to replace the sequence of a region of the same length in the decoy gene. In other words, in each decoy gene, we create an artificial repeat for which the sequence is taken from a randomly chosen replacement gene. The selected genes and the repeat regions are given in Table 2.
|Gene||Chromosome||Strand||Start||End||# of reads||Gene||Chromosome||Strand||Start||End||# of reads|
|Gene||Chromosome||Strand||Start||End||# of reads||Gene||Chromosome||Strand||Start||End||# of reads|
Note: ‘# of reads’ denotes the initial number of reads mapping to the 400-bp region that is used to introduce the artificial repeats.
In the next step, we identify the reads mapping to the coordinates coinciding with the artificial repeat region in each decoy gene. The sequences of these reads are changed according to the new sequence of the decoy gene. All other reads are kept the same. The entire set of reads is then mapped to the new genome reference and the original Ensembl annotations using TopHat with the same parameters.
In this experiment, we compare ORMAN with the modified version of RESCUE as used in Cufflinks (Mortazavi et al., 2008; Trapnell et al., 2010). This modified version calculates the initial gene/transcript abundances first by equally distributing the multireads to each gene they map to. In the second phase, each multiread is distributed in proportion to the relative abundance of each gene as computed in the first phase.
Figure 6 shows the relative error of coverage in the artificial repeat regions after resolution with ORMAN and RESCUE. This measure is calculated as:and corman are the original coverage, raw coverage after the second TopHat mappings and coverage after multiread resolution with ORMAN, respectively. The relative error for RESCUE is defined similarly.
On genes APPBP2, CD164, PPM1H, RCOR1, RYBP, SERPINB6, SSR2, TXNDC16, UQCRC2 and ZBTB42, we see that ORMAN produces lower error values than RESCUE, whereas in the rest of the genes, it produces a higher relative error. On the other hand, the relative error of ORMAN never exceeds 0.3. Furthermore, a closer look on some of the genes suggests that ORMAN is still able to reproduce the look of the original distribution quite well despite the fact that RESCUE has a lower relative error. Figure 7 illustrates two such genes. Note that although both genes have a high variation in coverage, the coverage distribution in the repeat region is close to the original distribution after processing with ORMAN.
In this article, we introduce a combinatorial optimization formulation for resolving mapping ambiguity of RNA-Seq reads. Using a simulated RNA-Seq dataset on humans, we have shown that ORMAN improves the performance of popular computational tools in transcript identification and quantification, especially for genes with novel isoforms. Furthermore, our experiments based on real RNA-Seq reads suggest that the localized approach of ORMAN is able to approximate the original read distribution of the multimapping regions even in genes with highly variable coverage. Although ORMAN’s performance was similar to that of RESCUE in our small-scale experiment, we suspect that datasets that suffer from elevated level of sequencing biases such as severe RNA degradation could benefit even more from our approach.
Funding: This research was supported by NSERC Discovery and Discovery Accelerate Grant to S.C.S., Canadian Cancer Society Innovation Grant to C.C. and S.C.S., Vanier Canada Graduate Scholarship to I.N. and Ebco/Epic Visiting Chairs and ICGC Grant to N.D.
Conflict of Interest: none declared.