Intron-centric estimation of alternative splicing from RNA-seq data

Motivation: Novel technologies brought in unprecedented amounts of high-throughput sequencing data along with great challenges in their analysis and interpretation. The percent-spliced-in (PSI, ) metric estimates the incidence of single-exon–skipping events and can be computed directly by counting reads that align to known or predicted splice junctions. However, the majority of human splicing events are more complex than single-exon skipping. Results: In this short report, we present a framework that generalizes the metric to arbitrary classes of splicing events. We change the view from exon centric to intron centric and split the value of into two indices, and , measuring the rate of splicing at the 5′ and 3′ end of the intron, respectively. The advantage of having two separate indices is that they deconvolute two distinct elementary acts of the splicing reaction. The completeness of splicing index is decomposed in a similar way. This framework is implemented as bam2ssj, a BAM-file–processing pipeline for strand-specific counting of reads that align to splice junctions or overlap with splice sites. It can be used as a consistent protocol for quantifying splice junctions from RNA-seq data because no such standard procedure currently exists. Availability: The C code of bam2ssj is open source and is available at https://github.com/pervouchine/bam2ssj Contact: dp@crg.eu


INTRODUCTION
One major challenge in the analysis of high-throughput RNA sequencing data is to disentangle relative abundances of alternatively spliced transcripts. Many existing quantification methods do so by using considerations of likelihood, parsimony and optimality to obtain a consolidated view of cDNA fragments that map to a given transcriptional unit (Katz et al., 2010;Montgomery et al., 2010;Trapnell et al., 2012). The advantage of such integrative approaches is that they provide robust estimators for transcript abundance by reducing sampling errors, as they effectively consider samples of larger size. In contrast, because all the reads from the same transcriptional unit are combined into one master model, there is no guarantee that the inclusion or exclusion of a specific exon is estimated independently of co-occurring splicing events (Katz et al., 2010;Pan et al., 2008).
The quantification of alternatively spliced isoforms based on the É metric captures more accurately the local information related to splicing of each particular exon (Katz et al., 2010). We follow Kakaradov et al. (2012) in considering only the reads that align to splice junctions ( Fig. 1) and ignoring the reads that align to exon bodies (position-specific read counts are not considered). É is defined as where the factor of two in the denominator accounts for the fact that there are twice as many mappable positions for reads supporting exon inclusion as exon exclusion. Equation (1) defines an unbiased estimator for the fraction of mRNAs that represent the inclusion isoform under the assumption that splice-junction reads are distributed evenly. É can also be derived from the expression values of whole isoforms, for instance, as the abundance of the inclusion isoform as the fraction of the total abundance. However, the non-uniform read coverage not only between but also within transcripts makes such estimates generally detrimental (Kakaradov et al., 2012). The É metric can be generalized beyond the class of singleexon-skipping events by counting inclusion and exclusion reads regardless of exon adjacency ( Fig. 1, dashed arcs). Although this definition helps to reduce the undercoverage bias by taking into account splice junctions that are not present in the reference annotation, it often assigns misleading values to É metric, for instance, in the case of multiple-exon skipping, where the amount of support for exon exclusion does not reflect the true splicing rate of each individual intron.

APPROACH
In this work, we change the view from exon centric to intron centric. Each intron is defined uniquely by the combination of its 5 0 -splice site (D, donor) and 3 0 -splice site (A, acceptor). Denote by nðD, AÞ the number of reads aligning to the splice junction spanning from D to A (Fig. 2) and define where D 0 and A 0 run over all donor and acceptor sites, respectively, within the given genomic annotation set. Because A 0 could be A and D 0 could be D, both 5 ðD, AÞ and 3 ðD, AÞ are real numbers from 0 to 1. The value of 5 ðD, AÞ can be regarded as *To whom correspondence should be addressed. an estimator for the conditional probability of splicing from D to A, i.e. the fraction of transcripts in which the intron D to A is spliced, relative to the number of transcripts in which D is used as a splice site. Similarly, 3 ðD, AÞ is the relative frequency of D-to-A splicing with respect to the splicing events in which A is used.
In the particular case of single-exon skipping (Fig. 1), the values of É, 5 and 3 are related as follows. Denote the upstream and downstream introns of the highlighted exon by ðD 1 , A 1 Þ and ðD 2 , A 2 Þ, respectively. Let Assuming uniform read coverage across the gene (a ' b), we get ! 5 ' ! 3 ' 1 2 and, therefore, That is, in the particular case of single-exon skipping, the value of É is equal to the average of 5 and 3 given that the read coverage is reasonably uniform. If a and b differ significantly, the contribution of 5 and 3 to É is given by the weight factors ! 5 and ! 3 . Similarly, the completeness of splicing index (Tilgner et al., 2012) is split into two indices, 5 ðDÞ and 3 ðAÞ, where and nðXÞ denotes the number of genomic reads (reads mapped uniquely to the genomic sequence) overlapping the splice site X.
Note that 5 depends only on D and 3 depends only on A. The values of 5 and 3 are unbiased estimators for the absolute frequency of splice site usage, i.e. the proportion of transcripts in which D (or A) is used as a splice site, among all transcripts containing the splice site D (or A).

METHODS
To compute 5 , 3 , 5 and 3 for a given donor-acceptor pair, one needs to know five integers, nðD, AÞ, P nðD, A 0 Þ, P nðD 0 , AÞ, nðDÞ and nðAÞ, of which only the first one depends on both D and A, while the rest have a single argument. We developed bam2ssj, a pipeline for counting these five integers directly from BAM input. bam2ssj is implemented in C þþ and depends on SAMtools (Li et al., 2009). The input consists of (i) a sorted BAM file containing reads that align uniquely to the genome or to splice junctions and (ii) a sorted GTF file containing the coordinates of exon boundaries. Each time the CIGAR string (Li et al., 2009) contains xMyNzM, x, z 1, the counter corresponding to the splice junction defined by yN is incremented. One mapped read may span several splice junctions and increment several counters. If the CIGAR string does not contain the xMyNzM pattern, the read is classified as genomic and increments nðXÞ for every splice site X it overlaps. Position-specific counts (Kakaradov et al., 2012) are implemented as a stand-alone utility that is not included in the current distribution. Importantly, bam2ssj counts reads that align to splice junctions in a strand-specific way, i.e. nðD, AÞ, P nðD, A 0 Þ, P nðD 0 , AÞ, nðDÞ and nðAÞ are reported for the correct (annotated) and incorrect (opposite to annotated) strand. We leave further processing of these counts by Equations (2)-(4) to the user.

RESULTS AND DISCUSSION
We validated bam2ssj by counting reads aligning to splice junctions in the whole-cell polyadenylated fraction of Cold Spring Harbor Long RNA-seq data (http://genome.ucsc.edu/ ENCODE/). In total, 8 558 231 343 mapped reads were analyzed in 404 min ('350 000 reads/sec). 1 184 553 724 reads align to splice junctions, of which '1% align to the opposite strand. 1 699 718 327 reads overlap annotated splice junctions, of which '5% map to the opposite strand. The values of nðD, AÞ coincide with those reported by ENCODE in 98.9% of cases (1 163 251 008 reads); all discrepancies were due to the ambiguity of CIGAR translation in the mapper's output. Because RNA-seq data are increasingly processed into the compact BAM form, we propose that bam2ssj be used as a standard operating procedure for counting splice junction reads.
Funding: Grants BIO2011-26205 and CSD2007-00050 Consolider, Ministerio de Educacio´n y Ciencia (Spain). Fig. 1. The percent-spliced-in (PSI, É) metric is defined as the number of reads supporting exon inclusion (a þ b) as the fraction of the combined number of reads supporting inclusion and exclusion (c). The exon of interest is shown in gray. Only reads that span to the adjacent exons (solid arcs) account for Equation (1) D A A D D A Fig. 2. Left: the 5 0 -splicing index, 5 , is the number of reads supporting the splicing event from D to A relative to the combined number of reads supporting splicing from D to any acceptor site A 0 . Right: the 3 0 -splicing index, 3 , is the number of reads supporting the splicing event from D to A relative to the combined number of reads supporting splicing from any donor site D 0 to A. The intron of interest is drawn thick