MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction. Availability: MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license. Contact: Jonas_Behr@web.de and raetsch@cbio.mskcc.org Supplementary information: Supplementary data are available at Bioinformatics online.


Suppl. Section A.1 Read simulation
Based on the genes structure generated as described in Suppl. Section C, we randomly selected 1000 genes from chromosome II with more than three transcripts. The reads were simulated using the flux simulator (Griebel et al., 2012). The gene expression values cg (sum over transcript mRNA copy numbers) for gene g was given by Where m = 10 8 , x1 = 2, 000 and xg is randomly chosen without replacement from the natural numbers from 1 to x1. Therefore, cg values range from 60 to 9, 980 with an average of 567. Gene expression values were then distributed over the transcripts based on a stick breaking process: We randomly permuted the transcripts and the assigned a fraction δ1 of the total mass to the first transcripts, where δ1 is uniformly distributed between zero and one. Figure A shows the distribution of average abundance values we obtain with this strategy in comparison to the distribution of relative abundance values measured on the drosophila modENCODE data set using Cufflinks.
We iterate by assigning a fraction of δi of the remaining mass to the i-th transcript. Read length was set to 75bp, library preparation simulation parameters were chosen to be "random priming" and "chemical fragmentation".
For each of the five samples we obtain approximately 2.85 million fragments for 1000 genes, corresponding to about 57 million read for the whole human genome (assuming 20,000 expressed genes). We note that this is in the same order of magnitude as the D. melanogaster data set comprising 550 million reads in total. The gene structures and simulated reads are available from the MITIE website (www.bioweb. me/mitie).
We simulated sequencing errors by estimating an error model based on an Illumina sequencing run (HepG2 Encode cell lines, ENCODE Project Consortium et al., 2012) The error model computes a mutation probability based on read quality scores, while error positions are assumed to be independent. A set of read quality strings was sampled from the same Illumina run and randomly assigned to a read. Thus, we obtain a read error distribution similar to a given Illumina run. This strategy is implemented in the Palmapper package (De Bona et al., 2008;Jean et al., 2010).

SUPPL. SECTION B SPLICING GRAPH GENERATION
We generate the splicing graphs in four major steps from aligned RNA-Seq reads: 1. Genomic region identification: For genes where at least one transcript was known we used genomic regions starting 50000 bases upstream of the transcript start to 50000 based downstream of the transcript end to account for potentially significantly longer transcript.
We then trimmed the regions if we found a gap in read coverage of more than 100 bases that was also not overlapped by spliced reads. Other parts of the genome were segmented into regions based on a map adding per position coverage, number of spliced reads spanning a position and number of read-pairs spanning a position. Whenever the value of this map exceeds a user defined threshold (default 2) we call a region. We join neighboring regions with a distance of less than 50 bases and finally, we discard regions with fewer than a user defined number of reads (default 50).
2. Segment identification: Given a genomic region, we construct splicing graphs by generating a list of segment boundaries. Boundaries are either splice sites (SS), potential transcription start sites (TSS) and termination sites (TTS). Potential SS positions can originate from spliced reads or annotated transcripts. Analogously, TSS and TTS sites can stem from annotated transcripts or from potential transcript start and end positions inferred from RNA-Seq coverage. We identify possible start and end positions as a) drop of the read coverage to zero or b) steep drops in read coverage. The latter we find by applying a statistical test as follows. For each segment, we use a sliding window of length 60 and compare the number of read starts (ends) in the first half of the window to the corresponding number in the second half of the window in case of TSS (TTS). We apply a binomial test on the obtained counts and call a TSS/TTS site, if the p-value is smaller than 10 −4 .

Exon identification:
We keep segments that a) have more than 5% of their nucleotides covered, b) are part of annotated transcripts, or c) if the removal of segment s does not leave any path between two segments connected by paired-end reads (if available).

Intron identification:
We connect segments based on spliced reads and annotated introns.
See Figure 1 for more details.

SUPPL. SECTION C NUMBER OF PATHS IN HUMAN SEGMENT GRAPH
We merged the four human genome annotations (Ensembl, HAVANNA, ENCODE, Vega, see Harrow et al., 2006;Coffey et al., 2011;Flicek et al., 2012;ENCODE Project Consortium et al., 2012) by first creating the union of all transcripts for each gene. We then merged transcripts having identical splice structure. We consolidated minor deviations in transcript ends by using the mean on the original transcript ends in the resulting transcript. We generated a splicing graph and obtained additional transcript features by integrating evidence from two RNA-Seq libraries for cell lines HepG2 (wgEncodeCshlLongRnaSeqHepg2CellLongnonpolyaAlnRep2.bam) and K562 (wgEncodeCshlLongRnaSeqK562CellPapAlnRep1.bam) from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/ encodeDCC/wgEncodeCshlLongRnaSeq/ ENCODE Project Consortium et al. (2012). The integration was performed using the splice graph generation strategy described in Section 3.1.

SUPPL. SECTION D READ COUNT VARIABILITY WITHIN TRANSCRIPTS
To estimate the variability of read counts falling into segments we investigated human annotated single transcript genes using one RNA-Seq library from the ENCODE project (library for cell line K562 (wgEncodeCshlLongRnaSeqK562CellPapAlnRep1.bam) from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/) ENCODE Project Consortium et al. (2012). We counted read starts falling into 20 randomly selected segments for each transcript and computed mean and variance. Figure C shows a scatter plot of mean versus variance. As discussed in the main manuscript we model the relationship of mean and 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5

MITIE: Transcript Identification and Quantification in Multiple Samples
Fig. C. Relationship of mean and variance of read starts for randomly selected segments in 10,820 single transcript genes (human hg19). We counted read starts for up to 20 non-overlapping regions of length 30nt and computed mean and variance of read start counts. In red we show the weighted least squares fit.
variance with σ 2 = (1 + η1)µ + η2µ 2 . We estimate parameters η1 and η2 by computing a weighted least squares fit, where the standard least squares is modified using a gaussian distributed weighting (mean zero and standard deviation 5,000) to increase robustness. For η1 = 0 we find that η2 = 0.42 gives the best fit for the observed data (see Figure C red dots). We repeated the analysis for one sample obtained from the TCGA RNA-Seq data collection (https://wiki.nci.nih.gov/display/TCGA/RNASeq+Data+Format+Specification TCGA-DK-A3IM-01A-11R-A20F-07.d0cbd85a-e244-4f99-9adf-71f7afa5f6ec aligned with Star aligner (Dobin et al., 2012)) and obtained an optimal value of η2 = 0.50. We conclude that we consistently observe a significantly higher variability between segments in the same sample than previously observed for the same segment in different samples (cf. Anders and Huber (2010) and Drewe et al. (2012), where values for η2 ranging from 0.1 to 0.2 were estimated). For the data sets simulated using the Flux Simulator we estimate η2 = 0.38.

SUPPL. SECTION E RUNTIME ESTIMATION FOR THE D. MELANOGASTER MODENCODE DATA SET
We performed predictions for regions overlapping with 2000 evaluation genes. These genes have at least two expressed transcripts with different splice structure. We used Cufflinks to quantify transcripts in order not to bias the selection towards MITIE. Using the runtime measurements of the core optimization problem (excluding data processing steps largely depending on the file system and network speed) for those 2000 genes we estimate the total runtime by correcting for the complexity of those examples. We take the number of annotated transcripts as a proxy for the complexity and compute the α-trimmed average runtime for genes with k annotated transcripts. The α-trimmed mean is the mean obtained after discarding the values below the α and above the 100 − α-percentile. We use α = 1%. This is justified since we performed computations on a subset of all genes in order to be able to compute the optimal solution in all cases. However, the very expensive cases are cases where many solutions are plausible. We observed a significant lower performance for these cases and therefore in practice it is plausible to stop computations earlier and stay with a solution, where e.g. we allow only one or zero transcripts in addition to the annotated transcripts. The loss in performance is then at most α and likely much lower. Since there are no genes with less than two transcripts in the set we use the average runtime for genes with two transcripts also for genes with one transcript. We then multiply the number of genes with k transcripts in the annotation with the average runtime observed on this subset and integrate over k to estimate the runtime for a genome wide prediction. We obtain 19. 33, 560.61, 873.39, 1189.58, 1414.62, 1184.68, and 1206.07 CPU hours for one to seven samples, respectively. For Cufflinks we recorded the CPU-time spend in the assemble bundle method in the cufflinks.cpp file using the boost::chrono library. We recorded the time for each bundle and stored it with bundle start and stop coordinates in a separate output file.

SUPPL. SECTION F MITIE+MMO: JOINT-OPTIMIZATION OF TRANSCRIPT ABUNDANCE, STRUCTURE AND MULTIPLE MAPPING LOCATIONS
Given the transcript structures and abundances, we can compute the expected read coverage for segments and exon-exon junctions (C exp r and I exp r ; see Section 3.2). For each alignment, the overlapping intronic and exonic segments are determined. For each segment, the mean coverage is computed with and without the respective alignment. The decision where to finally place, i.e., select, the alignment of the read is made using the NB-loss function (see Section 3.4). It computes the loss between the observed mean coverages (with and without the currently considered alignment) and the expected coverage for the current location (see Section 3.2). For each read, all pairs of possible mapping locations are evaluated computing the loss for putting the alignment to location A and not to location B or vice versa. In each case the loss for both locations is added up to a total loss. The location with the smaller total loss is chosen. This procedure is repeated iteratively for all location pairs, for all multiple mapping reads (repeated up to five times). Furthermore, it is iterated with the core MITIE optimization to obtain updated transcripts and abundance estimations in a EM like fashion (repeated up to five times). For unstranded RNA-Seq protocols this strategy can also be utilized to optimize the strand assignment.

SUPPL. SECTION G TRANSCRIPT IDENTIFICATION WITH EXACT INFORMATION
The regions in Figure 3A summarize genomic positions sharing the same composition of overlapping transcripts. If we assume to know the number of expressed transcripts in advance (3 transcripts in this case), then we know that at least five of the unknowns are equal to zero. If we randomly select 5 unknowns and set them to zero, then this results in a system of four equations and three unknowns which is either infeasible or has exactly one solution. If it is infeasible, we know that the three remaining transcripts cannot explain the read coverage. Otherwise, we found one possible solution. We iterate this by setting all possible permutations of transcripts to zero and count the number of cases the corresponding system of equations has a solution.

SUPPL. SECTION H TRANSCRIPT PREDICTION WITH CUFFLINKS
Model selection for MITIE optimized the F-score on transcript level. Doing the same for Cufflinks results in sub-optimal predictions when samples are merged with Cuffmerge. Thus, the main text shows the Cufflinks+Cuffmerge combination, where the mean of sensitivity and specificity was optimized for Cufflinks. Figure D shows the sensitivity and specificity for all predictions also shown in Figure 5A and in addition the result for the F-score optimized version. We note that if we optimize the same criterion for Cufflinks and MITIE then MITIE predictions significantly outperform Cufflinks predictions in sensitivity as well as in specificity. Furthermore, we note that ranking methods based on the mean of sensitivity and specificity favors trivial and meaningless solutions. One example for such a trivial solution is to select only a single high confidence transcript prediction for the whole genome which if correct results in an mean(SN, SP) of 50%, but a Fscore close to zero. For each of the eight optimized Cufflinks parameters we tried seven values. Thus we performed 56 predictions for each optimization criterion (112 in total). Tested values were equally distributed in log-space from default value divided by five to default value times five. All optimized parameters are shown in Table A We combined the different Cufflinks predictions using Cuffmerge and also optimized the hyper-parameter "-min-isoform-fraction" of the Cuffmerge tool. Sensitivity and specificity of the predictions are shown in (Suppl. Figure D).
For the D. melanogaster data set we performed Cufflinks predictions with default parameters. The parameter values optimized for the human artificial data are not likely to perform better in this setting, since data as well as the evaluation criterion have changed. Performing the same model selection on genome wide data was computationally prohibitive.

SUPPL. SECTION I TRANSCRIPT PREDICTION BASED ON TRINITY GRAPHS
We parsed the graphs and read counts for all components from files "comp*.out" and collapsed linear portions of the graph. We then resolved cycles by removing self loops and cutting each larger cycle at the first node on the path from an initial segment which is also part of the cycle.  We then defined a total order of nodes ran the MITIE optimization. For simple cases with less than 9 paths we did not run the MITIE, but reported all possible paths instead.

SUPPL. SECTION J MODEL SELECTION
Following ideas from Snoek et al. (2012), we used Gaussian Processes (GP) to find optimal hyper-parameters for MITIE on the respective training sets. We employed the GP implementation provided by Rasmusen and Nickisch (2010). We trained a GP to predict the performance of our algorithm by randomly choosing initial parameter vectors and 20 training examples from the training set. As target values for the GP, we chose the F-score on transcript level. In each iteration, we randomly sampled parameter vectors and selected the one for evaluation. As selection criterion, we used the maximal upper confidence bound ucb: ucb = µy + γσy Where µy and σy are the mean and standard deviation of the predictive distribution of the GP. γ is a hyper-parameter of the model selection strategy and was chosen to be 2. We iterated until we had 100 data points. Finally, we selected the parameter combination with maximal lower confidence bound (lbc) to predict on the test set. The lbs can be computed as: The rationale behind this strategy is to explore the space and choose new parameter vectors that might lead to good performance (vectors with high predicted mean and high variance), but finally to select a vector with a high mean and low variance to achieve good performance with high confidence. We optimized the regularization parameters for (1) the number of transcripts (2) Figure 4A in the main manuscript, but including Cufflinks quantification results. We computed the Pearson correlation based on the lower confidence interval reported by Cufflinks, which has higher correlation to the true abundance than the estimated abundance value itself.
penalty term and (4) the exonic segment read count fit. We did not include parameters η1 and η2 into the model selection. We found that choices of 1.2 and 0.0 worked consistently well in all experiments.

Suppl. Section J.1 Transcript Evaluation
We evaluate against a set of annotated genes by first finding all transcripts overlapping with a given gene. We then compute a binary match score for each pair of transcripts according to the criteria described in Section 4.2.3. We then computed a maximal matching to find pairs of annotated and predicted transcripts such that the total number of correct matches is maximal. The number of matching pairs is taken as the number of true positive predictions, when computing sensitivity and specificity of the methods. Clearly, no annotated and no predicted transcript is allowed to be part of more than one such pair. All predicted transcripts not being part of a matching pair are counted as false positives and all annotated transcripts not being part of a matching pair are counted as false negatives.

SUPPL. SECTION K APPROXIMATION OF THE LOSS FUNCTION
Our optimization formulation is restricted to polynomials of degree two. Thus, we cannot directly employ the loss function L. Instead, we utilize a approximation of by fitting the this function with two polynomial of degree two. We fit L for expected values smaller than the observed value V separately from expected values larger than V . We minimize the squared error between the quadratic proxy function and the negative log likelihood function on a grid ranging from −10 to +10 standard deviations of the negative binomial with a step size of V /500 + 0.002. We obtain four coefficients llV (left linear), lqV (left quadratic), rlV (right linear) and rqV (right quadratic). The resulting loss function can be written as: We precompute the coefficients of l for values of V ranging from 1 to 30000. Between the positions of this grid we linearly interpolate the coefficients.

SUPPL. SECTION L CONSTRAINTS
Computing the expected segment count C exp equivalent to C exp s,t,r = Us,t × Wt 8 : We require the transcript abundance Wt,r to be zero for each sample r if transcript t does not have any segments: We sort the transcript in order to reduce the search space by eliminating equivalent solutions corresponding to permutations of transcripts.
The transcript indicator variables It = can be computed as: It is part of the objective function. We enforce all predicted introns to correspond to connections in the graph G = (S, I). This can be done by firstly enforcing that, if segment s is used in transcript t (i.e., Us,t > 0) any of the segments preceding s in the graph is used as well: Secondly, we need to make sure, that no intron (s, s2) is used which is not in G: Us,t + Us 2 ,t <= 1 + Since constraints (10) force any of the segments directly preceding s to be used, there is no need to exclude intron (s, s3) for any s3 > max({i|(s, i) ∈ I}). If s is a potential TSS or TTS site, do not force to use any connected upstream or downstream segment, respectively. Therefore, we have to exclude all invalid connections accordingly. Compute expected intron count C I,exp s 1 ,s 2 ,t ,. Determine if intron is used and if so let C I,exp s 1 ,s 2 ,t be equal to Wt: C I,exp s 1 ,s 2 ,t = Wt × Us 1 ,t × Us 2 ,t × This relationship can be expressed in terms of linear constraints as follows: Ui,t 8 For simplicity we discard the constant scaling factor cg in the following formulation. Replace C exp by C exp cg Paired-end information: Compute binary variable Ps 1 ,s 2 ,t indicating that segment pair (s1, s2) is part of a expressed transcript: Ps 1 ,s 2 ,t = Us 1 ,t * Us 2 ,t * It In terms of linear constraints this can be rewritten as: Ps 1 ,s 2 ,t ≤ 1/3(Us 1 ,t + Us 2 ,t + It) Ps 1 ,s 2 ,t ≥ Us 1 ,t + Us 2 ,t + It − 2 Compute binary variable P any s1,s2 indicating weather segments s1 and s2 confirmed by paired-end reads do not occur together in any transcript: Ps 1 ,s 2 ,t + 1 The term Ns 1 ,s 2 × P any s 1 ,s 2 is part of the objective function, where Ns 1 ,s 2 is the number of paired-end fragments supporting the connection between segments s1 and s2.