Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples

Abstract Background Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. Results In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. Conclusions Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.


Background
Long non-coding RNAs (lncRNAs) are a diverse class of RNA molecules that are >200 nucleotides in length and do not encode proteins [1]. While functional classification is lacking for most lncRNAs, on the basis of their genomic proximity to proteincoding genes and the direction of transcription, lncRNAs are often classified into antisense, intronic, bidirectional, intergenic, or overlapping RNAs [1]. GENCODE, the database that provides annotations for human genes and transcripts, defines >14,000 human lncRNA genes (release 27, https://www.gencodegenes.o rg). Other lncRNA databases including NONCODE [2] and Mi-Transcriptome [3] both collect >60,000 lncRNAs. Compared with protein-coding genes, lncRNAs are shorter, lower-expressed, less evolutionarily conserved, and expressed in a more tissuespecific manner [4]. lncRNAs have recently emerged as an essential class of regulatory elements for many biological processes including imprinting, cell differentiation, and development [5]. They are often disrupted in human diseases including cancer [6].
They may interact with DNA, RNA, and proteins, and exert regulatory roles through a variety of mechanisms. Based on their molecular functions, lncRNA may act as (i) signals, which are indicators of transcriptional activity; (ii) decoys, which bind to and titrate away protein targets such as transcription factors; (iii) guides, which direct regulatory complexes or transcription factors to specific targets and regulate gene expression in cis or trans; and (iv) scaffolds, which serve as central platforms where relevant molecular components in cells are assembled [7].
lncRNAs have been shown to be important in the pathogenesis of human diseases, especially in cancer, and many cancer-relevant lncRNAs have been identified [8,9]. For example, Hox transcript antisense RNA (HOTAIR), one of the most well-characterized lncRNAs, promotes breast cancer metastasis through recruitment of Polycomb chromatin remodeling complex to silence the HOXD gene cluster [10]. In addition, HO-TAIR is overexpressed in breast, liver, lung, and pancreatic cancers [11]. CDKN2B-AS1, an antisense lncRNA encoded by the CDKN2B locus, epigenetically silences nearby tumor suppresser genes and promotes oncogenesis [12]. Telomerase RNA component (TERC), the critical RNA component of telomerase polymerase, serves as a template for the enzyme telomerase reverse transcriptase (TERT) to elongate telomeres. Variants and copy number changes at the TERC locus have been associated with cancer risk and progression [8]. The lncRNA LINC01106 is shown to be differentially expressed in multiple cancer types including lung adenocarcinoma and nasopharyngeal carcinoma [13,14]. Another lncRNA, LINC01123, is among the 5 most significantly up-regulated lncRNAs in intrahepatic cholangiocarcinoma [15].
The discovery of oncogenic and tumor suppressor lncRNAs has led to an increased interest in the investigation of lncR-NAs as potential cancer drug targets and biomarkers. Hence, it is critical to accurately determine lncRNA expression in cancer research. RNA sequencing (RNA-Seq) has been widely used for massive-parallel gene expression quantification. There have been many studies that explore lncRNA expression profile in cancer using publicly available RNA-Seq datasets such as those generated by The Cancer Genome Atlas (TCGA), which provide a rich source of lncRNA expression data in large cancer patient populations [16,17]. Among those studies, the analysis of the lncRNA expression profile of breast cancer samples in TCGA revealed different subtypes of breast cancer and subtype-specific overexpression of HOTAIR [16]. The analysis of 13 cancer types in TCGA revealed highly cancer site-specific lncRNA expression and dysregulation [17].
There are 2 types of RNA-Seq protocols, depending on whether strand specificity information of transcripts is retained in the library preparation step [18]. The standard protocol loses the information regarding which strand the original mRNA template is coming from, which makes it difficult to accurately determine gene expression from overlapping genes. The strandspecific RNA-Seq protocol, such as the deoxyuridine triphosphate (dUTP) method, retains strand origin of transcripts by degrading the second strand in the complementary DNA synthesis step. It has been shown to be more reliable in gene expression quantification and is recommended over the standard protocol [19]. However, the majority of TCGA samples were prepared with non-standard RNA-Seq protocol.
Multiple tools for processing RNA-Seq data have been developed in recent years. While some studies have benchmarked RNA-Seq analysis workflows [20,21], their focus has been primarily on protein-coding genes. There is no accepted gold standard pipeline yet that shows which method performs best to quantify expression of lncRNAs. As the interest in studying lncR-NAs in cancer grows, it is necessary to determine which algorithms perform best in lncRNA expression quantification because it is important to understand the differences and limitations of each of them and to follow the best practices of RNA-Seq analysis.
Because of the lower expression and different properties of lncRNAs with respect to protein-coding genes, we hypothesized that the processing and analysis of RNA-Seq data for lncRNA expression may be subjected to different technical biases and challenges, and that special considerations may be necessary to optimize the pipeline specifically for lncRNAs.
To investigate the performance of different methods on the quantification of lncRNAs as well as the effect of different RNA-Seq library preparation protocols, we applied 5 popular quantification methods, Kallisto [22], Salmon [23], RSEM [24], HTSeq [25], and featureCounts [26], on RNA-Seq samples prepared using a standard protocol (i.e., un-stranded) and a strand-specific protocol. Kallisto and Salmon are so-called pseudoalignment methods because they do not align sequencing reads to the reference genome; instead, they use an expectation maximization algorithm to iteratively assign reads to a set of compatible transcripts to obtain the estimated abundances for all transcripts. The alignment-free feature makes pseudoalignment methods much faster than alignment-based methods such as RSEM, HT-Seq, and featureCounts because the latter require mapping of the sequencing reads to the genome or transcriptome, which takes substantial time and computational resources. Among the alignment-based methods, RSEM aligns reads to the transcriptome using bowtie as the default aligner and obtains transcriptlevel expression, while HTSeq and featureCounts use genomealigned reads to obtain gene-level expression directly. We refer to RSEM as an "alignment-based transcript quantification method" and HTSeq and featureCounts as "alignment-based gene quantification methods." We used 3 aligners, STAR [27], Subread [28], and HISAT2 [29], to map the reads to the genome, before applying HTSeq and featureCounts to count the reads mapped to individual genes.

Data Description
Both un-stranded and reverse-stranded RNA-Seq data from TCGA samples were downloaded from the ISB Cancer Genomics Cloud. The other reverse-stranded dataset was downloaded from NCBI SRA under the accession PRJEB11797. Read quality control was performed with Trim galore [30], with the setting "-q 20 -stringency 3 -gzip -length 20 -paired." Afterwards the reads were mapped to the human transcriptome (both GEN-CODE and GENCODE combined with NONCODE) by STAR and were further processed by RSEM [24] (version 1.3.0) to obtain gene and transcript expression. Stand-specific option was set as "-forward-prob 0.5" for un-stranded samples and "-forwardprob 0" for reverse-stranded samples. RSEM and Polyester [31] were then used to generate 2 sets of simulated RNA-Seq reads. In RSEM simulation, RNA-Seq reads were generated with the command "rsem-simulate-reads," which takes as input abundance estimates, sequencing model parameters, and reference transcripts. The abundance estimates and sequencing model are obtained by running RSEM on the real datasets mentioned above. The total number of simulated reads for each sample is 60 million. The simulated reads were 50 bp (simulated from TCGA samples) or 100 bp (simulated from PRJEB11797 data) paired-end reads. The fragment length distribution is 178 ± 60 (mean ± sd) bp for TCGA samples and 155 ± 51 bp for the other dataset.
In Polyester estimation, RNA-Seq reads were generated with the command "simulate experiment countmat," which takes as input the count matrix of transcripts obtained from the real datasets. Both un-stranded and strand-specific RNA-Seq reads were generated in RSEM and Polyester simulation. The 2 sets of simulated samples with pre-defined gene expression levels serve as the "ground truth" for the evaluation of other pipelines.

Full transcriptome annotation improves the specificity of RNA quantification
We used RSEM [24] to simulate RNA-Seq reads based on 3 RNA-Seq datasets: (i) 100 un-stranded samples from 10 cancer types in TCGA, (ii) 40 reverse-stranded samples in TCGA, and (iii) 62 reverse-stranded samples from a study of Barrett esophagus and esophageal adenocarcinoma (PRJEB11797) [32]. To evaluate the effect that different transcriptome annotations has on the quantification of gene expression, we built 3 transcriptome annotation sets: (i) full annotation with all 58,288 genes in GENCODE release 27, (ii) partial annotation containing only the 19,836 protein-coding genes, and (iii) partial annotation with only the 14,168 lncRNAs (Additional File 1).
Using the lncRNA-only annotation overestimates lncRNA expression compared to full annotation ( Fig. 1, Additional File 2). The overestimation effect using an incomplete transcriptome annotation set can be observed for all the methods when using either un-stranded or reverse-stranded RNA-Seq libraries, although the effect is less drastic for alignment-based methods when using reverse-stranded libraries. The effect of incomplete transcriptome annotation is less obvious for proteincoding genes, but there is still a slight increase of the percentage of expressed genes when using only protein-coding annotation, compared to full annotation (Additional Files 2 and 3). Thus, using a full annotation improves the specificity of RNA quantification; therefore, it was used in the following analysis.

Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA expression quantification
Pseudoalignment methods detect expression of more genes than alignment-based methods ( Fig. 2A, Additional File 4). The average percentage of expressed lncRNAs (fragments per kilobase million [FPKM] ≥ 1 in ground truth) in the simulated ground truth ranges between 4.7% and 7.4% for the 3 RNA-Seq datasets, which is very close to the output of Kallisto and Salmon. The alignment-based methods detect fewer lncRNAs compared to the ground truth, especially for un-stranded samples.
The performance of each method was further evaluated at both sample level (Fig. 2B) and gene level (Fig. 2C). For samplelevel evaluation, only expressed lncRNAs (FPKM > 1 in the ground truth) were kept in each sample. The concordance of each method with the ground truth was measured by means of Spearman's correlation, Euclidean distance, median percent error, and linear regression. Gene expression from Kallisto and Salmon yields the highest Spearman's correlation, the lowest Euclidean distance, and the lowest median percent error with respect to the ground truth. The 2 pseudoalignment methods also have the highest level of fitness to the ground truth, in terms of the lowest mean squared error, the highest adjusted R 2 value, and a slope value of close to 1 (Fig. 2B, Additional File 5). A similar trend can also be observed for protein-coding genes in GEN-CODE (Additional File 6A). For gene-level evaluation, a comparison was performed using, for each corresponding dataset, only those lncRNAs with median FPKM > 1 in the ground truth because genes with low read counts are likely to be noise and unlikely to yield reliable results. The number of lncRNAs examined ranges between 464 and 729 for the 3 RNA-Seq datasets. Kallisto and Salmon perform better than alignment-based methods in terms of higher Spearman's correlation, lower Euclidean distance and median percent error with respect to the ground truth, linear regression slope closer to 1, and higher adjusted R 2 value (Fig. 2C, Additional File 7). The fraction of genes for which the estimates are significantly different (percent error > 5%) from the ground truth is significantly larger in HTSeq and feature-Counts than pseudoalignment methods. A similar trend can also be observed for protein-coding genes in GENCODE (Additional File 8A).
Because RSEM cannot be assessed in an unbiased manner using RSEM-simulated datasets, we later used the Polyestersimulated datasets to include RSEM in the benchmark. We simulated 40 samples for both un-stranded and strand-specific protocols, and compared RSEM, pseudoalignemnt methods, and alignment-based gene quantification methods with ground truth. The performance of RSEM was similar to pseudoalignment methods and outperformed HTSeq and featureCounts, in terms of the percentage of expressed lncRNAs detected ( For each of the expressed lncRNAs in any of the 3 datasets, hierarchical clustering was performed to evaluate the similarity of each method's measurement to the ground truth and between each other (Fig. 4). Kallisto and Salmon often clustered together with the ground truth. In addition, the 3 feature-Counts pipelines (STAR+featureCounts, HISAT2+featureCounts, Subread+featureCounts) form another cluster, while pipelines using HTSeq loosely cluster together.
Next, we expanded our analysis and also included lncRNAs from NONCODE, a database collecting 172,216 transcripts from 96,308 lncRNA genes (version 5) [2]. We simulated RNA sequencing reads based on both GENCODE and NONCODE gene annotations and replicated the analysis for lncRNAs in NONCODE. Similar to the results from GENCODE annotation, the 2 pseudoalignment methods outperform alignment-based methods in both sample-level (Additional File 6B) and gene-level comparison (Additional File 8B).

Characteristics of expressed and discordant lncRNAs
Antisense and long intergenic non-coding RNAs (lincRNAs) are the 2 major types of lncRNAs. In un-stranded samples, the mean proportion of antisense lncRNAs in the expressed lncRNAs is 54%, which is much higher than the proportion of antisense lncRNAs in GENCODE (39%) and the expressed antisense lncR-NAs in reverse-stranded libraries (25-48%) (Fig. 5A). More than three-quarters of lncRNAs have only 1 isoform in GENCODE, while they only constitute approximately half of the expressed lncRNAs in the 3 datasets, indicating that lncRNAs with more isoforms are expressed at a higher percentage (Fig. 5B). In addition, shorter lncRNAs (≤1,000 nucleotides) and lncRNAs with 2 exons are expressed at a lower percentage, compared to the distribution in GENCODE (Additional File 9).
We further investigated the features of discordant lncR-NAs (Spearman's correlation <0.7 compared with respect to the ground truth), especially for alignment-based methods, because pseudoalignment methods are highly concordant with the Downloaded from https://academic.oup.com/gigascience/article-abstract/8/12/giz145/5663671 by guest on 20 December 2019 The expression profile of lncRNAs in 1 representative sample from each of the datasets was shown with violin plot (left) and scatter plot (right), which demonstrates the overestimation effect using lncRNA-only annotation compared with full annotation, in all 3 samples for both pseudoalignment and alignment-based methods. In the box plots, the top and bottom of the rectangle represent the third and the first quartiles. The band inside the rectangle is the second quartile (the median). The whiskers above and below the box show the upper and lower fences, which are 1.5 times interquartile range above the third quartile, or 1.5 times interquartile range below the first quartile, respectively. ground truth. In un-stranded samples, the majority of discordant lncRNAs are antisense (Fig. 5C, Additional File 10). Approximately 20-26% of expressed antisense lncRNAs are discordant, while only 7-10% of expressed lincRNA are discordant, indicating that antisense lncRNAs are more susceptible to misquan-tification from alignment-based methods in un-stranded samples. However, in reverse-stranded samples, the percentage of discordant antisense lncRNAs is <2%, whereas the percentage of discordant lincRNAs is still as high as 4-7% (Fig. 5D, Additional Files 10 and 11A). Therefore, compared to un-stranded RNA-Seq, Downloaded from https://academic.oup.com/gigascience/article-abstract/8/12/giz145/5663671 by guest on 20 December 2019  reversed-stranded protocols are better at the quantification of antisense lncRNAs. The inferior performance of lncRNA quantification in un-stranded samples is further reflected when comparing the number of transcripts, transcript length, number of exons, and sequence uniqueness of expressed and discordant lncRNAs among the 3 datasets (Figs 5C and D, Additional Files 11B-D and 12-15). The difference among the breakdown of these lncRNA features is largely due to inaccurate quantification of antisense lncRNAs in un-stranded samples (Additional File 16). For example, of the 102 discordant lncRNAs with only 1 isoform in un-stranded samples, three-quarters are antisense, and of the 105 discordant lncRNAs with high sequence uniqueness (>80% unique sequences), the majority are antisense. In both cases, the number discordant antisense lncRNAs is <3 in reverse-stranded samples. Nevertheless, lncRNAs with very low sequence uniqueness (<20% unique sequences) are quantified poorly in both unstranded and reverse-stranded samples. To summarize, antisense RNAs and lncRNAs with very low sequence uniqueness are quantified poorly by alignment-based methods, especially in un-stranded RNA-Seq samples.

Examples of concordant and discordant lncRNAs
To demonstrate the importance of accurate lncRNA expression quantification, we investigated the expression profile of a number of well-known lncRNAs with important functions in cancer: HOTAIR, CDKN2B-AS1, TERC, LINC01106, and LINC01123 (Fig. 6). HOTAIR and CDKN2B-AS1 are 2 examples in which all methods perform equally or similarly well, although the expression levels called by Kallisto and Salmon are closer to the ground truth. Note that this is not the case for the other 3 lncRNAs. TERC, involved in cancer progression and risk, was accurately called by Kallisto and Salmon mostly in reverse-stranded datasets, whereas alignment-based methods using featureCounts and HTSeq did not correctly pick up this lncRNA. Similarly, the lincRNA LINC01106, differentially expressed in lung adenocarcinoma and nasopharyngeal carcinoma, and LINC01123, differentially expressed in intrahepatic cholangiocarcinoma, also showed a similar pattern, in which their expression was called accurately by Kallisto and Salmon in both un-stranded and reverse-stranded samples but not by the other methods.

Discussion
In this work, we compared the performance of popular RNA-Seq processing pipelines for the quantification of gene expression. In particular, we focus on cancer samples and lncRNAs, which have not yet been studied thoroughly in previous RNA-Seq benchmarking studies. An increasing number of studies are using TCGA RNA-Seq data to study lncRNA expression profiles and identify potential lncRNA biomarkers [17,33]. These public resources provide rich opportunities for studying the expression and function of lncRNAs in cancer in a cost-effective way. It is thus critical to choose the right method for accurate expression quantification of lncRNAs.
The 2 pseudoalignment methods, Kallisto and Salmon, outperformed alignment-based gene quantification methods HT-Seq and featureCounts at both sample-level and gene-level comparison, regardless of the choice of library type (un-stranded vs reverse-stranded), aligners (STAR, Subread, HISAT2) or transcriptome annotation (GENCODE and NONCODE). Further evaluation of the methods, including RSEM, on datasets generated by Polyester showed that RSEM's performance was similar to that of Kallisto and Salmon. Pseudoalignment methods detected more lncRNAs in each sample, which is similar to those levels in the ground truth for the simulated datasets. They were also highly concordant with the ground truth in terms of having the highest Spearman's correlation and the lowest Euclidean distance, especially at sample-level comparisons. When each method was linearly regressed with the ground truth, almost all the points from Kallisto and Salmon fell on the diagonal line, with very few outliers (Additional File 5). The superior performance of Kallisto and Salmon could be observed for both un-stranded and reversestranded samples, for both lncRNAs and protein-coding genes in GENCODE. Furthermore, it also held true when different transcriptome annotation was used in the analysis, because a similar pattern was observed for lncRNAs in the analysis with GEN-CODE and NONCODE transcriptome annotation.
Because both Kallisto and Salmon performed highly concordantly with the ground truth, they were also highly concordant with each other, as previously reported [34,35]. They cluster together in the method similarity matrix before clustering with the ground truth (Fig. 4). However, Kallisto was faster than Salmon, used less memory (Fig. 7), and performed better at sample-and gene-level comparison when examining Spearman's correlation, Euclidean distance, mean squared error, and adjusted R 2 values as compared with the ground truth, especially for the 2 reverse-stranded datasets (Additional File 7).
On the contrary, alignment-based methods HTSeq and fea-tureCounts underestimated the expression of lncRNAs. They detected the expression of fewer lncRNA genes and they had far more discordant genes compared to the ground truth. In the simulated datasets, the expressed lncRNAs are mainly antisense and lincRNAs. There are more antisense lncRNAs expressed in the un-stranded samples, compared to the composition of lncRNA types in the GENCODE annotation (Fig. 5A). However, >20% of the expressed antisense lncRNAs in un-stranded samples are discordant, which is much higher than the discordance rate of expressed lincRNAs in the same dataset. This further confirms that un-stranded RNA-Seq protocols do not perform well for expression quantification for overlapping ge-   nomic features such as antisense lncRNAs. The quantification of antisense lncRNA expression is improved greatly in reversestranded samples, with <2% discordance rate of expressed antisense lncRNAs. The differences observed for the Spearman's correlation in the breakdown of lncRNA features (e.g., number of transcripts, transcript length, number of exons, etc.) in unstranded and reverse-stranded datasets can be largely ascribed to the features of the large numbers of discordant expressed antisense lncRNAs in the un-stranded dataset (Fig. 5, Additional File 16). Because antisense lncRNAs contribute to the majority of discordant lncRNAs in un-stranded samples, reverse-stranded protocol is recommended for future RNA-Seq experiments. Alternatively, if only un-stranded samples are available, it becomes vital to choose the right method such as Kallisto or Salmon for the analysis because these 2 pseudoalignment methods are seldom affected by the type of lncRNAs or the type of RNA-Seq protocols and have very few discordant lncRNAs, even for antisense lncRNAs, or lncRNAs with low sequence uniqueness.
When the comparison was restricted to only the alignmentbased gene quantification methods (Figs 2 and 3), featureCounts performed slightly better than HTSeq and also took much shorter CPU time (Fig. 7). Subread performed slightly better than the other 2 aligners for un-stranded samples.
It is worth noting that HTSeq is the default workflow for TCGA data stored in the Genomic Data Commons data portal. Because the majority of TCGA RNA-Seq samples were prepared with un-stranded protocol, it is recommended to use pseudoalignment methods for analysis. We have reprocessed the TCGA RNA-Seq datasets using Kallisto with GENCODE annotation (version 27). The results are deposited online for the wider research community studying lncRNAs in TCGA samples, and we also provide a web interface to investigate and visualize gene expression in these samples.
The comparison in this analysis was performed at the gene level. If the goal is to examine transcript-level expression, HT-Seq and featureCounts are not suitable for the purpose. They are developed explicitly for gene-level read counts. When they count the reads mapped to transcripts rather than genes, reads mapped to exons shared by several transcripts will then be considered ambiguous and discarded by default. Kallisto, Salmon, and RSEM are able to produce both transcript-and gene-level expression output.
We further addressed the problem of using partial transcriptome annotation (Fig. 1, Additional File 3). Full transcriptome annotation is always recommended for RNA-Seq analysis when it is available to improve accuracy. When studying organisms with poorly annotated transcriptome, it is advisable to assemble and reconstruct the transcriptome first. Furthermore, with the advent of long-read sequencing technologies, more novel transcripts are expected to be identified even for well-annotated transcriptome such as the human. Several methods have been developed for transcriptome assembly and reconstruction, including Cufflinks [36], Trinity [37], TransPS [38], and DRUT [39].
One limitation of this study is that the simulated datasets are based on polyA-selected RNA-Seq. It would be helpful to evaluate RNA-Seq methods that capture more lncRNAs. However, this study focuses on using existing RNA-seq datasets for profiling lncRNAs in cancer. More importantly, in cancer research most of the available datasets, including TCGA, were generated using polyA-selected RNA-Seq. Thus, our study still provides valuable guidelines for researchers studying lncRNAs in cancer. Another limitation is that only simulated data were used in the study. Simulated data may not capture the complexity of real data and true experimental variability. A more comprehensive approach to complement the simulated data with experimental data should be considered in further benchmark studies [40]. More vigorous evaluation using real expression data from other platforms and experimental validations such as reverse transcriptase PCR can be carried out in future.
Reads from the lncRNAs that cannot be quantified accurately by HTSeq and featureCounts are either aligned poorly to the genome, or they can be properly aligned, but HTSeq and fea-tureCounts cannot determine where to assign the reads because of overlapping annotation with other genes (Additional File 17). The superior performance of pseudoalignment methods and RSEM might be due to the expectation maximization algorithm that they deploy, which focuses on the difficulty of accurate quantification for reads that cannot be uniquely aligned to the genome or cannot be uniquely assigned to genes. An RNAseq experiment can be regarded as the statistical problem of random sampling of subsequences (i.e., reads) from spliced transcripts of different length. Several of these transcripts may share the same exact exons, bringing uncertainty to the reads drawn from those shared exons. Thus, it is important to properly model this statistical problem and to capture and resolve such uncertainty as accurately as possible. We speculate that pseudoalignment methods and RSEM methods are able to model this problem more accurately by iteratively assigning reads to a transcript or a set of transcripts with a certain probability. Other methods using an expectation maximization algorithm such as IsoEM [41] might achieve similar accuracy to pseudoalignment methods and RSEM. Furthermore, the speed improvement achieved by pseudoalignment rather than alignment-based methods (Fig. 7) allows the use of more robust statistical inference techniques such as bootstrapping.

Potential Implications
In summary, considering the consistency with the ground truth, flexibility at both gene-and transcript-level analysis, and the computational resource use, pseudoalignment methods Kallisto and Salmon are recommended for RNA-Seq analysis for lncR-NAs, with Kallisto performing slightly better than Salmon. The full transcriptome annotation including protein-coding genes, lncRNAs, and others is also the recommended strategy for RNA-Seq analysis.
The large amount of data produced by next-generation sequencing techniques has posed great challenges for fast and scalable data analysis. Our study findings imply that for RNA-Seq datasets, incorporating pseudoalignment methods into the analytical framework can achieve high accuracy with minimum computing requirements. Moreover, with more and more RNA-Seq datasets specifically for studying lncRNAs becoming available, our work also lays the basis for a more comprehensive evaluation of tools for lncRNA expression quantification.

Definitions
Here we clarify and define relevant terms. (i) Genes and transcripts: a transcript, sometimes also referred to as an isoform, is composed of exons. An exon is any part of a gene that will encode a part of the final mature RNA. A gene is a collection of transcripts. Transcripts of the same gene often share exons. In this study the analysis was performed at the gene level. (ii) Expressed genes: when describing a single sample, "expressed genes" refer to genes with FPKM ≥ 1 in that particular sample. When describing multiple samples in a dataset, "expressed genes" refer to the genes with median FPKM ≥ 1 across the cohort of samples. (iii) Discordant genes: these are defined as genes whose Spearman's correlation of FPKM values with the ground truth is <0.7, when compared across a cohort of samples.

Reference genome and transcriptome
Human transcriptome version GENCODE release 27 (GTF file and transcriptome fasta file) were downloaded from the GEN-CODE FTP site. The GENCODE release 27 collects 58,288 genes and 200,401 transcripts, among which are 19,836 protein-coding genes (Additional File 1). If lncRNA is defined as non-coding, 3 overlapping non-coding RNA, antisense, bidirectional promoter lncRNA, lincRNA, macro lncRNA, sense intronic, and sense overlapping, there are 14,168 lncRNAs. The primary assembly of human genome GRCh38 was also downloaded from the same site.
The NONCODE database [2] collecting 172,216 transcripts from 96,308 lncRNA genes (version 5) was downloaded from their website and merged with GENCODE to create a new set of transcriptome annotation. Both the GENCODE version and GEN-CODE combined with NONCODE version were used to analyze the datasets and simulate 2 sets of ground truth for comparison.
The indexes for Kallisto and Salmon were built using the transcriptome fasta file. The indexes for RSEM and STAR were built using transcriptome GTF file and GRCh38 genome sequences. The indexes for Subread and HISAT2 were built using GRCh38 genome sequences.
Gene features (number of transcripts, number of exons, transcript length) were generated from GTF file using in-house script. Unique k-mers of genes were generated using script from Computational Genomics Analysis and Training (CGAT).

Pipelines
Nine pipelines were applied to process the datasets, including 2 pseudoalignment methods, Kallisto and Salmon; RSEM with bowtie as the aligner; and a combination of read aligners (STAR, Subread, HISAT2) and quantification tools (HTSeq and feature-Counts).
Kallisto, version 0.44.0, quant mode. Default parameters were applied. Stand-specific option was set as "-rf-stranded" for reverse-stranded samples.
Salmon, version 0.9.1, quant mode. Default parameters were applied. Stand-specific option: "-l A." Kallisto and Salmon measure the expression level of each transcript by default. To get gene-level expression results, the package tximport [42] was used.
HTSeq, version 0.7.2. Mapped reads from STAR, Subread, and HISAT2 were counted for each gene according to the GTF file from either GENCODE or GENCODE+NONCODE annotations. Stand-specific option was set as "-s no" for un-stranded samples and "-s yes" for reverse-stranded samples.
featureCounts, version 1.6.1. Mapped reads from STAR, Subread, and HISAT2 were counted for each gene according to the GTF file from either GENCODE or GENCODE+NONCODE annotations. Stand-specific option was set as "-s 2" for reverse-stranded samples. HTSeq and featureCounts output read count for each gene. FPKM values were generated from read counts with inhouse scripts.
RSEM, version 1.3.0. The command "rsem-calculateexpression" was used together with bowtie aligner to obtain transcript-and gene-level quantification.
To compare the computational resources required by each tool, RNA-Seq reads from 5 original samples were chosen and processed with each tool, with the number of threads set at 4.

Sample-level and gene-level comparison
The gene expression measured by each tool was compared with ground truth at both sample level and gene level. In sample-level comparison, gene expression from a single sample was compared with ground truth for each method. In gene-level comparison, gene expression for a single gene across the samples was compared with ground truth for each method.

Clustering and heat maps
Hierarchical clustering was performed for the sample method matrix. Euclidean distance and average linkage were used for both the columns and rows. The clustering dendrogram was cut into 4 groups and the number of times that every 2 methods are clustered together was counted and used to construct a similarity matrix.

Statistical tests
The Mann-Whitney U test was used to test the difference between 2 continuous variables, and χ 2 test was used to test the difference of 2 ratios. Unless otherwise specified, all the tests have P-values <0.001; thus, they are not explicitly explained in the main text.