Evaluation of tools for long read RNA-seq splice-aware alignment

Abstract Motivation High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads. Results The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads. Availability and implementation https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391 Supplementary information Supplementary data are available at Bioinformatics online.

The table has four entries for dataset 6representing different results of error correction of dataset 5. Error correction was performed twice, once using Illumina reads from dataset 0, and once using selfcorrection. During the error correction process, only some of the reads are successfully corrected, resulting in a more accurate but smaller dataset. To make the corrected dataset comparable to the original one, error corrected reads were joined with original reads that were not corrected.
-6a1 -Error correction using Illumina reads, dataset containing corrected reads only -6a2 -Error correction using Illumina reads, dataset containing all reads -6b1 -Self-correction, dataset containing corrected reads only -6b2 -Self-correction, dataset containing all reads When looking only at corrected reads (datasets 6a1 and 6b1), self-correction and correction using Illumina reads have the same mean match rate. However, self-correction managed to correct a much larger number of reads, so when joined with uncorrected reads (dataset 6b2) it has higher mean match rate compared to correction using Illumina reads. Because of the higher mean match rate, and consequently lower error rate, self-corrected dataset containing both corrected and uncorrected reads (6b2) was chosen for the testing.

Supplement table 2.
Aligner evaluation on synthetic datasetsprecision. Table 3 in the main part of the paper displays as results as a percentage of the total number of reads, representing a recall value for each measure. In here, the same measurements are displayed as a percentage of aligned number of reads, thus representing precision value.
While looking at recall values, STAR showed significantly worse results than BBMap and GMap, looking at precision values, the results displayed by STAR are comparable and often better than other two aligners.
The statistics for all reads is displayed as the percentage of aligned reads and the statistics for split reads is displayed as a percentage of aligned split reads. The percentages of reads that were aligned is shown (without assessing the accuracy), the match rate of aligned reads, percentage of reads for which the beginning and the end and all inner exon boundaries are accurately aligned (correct), percentage of reads that overlap all exons of the read origin (hit all) and percentage of reads that overlap at least one exon of the read origin (hit one). All metrics are calculated with an allowed error of 5 base-pairs.  While looking at recall values, GMap displayed clearly the best results of the three aligners tested in detail, looking at precision values the best aligner is not apparent. BBMap manages to hit more exons, but GMap better aligns spliced reads.
The table shows percentage of reads that were aligned (without assessing the accuracy), percentage of reads that overlap at least one exon (exon hit) and percentage of reads that overlap one or more exons in a sequence, corresponding to a gene annotation (contiguous exon alignment). Exon hit and contiguous exon alignment values are displayed both as a percentage of the total number of reads (recall) and as a percentage of the number of reads aligned (precision). The table also shows the number of expressed genes and average match rate of aligned reads. Match rate is calculated as a percentage of aligned bases that are equal to the corresponding bases on the reference. Synthetic datasets were created from the following organisms: -Saccharomyces cerevisiae S288 (baker's yeast) -Drosophila melanogaster r6 (fruit fly) -Homo Sapiens GRCh38.p7 (human) Reference genomes for all organisms were downloaded from http://www.ncbi.nlm.nih.gov.

Recall Precision
PBSIM is intended to be used as a genomic reads simulator, taking as input a reference sequence and a set of simulation parameters (e.g., coverage, read length, error profile). To generate RNA-seq reads, PBSIM was applied to a set of transcripts generated from a particular genome using the gene annotations downloaded from https://genome.ucsc.edu/cgi-bin/hgTables. To make the datasets as realistic as possible, real datasets were analyzed and used to determine simulation parameters. Real gene expression datasets were used to select a set of transcripts for simulation (downloaded from http://bowtie-bio.sourceforge.net/recount/; core (human), nagalakshmi (yeast) and modencodefly (fruit fly) datasets were used).

Simulated data preparation
Simulated datasets were generated using the following workflow: 1. Analyze real datasets to determine error profiles. 2. Filter annotations (keep only primary assembly information) and unify chromosome names. 3. Separate annotations for genes with one isoform and genes with alternative splicing, keeping up to 3 isoforms randomly for each gene with alternative splicing. 4. Generate a transcriptome from processed annotations and a reference genome. 5. Analyze gene expression data and determine gene coverage histogram (see Figure). 6. Approximate gene coverage histogram with 3 points to determine coverage and number of genes in simulated dataset (see Figure). Scale coverages proportionally down to make a smaller dataset, more suitable for testing. 7. Extract 6 subsets of sequences from generated transcriptome, 3 for genes with single splicing and 3 for genes with alternative splicing. Each set contains a number of transcripts corresponding to the number of genes from a previous step. 8. Using PBSIM, simulate reads on each generated subset of transcriptome, using coverages determined in step 6 and error profiles determined in step 1. 9. Combine generated reads into a single generated dataset.
For simplicity, we rounded the coverage and number of genes from each transcriptome subset. For example, the Table below shows the numbers used to generate dataset 2 (D. melanogaster). The annotation includes roughly 23,000 genes with a single isoform and 3,000 genes with alternative splicing. Rounding up the ratio, we have decided to simulate 1/10 genes with alternative splicing and 9/10 genes without. We considered that each gene undergoing alternative splicing gave rise to three different isoforms with equal expression.
For simulation of PacBio reads, PBSIM parameters (read length, error probability by type, etc) were set to match those of dataset 5 containing reads of insert (see Supplement table 1).
For simulation of MinION ONT reads, PBSIM parameters (read length, error probability by type etc.) were set to match those for MinION reads from a R9 chemistry dataset obtained from the Loman lab website (http://lab.loman.net/2016/07/30/nanopore-r9-data-release). Only 2d reads statistics were used. Figure. Data preparation step 6: Approximating gene expression with three points, applied to dataset 2. Three points were chosen as a compromise between achieving simple simulation and realistic datasets.

Supplement note 2. Commands used to run each tested tools
All tested tools were run with default parameters, except STAR.

BBmap:
BBMap will create an index if it does not already exist. It was run with default parameters

TopHat2:
First, index was created using the script bowtie2-build.

Hisat2:
First, index was created using the script hisat2-build.
Then, Hisat2 was run with default parameters.

GMAP:
First, index is created using the script gmap_build.
Then, GMAP was run with default parameters.
On baseline Illumina dataset 0.1, GSNAP was run in addition to GMAP, also with default parameters.

Supplement note 3. Long read low error dataset
To better investigate the poor performance of some RNA mapping tools, we created an additional synthetic dataset containing long reads, but very few errors. It was used to determine which tools have trouble with longer reads and which with higher error rate.
Simulation parameters used with PBSIM are given below.

Supplement note 5. Description of metrics used for the evaluation
Due to the nature of the data different metrics were used to evaluate mapping tools on real and on synthetic datasets. Namely, for simulated data, the exact origin of each read is known and the corresponding alignment can be compared to it to examine how closely it matches. Read origins are determined from the simulation files generated by the simulation tool (PBSIM). For real data, the origins of reads are unknown and they are evaluated by comparing them to a set of known annotations to find out which annotations a read matches the most. The mapping quality is then calculated by comparing an alignment to that annotation.

Synthetic dataset evaluation.
The results for aligner evaluation on synthetic datasets is given in the Table 3 in the main manuscript. For each aligner and dataset, the following metrics are shown: • Alignedthe percentage of reads for which the aligner produce an alignment, without examining whether the alignment is correct • Match ratethe percentage of aligned bases that are equal to the corresponding bases on the reference. This is calculated across all alignments. • Correctthe percentage of reads for which the beginning and the end match the beginning and the end of read origin, and for which the alignment also matches internal exon boundaries.
To be considered correct, read beginning and end, as well as exon boundaries, must be within 5 bases from the origin. This metric enumerates reads that are considered correctly mapped. • Hit allthe percentage of reads that overlap all exons from the read origin. Each overlap must be at least 5 bases to be considered valid. • Hit onethe percentage of reads that overlap at least one exon from the read origin. Each overlap must be at least 5 bases to be considered valid. • Split readsthe number of reads whose origin contains more than one exon. • "Correct, split", "Split hit all" and "Split hit one"same as the metrics defined above, but calculated only for split reads.