rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

Abstract Background The possibility of generating large RNA-sequencing datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the organisms with finished and well-annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing, and paralogous genes. Results Herein we describe the novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler and explores computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-sequencing datasets, and briefly highlight strong and weak points of different assemblers. Conclusions Based on the performed comparison between different assembly methods, we infer that it is not possible to detect the absolute leader according to all quality metrics and all used datasets. However, rnaSPAdes typically outperforms other assemblers by such important property as the number of assembled genes and isoforms, and at the same time has higher accuracy statistics on average comparing to the closest competitors.


Background
While reference-based methods for RNA-Seq analysis [1,2,3,4,5,6] are widely used in transcriptome studies, they are subjected to the following constraints: (i) they are not applicable in the case when the genome is unknown, (ii) their performance deteriorates when the genome sequence or annotation are incomplete, and (iii) they may miss unusual transcripts such as fusion genes or genes with short unannotated exons. To address these constraints, de novo transcriptome assemblers [7,8,9,10,11] have emerged as a viable complement to the reference-based tools. Although de novo assemblers typically generate fewer complete transcripts than the reference-based methods for the organisms with accurate reference sequences [12], they may provide additional insights on aberrant transcripts.
While the transcriptome assembly may seem to be a simpler problem than the genome assembly, RNA-Seq assemblers have to address the complications arising from highly uneven read coverage depth caused by variations in gene expression levels. However, this is the same challenge that we have addressed while developing SPAdes assembler [13,14], which originally aimed at single-cell sequencing. Similarly to RNA-Seq, the Multiple Displacement Ampli cation (MDA) technique [15], used for genome ampli cation of single bacterial cells, results in a highly uneven read coverage. In the view of similarities between RNA-seq and single-cell genome assemblies, we decided to test SPAdes without any modi cations on transcriptomic data. Even though SPAdes is a genome assembler and was not optimized for RNA-seq data, in some cases it generated decent assemblies of quality comparable to the state-ofthe-art transcriptome assemblers.
To perform the benchmarking we have used rnaQUAST tool [16], which was designed for quality evaluation of de novo assemblies with the support of reference genome and its gene database. For the comparison, we selected a few representative metrics such as (i) total number of assembled transcripts (contigs), (ii) reference gene database coverage, (iii) number of 50% / 95%-assembled genes/isoforms, (iv) number of misassemblies and (v) duplication ratio. The detailed description for these metrics can be found in the Supplementary material. Table 1 demonstrates comparison between di erent assembly tools on publicly available Mouse RNA-Seq dataset. All transcriptome assemblers were launched with default parameters, SPAdes was run in single-cell mode due to the uneven coverage depth of RNA-Seq data. Table 1 shows that SPAdes generates more 50% / 95%-assembled genes than any other tool and has comparable gene database coverage. At the same time, SPAdes produces rather high number of misassembled transcripts, which can be explained by the fact that algorithms for genome assembly tend to assemble longer contigs and may incorrectly join sequences corresponding to di erent genes when working with RNA-Seq data. In addition, SPAdes generates the same number of 95%-assembled genes and isoforms, which emphasizes the lack of isoform detection step.
Benchmarking on other datasets also showed that SPAdes successfully deals with non-uniform coverage depth and produces relatively high number of 50% / 95%-assembled genes in most cases. However, it also con rmed the problem of large amount of erroneous transcripts as well as relatively low number of fully reconstructed alternative isoforms in SPAdes assemblies. Based on the obtained statistics we decided to adapt current SPAdes algorithms for RNA-Seq data with the goal to improve quality of generated assemblies and develop a new transcriptomic assembler called rnaSPAdes. In this manuscript we describe major pipeline modi cations as well as several algorithmic improvements introduced in rnaSPAdes that allow to avoid misassemblies and obtain sequences of alternatively spliced isoforms.
To perform su cient benchmarking of rnaSPAdes and other transcriptome assemblers mentioned above, we assembled several simulated and publicly available real RNA-Seq datasets from the organisms with various splicing complexity. For the generated assemblies we present quality assessment reports obtained with di erent de novo and reference-based evaluation approaches. In addition, based on these results we discuss strengths and disadvantages of various assembly tools and provide insights on their performance.

Data Description
To compare rnaSPAdes performance with the state-of-the-art transcriptome assemblers we selected 2 simulated and 6 real publicly available RNA-Seq datasets ( Table 2) with di erent (i) read length, (ii) library size, (iii) strand-speci city and (iv) organism splicing complexity. Simulated data was generated using RSEM simulator [1] based on real Human and Mouse datasets used in this study (the exact commands are provided in the Supplementary material section S3).
All downloaded public datasets were analysed using FastQC [17]. The reports showed that no dataset contains adapters or overrepresented sequences. Human large dataset was quality-trimmed using Trimmomatic [18] due to quality drop towards reads ends. All other datasets were assembled without additional preprocessing. Although 8 datasets used in this manuscript may not represent all kinds of transcriptomic data, they are su cient for comparing di erent assembly tools and detecting their strengths and disadvantages.
To make the results reproducible, we also provide software versions and command lines used in this study in the Supplementary material (section S3).
In addition to statistics provided by di erent tools, we decided to compute faction of 95%-assembled genes relative to the number of genes detected by a reference-based method. For this purpose we used genes assembled by kallisto [25] that have nucleotide coverage > 5. Coverage values were computed using estimated fragment counts. While it remains unclear how to select a proper coverage threshold for this experiment, number of genes/isoforms with coverage > 5 appeared to be the best upper bound estimate for most of the datasets (see Supplementary Table S15 for details). Using fraction of assembled genes instead of raw numbers allows to conveniently visualize the data in the same plot, compute average values across all datasets and, at the same time, estimate how well de novo assemblers perform relatively to the reference-based tool.

Evaluating assemblers on simulated data
To simulate RNA-Seq dataset we used RSEM simulator [1], which allows to generate reads based on the real RNA-Seq data. For this purpose we selected Human and Mouse datasets ( Table 2). Table 3 shows short quality assessment report for Human simulated data. Complete evaluation reports for both simulated dataset are presented in the Supplementary material (Tables S1 and S2). Table 3 shows that rnaSPAdes produces the highest number of 95%-assembled genes and isoforms, with Trinity and RNA-Bloom being the closest competitors. Trinity and RNA-Bloom also have the highest gene database coverage, while rnaSPAdes and Trans-ABySS are just slightly behind (1.5% di erence at most). However, both Trinity and RNA-Bloom seem to produce a lot of excessive sequences resulting in high duplication ratios (1.74 and 1.93 respectively), and Trinity also appears to be somewhat inaccurate in terms of misassembled sequences (5 times more than rnaSPAdes). Among the tools with high number of assembled genes and isoforms, Trans-ABySS and SOAPdenovo-Trans are the most accurate (126 and 198 misassemblies respectively), rnaSPAdes and RNA-Bloom follow with 309 and 358 of misassembled contigs respectively. Although IDBA also generates an accurate assembly (174 misassemblies), it appears to be fragmented (small number of 95%-assembled genes and isoforms). Although both BinPacker and Bridger produce comparable amount of assembled genes and isoforms, they have the largest number of misassemblies (> 3500). Bin-Packer also has the highest duplication ratio (2.19).
Since RSEM simulator provides read count for each particular gene, we also computed the number of assembled genes reported by rnaQUAST depending on their read coverage (Fig.  1). The gure demonstrates that rnaSPAdes, SPAdes and Trinity outperform other tools on low-abundant transcripts, with rnaSPAdes reaching the highest fraction of total 95%assembled genes (52.2%).

Evaluating assemblers on real RNA-Seq data
For comparison on real RNA-Seq reads we selected 4 nonstranded and 2 strand-speci c datasets (Table 2). Short report for Human assemblies is shown in Table 4, while complete reports for all data are presented in the Supplementary material (Tables S3-S8 respectively). In addition, we added BUSCO reports (Supplementary Figure S2) and presented various metrics as bar plots ( Figure 2, Supplementary Figures S3-S5). Table 4 indicates, that while all assemblies have more than ten thousand of 50%-assembled genes, amount of 95%assembled genes signi cantly di ers. RnaSPAdes, RNA-Bloom and Trinity are the best according to 95%-assembled genes and isoforms. Among these three assemblers, rnaSPAdes dominates with 16% and 31% more 95%-assembled genes than RNA-Bloom and Trinity respectively. Although both RNA-Bloom and Trinity have the highest database coverage, they also have very high duplication ratio (≥ 2). In addition, Trinity (along with BinPacker and Bridger) generate signi cant amount of misassemblies (> 5000). SOAPdenovo-Trans and Trans-ABySS produce accurate assemblies according to these parameters, but generate 2043 and 2250 fewer 95%-assembled genes than rnaSPAdes. IDBA-trans also has rather small number of misassembled contigs (1015), but outputs a very frag- mented assembly with the lowest number of 95%-assembled genes/isoforms. Figure 2 demonstrates fraction of 95%-assembled genes in all generated assemblies and mean values for each assembler across all datasets. RnaSPAdes, Trinity and RNA-Bloom show stable performance across di erent datasets and have the highest fraction of 95%-assembled genes on average (0.5, 0.438 and 0.406 respectively). While genomic SPAdes also has high value on average (0.397), it is mostly achieved by decent performance of simulated data. Figure 2 shows that the fraction of 95%-assembled genes for simulated datasets is typically higher comparing to the respective values for real data, most likely due to the absence of sequencing artifacts. Additionally, de novo assemblies of complex organisms, such as H. sapiens and M. musculus, tend to have lower fractions of assembled genes in comparison to C. elegans and A. thaliana. For example, Human large dataset has smaller values than the ones for Worm assemblies, although the later one have almost 3x lower coverage.

Computational performance
To compare selected assemblers in terms of computational performance, we measured their running time and RAM consumption on two largest datasets using system utilities rather than using log les. As Table 5 indicates, SOAPdenovo-Trans is at least 3 times faster than any other assembler and have one of the lowest memory requirements (less than 30 GB for both datasets). Trans-ABySS and rnaSPAdes have comparable performance, with rnaSPAdes being slightly faster and more greedy regarding RAM consumption. Other assemblers typically have longer running time (at least 2 times more than rnaSPAdes in most cases) and higher memory requirements. Among all tools, BinPacker, Bridger and Trinity have the the highest peak RAM, e.g. more than 100 GB of Arabidopsis dataset.

Discussion
Quality reports provided in this manuscript (Tables 3 and 4) and Supplementary material (Tables S1-S8, Figures S1-S5) contain large variety of metrics that re ect completely di erent assembly properties, importance of which may vary depending on the further analysis and the entire pipeline being used. We believe that one of the key features of the de novo transcriptome assembler is the ability to correctly capture the entire transcript into a single contig (e.g. re ected by the number of 95%-assembled genes/isoforms, contig recall). On the other hand, such metrics as gene database coverage, number of covered reference proteins and nucleotide recall do not re ect this signi cant property, since they account all contigs mapped to a speci c gene or protein and do not include assembly contiguity. For example, high database coverage or nucleotide recall can be achieved by a very fragmented assembly (or even raw reads), which, indeed, does not suit well for further referencefree analysis.
Below we attempt to summarize these results and highlight strong and weak points of di erent assemblers.

Comparison between SPAdes and rnaSPAdes
In comparison to the original version of SPAdes, rnaSPAdes dominates by the majority of metrics. More precisely, it has signi cantly better assembly completeness metrics: 26% higher average fraction of 95%-assembled genes, 18% larger database coverage, 30% higher contig recall reported by REF-EVAL and 18% more detected BUSCOs. It also shows 18% higher contig precision on average, better reference coverage metrics reported by Transrate (50% / 95%-covered reference proteins, reference coverage) and typically fewer misassemblies (except for Corn SS and Human large datasets). Due to aggressive overlap removal stage, SPAdes always has smaller mean duplication ratio (2% vs 32% for rnaSPAdes), fewer duplicated BUSCOs (1% vs 16% on average), percentage of uncovered bases (2% vs 19%) and higher nucleotide precision (0.66 vs 0.56).
Simulated Mouse dataset is the only one where original SPAdes generates more assembled genes and isoforms than rnaSPAdes. Detailed investigation showed that the key reasons are the low coverage of this data (11 million reads) and its arti cial nature (rnaSPAdes assembles more genes on real Mouse data). By using small k = 21 during the rst iteration SPAdes manages to assemble extremely low-covered genes, where overlaps between reads are short. Pitfalls of using small k-mer sizes in transcriptome assembly are discussed in the Methods section.
Finally, due to the exclusion of BayesHammer error correction module [26] and using only two k-mer sizes, rnaSPAdes pipeline appears to be about twice faster and consumes less RAM than usual SPAdes.

Assembly completeness
In comparison to other assemblers, rnaSPAdes shows the highest fraction of 95%-assembled genes and isoforms (0.5 and 0.32 respectively). Trinity (0.44 and 0.3) and RNA-Bloom (0.41 and 0.28) are ranked the second and the third according to these metrics (Supplementary Figure S3). These numbers correlate with the percentage of detected BUSCOs, for which rnaS-PAdes also has the best average value across all datasets (74%), followed by Trinity (72%), Trans-ABySS (71%) and RNA-Bloom (68%).
The same assemblers typically form the top four according to various coverage metrics, such as database coverage provided by rnaQUAST, reference coverage, number of 50% / 95%-covered reference proteins and number of reference sequences with CBBR hits reported by Transrate (Supplementary Figures S3 and S5 Nucleotide and contig recall metrics reported by Detonate generally support the conclusions stated above (Supplementary Figure S4). Thus, Trinity and rnaSPAdes have the best average nucleotide recall values (0.86 and 0.84 respectively). The max-imal mean contig recall, however, is reported for RNA-Bloom (0.097), followed by Trinity (0.089), Trans-ABySS (0.087) and rnaSPAdes (0.079). To compute contig metrics Detonate keeps only the most reliable alignments with mapped fraction more than 99% (for both assembled and reference sequence). In contrast, rnaQUAST assigns contigs to known genes/isoforms and then counts ones that have at least X% covered by a single assembled contig. However, no cuto is applied for mapped fraction of the assembled sequences in rnaQUAST. This di erence between algorithms might explain the absence of perfect correlation between contig recall and number of 95%-assembled isoforms.

Assembly correctness
According to the number of misassembled contigs, the most accurate contigs are typically produced by SOAPdenovo-Trans, Trans-ABySS and IDBA-tran (see Supplementary gure S3d). Among these three, IDBA-tran, however, produces the most fragmented assemblies with the lowest average fraction of 95%-assembled genes equal to 0.18. Supplementary gure S3d also suggests that the highest numbers of misassemblies often belong to BinPacker, Bridger, RNA-Bloom and Trinity.
IDBA-tran, usual SPAdes and SOAPdenovo-Trans tend to provide assemblies with the smallest amount of duplicated sequences, which is con rmed by rnaQUAST duplication ratio (average values are 1.02, 1.02 and 1.07 respectively), percentage of duplicated BUSCOs (0.8%, 1% and 4.7%), fraction of uncovered bases reported by Transrate (0.018, 0.019 and 0.076) and Detonate's nucleotide precision (0.68, 0.66 and 0.66). Highest contig precision equal to 0.133, however, belongs to rnaS-PAdes, followed by 0.129 for SOAPdenovo-Trans. The most duplicated assemblies according to these metrics are produced by RNA-Bloom, Trinity and BinPacker. In comparison to other assemblers, they have signi cantly higher mean duplication ratio (2.5, 1.77 and 1.71 respectively) and fraction of duplicated BUSCOs (40.6%, 31.4% and 29.7%), as well as lowest average nucleotide precision (0.37, 0.46 and 0.46). As to rnaSPAdes, according to duplication metrics and misassemblies, it neither fails, nor dominates, showing moderate average duplication ratio of 1.32 and fraction of duplicated BUSCOs equal 16.7%.
Indeed, beside completeness-related metrics, such as number of assembled genes and isoforms, metrics discussed above should be also considered during transcriptome quality evaluation, since erroneous and duplicated sequences may negatively a ect further transcriptome analysis, such as gene annotation.

Read-based scores
According to the read-based scores reported by Transrate and Detonate RSEM EVAL which represent how well the assembly corresponds to the initial reads, rnaSPAdes also shows good results. Regarding the average Transrate contig score, conventional SPAdes has the highest average score equal to 0.31, followed by IDBA-trans and SOAPdenovo-Trans both having 0.17, and rnaSPAdes with 0.16. As to Detonate score, rnaSPAdes has the best average (-3.45 · 10 9 ) with RNA-Bloom (-3.46 · 10 9 ) and Trinity (-3.84 · 10 9 ) being slightly behind. RNA-Bloom and Trinity, however, have the lowest Transrate average scores among all tools (0.026 and 0.084 respectively). Vice versa, SPAdes, IDBA and SOAPdenovo-Trans, which are the top three assemblers according to mean Transrate score, have the lowest three RSEM EVAL scores. Based on the complete quality reports presented in the Supplementary material, it appears that Transrate score mostly correlates with correctness-related metrics and is negatively a ected by the presence of duplicated sequences, which explains highest average score for standard SPAdes. In contrast, RSEM-EVAL score seems to correlate with assembly completeness metrics.

Conclusion
Although every transcriptome assembler presented in this study has its own bene ts and drawbacks, the trade-o between assembly completeness and correctness can be signicantly shifted by modifying the algorithms' parameters. For example, various thresholds for transcripts ltration in rnaS-PAdes (Table S14 in the Supplementary material) result in assemblies with di erent properties. Also, varying k-mer size or incorporating iterative de Bruijn graph construction in rnaS-PAdes may signi cantly a ect the assembly quality (Tables S9-S11 in the Supplementary material). Thus, the parameters of the assembly algorithms can be varied in order to achieve the desired completeness or correctness characteristics and make the method to dominate by a certain group of metrics.
While the developed algorithm, rnaSPAdes, typically shows stable results across analyzed RNA-Seq datasets and often allows to capture more genes and isoforms than any other tool, there is no clear winner according to all metrics. Thus, the selection of the assembler may be varied depending on the goals of the particular research project and the sample preparation protocols being used, as well as secondary parameters, such as usability and computational performance. Even with the aid of specially developed tools, such as Transrate, DETONATE, BUSCO and rnaQUAST, the choice of a suitable assembly tool remains a non-trivial problem and may require additional benchmarks in each particular case.

Potential implications
Although the developed approach was initially designed for RNA-Seq data obtained from a single organism, it can be potentially applied for metatranscriptome assembly of samples collected from bacterial communities. Indeed, metatranscriptome assembly does not require reconstructing complex alternatively spliced isoforms, but implies other computational challenges, such as repetitive patterns in di erent genes (including homologous genes from various strains) and extreme di erences in mRNA quantities [27,28], which are caused by both -varying expression levels and abundances of di erent species. Improving the assembly algorithms, as well as designing an appropriate pipeline for quality evaluation of metatranscriptomic assemblies, are the possible further implications of this work.
Recently emerged long read protocols for mRNA sequencing allow to capture full-length transcripts without the assembly [29]. However, high error rate of Oxford Nanopore and PacBio sequencers prevents using output reads directly as complete transcripts. Typically, mapping to the reference genome, additional error-correction by short accurate Illumina reads or consensus construction is performed to obtain and further analyze high-quality sequences [30,31,32,33,34]. Combining rnaSPAdes with previously developed hybridSPAdes approach for joint assembly of short and long reads [35] may result into a viable alternative to the existing methods for processing long error-prone RNA reads.
In addition, benchmarking reports presented in this work can be used by the researchers for selecting the appropriate assembly method that meets their speci c criteria and for better understanding of transcriptome assembly quality evaluation, such as, for example, correlation of di erent metrics.

Methods
Most of the modern de novo genome assembly algorithms for short reads rely on the concept of the de Bruijn graph [36]. While the initial study proposed to look for an Eulerian path traversing the de Bruijn graph in order to reconstruct genomic sequences, it appeared to be rather impractical due to the presence of complex genomic repeats and sequencing artifacts, such as errors and coverage gaps. Instead, genome assemblers implement various heuristic approaches, most of which are based on coverage depth, graph topology and the fact that the genome corresponds to one or more long paths traversing through the graph [37,14]. Indeed, the later observation is not correct for the case of transcriptome assembly, in which RNA sequences correspond to numerous shorter path in the graph. Thus, to enable high-quality assemblies from RNA-Seq data the majority of procedures in the SPAdes pipeline have to undergo major alterations.
SPAdes genome assembler consists of the following major steps: (i) construction of the condensed de Bruijn graph, (ii) graph simpli cation, which implies removing chimeric and erroneous edges, (iii) mapping read pairs to the assembly graph, and (iv) repeat resolution and sca olding using aligned paired reads with exSPAnder algorithm [38,39]. While graph construction and mapping paired reads do not depend on the dataset type and requires no change for RNA-Seq data, graph simpli cation and repeat resolution procedures strongly rely on the properties of genomic sequences and thus require signi cant modi cations and novel functionality for de novo transcriptome assembly. Below we describe the key changes introduced in rnaSPAdes.

Simpli cation of the de Bruijn graph in rnaSPAdes
During the graph simpli cation stage erroneous edges are removed from the de Bruijn graph based on various criteria in order to obtain clean graph containing only correct sequences (further referred to as an assembly graph). In the SPAdes pipeline the simpli cation process includes multiple various procedures that can be classi ed into three types: (i) trimming tips (dead-end or dead-start edges), (ii) collapsing bulges (alternative paths) and (iii) removing erroneous connections (chimeric and other false edges). In this section we present alterations introduced in rnaSPAdes simpli cation pipeline. We also provide comparison between initial and improved simpli cation procedures on several RNA-Seq datasets in the Supplementary material (Table S10).

Trimming tips
In the de Bruijn graph constructed from DNA reads the major fraction of tips (edges starting or ending at a vertex without other adjacent edges) typically correspond to sequencing errors and thus have to be removed. Since only a few tips are correct and either represent chromosome ends or formed by coverage gaps, the existing genome assemblers implement rather aggressive tip clipping procedures [37,13] assuming that coverage gaps appear rather rarely. However, in the de Bruijn graph built from RNA-Seq data a signi cant amount of tips correspond to transcripts' ends and thus have to be preserved. In order to keep correct tips and obtain full-length transcripts, rnaSPAdes uses lower coverage and length thresholds for tip trimming procedure than SPAdes (see details below).
In some cases, tips originate from sequencing errors in multiple reads from highly-expressed isoforms and thus may have coverage above the threshold. While genome assemblers may also exploit relative coverage cuto to remove such tips, in transcriptome assembly this approach may result in trimming correct tips corresponding to the ends of low-expressed isoforms. However, erroneous tips typically have a small di erence from the correct sequence without errors (e.g. 1-2 mismatches). To address this issue, we align tips to the alternative (correct) edges of the graph (Fig. 3a) and trim them if the identity exceeds a certain threshold (similar procedure is implemented in truSPAdes, which was designed for True Synthetic Long Reads assembly [40]). In case when two tips correspond to the starts/ends of an alternatively spliced isoforms, it is highly unlikely for them to have similar nucleotide sequences (Fig. 3b). Such tips are preserved during graph simpli cation procedure thus allowing to restore isoforms that di er only by starting or terminating exons.
Another speci cs of RNA-seq datasets is the large number of low-complexity regions that originate from poly-A tails resulting from polyadenylation at the ends of mRNAs. In order to avoid chimeric connections and non-informative sequences, we also remove low-complexity edges from the de Bruijn graph (see exact criterion below).  Below we summarize all conditions used in tip clipping procedure, parameters for which were optimized based on our analysis of various RNA-seq datasets. We de ne l T as the length of the tip that is being analyzed and c T as its mean kmer coverage, and c A as the k-mer coverage of the alternative edge (which is presumably correct) A tip is removed if any of the following conditions is true: • l < 2 · k and c T ≤ 1 (short tips with very low coverage); • l < 4 · k, c T < c A /2 and the Hamming distance between the tip and the alternative edge does not exceed 3 (the tip containing sequencing errors); • the tip contains more than 80% of A/T nucleotides (low complexity tip).

Collapsing bulges
A simple bulge (two edges sharing starting and terminal vertices) in the de Burijn graph may correspond to one of the following events: (i) a sequencing error, (ii) a heterozygous mutation or another allele di erence, or (iii) an alternative splicing event (typical for transcriptomic data). The rst two cases are characterized by the bulge edges having similar lengths and sequences. However, edges formed by sequencing errors are typically short and have signi cantly di erent coverage depth, since it is unlikely for the same error to occur numerous times at the same position (Fig. 3c). Vice versa, in the case of allele di erence bulge edges usually have similar coverage. Thus, genome assembly algorithms for bulge removal typically rely on the coverage depth [37,13]. Since the most typical di erence between two alternatively spliced isoforms of the same gene is the inclusion/exclusion of a an exon (usually short), edges of the bulge originated from these isoforms have di erent lengths (Fig. 3d). At the same time, since the expression levels may vary for such isoforms, the coverage depth may signi cantly di er. To avoid missing alternatively spliced isoforms in the assembly, rnaSPAdes does not use any coverage threshold for bulge removal and collapses only bulges consisting of edges with the similar lengths (less than 10% di erence in length).

Removing chimeric connections
While undetected tips and bulges formed by sequencing errors result in mismatches and indels in the assembled contigs, chimeric reads (typically corresponding to a concatenation of sequences from distant regions of the original molecules) may trigger more serious errors, such as incorrect junctions in the resulting contigs (often referred to as misassemblies). In conventional genome assembly chimeric edges usually have low coverage and thus can be easily identi ed [37]. Single-cell datasets, however, feature multiple low-covered genomic regions and elevated number of chimeric reads, which result in numerous erroneous connections having higher coverage depth than correct genomic edges. Similarly, since true edges representing low-expressed isoforms in the transcriptome assembly also have relatively low coverage depth, cleaning the graph using coverage threshold will result in multiple missing transcripts in the assembly.
To detect chimeric connections in single-cell assemblies SPAdes implements various algorithms, which mostly rely on the assumption that each chromosome corresponds to a long contiguous path traversing through the de Bruijn graph [14]. Since this assumption does not hold for transcriptomes consisting of thousands isoforms, we had to disable most procedures for the chimeric edge detection in SPAdes and implement a new erroneous edge removal algorithm that addresses the speci cs of chimeric reads in RNA-seq data sets.
Our analysis revealed that most of the chimeric connections in RNA-seq data can be divided into two groups: single-strand chimeric loops and double-strand hairpins. In the rst case, a chimeric junctions connects the end of a transcript sequence with itself (Fig. 4a). The erroneous hairpin connects correct edge with its reverse-complement copy (Fig. 4b) and potentially may result in chimeric palindromic sequence in the assembly. To avoid misassemblies, rnaSPAdes detects chimeric loops and hairpins by analyzing the graph topology rather than nucleotide sequences or coverage.
While it remains unclear whether these chimeric reads are formed during transcription, RNA-seq sample preparation or sequencing, similar chimeric connections have been observed in the context of single-cell MDA. E.g., when a DNA fragment is ampli ed by MDA, the DNA polymerase moves along DNA molecule and copies it, but sometimes (as described in [15]), the polymerase may jump to a close position (usually on the opposite DNA strand) and proceed to copy from the new position.

Removing isolated edges
Another type of excessive edges that appear in the assembly graph are isolated edges, i.e. that have no adjacent edges. They Since e 2 connects a vertex and its reverse-complement, e 2 (the reverse-complement of e 2 ) also connects these two vertices.
typically appear in the regions of extremely low-coverage (including DNA contamination), where overlaps between neighboring reads are smaller than k-mer size, or originate from reads containing zero correct k-mers due to multiple sequencing errors. The rst type of isolated edges can possibly be connected with other edges by gap closing procedure (described below). The second type, on the other hand, may result in excessive erroneous sequences in the assembly or even create ambiguities during gap closing. Thus, during graph simpli cation we additionally remove isolated edges that have low coverage (< 2) and have length smaller or equal to read length.

Selecting optimal k-mer sizes
One of the key techniques that allows SPAdes to assemble contiguous genomic sequences from the data with non-uniform coverage depth is the iterative de Bruijn graph construction. During each next iteration, SPAdes builds the graph from the input reads and sequences obtained at the previous iteration, simpli es the graph and provides its edges as an input to the next iteration that uses larger k-mer size. Assembly graph obtained at the nal iteration is used for repeat resolution and sca olding procedures, which exploit read-pairs and long reads [38,35]. In this approach small k-mer sizes help to assemble low-covered regions where reads have short overlaps, and large k values are useful for resolving repeats and therefore obtaining less tangled graph. Although this method seems to be useful for restoring low-expressed isoforms from RNA-Seq data, our analysis revealed that it appears to be the main reason of the high number of misassembled contigs in SPAdes assemblies. Below we describe how these false junctions are formed. When two transcripts (possibly from di erent genes) have a common sequence in the middle, they form a typical repeat structure in the de Bruijn graph (Fig. 5a), which may further be resolved, e.g. using paired reads. However, if a common sequence appears close to the ends of the transcripts (Fig. 5b), edges e 2 and e 3 appear to be rather short and may be trimmed as tips (since coverage depth often drops near the transcripts ends), or may not be present at all. In this case, the remaining edges e 1 , e and e 4 will be condensed into a single edge corresponding to chimeric sequence.
Indeed, since small k-mer size results in a higher chance of creating such kind of chimeric junction, we decided to modify the parameters for the iterative graph construction. In rnaS-PAdes we decided to use only two k values: smaller one for restoring low-covered regions with insu cient overlaps between reads and larger one for obtaining less tangled graph.
To estimate the optimal k values, we ran rnaSPAdes on several RNA-Seq datasets with various read lengths sequenced from organisms with di erent gene complexity. Since it requires tremendous amount of time to try all possible pairs of k-mer sizes on multiple datasets, we rst estimated upper k value used for the main iteration, and then selected lower k with the xed upper one. We assembled a number of datasets using only a single kmer size and selected the best assemblies according to number of assembled genes, database coverage and number of misassemblies. Although it may be not possible to choose a single best k value simultaneously for multiple datasets, nearly optimal k-mer size was estimated as half of the read length (more precisely, the largest odd number that does not exceed read_length/2 -1). The smaller k value was estimated in a similar manner with the xed upper k-mer size. Optimal lower k was considered based on number of additional assembled genes and misassemblies. Experiments showed that small k values (e.g. below 29) tend to dramatically increase the number of erroneous contigs due to the higher probability of two transcripts sharing the same k-mer. Thus, the lower k-mer size was estimated approximately as read_length/3 with the minimal possible value set to 29. Although estimated k values may not provide the best assembly for every dataset, they typically appear to be a good trade-o between the number of recovered genes and generated errors (see Supplementary Tables S7-S9).
In this work rnaSPAdes was launched with the default k values. Indeed, rnaSPAdes keeps the possibility to set the k-mer sizes manually. In addition, we introduced the --fast option, which forces the assembler to use only a single k value (the optimal upper k). Typically, assemblies obtained with a single k capture fewer genes and isoforms (especially low-covered), but also have smaller number of misassembled contigs (see Supplementary material for comparison).
In order to preserve correct connections that could be restored using only small k-mer sizes, we carefully examined low-expressed transcripts that were not completely assembled using default k-mer sizes. The analysis revealed that the majority of such fragments can be joined by the small overlap, which is often con rmed by the read-pairs. To perform the gap closing procedure rnaSPAdes glues two tips if one of the following conditions is true: • tips have an exact overlap of length at least L ov and are connected by at least N ov read pairs; • tips are connected by at least N min read pairs.
where the default parameters are L ov = 8 bp, N ov = 1 and N min = 5. Although these parameters seem to be slightly adhoc, such gap closing procedure appears to be a viable alternative to using small k values and allows to restore more lowexpressed transcripts without increasing the number of misassemblies. Using smaller thresholds for gap closing often create false connections and increase the amount of erroneous tran-scripts, while larger values for these parameters result in a smaller increase of reconstructed sequences.

Adapting repeat resolution algorithms
Genomic repeats present one of the key challenges in the de novo genome assembly problem. Although mRNA sequences typically do not contain complex repeats, transcriptome assembly has a somewhat similar problem of resolving alternatively spliced isoforms and transcripts from paralogous genes. Repeat resolution and sca olding steps in SPAdes genome assembler are implemented in the exSPAnder module [38], which is based on simple path-extension framework. Similar to other modules of SPAdes, exSPAnder was designed to deal with highly uneven coverage and thus can be adapted for isoform detection procedure when assembling RNA-Seq data.
The key idea of the path-extension framework is to iteratively construct paths in the assembly graph by selecting the best-supported extension edge at every step until no extension can be chosen. The extension is selected based on the scoring function that may exploit various kinds of linkage information between edges of the assembly graph (di erent scoring functions are implemented for di erent types of sequencing data). A situation when a path cannot be extended further is usually caused by the presence of long genomic repeat or a large coverage gap. The extension procedure starts from the longest edge that is not yet included in any path and is repeated until all edges are covered.
More formally, a path extension step can be de ned as follows. For a path P and its extension edges e 1 , . . . , e n (typically, edges that start at the terminal vertex of P) the procedure selects e i as a best-supported extension if i. Score P (e i ) > C · Score P (e j ) for all j = i ii. Score P (e i ) > Θ where C and Θ are the algorithm parameters, and Score P (e i ) is a score of edge e i relative to path P (described in [38]).
In contrast to genome assembly, in which there is usually only one true extension edge, in transcriptome assembly multiple correct extensions are possible due to the presence of alternatively spliced isoforms. Thus, the modi ed procedure is capable of selecting several edges e k 1 , . . . e km among all possible extensions e 1 , . . . , e n , which satisfy the following conditions: Namely, all correct extension edges must have a score close to the maximal one (C = 1.5 by default), and the second condition remains the same. Afterwards, the algorithm extends path P by creating new paths (P, e k 1 ), . . . , (P, e km ), which are then extended independently. Since the scoring function implemented in exSPAnder does not strongly depend on the coverage depth, there is no danger that highly-expressed isoforms will be preferred over the low-expressed ones.
Finally, to avoid duplicating sequences in the genome assemblies, exSPAnder performs rather aggressive overlap removal procedure. However, since alternatively spliced isoforms may di er only by a short exon, in order to avoid missing similar transcripts the modi ed overlap detection procedure removes only exact duplicates and sub-paths.

Exploiting coverage depth
Varying coverage depth may seem to be an additional challenge for de novo sequence assembly, but can be also used as an advantage in some cases. For instance, if two alternatively spliced isoforms of the same gene have di erent expression levels, they can be resolved using coverage depth even when the read-pairs do not help (e.g. shared exon is longer than the insert size). Although using coverage values becomes more complicated when a gene has multiple di erent expressing isoforms, our analysis of several RNA-Seq datasets revealed that such cases are rather rare and most of the genes have one or two expressing isoforms within a single sample.
To exploit the coverage depth we decided to add a simple, but reliable path-extension rule. Let the path P = (e 1 , e 2 , e 3 ) have extension edges e and e (Fig. 6a), such that cov(e) > cov(e ) and cov(e 2 ) > cov(e 2 ) , where cov(e) denotes the k-mer coverage of edge e. To select a correct extension the algorithm detects a vertex closest to the end of path P that has two incoming alternative edges, one of which is included in P and another is not (e 2 and e 2 in this example). Since edge e 2 ∈ P has higher coverage than the alternative edge e 2 / ∈ P, we select extension edge e as the one with the higher coverage. However, if both isoforms have similar coverage, this simple approach may chose a false extension (since the coverage depth is rarely perfectly uniform even along a small region). Thus the di erence in coverage should be signi cant enough to distinguish between the isoforms. More formally, the following conditions should be satis ed: i. cov(e) > ∆ · cov(e ) ii. cov(e 2 ) > ∆ · cov(e 2 ) iii. Ω > cov(e 2 )/cov(e) > 1/Ω iv. cov(e) > C min where the default values of the algorithm parameters are ∆ = 2, Ω = 10 and C min = 2. The rst two conditions ensure that the extension edges (e and e ) and alternative edges (e 2 and e 2 ) have signi cant coverage di erence, the third one requires the coverage depth to remain relatively persistent along the path and the latter one prevents the algorithm from resolving low-covered isoforms (which may result in a misassembly). In general case, this procedure also utilizes only the last pair of alternative edges, and is applied only in case when the path has two possible extension edges and conventional read-pair extender fails to extend the path. (b) Two transcripts T 1 and T 2 (red and blue bold dashed lines respectively) share a reverse-complement sequence and thus can be resolved using strand-speci c reads. manuscript and managed the project. All authors read and approved the nal manuscript.