IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing

Abstract Motivation In the past years, the long read (LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand. Results Without a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads. Using the GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered a number of novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of gene isoforms in D.officinale, which was not reported in the existing annotation library. Availability and implementation The dataset of Dendrobium officinale used/analyzed during the current study has been deposited in SRA, with accession code SRP094520. IDP-denovo is available for download at www.healthcare.uiowa.edu/labs/au/IDP-denovo/. Supplementary information Supplementary data are available at Bioinformatics online.

. The algorithms for pseudo-reference generation.

Note 3: Optimization of gap length cutoff for a possible alternative exon usage event
Errors in the assembled transcript sequences can cause small gaps (i.e., indel) in transcript alignment to pseudo-reference sequences. Therefore, we need to find a gap length cutoff to distinguish alternative exon usage events from error-caused gaps.
We investigated the length distribution of alternative exon usage in the Ensembl annotation library (version 79) (Cunningham, et al., 2015). A toy example is shown below ( Figure S2). The lengths within gaps between adjacent exons were labeled in black above the gaps. In this example, we can get the lengths of alternative exon usage as 85, 70, 35, 50 and 15 bp. Considering all genes and transcripts in the whole Ensembl annotation library (version 79), we found 95% alternative exon usage were ≥43 bp.
Since error rates of LRs and SRs are 10-20% and <1%, respectively, the probability of an error-caused gap ≥43 bp is very low, so we set 43 bp as the gap length cutoff to determine if a gap in alignment indicates an alternative exon usage event.

Note 4: Dataset for performance evaluation
To assess the performance of IDP-denovo, human RNA-seq data produced on the Illumina and Pacific Biosciences (PacBio) platforms from GM12878 cell line (SRP036136) (Tilgner, et al., 2014) were used. Read quality was checked with FastQC, and 10 bp at the end were trimmed. LRs from the PacBio platform were corrected by short reads (SRs) from the Illumina platform by the error correction tool LSC (Au, et al., 2012). Note 5: Parameter settings of existing SR-scaffold assembly

algorithms and precision-recall statistics
To choose an optimal SR-scaffold assembly algorithm, we applied performance evaluation granularities (Li, et al., 2014) including precision (fraction of matched nucleotides of assembled transcripts), recall (fraction of matched nucleotides of reference transcripts), and F 1 score (harmonic mean of precision and recall) to pick out the SR-alone algorithm with the best performance for SR-scaffold assembly. We tested the existing tools Trinity (version 2.1.1) (Grabherr, et al., 2011), SOAPdenovo-Trans (version 1.03) (Xie, et al., 2014), Bridger (version r2014-12-01) (Chang, et al., 2015), Trans-ABySS (version 1.5.3) (Robertson, et al., 2010), and Oases (version 0.2.9), which is an assembly pipeline with input of preliminary assembly by SRs from Velvet (version 1.2.10) (referred to herein as "Velvet+Oases") (Schulz, et al., 2012;Zerbino and Birney, 2008). Next, we aligned the assembled SR-scaffolds to the reference genome (GRCh38) by GMAP (Wu and Watanabe, 2005), and the precision, recall and F 1 score of each tool were calculated. The length and coverage cutoff of k-mer were set as 31 and 10 for SOAPdenovo-Trans, Trans-ABySS, and Velvet+Oases; they were set as defaults for Trinity and Bridger. Velvet+Oases showed the best performance among the five SRalone methods. Therefore, as the first step of IDP-denovo, de novo assembly is applied to SRs by the assembly algorithm Velvet+Oases, with assembly parameters length and coverage cutoff of k-mer set as 31 and 10, respectively. PacBio LRs were aligned to the human genome by GMAP, and 697,247 LRs that could be annotated by Ensembl gene annotation (version 79) were used in evaluation. Assembled transcripts that cover all splice sites in annotation are considered as full-length gene isoforms.

SR-scaffolds and missed by SR-scaffolds but covered by LRs
The boxplot below shows differences of their abundances: the transcripts missed by SR-scaffolds but covered by LRs have significantly lower FPKM than those covered by SR-scaffolds ( Figure S3, p-value< 2.2e-16). It indicates that LRs can rescue lowly expressed transcripts that are missed by the SR assembly method.

Figure S3. Comparison of abundances between transcripts covered by SR-scaffolds and those missed by SR-scaffolds but covered by LRs.
To reduce effects of highly expressed transcripts and to avoid skewed distribution of transcript expression, we log10-tranformed FPKM of transcripts (FPKM ≥ 1e-6) from those covered by SR-scaffolds (80.54%, n= 17,243) and those only by LRs (66.60%, n=13,780), with FPKM computed by StringTie (Pertea, et al., 2015) with SR coverage, then performed t-test between these two groups (Littlejohn, et al., 2014;Xu and Su, 2015;Zwiener, et al., 2014). Outliers are not included. There are some assembled transcripts with unappreciable FPKM: FPKM was estimated by SRs, and thus transcripts only assembled by LRs may have unappreciable FPKM; for very low-expressed SRassembled transcripts, they may be caused by incorrect SR-assembly or incorrect abundance estimation of FPKM.

Note 7: The influences of SR and LR coverage on assembly accuracy
To investigate the influences of SR and LR coverage, the output transcripts from GM12878 dataset by IDP-denovo were binned according to their SR coverage (estimated by StringTie) and LR coverage (number of aligned LRs) separately, with the roughly equal numbers of transcripts in each bin, and the accuracy metrics at average, including precision, recall and F 1 score, were evaluated ( Figure S4).
The transcript accuracy improves with increasing coverage of SRs or LRs.
1) Influence of SR coverage: High SR coverage aids in assembly of accurate SRscaffolds (step a1 in Figure 1 in main text), while low SR coverage can lead to low-accuracy assembly that further prevents long reads from being aligned correctly to extend SR-scaffolds.
2) Influence of LR coverage: For the regions uncovered by SR-scaffolds, LRs are used for extension to get full-length transcripts (step a3 in Figure 1 in main text).
High LR coverage is helpful to generate accurate consensus from error-prone LRs.
Therefore, either high SR or high LR coverage contributes to accurate transcript assembly by IDP-denovo. Figure S4. The influences of SR and LR coverage on assembly accuracy, with metrics of precision, recall and F 1 score.

Seq data and evaluation granularities
Two existing assembly methods with Hybrid-Seq data were tested: 1) Trinity (version 2.1.1), which can integrate LRs into de novo assembly on SRs to improve assembly of isoforms with complex structures, and 2) a hybrid de novo transcriptome assembly pipeline proposed by the Roulin group (referred to herein as "Roulin's pipeline") (Roulin, et al., 2014), which assembles Roche 454 LRs and Illumina SRs separately, followed by clustering and removal of redundant contigs with usearch (Edgar, 2010) and CAP3 (Huang and Madan, 1999). Parameters were set as default for these two methods. The performance of IDP-denovo was compared to these two existing assembly methods by GM12878 data with precision-recall statistics (Li, et al., 2014) mentioned above and sensitivity (the number of reconstructed full-length transcripts) that were described in the previous study (Chang, et al., 2015).

Note 9: Evaluation strategies for k-mer clustering
To optimize the performance of the k-mer-based clustering method, 94,506 LRs from the GM12878 (Tilgner, et al., 2014) dataset that were annotated with genes in Chr19 (chromosome 19) in Ensembl database by alignment with GMAP, were used as training data. Four typical measures of clustering performance, including the Jaccard Index, precision, recall, and F-measure, were applied (Bao, et al., 2011;Chen, et al., 2006).
Let a be the number of pairs that are from the same class and grouped into the same cluster. Let b be the number of pairs that are from the same class but grouped into different clusters. Let c be the number of pairs that are from different classes but grouped into the same cluster. The Jaccard Index is computed as a/(a+b+c). Precision is computed as a/(a+c) and recall as a/(a+b). F-measure is computed as 2x precision/(precision+recall).
The optimal values of these measures were obtained when k = 15 and C threshold = 0.05 for all LRs from chr19 as well as unaligned LRs from chr19 (Table S1). Therefore, these parameter settings were used to cluster unaligned LRs.

Note 10: Evaluation strategy for annotation of gene isoform structures
The annotation analysis was applied to clusters with at most 30 sequences, which comprised of 89.67% of all clusters. To examine the accuracy of isoform structure annotation, transcript sequences from a cluster were aligned to the reference genome by GMAP. A gap in alignment is supposed to be an alternative exon usage event. The 5' end splice site from the reported skipped exon corresponds to the nearest annotated 5' end splice site in alignment. Identification error is defined as the difference between the positions of the predicted 3' end splice site by IDP-denovo and the 3' end splice site reported by reference-alignment.

Note 11: Comparison to abundance estimated by StringTie
SR and LR abundance indices reported by IDP-denovo were compared to FPKM reported by StringTie (Pertea, et al., 2015) for each annotated isoform on a naturallogarithmic scale, if all these values were positive. 5,967 isoforms were included. We calculated Spearman and Pearson correlation coefficients between the SR abundance index and the FPKM estimated by StringTie, as well as those between LR abundance index and FPKM estimated by StringTie.

Note 12: Application of IDP-denovo to Dendrobium officinale
To demonstrate application of IDP-denovo to non-model organisms, Illumina and PacBio data of D. officinale were used (accession number SRP094520). Read quality was checked with FastQC, and 13 bp at the end were trimmed. LRs from the PacBio platform were corrected by SRs from the Illumina platform by LSC. The two SR-scaffold assembly parameters of Velvet+Oases, length and coverage cutoff of k-mer, were set as 31 and 10, respectively. A previously published annotation library and a draft assembly of D. officinale genome (Yan, et al., 2015), polished by an assembled transcriptome (Wu, et al., 2016), were used to evaluate the IDP-denovo output.
Annotation was performed by alignment of assembled sequences to the draft assembly by GMAP with aligned sequences with minimal alignment length of 30 nts were annotated, while the best alignment reported no overlap to annotated loci, the second best alignment was considered if the alignment length was at least 70% of the best alignment. Transcripts unaligned to annotated loci in the draft genome were considered as novel transcripts from novel genes.

Supplementary references
FastQC: A quality control tool for high throughput sequence