NanoSplicer: accurate identification of splice junctions using Oxford Nanopore sequencing

Abstract Motivation Long-read sequencing methods have considerable advantages for characterizing RNA isoforms. Oxford Nanopore sequencing records changes in electrical current when nucleic acid traverses through a pore. However, basecalling of this raw signal (known as a squiggle) is error prone, making it challenging to accurately identify splice junctions. Existing strategies include utilizing matched short-read data and/or annotated splice junctions to correct nanopore reads but add expense or limit junctions to known (incomplete) annotations. Therefore, a method that could accurately identify splice junctions solely from nanopore data would have numerous advantages. Results We developed ‘NanoSplicer’ to identify splice junctions using raw nanopore signal (squiggles). For each splice junction, the observed squiggle is compared to candidate squiggles representing potential junctions to identify the correct candidate. Measuring squiggle similarity enables us to compute the probability of each candidate junction and find the most likely one. We tested our method using (i) synthetic mRNAs with known splice junctions and (ii) biological mRNAs from a lung-cancer cell-line. The results from both datasets demonstrate NanoSplicer improves splice junction identification, especially when the basecalling error rate near the splice junction is elevated. Availability and implementation NanoSplicer is available at https://github.com/shimlab/NanoSplicer and archived at https://doi.org/10.5281/zenodo.6403849. Data is available from ENA: ERS7273757 and ERS7273453. Supplementary information Supplementary data are available at Bioinformatics online.

1. the location of mapped splice junction (a pair of mapped 5' and 3' splice sites) and 2.
subsequences from both exons. NanoSplicer takes as input mapped long reads in BAM/SAM format, from which the location of mapped splice junctions can be quickly extracted. Then, we identify subsequences from both exons as follows. To ensure a JWR contains information about all its candidate splice junctions, we choose the subsequences to be included in the JWR after obtaining all candidate junctions. Section 2.2 in the main text describes how NanoSplicer selects candidate splice junctions and supplementary section 1.2.1 provides a detailed procedure to identify nearby canonical candidate splice junctions (one option for candidate splice junctions).
Fig. S1: Candidate splice junctions, candidate junction motifs, and junction within reads (JWRs) A: All potential 5' and 3' canonical splice sites within 10nt of the mapped splice sites. B: Construct canonical candidate splice junctions by considering all 5'-3' combinations of the splice sites identified in A. C: Panel shows the start and end positions of a JWR and its candidate junction motifs for a list of candidate splice junctions Once we obtain all candidate splice junctions, we identify the 5' and 3' splice sites farthest away from each other among all 5' and 3' splice sites in the candidate junctions (e.g., leftmost 5' splice site and rightmost 3' splice site in supplementary Fig. S1C). Then, we use the 15th nucleotide preceding the 5' splice site and the 15th nucleotide after the 3' splice site as the first and last nucleotides of the subsequences for the JWR, respectively.

Obtaining candidate squiggles
Section 2.2 in the main text describes each step of obtaining candidate squiggles: constructing a list of candidate splice junctions, and then for each candidate, identifying a candidate junction motif and predicting its expected candidate squiggle. Here, we will provides details on the iden-tification of nearby canonical splice junctions (one option for candidate splice junctions) (supplementary section 1.2.1), the selection of the start and end positions of each candidate junction motif (supplementary section 1.2.2), and the prediction of an expected candidate squiggle using the "expected current level model" in Tombo (supplementary section 1.2.3).

Canonical ("GT-AG") splice junctions as candidate splice junctions
For each JWR, to identify nearby canonical splice junctions (introns that start with "GT" and end with "AG"), we first search for all potential 5' splice sites followed by "GT" within 10nt of the mapped 5' splice site and all potential 3' splice sites after "AG" within 10 nt from the mapped 3' splice site (supplementary Fig. S1A). Then, we consider all possible combinations of these potential 5' and 3' splice sites, creating canonical splice junctions (supplementary Fig. S1B).
If the transcript strand information is recorded in the input BAM/SAM file (i.e. 'ts' tag is present in the BAM/SAM file), we search for canonical candidate splice junctions on the transcript strand of the reference genome. Otherwise, the "GT-AG" pattern from both strands will be included as candidates.

Start and end positions of candidate junction motifs
Each candidate junction motif for a JWR extends 5' and 3' from the candidate splice junction to a common location which is the start and end positions of the JWR (see supplementary Fig. S1C and supplementary section 1.1). This ensures each candidate has the same nucleotide sequence (and squiggle signal) at the beginning and end, ensuring differences between candidate squiggles are solely due to the various splice junctions utilised.
In practice, the start and end of the junction squiggle may not perfectly match that of JWR (or start and end positions of candidate junction motifs). Thus, we include 6 nucleotides at each side of the candidate junction motif as a buffer (see supplementary Fig. S1C), and allow the junction squiggle to be aligned to a part of the candidate junction motifs (see supplementary section 1.4.2).

Prediction of an expected candidate squiggle from a candidate junction motif
Nanopore sequencing works by recording changes in electrical current when a DNA or RNA molecule traverses through a pore. During the transmission of the DNA/RNA molecule, multiple nucleotides occupy the pore simultaneously. Thus, the current level is influenced by a k-mer (consecutive nucleotides with length of k). We predict a candidate squiggle for each candidate junction motif using an "expected current level model" in Tombo v1.5. The expected current level model provides the mean and standard deviation of the current level for each 6-mer; see https://nanoporetech.github.io/tombo/model\ training.html for the procedure in Tombo to compute these means and standard deviations.
For simplicity, in the main text, we use the third nucleotide in a 6-mer to represent the whole 6-mer by following Tombo's procedure to match nucleotides to 6-mers. For example, to predict an expected squiggle for a motif GCAG, we first include 2 and 3 nucleotides present before and after the motif, respectively, yielding an extended motif, such as, AGGCAGCTG. Then, we convert this extended motif into four 6-mers (AGGCAG, GGCAGC, GCAGCT and CAGCTG), resulting in four mean-standard deviations, each of which is the mean-standard deviation of current levels for each 6-mer.
Note that the means and standard deviations provided by the expected current level model in Tombo are computed using normalised squiggles; see https://nanoporetech.git hub.io/tombo/resquiggle.html for details on the normalisation procedure in Tombo.
NanoSplicer obtains junction squiggles using the "resquiggle" tool in Tombo which performs the same normalisation procedure. This enables the junction squiggles to be comparable to candidate squiggles.

Junction squiggle preprocessing
We occasionally observe transient current spikes in junction squiggles which are a few current measurements far away from a typical range of squiggle current measurements (between -3 and 3 for squiggles normalised using Tombo; see https://nanoporetech.github.io/t ombo/resquiggle.html for details on the normalisation procedure in Tombo). Supplementary Fig. S2 visualises a squiggle corresponding to a long read from the synthetic data (see section 3 in the main text or supplementary section 2.1 for details of the data), showing a typical range of normalised current measurements, as well as examples of the spikes. NanoSplicer removes those spikes from junction squiggles by eliminating current measurements whose absolute values are bigger than a user-specified threshold T . In the analysis of this paper, we used T = 4 to remove spikes from junction squiggles which are normalised by Tombo.

Dynamic time warping (DTW) in NanoSplicer
In NanoSplicer, we adapt DTW to align a junction squiggle to each of its candidate squiggles.
DTW is an efficient algorithm for aligning two sequences which may vary in speed. Formal backgrounds on DTW can be found in Keogh and Ratanamahatana (2005). Here, supplementary section 1.4.1 presents our implementation of DTW in NanoSplicer which places restrictions on standard DTW, and supplementary section 1.4.2 details these restrictions.
1.4.1 Align junction squiggle to candidate squiggles using DTW NanoSplicer aligns a junction squiggle to each of its candidate squiggles using DTW. For each alignment, NanoSplicer treats the junction squiggle as observations from a model that has the means and standard deviations from each candidate squiggle as parameters. Then, we use a negative log-likelihood as a measure of distance in DTW. Let x = (x 1 , . . . , x K ) be the junction represent the m-th candidate squiggle, where µ m i and σ m i are the mean and standard deviation of the current level for the i-th k-mer in its junction motif with length L m . Note that for simplicity, the main text describes "the mean and standard deviation for each nucleotide" instead of "for kmer"; see supplementary section 1.2.3 for the definition of k-mer and the relationship between a nucleotide and a k-mer. Our implementation of DTW for the alignment is the same for all M candidate squiggles, so we will describe it for a single candidate squiggle and drop the superscript m for the rest of supplementary section 1.4.
where f is the probability density function of the modified normal distributions (see supplementary section 1.5 for the definition). Let a = (a 1 , . . . , a K ) denote an alignment between x and c, where a k indicates the index of the mean-standard deviation in c where x k is aligned. For example, a k = j indicates that x k and (µ j , σ j ) are aligned. Using DTW, NanoSplicer finds the alignment that minimises a distance between x and c, defined by K k=1 d a k ,k , among alignments satisfying the restrictions described in the following section 1.4.2.

DTW restrictions in NanoSplicer
Here, we detail the restrictions that are imposed on alignments when NanoSplicer searches for the alignment with minimum distance between x and c. Supplementary Fig. S3 shows examples of a valid alignment (satisfying all restrictions) and an invalid alignment (breaching the restrictions). . , x K in its column and candidate squiggle (µ 1 , σ 1 ), . . . , (µ L , σ L ) in its row, respectively. An alignment is represented by a path on the matrix. Specifically, a path passing the grid at i-th row and j-th column indicates that (µ i , σ i ) and x j are aligned. The diagonal pink band is the area where valid paths are restricted to be inside. The green path shows a valid path in NanoSplicer DTW implementation. The black line represents an invalid path, and the red "X"s show where each of the restrictions described in supplementary section 1.4.2 is breached: 1) x 1 is aligned to multiple elements in the candidate squiggle; 2) The continuity is breached; 3) The monotonicity is breached; 4) A path is not allowed to go out of the pink band; 5) The minimum dwell time t (t = 4 in this example) is not satisfied as the number of current measurements aligned to (µ 4 , σ 4 ) is less than t (= 4).
• No warping in the junction squiggle: We treat each current measurement of the junction squiggle as an observation from a model that has means and standard deviations of each candidate squiggle as parameters. Each single observation has only one meanstandard deviation in the model. Thus, each measurement in x is aligned to only one mean-standard deviation in the candidate squiggle.
• Warping band: An alignment a = (a 1 , . . . , a K ) can be represented by a path (a 1 , 1), . . . , (a K , K) in an L × K matrix, where L is the length of the candidate squiggle. Paths are constrained to be within a diagonal band with bandwidth w × L, where w ∈ (0, 1] is a user-specified value. See supplementary Fig. S3 for examples of paths and band. Specifically, (a k , k) has to be within the band for all k ∈ {1, . . . , K}. In our paper, we used w = 0.4. The motivations of this warping band restriction include: -The junction squiggle alignment is expected to cover most nucleotides in the candidate junction motif. Thus, we prevent current measurements in the junction squiggle from being aligned to only a small proportion of the candidate squiggle.
-In practice, the start and end of the junction squiggle may not perfectly match that of the candidate squiggles. Thus, we include more nucleotides at each side of the candidate junction motif as a buffer (see supplementary section 1.2.2 for more details), and then allow the method to search for the best start and end positions of the junction squiggle alignment among that near the two ends of the candidate squiggle (i.e., path does not have to start at (1,1) or end at (L, K) on the matrix.).
-It speeds up the DTW algorithm which searches for the best alignment (see Keogh and Ratanamahatana (2005) for more explanation).
• Minimum dwell time: Very short dwell times (i.e., the duration of a translocation event) may not reflect typical translocation events, potentially causing misleading results in NanoSplicer. Thus, NanoSplicer restricts the number of current measurements in x 1 , . . . , x K aligned to (µ j , σ j ) to ones bigger than or equal to t for all j ∈ 1, . . . , L. We used t = 4 in the analysis of the paper. In our NanoSplicer model in section 2.4.2 in the main text, we used modified normal distributions which have flat tails, making our methods robust to measurements that match none of the M candidate squiggles. This section describes the modified normal distribution with µ, σ, C, and α as parameters. The probability density function of the modified normal distribution can be represented by:

Modified normal distributions
where q α and q 1−α are the α and 1 − α quantiles of the normal distribution with mean µ and standard deviation σ. Supplementary Fig. S4 shows the shape of this density function. In the analysis of this paper, we used α = 0.01 (see supplementary section 2.10.2 for the performance assessment with different values of α) and C = 4 to ensure all observed current measurements and their summary values are between −C and C. standard deviations of the summary is to assume that summary values from the same squiggle (i.e., the same read) share the same standard deviation and to estimate the shared standard deviation by using multiple summary values from an entire squiggle. Specifically, for each (entire) squiggle, we already have its alignment to its basecalled read (from the "resquiggle" tool in Tombo during the step of obtaining junction squiggles; see section 2.1 in the main text). Let L denote the number of nucleotides in the basecalled read. We first partition current measurements in the (entire) squiggle into L sections by using their alignments to the read (i.e., consecutive measurements aligned to the same nucleotide belong to the same section). Then, we compute compute the summaryỹ = (ỹ 1 , . . . ,ỹ L ), whereỹ i summarises information in the squiggle at its i-th section. The standard deviation of the summary values from this squiggle Fig. S5: Toy example of alignments for junction squiggle, junction squiggle segmentation, and alignments for segmented junction squiggle a: Example of alignments between a junction squiggle x = (x 1 , x 2 , . . .) and 3 candidate squiggles c 1 , c 2 , c 3 . b: The alignments are represented by a matrix A. c: Segments are defined by consecutive current measurements whose columns in A are the same. d: Current measurements in the junction squiggle are summarised into y. The candidate squiggles of y are denoted by c 1 s , c 2 s , c 2 s and the matrix A s represents the alignments between y and the candidate squiggles c 1 s , c 2 s , c 2 s ; see Supplementary sections 1.6.1 and 1.6.2 for details.
can be estimated byσ where µ i is the mean ofỹ i which can be obtained from the "expected current level model" in Tombo.

Alignments for the summary from junction squiggle segments
The M alignments between the junction squiggle x = (x 1 , . . . , x K ) and the M candidate squig-

Practical implementation of junction squiggle segmentation
Segments with only a small number of measurements yield less stable summary values. However, NanoSplicer model does not incorporate the uncertainty of the summary, and so these summary values possibly cause less optimal performance of NanoSplicer. Thus, we filter out summary values that are computed by ≤ s current measurements -we used s = 3 in the analysis of the paper.
As the number of candidates increases, the number of segments will likely increase as well, as each candidate will contain unique mean-standard deviation elements that require further segmentation. This often leads to increasing numbers of segments with ≤ s measurements (see candidate c m , we can compute the candidate-specific summary y m using the candidate-specific segments. Then, we approximate the assignment probability in equation (3) of the main text using candidate-specific summaries, y 1 , . . . , y M . Specifically, equation (3) in the main text can be written as: where Θ = (c 1 s , . . . , c M s , A s ). We then approximate the likelihood ratio P(y|z=m,c m s ,As) P(y|z=b,c b s ,As) using Here, A s m,b represents alignments of y m to c m s and c b s . Fig. S6: Junction squiggle segmentation A: Example of alignments between a junction squiggle and 3 candidate squiggles. Each current measurement in the junction squiggle (blue) is vertically aligned with its corresponding mean-standard deviation in each candidate squiggle. The black dotted lines are the boundaries of the segments. B: Left and right panels show candidatespecific segments for candidates 3 and 2, respectively (candidate 1 used as a "base candidate").

Squiggle information quality
The NanoSplicer model assumes that the junction squiggle corresponds to the location of the JWR, however Tombo can potentially align the JWR to the incorrect squiggle location. Therefore, we add a step to filter out junction squiggles that do not have a high quality alignment to any candidate squiggle, suggesting they emanate from an off-target read subsequence. First we measure the alignment quality between a junction squiggle and each of its candidate squiggles using the average log likelihood over the nucleotides of the candidate squiggle. Then, we compute the maximum of these alignment qualities across M candidates, referred to as squiggle information quality (SIQ). In practice, we restrict our splice junction identification to JWRs with SIQ bigger than a threshold to ensure their junction squiggles have high quality alignments at least one of their candidate squiggles. Here, we detail each step to computing SIQ.
Step 1: Measure the alignment quality between a junction squiggle and each of its candidate squiggles. Let x and c m denote the junction squiggle and the m-th candidate squiggle with length L m , respectively. We first partition current measurements in x into L m sections using the alignment between x and c m (i.e., consecutive measurements aligned to the same mean-standard deviation belong to the same section). Then, we compute the summary y m = (y m 1 , . . . , y m Lm ), where y m n summarises information in x at its n-th section. Then, we measure the alignment quality between x and c m using where f is the probability density function of the modified normal distribution (see supplementary section 1.5), and µ m i , σ m i are the mean and standard deviation in the candidate squiggle c m s aligned to y m n (see supplementary section 1.6.1 for details on the candidate squiggle c m s of the summary values).
Step 2: Compute the maximum of the alignment qualities across M candidates.
Step 1 provides S 1 , . . . , S M that measure the alignment quality between the junction squiggle and the M candidate squiggles. Then, we define squiggle information quality (SIQ) by and we restrict our splice junction identification to JWRs with SIQ bigger than a threshold to ensure their junction squiggles have high quality alignments at least one of their candidate squiggles.

Prior mixing proportion as a function of nucleotide composition near splice sites
It has been reported that among the canonical Eukaryotic splice junctions (i.e. "GT-AG" splice junctions), "GTC" or "GTT" is less frequent than "GTA" or "GTG" as intron patterns right after 5' splice sites; "AAG" or "GAG" is less frequent than "CAG" or "TAG" as the 3 nucleotides proceeding 3' splice sites (Irimia and Roy, 2008). For example, it's been reported that the relative frequency of "GTA" or "GTG" at the 5' splice site is larger than 94% in human, mouse, S.
cervisiae and S. pombe (Ast, 2004). Also, our empirical analysis of GENCODE Comprehensive gene annotation (v39) (Frankish et al., 2021) shows that the relative frequency of "GTA" or "GTG" at the 5' splice site and the relative frequency of "CAG" or "TAG" at the 3' splice site are 93.0% and 92.0%, respectively. Minimap2 (Li, 2018) provides an option to exploit these different frequencies of nucleotide composition near splice sites. NanoSplicer also enables users to exploit that information by modelling the prior mixing proportion as a function of nucleotide composition near splice sites.
In our biological RNA data analysis (see section 4 in the main text), we incorporated this information as follows. Let "GT + " and "GT − " denote the more frequent (i.e."GTA" or "GTG") and less frequent (i.e. "GTC" or "GTT") intron patterns at 5' splice sites, respectively; similarly, "AG + " and "AG − " denote the more and less frequent ones at 3' splice sites. Then, we model the mixing proportion as h if the m-th candidate junction is non-canonical, 1 if the m-th candidate junction has "GT − -AG − ", r if the m-th candidate junction has "GT + -AG − " or "GT − -AG + ", r 2 if the m-th candidate junction has "GT + -AG + ".
We used r = 9 in our biological RNA data analysis, corresponding to a priori assumption that 1) GT + at the 5' site and AG + at the 3' site are nine times more likely to be splice sites than GT − and AG − , respectively; and 2) these propensities at the 5' and 3' splice sites are independent; see supplementary section 2.10.3 for NanoSplicer performance assessments for difference choices of r. We used h = 1 in the biological RNA data analysis, a priori assuming that non-canonical junctions have the same propensity to be a splice junction as canonical ones with "GT − -AG − ".
The values for these hyperparameters (h and r) or the function for the prior mixing proportions can be modified for different datasets (e.g., datasets from different organisms) based on prior knowledge on the general propensity of a candidate to be a splice junction. Ultimately, methods that learn the prior mixing proportions from the data at hand may provide a better performance (left for future work).

Prior mixing proportion when annotated splice junctions are included as candidates
Methods such as FLAIR (Tang et al., 2020) require a set of splice junctions, either from annotations or from matched short reads, to be provided to guide their corrections. While NanoSplicer does not require those input splice junctions, it provides the option to utilise them, for example, in organisms where comprehensive annotations are available. In our NanoSplicer applications with the option to use the human GENCODE v39 annotations (see sections 4.3 and 4.4 in the main text and supplementary section 2.17.2), we considered the following two types of annotated splice junction defined by using the information in the human GENCODE v39 annotations and included them as candidates: • Type 1 annotated splice junction in NanoSplicer: splice junction that appears in annotated transcripts in the human GENCODE v39 annotations • Type 2 annotated splice junction in NanoSplicer: splice junction that is not a type 1 annotated splice junction, but consists of splice sites appearing in annotated transcripts in the human GENCODE v39 annotations (see Fig. S7 for examples) Then, we model the mixing proportion as P(Z = m) ∝ q if the m-th candidate junction is a type 1 annotated splice junction, (8) and, otherwise, r if the m-th candidate junction has "GT + -AG − " or "GT − -AG + ", r 2 if the m-th candidate junction has "GT + -AG + ".
We used q = 81 in our analysis (sections 4.3 and 4.4 in the main text and supplementary section 2.17.2), giving the same weight to type 1 annotated splice junctions and "GT + -AG + " junctions (as we used r 2 = 81 in the analysis). Thus, we gave a higher weight for junctions appearing in the annotated transcripts (type 1), but treated type 2 annotated splice junctions as other candidates (whose weights are determined by nucleotide compositions near splice sites). The squiggles (fast5 files) of the reads included in our data analysis are available from ENA (Accession: ERS7273453).
When running Guppy, we used configuration files dna r9.4.1 450bps hac.cfg for the synthetic RNA dataset and dna r9.4.1 450bps hac prom.cfg for biological RNA dataset to get fastq files.
2.3 Nanopore data : Mapping of basecalled reads 2.3.1 Mapping nanopore reads to a reference genome (splice-aware) NanoSplicer requires mapped nanopore reads as input. We used minimap2 to map nanopore reads to a reference genome (both synthetic and biological data).
Synthetic RNA nanopore reads: Using minimap2 (version 2.17-r943-dirty), we mapped the synthetic RNA reads to the sequin genome v2.4 with the following set of parameters: "-ax splice -t 32 -k 15 -ub --eqx --secondary=no --sam-hit-only --splice-flank=no". We activated "--eqx" option in minimap2 to output the CIGAR string with mismatch information. By default, the CIGAR string in the output BAM file uses "M" to indicate a match or mismatch. With "--eqx", the matches and mismatches are denoted as "=" and "X", respectively. We used the mismatch information when calculating the junction alignment qualitiy (see supplementary section 2.6). We deactivated the "splice flank" sequence preference option ("--splice-flank=no") as this preference is not present in sequins; see supplementary section 1.8 for details on the intron sequence preference.
We ran minimap2 with the following set of parameters: "-ax splice -t 32 -k 15 -ub --eqx --secondary=no --sam-hit-only". The setting is identical to what we used in the synthetic RNA data above, except for the removal of " --splice-flank=no".

Mapping (synthetic) nanopore reads to the sequin isoforms
In our synthetic data analysis (section 3 in the main text), we defined a ground truth splice junction for each JWR by choosing one among the known 745 sequin splice junctions. To identify a 1-1 correspondence between each read and a sequin isoform, we mapped the reads to the sequin isoforms using minimap2. We ran minimap2 with command "minimap2 -ax mapont -t 32 --secondary=no --sam-hit-only", which is the setting for mapping nanopore reads to a reference without splicing. With the parameter "--secondary=no", the resulting BAM file did not contain any secondary alignments. Furthermore, to remove the potential ambiguity, we filtered out the reads with supplementary alignment(s) to a different isoform.
2.4 Short read data: pre-processing, mapping, and identifying splice junctions supported by the mapped short reads In short-read sequencing data, base qualities near the ends of reads tend to be poor. All leading and trailing bases below a Phred quality score of 20 were removed using Trimmomatic v0.36 (Bolger et al., 2014). We mapped the short reads to GRCh38 using STAR v2.7.3a (Dobin et al., 2013) with the following parameters: "--outSAMtype BAM SortedByCoordi-nate --twopassMode Basic". To improve the mapping quality, we provided the GENCODE Comprehensive gene annotation (v39) (Frankish et al., 2021) to guide the mapping. We used pysam v0.15.2 (https://github.com/pysam-developers/pysam) to find splice junctions supported by the mapped short reads, as well as the number of reads mapped to each of them.

Comparing splice junctions supported by short and long-reads mapping
In the biological RNA data analysis in section 4 of the main text, we used short reads to define a ground truth for each JWR -treating a splice junction with at least 3 mapped short reads within 10 bases of the mapped splice sites as a ground truth for that JWR. Thus, in this section, we compared the number of unique splice junctions identified by both short and long-read mapping and their overlap.
We first identified splice junctions from the short-and long-read mapping results (see sup- were supported by < 5 long-reads; see Fig. S8B. Moreover, these 37,384 JWRs had lower junction alignment qualities (JAQs) than those confirmed by short reads (Fig. S8C), suggesting many were potential mapping errors. Therefore, many of these long-read only junctions were expected to be candidates for correction with Nanosplicer. This was indeed the case, with the number of long-read only junctions decreasing by 12,452 after NanoSplicer correction (from 37,384 to 24,932; see Fig. S8A). In contrast, almost all (98.8%) of long-read junctions confirmed by short-read were retained after Nanosplicer correction. Note that our overall results encompass these long-read only junctions (after NanoSplicer correction), which are part of the 2.9% Nanosplicer error rate we reported in Table 1. As detailed in supplementary section 2.11, most Nanosplicer errors and hence long-read only junctions are "completely missed JWRs" where the initial mapping placed the JWR far from the true splice junction and hence NanoSplicer was unable to include the true junction as a candidate.
Performing the opposite comparison, among the 22,383 unique splice junctions identified by short reads, 16,138 were confirmed by long-read junctions and 6,245 were not; see Fig. S8A.
The 6,245 junctions found only in short reads tend to be supported by a low number of shortreads ( Fig. S8D), suggesting they are low-abundance junctions, which are most likely not identified in long-reads due to differences in sequencing depth.

Junction alignment quality (JAQ)
Basecalling errors can result in low quality alignments between the JWR and the reference genome. Therefore we hypothesized that the initial mapping would perform poorly for JWRs with high basecalling errors and that the advantage of NanoSplicer will be greatest for these. To .
We obtained the number of matched, mismatched, inserted or deleted bases from the CIGAR string in the mapping result (SAM/BAM file). In the CIGAR string from splice-aware mappers,  NanoSplicer failed to run on 0.49% of all JWRs in the synthetic dataset and 0.26% in the biological dataset due to the following two reasons.

Non-analysable JWRs
1. 0.41% and 0.21% of the total JWRs in the synthetic and biological datasets, respectively: Mapped exons were too short, so NanoSplicer could not construct candidate junction motifs (supplementary Fig. S9). These JWRs are potentially related to spurious splice junction mapping. Only 20.5% and 58.4% of these JWRs (the synthetic and biological datasets, respectively) are initially mapped to true splice junctions.
2. 0.08% and 0.05% of the total JWRs in the synthetic and biological datasets, respectively: Tombo failed to align squiggles to reads, so NanoSplicer could not locate junction squiggles. These JWRs may have low squiggle quality. 86.2% and 93.4% of these JWRs (the synthetic and biological datasets, respectively) are initially mapped to true splice junctions. The initial mapping accuracy for these JWRs are lower than that for JWRs for which we successfully ran NanoSplicer (92.2% and 95.7% for the synthetic and biological datasets, respectively).
Note that these JWRs were omitted when we reported the number of JWRs in NanoSplicer output and calculated the accuracy of the NanoSplicer in the main text and all the following supplementary sections. . Then, assuming that most JWRs are well aligned to correct squiggle locations, we choose -0.8 as a SIQ threshold that identifies junction squiggles whose SIQ values much smaller than a majority of SIQs in the distribution (5.4% and 4.5% of all JWRs < -0.8 for synthetic and biological data, respectively). In supplementary section S2.5.1, we assess NanoSplicer performance for different choices of a SIQ threshold and discuss how the choice of thresholds involves trade-offs between accuracy and the ability to identify splice junctions.
2.8 Contribution of SIQ, assignment probability and NanoSplicer correction to the accuracy improvement    Table S1 shows the number of JWRs in each junction alignment quality bin for all JWRs, JWRs after SIQ filtering, and JWRs after SIQ and strongest assignment probability filtering, including completely missed JWRs. Supplementary Table S2 shows more detailed impact of NanoSplicer correction: the number of JWRs which were incorrectly mapped and then corrected by NanoSplicer, as well as JWRs which were correct in the initial mapping but wrongly identified by NanoSplicer. Fig. S11: Contribution of SIQ, assignment probability and NanoSplicer correction to the accuracy improvement. A: Accuracy of NanoSplicer and the initial mapping in the synthetic RNA dataset. B: Accuracy of NanoSplicer and the initial mapping in the biological RNA dataset.  Table S2: Contribution of NanoSplicer correction to the initial mapping result. The table presents, for each junction alignment quality (JAQ) bin, the number of JWRs after SIQ and strongest assignment probability filtering, and among them, the number of JWRs which were incorrectly mapped and then corrected by NanoSplicer, as well as JWRs which were correct in the initial mapping but wrongly identified by NanoSplicer.

JWRs filtered by SIQ and/or assignment probability are enriched in completely missed JWRs
Completely missed JWRs are JWRs whose mapped splice junctions are far away (> 10 nucleotides away in the analysis of the paper) from any true splice junctions, so their true splice junctions are not included as candidate junctions. Here, we assess how well SIQ and/or assignment probability in NanoSplicer recognise and filter out these JWRs. Supplementary Table S3 shows the proportions of JWRs and completely missed JWRs after SIQ and/or assignment probability filtering (these proportions have been computed from the numbers in supplementary Table S1). For each proportion, we performed the hypothesis testing for H 0 : p 1 = p 2 versus H 1 : p 1 < p 2 , where p 1 is the proportion of completely missed JWRs after filtering and p 2 is the proportion of non-completely missed JWRs after filtering. For most cases, the SIQ and/or assignment probability filtered out more JWRs for completely missed JWRs than non-completely missed JWRs (p-value < 10 −15 for all cases except for the case with "*" in supplementary Table S3 which has p-value < 10 −9 ), supporting that SIQ and/or assignment probability filtering help identify JWRs without true junctions as candidates.
2.10 Assessing NanoSplicer performance for different choices of SIQ and strongest assignment probability thresholds, quantiles in the modified normal distribution, and a hyperparameter in the mixing proportions In our analysis in the paper, we chose -0.8 as a SIQ threshold, 0.8 as a strongest assignment

Squiggle information quality (SIQ) and strongest assignment probability threshold
We used randomly selected 100K JWRs from the synthetic data to assess NanoSplicer performance for different choices of SIQ and strongest assignment probability thresholds. Supple-

α for quantiles in the modified normal distributions
Our NanoSplicer model (section 2.4.2 in the main text) uses the modified normal distributions which have flat tails, making our methods robust to measurements that match none of the M candidate squiggles. Supplementary section 1.5 introduces details of these distributions which have α as a parameter to define q α and q 1−α in the distributions. Here, we used the randomly selected 100K JWRs from the synthetic data to assess NanoSplicer performance for different choices of α. Supplementary Fig. S13 show the accuracy of NanoSplicer and the initial mapping for varying α values (0.05, 0.01, 0.001, 0). Note that the modified normal distributions with α = 0 are equivalent to normal distributions. The use of the modified normal distributions (α > 0) helps improve NanoSplicer correction of the initial mapping results, particularly for JWRs with JAQ > 0.8. There was no significant difference among the results of α > 0.

Hyperparameter r in the mixing proportions
In our biological RNA data analysis, we used r = 9 in the prior mixing proportions, corresponding to a priori assumption that 1) GT + at the 5' site and AG + at the 3' site are 9 (=r) times more likely to be splice sites than GT − and AG − , respectively; and 2) these propensities at the 5' and 3' splice sites are independent (see supplementary section 1.8 for details). Here, we assess NanoSplicer performance for different choices of r using randomly selected 100K JWRs from the biological data. Supplementary Fig. S14 shows accuracy of NanoSplicer and the initial mapping for varying r values between 1 to 96. Note that r = 1 corresponds to a priori assumption that nucleotide composition near splice sites is not related to the propensity of a candidate to be a splice junction. Modelling a prior mixing proportion as a function of nucleotide composition near splice sites (r > 1) helps improve NanoSplicer performance. NanoSplicer performance is robust to difference values of r -at all values of r between 1 to 98, NanoSplicer accuracy is higher than that of the initial mapping (95.6%).

JWRs remaining wrong after NanoSplicer correction
Among 3,066,881 JWRs which passed the SIQ and strongest assignment probability filtering in synthetic data analysis [1,724,883 JWRs in biological data analysis], 4.3% [2.9%] of the JWRs still remain wrong after NanoSplicer correction. In this section, we summarise several reasons for these JWRs with NanoSplicer errors.
The majority of these JWRs (66.0% and 66.2% for synthetic and biological data sets, respectively) are completely missed JWRs (supplementary Table S4). Completely missed JWRs occur when the initial mapping is too distant (>10 bases) from the true site for it to be included in the list of candidates and all candidates which can be assessed are incorrect. While Nanosplicer filters out many of these completely missed JWRs (see supplementary section 2.9), those not filtered out make up the bulk of the NanoSplicer errors.
In addition to the above, a smaller proportion of the errors are from JWRs whose squiggles do not have enough information to identify the correct splice junctions. While we implement multiple filtering steps (SIQ and strongest assignment probability) to remove most squiggles that fall into this category, there are trade-offs between accuracy and the ability to identify splice junctions. The current analysis used -0.8 and 0.8 as threshold values for the SIQ and strongest assignment probability filtering. We find that 20.2% [21.2%] and 28.0% [23.2%] of the incorrect JWRs in synthetic data analysis [biological data analysis] have SIQ from -0.5 to -0.8 or a strongest assignment probability from 0.8 to 0.99, respectively; see supplementary   Table S4. So these JWRs passed our filtering, but with lower levels of confidence in their accuracy. Supplementary Fig. S12 shows that the use of more stringent threshold values leads to higher accuracy by filtering out these JWRs, but at the expense of many correctly identified JWRs.
We observed that NanoSplicer performs poorly on these JWRs because our model does not

Number of JWRs Proportion
Completely  Table S4: JWRs remaining wrong after NanoSplicer correction take into account the uncertainty of the summary values, (note that NanoSplicer partitions an observed squiggle into multiple segments, combines noisy measurements of the squiggle into a more stable summary value at each segment, and use the summary values as data in the NanoSplicer model; see section 2.4.1 in the main text for details). One potential way to incorporate the uncertainty is to exploit the likelihood approximation as described in Shim et al. (2021), yielding likelihoods expressed by the estimates of model parameters and their standard errors.
We leave it for future work.

More examples of splice junction correction with NanoSplicer
To demonstrate how squiggles can provide extra information to identify splice junctions, Sup-  Orange line shows the splice junction identified by the initial mapping of the JWR. Insertionbasecalled nucleotides in read that were not part of genome alignment. Deletion -nucleotides in the reference genome that are not part of the read. A2 & B2: Alignments between the junction squiggle for the JWR (blue) and the corresponding candidate squiggles from A1 & B1 (orange and green). Each junction squiggle current measurement is vertically aligned with its assigned mean-standard deviation in the candidate squiggle. The junction motifs for each candidate are shown at the top of each panel. Each nucleotide in the motifs is aligned with its corresponding squiggle position. Panels focus on the junction squiggle areas that distinguish between the candidates (grey background; the absolute difference in log likelihood between the two candidate models is bigger than 1.35; see Supplementary section 2.13 for details).

Additional example in synthetic RNA data
In the reference genome in Supplementary Fig. S15A1, there are two "GT" 5' splice motifs right next to each other, leading to two candidate splice junctions. Based on the ground truth of the synthetic RNA, two exonic bases "GT" preceding the 5' splice site were deleted from the nanopore read, resulting in the JWR initially mapped to the wrong 5' splice site with a 2-base shift. We compared the junction squiggle of this JWR to the candidate squiggles obtained from the true splice junction and the splice junction matching the initial mapping. Supplementary   Fig. S15A2 shows the alignments between the junction squiggle and the two candidate squiggles respectively. The squiggle from the true candidate splice junction is visually a closer match to the junction squiggle. NanoSplicer quantified this squiggle similarity, giving an assignment probability of 0.99 to the true candidate for this JWR.

Additional example in biological RNA data
In the reference genome in Supplementary Fig. S15B1, there are two "AG" 3' splice motifs 4 bases apart at the 3' splice site, leading to two candidate splice junctions. One of the candidate splice junctions is supported by the short read data and is assumed to be the true one. However, the nanopore read in supplementary Fig. S15B1 was mapped to the other candidate splice junction due to basecalling errors near the splice junction (the "A" and "GAAG" preceding and after the true splice junction were basecalled as only "G" in the nanopore read). We compared the junction squiggle of this JWR to the squiggles obtained from both candidate splice junctions.
Supplementary Fig. S15B2 shows the alignments between the junction squiggle and the two candidate squiggles respectively. The shape of the squiggle for the true candidate is clearly a better match for the junction squiggle. NanoSplicer quantified this squiggle similarity, leading to an assignment probability of 0.999 to the true candidate for this JWR.

Identification of junction squiggle areas that distinguish between two candidates
In Fig. 3 in the main text and Fig. S15 for examples of NanoSplicer correcting wrongly mapped JWRs, we highlighted junction squiggle areas that distinguish between two candidates using grey background. We identified those areas as follows. Let y n represent a summary value at the n-th segment of the junction squiggle. Suppose y n is aligned to (µ t , σ t ) and (µ o , σ o ) in the candidate squiggles c t s and c o s , respectively. Summary values from the same squiggle (i.e., the same read) share the same standard deviation, so σ t = σ o = σ. Assuming that c t s is the candidate squiggle identified by NanoSplicer, we denote µ t by µ o +μ. Then, y n ∼ N * (µ o , σ) for the candidate squiggle c o s and y n ∼ N * (µ o +μ, σ) for the candidate squiggle c t s , where N * denotes the modified normal distributions. Then, the n-th segment is junction squiggle areas that distinguish between the two candidates if where f is the probability density function of the modified normal distributions (see Supplementary section 1.5 for details) and χ 2 1,α represents the 1 − α quantile for the χ 2 distribution with 1 degree of freedom. We used α = 0.1 in all figures of this paper.

Run time
The NanoSplicer analysis pipeline includes following steps: 1. Identify the locations of JWRs from mapped reads; 2. Calculate junction alignment quality (JAQ) for each JWR; 3. Perform splice junction identification for each JWR. In steps 2 and 3, the entire analysis is parallelizable because the analysis for each JWR is independent. Multiprocessing is available in NanoSplicer (https://github.com/shimlab/NanoSplicer). Table S5 shows the run time for each step in our analysis of synthetic and biological datasets. Table S6 shows the run time of FLAIR and StringTie2 for biological data analysis. Their run time is shorter compared to NanoSplicer, although this is not unexpected given the higher computational burden of NanoSplicer analysis. A higher computational burden is a common tradeoff of methods based on probabilistic models such as NanoSplicer. Note that splice junction identifications in FLAIR and StringTie2 are not based on probabilistic models. However, due to the flexible nature of NanoSplicer, it can be run on JWRs of interest and not just transcriptomewide. For example, it can be focused only on JWRs below a user-specified JAQ threshold, or on genes and genomics regions of interest. This will greatly shorten run time, while still providing the information required by the user.   Table S6: FLAIR and StringTie2 run time (CPU time) We ran FLAIR (splice site correction step) and StringTie2 on 758,330 mapped reads (BAM file) in the biological data. Note that the run time for FLAIR is for "flair correct" module, which is the module for the splice site correction in the FLAIR package.

Impact of including common non-canonical junctions as candidates
NanoSplicer provides the option of including "GC-AG" and "AT-AC" junctions as candidates, as they are the most prevalent non GT-AG junctions in mammals (∼0.7% and ∼0.05% of splice junctions (Burset, 2000)). Using this option, we have added all GC-AG/AT-AC junctions within 10 bases of each JWR to the candidates initially selected by using the default option ("default candidates"; see section 4 in the main text for details) and re-applied NanoSplicer to the biological data. In this section, we summarise the impact of including GC-AG/AT-AC junctions as candidates by comparing the performance of NanoSplicer with the default candidates and the default candidates + GC-AG/AT-AC junctions; see Table S7.
Adding GC-AG/AT-AC junctions as candidates significantly improved the identification of JWRs with non-canonical ground truth (accuracy increased from 49.7% to 83.8%), because 98.1% of those JWRs in the dataset used GC-AG or AT-AC junctions.
However, the overall impact on NanoSplicer accuracy from including common non-canonical junctions is slight (from 97.1% to 97.2%). Therefore, we don't include GC-AG/AT-AC junctions as candidates by default because it roughly doubles the runtime of NanoSplicer (runtime for each JWR increased from 3 to 6 sec) without much benefit on overall accuracy. Even so, if accurate identification of GC-AG and AT-AC junctions is important, users can use the option to do so, which we would recommend when users need higher accuracy for non GT-AG junctions or are analyzing genes that utilise non GT-AG junctions.
Overall accuracy (calculation from 100K randomly selected JWRs) Accuracy of JWRs whose ground truths are non-canonical splice junctions (8,861 JWRs in total) Accuracy for JWRs whose ground truths are GC-AG/AT-AC junctions (8,689 JWRs in total)  Table S7: Performance of NanoSplicer with the default candidates and the default candidates + GC-AG/AT-AC junctions The numbers in the brackets represent the numbers of JWRs after SIQ and strongest probability filtering in NanoSplicer.
2.16 Improved performance of NanoSplicer in rarer junction identification compared to methods that correct splice junctions using information from other mapped reads Methods such as StringTie2 (Kovaka et al., 2019) correct splice junctions using information from other mapped reads (e.g., nearby splice junctions supported by high read counts), potentially leading less abundant junctions to go undetected. Compared to those methods, NanoS-plicer enables its identification to be independent of other reads, possibly having the advantage in rarer junction identification. To demonstrate this advantage, we compared NanoSplicer with StringTie2 at all sites where we could ascertain if rarer splice junctions were being replaced.
Specifically, using the biological short read dataset (see supplementary sections 2.1 and 2.4 for details of the dataset), we first considered splice junctions with at least 3 mapped short reads to be "true". We then focused on 492 sites where there are multiple nearby true splice junctions.
That is, each site has multiple true junctions whose 5' or 3' splice sites are ≤10nt away from that of another junctions in the site. We will refer to these multiple junctions as a "junction set". To assess the ability of each method to identify rarer junctions, we compared the number of a "fully detected junction set", defined by a junction set where all true junctions are detected by each method using the biological nanopore sequencing dataset (see supplementary sections 2.1, 2.2, and 2.3 for details of the dataset). Among the 492 junction sets, StringTie2 fully detected 10 junction sets (2%) while NanoSplicer was successful for 273 (55.5%). This result illustrates the benefits of NanoSplicer, whose performance is independent of other reads, in identifying rare junctions. We observed similar results when true splice junctions were defined by using different minimum numbers of short reads (supplementary Table S8).  Table S8: The number of junction sets fully detected by NanoSplicer and StringTie2. This table shows, for each minimum number of short reads to define true splice junctions, the number of junction sets constructed by the true splice junctions and the numbers of fully detected junction sets by NanoSplicer and StringTie2. StringTie2 has been applied with two different settings: 1. default long read setting (command "stringtie -L") whose results have been presented in the main text; 2. high sensitivity setting (command "stringtie -L -f 0 -m 30 -t -E 10") which uses relaxed thresholds for the final output to increase the sensitivity of splice junction detection.
Note that NanoSplicer identifies splice junctions only for JWRs whose squiggles are informative, limiting the potential usage of its output in splice junction "quantification" (see main text section 5 for more discussion). Therefore, we assessed the ability of each method to identify rarer junctions using the number of fully detected junction sets, rather than performing splice junction quantification. respectively). However, all isoforms assembled by StringTie2 using these mapped long reads support only the major splice junction, leading to the minor splice junction being undetected (supplementary Fig. S16C).
Supplementary Fig. S16D shows splice junctions identified by NanoSplicer using the same dataset (see main text section 4 for the details). NanoSplicer identified both splice junctions (the major and minor splice junctions are identified at 20 and 7 JWRs, respectively; NanoSplicer did not provide identification results for 5 JWRs due to their poor squiggle information quality; see main text section 4 for the details). This supports that NanoSplicer has a good potential for identifying rare splice junctions as its performance is not affected by other reads. 2.17 Improved performance of NanoSplicer when there are multiple nearby annotated junctions, or the true junctions utilise unannotated splice sites Methods such as FLAIR (Tang et al., 2020) require a set of splice junctions, either from annotations or from matched short reads, to be provided to guide their corrections. While NanoSplicer does not require those input splice junctions, it provides the option to utilise them, for example, in organisms where comprehensive annotations are available. We compared the output of FLAIR with NanoSplicer, including human GENCODE v39 annotations as input, to understand the relative performance of each on different categories of JWRs. Note that we used only annotations as input because the matched short reads have been used to define the ground truths for assessment of the performance (see section 4 in the main text).
We first review differences in functionalities between NanoSplicer and FLAIR in supple- . Note that except for the "annotation only" mode, NanoSplicer includes the default candidate (GT-AG junction + mapped junction).
Then, we assess the relative performance of each approach on the following four categories of JWRs ( Figure S17 and Table S9).
A. JWR whose ground truth is unannotated, i.e., being novel (6,385 JWRs): For JWRs in this category, FLAIR and NanoSplicer with "annotation only" mode output no identified junctions if there is no annotated junction within 10nt. Otherwise, they correct the JWRs to annotated junctions. Thus, they fail to identify unannotated ground truths, resulting in an accuracy of 0%. However, NanoSplicer with "no annotation" or "with annotation" mode still includes default candidates, enabling it to identify ground truths if they are included as default candidates and supported by squiggle matching. NanoS-plicer in these two modes identified ground truths for ∼94% of the JWRs in this category, leading to the improved accuracy compared to the initial mapping ( Figure S17).
B. JWR whose ground truth is annotated and is the only annotated junction within 10nt (1,780,475 JWRs): For JWRs in this category, FLAIR and NanoSplicer with "annotation only" mode correct all JWRs to ground truths because they are the only annotated junctions within 10nt, resulting in an accuracy of 100%. However, NanoSplicer with "no annotation" or "with annotation" mode considers other (unannotated) junctions as default candidates if they exist within 10nt. Moreover, NanoSplicer with "no annotation" can only consider the ground truths as candidates if they are included as default candidates (i.e., mapped splice junctions and all GT-AG junctions). Nevertheless, NanoSplicer in these two modes identified ground truths with a high accuracy of ∼99%, resulting in the improved accuracy compared to the initial mapping ( Figure S17).
C. JWR whose ground truth is annotated and that has multiple annotated junctions within 10nt -the ground truth is not the only annotated junction within 10nt (40,982 JWRs): JWRs in this category have multiple nearby annotated junctions for the correction of FLAIR and NanoSplicer with "annotation only" mode, and one of them is the ground truth. FLAIR chooses one closest to the mapped junctions by default, while NanoSplicer with "annotation only" mode uses information in squiggles for correction.
NanoSplicer with "with annotation" mode also has multiple candidates (including the annotated ones) and identifies one using squiggle information. Thus, NanoSplicer with "annotation only" and "with annotation" modes provides increased accuracy compared to FLAIR ( Figure S17). Fig. S18 shows an example of JWRs in this category where the ground truth is identified by NanoSplicer with "annotation only" or "with annotation" mode, but missed by FLAIR. The JWR in this example has 2 annotated splice junc-tions within 10nt of the mapped junction and FLAIR identified the wrong one which is closer to the initially mapped junction. However, the expected squiggle from the ground truth is visually a closer match to the junction squiggle and NanoSplicer quantified this squiggle similarity, successfully identifying the ground truth with an assignment probability of 0.988. NanoSplicer with "no annotation" can only consider the ground truths as candidates if they are included as default candidates (i.e., mapped splice junctions and all GT-AG junctions). NanoSplicer in this mode shows a lower accuracy compared to NanoSplicer with other modes and FLAIR, and shows similar accuracy compared to the initial mapping ( Figure S17).
D. Completely missed JWR (52,169 JWRs): Completely missed JWRs are those without junctions supported by short reads within 10nt, having no ground truths. Any identification by NanoSplicer and FLAIR for the completely missed JWRs will be incorrect, but both methods can filter out some of them, affecting overall accuracy. NanoSplicer with "annotation only" mode and FLAIR identify splice junctions only for JWRs that have annotated junctions within 10nt, and most ground truths are annotated junctions in this dataset (see the number of JWRs in each category). Thus, they filtered out more completely missed JWRs compared to NanoSplicer with other modes (Table S9). Note that FLAIR filtered out more JWRs (so more completely missed JWRs) compared to NanoSplicer with "annotation only" mode because FLAIR removes entire reads from its identification of splice junctions if they contain at least one JWR without annotated junctions within 10nt.
In summary, compared to FLAIR, NanoSplicer provides improved performance when there are multiple annotated junctions within 10nt of the JWR, or the ground truth utilised unannotated splice junctions ( Figure S17). Particularly, Nanosplicer with "annotation only" mode  Table S9: Numbers of (all) JWRs and completely missed JWRs where each method provides identified splice junctions after its own filtering outperforms FLAIR when there are multiple nearby annotated junctions while giving equivalent results otherwise. Overall accuracy of each method depends on the relative proportion of JWRs in each category. Also, the relative proportion is heavily dependent on the quality of the annotation of an organism or biological system. For example, for a non-model organism or in disease-causing altered mRNA processing where suitable annotations may not be available, the proportion of the category A is large, leading to increased overall accuracy of NanSplicer with "no annotation" or "with annotation" mode.
Comparison of NanoSplicer and FLAIR for JWRs with multiple ground truths within 10nt: The 1,880,011 JWRs belonging to the four categories above have at most one ground truth within 10nt. To assess the performance of each method for JWRs with multiple ground truths within 10nt, we performed an analysis similar to one in Supplementary section 2.16.
Specifically, we assessed the performance of each method by using the number of fully detected junction sets (out of the 492 junction sets; see Supplementary section 2.16 for the definitions of junction set and fully detected junction set); see Table S10 for results of the analysis. NanoSplicer with "annotation only" mode and FLAIR cannot fully detect any of 287 junction sets that include at least one unannotated junction, resulting in 141 and 139 fully detected junction sets, respectively. NanoSplicer with "no annotation" and "with annotation" modes fully detected more junction sets (273 and 277, respectively) due to their ability to identify unannotated junctions. Fig. S18: Example of JWRs in the category C where the ground truth is identified by NanoSplicer with "annotation only" or "with annotation" mode, but missed by FLAIR A: JWR mapping. The reference genome sequence is shown in purple. The mapped nanopore read (blue) shows the basecalled nucleotides of the JWR and how they were aligned to the reference genome. The green and orange lines show the locations of two annotated splice junctions (1 and 2). Annotated junction 1 (orange line) is the splice junction identified by the initial mapping and FLAIR and the annotated junction 2 (green line) is the ground truth splice junction (i.e., the only nearby splice junction supported by short reads). Insertion -basecalled nucleotides in the read that are not part of genome alignment. B: Alignments between the junction squiggle for the JWR (blue) and the corresponding candidate squiggles from A (orange and green). Each junction squiggle current measurement is vertically aligned with its assigned mean-standard deviation in the candidate squiggle. The junction motifs for each candidate are shown at the top of each panel. Each nucleotide in the motifs is aligned with its corresponding squiggle position. Panels focus on the junction squiggle areas that distinguish between the candidates (grey background; the absolute difference in log-likelihood between the two candidate models is bigger than 1.35; see Supplementary section 2.13 for details).  Table S10: The number of junction sets fully detected by NanoSplicer and FLAIR. This table shows the number of junction sets constructed by the true splice junctions (junction sets consisting of annotated junctions and junction sets containing at least one unannotated junction) and the numbers of fully detected junction sets by NanoSplicer with three different modes and FLAIR. Note that "fully detected junction set" is defined by a junction set where all true junctions are detected by each method.
Fig. S19: JWRs filtered out by SIQ and strongest assignment probability are unlikely to be a random subset of all JWRs. In the synthetic data analysis (section 3 in the main text), we identified 649 unique sequins splice junctions as ground truths for 3,333,887 JWRs (not included the completely missed JWRs). We grouped the JWRs based on their ground truths, leading to 649 groups of JWRs. JWRs in each group share the same ground truth. Out of the 3,333,887 JWRs, 349,703 JWRs have been filtered out by using SIQ and strongest assignment probability in our analysis of section 3. For each group, we computed the proportion of the filtered JWRs. Panel A shows the empirical distribution of that proportions constructed using all 649 groups (blue) and panel B shows those distributions for different sets of groups with different ranges in their size (i.e., the number of JWRs). To compare those distributions with that when a random subset of JWRs is filtered out, we randomly selected 349,703 JWRs from the 3,333,887 JWRs, filtered out them, computed the proportions of the filtered JWRs, and plotted similar empirical distributions (in orange in panels A and B). As expected, these distributions for the random filtering (orange) have modes around 0.1, showing similar proportions across groups (except for groups with 0-10 JWRs which have noisy proportion estimates). Compared to that (i.e., expectation by chance), the proportions after the SIQ and strongest assignment probability filtering showed higher variations (i.e., wider distributions), supporting that JWRs filtered out by these filtering criteria are unlikely to be a random subset of all JWRs. Note that the group with 0-10 JWRs have noisy proportion estimates, showing distributions different from the expectation.