An expectation–maximization framework for comprehensive prediction of isoform-specific functions

Abstract Motivation Advances in RNA sequencing technologies have achieved an unprecedented accuracy in the quantification of mRNA isoforms, but our knowledge of isoform-specific functions has lagged behind. There is a need to understand the functional consequences of differential splicing, which could be supported by the generation of accurate and comprehensive isoform-specific gene ontology annotations. Results We present isoform interpretation, a method that uses expectation–maximization to infer isoform-specific functions based on the relationship between sequence and functional isoform similarity. We predicted isoform-specific functional annotations for 85 617 isoforms of 17 900 protein-coding human genes spanning a range of 17 430 distinct gene ontology terms. Comparison with a gold-standard corpus of manually annotated human isoform functions showed that isoform interpretation significantly outperforms state-of-the-art competing methods. We provide experimental evidence that functionally related isoforms predicted by isoform interpretation show a higher degree of domain sharing and expression correlation than functionally related genes. We also show that isoform sequence similarity correlates better with inferred isoform function than with gene-level function. Availability and implementation Source code, documentation, and resource files are freely available under a GNU3 license at https://github.com/TheJacksonLaboratory/isopretEM and https://zenodo.org/record/7594321.

Executing the E step of the algorithm is computationally challenging, and therefore we repeatedly split the isoforms into 200 random subsets and use them to guide the search instead of the full log-likelihood. Here, a value on the x-axis corresponds to one optimization step, i.e. a random partition of all isoforms into 200 sets and optimization of the GO term assignments within each set. The y-axis shows the sum of likelihood changes divided by the number of log-likelihood terms over all 200 sets, starting from the difference between the value of the objective after the second step and its value after the first step. The E step terminates when the sum of changes over its last 25 partitions does not exceed a small threshold, after which the M step optimizes the parameters that map the number of shared GO terms to the normalized alignment score. The figure was generated for the optimization of GO Molecular Function+Interpro2GO.  . The x-axis shows between 0 and 5 shared terms since the number of pairs that share a given number of terms is smaller when breaking down to the 3 subontologies than when the terms are combined. Left: for isoforms, the greatest increase in correlation is for BP and CC, with some increase for MF. Right: for GO gene level annotation there is a modest increase in correlation for the 3 sub-ontologies.   Table S1: Availability of predictions or executable code of previous approaches to isoform prediction. The table shows published algorithms and indicated whether predictions made by the algorithm, are available ("Predictions?") or whether script or program is available which could be used to generate predictions for the algorithm ("Executable?"). We followed links from the original publications and searched for updated links using standard internet search engines. In some cases, following the original links produces a 404 Page Not Found error (e.g., [1,2]. In others, the original papers did not provide predictions or code (e.g., [6]). (*) We did not test DeepIsoFun, because is was presented by the same group as DIFFUSE, which is a later paper and was reported to outperform DeepIsoFun. (**) IsoFun, Diso-Fun and DMIL-IsoFun require a license to matlab [10,15,13]; FINER provided predictions for 471 GO terms but none of its predictions matched entries in our gold standard [16]. No open-source version is available. (***) Unable to run provided code.

Supplementary Note 1
A reduction is a method used in theoretical computer science to transform one problem into another problem. In this section, we show that the graph 3-coloring problem, which is NP-complete, can be transformed into the E step of the isoform function assignment problem as posed in the main manuscript (we will call it isoform-GO-assignment for brevity). Although it remains to be proved, it is widely believed that no polynomial time algorithms exist for finding solutions to NP-complete problems. Colloquially speaking NP-complete problems belong to a class of problems that are difficult to solve efficiently. The purpose of this proof is to motivate the need for a heuristic (approximation) at the E-step of the EM algorithm described in the main text. Given a graph G(V, E) where V is the set of vertices and E ⊂ V × V is the set of edges, a k-coloring assigns to each node v ∈ V a label l v ∈ 1, 2, . . . , k such that if (u, v) ∈ E then l v = l u .
Let G(V, E) be an instance of a 3-coloring, i.e. any input to the 3-coloring problem. We perform the following polynomial-time construction of an instance of isoform-GO-assignment: 1. For each node v ∈ V we create one isoform.
2. Assign the isoforms to genes arbitrarily. Every gene has the same set of 3 GO terms.
Claim 1. There is a 3-coloring for the graph if and only if there is an isoform-GO-assignment where the sum of absolute differences between predicted and observed sequence similarities is |V | 2 − |E| , i.e. the sum is equal to the number of vertex pairs that do not have an edge between them.
Proof. First direction: Assume that there is a 3-coloring for G. We assign GO terms as follows: For each isoform i assign the GO term with the index of the color of its corresponding node in the graph. Since nodes that have an edge between them do not share a color, the corresponding isoforms will not share a GO term, and the predicted sequence similarity between them will be β 0 = −1. By the construction this is exactly the observed sequence similarity score, so the sum of differences for these nodes will be 0. For nodes that do not have an edge between, by the construction they can either share one GO term or zero GO terms. So the predicted sequence similarity score will be either 1 or -1, in both cases an absolute difference of 1 from the observed sequence similarity of 0. The total sum of absolute differences is then |V | 2 − |E|, which is the number of these nodes.
Second direction: Now assume that there is an isoform function assignment such that the sum of absolute differences between predicted and observed sequence similarities is |V | 2 − |E|. First, we will show that the sum of absolute differences between isoforms i and j such that (i, j) / ∈ E is at least |V | 2 − |E|: If i and j share zero or one GO terms, then as we have seen in the other direction of the proof, the absolute difference between the predicted and observed sequence similarity is 1. If they share two GO terms, then the predicted similarity is β 0 + β 1 · 2 + β 2 · 2 2 = 3, and the absolute difference is |3 − 0| = 3. Similarly, if they share three GO terms the absolute difference is 5. Since as we have shown the minimal absolute difference is 1, and since there are |V | 2 − |E| such isoform pairs, the sum of absolute differences between predicted and observed sequence similarity scores for these isoform pairs is at least |V | 2 − |E|.
Since absolute difference are non-negative, and since the total sum of absolute difference in the solution is |V | 2 − |E|, all other differences must be 0. Since the sequence similarity between all other pairs of isoforms, i.e. those for which (i, j) ∈ E is -1, this is also the predicted sequence similarity scores between them, and this can only happen if they do not share a GO term. Now, assign to each node the color that corresponds to the index of the GO term that was assigned to its corresponding isoform. If the isoform was assigned more than one G0 term, arbitrarily select one term/color from those that were assigned to it. By the construction, none of the adjacent nodes will be assigned the same color. This completes the proof.

Notes
1. The reduction can be slightly changed such that isoforms can be left without any GO term assigned to them in the definition of the GO assignment problem. To obtain this, we connect each node/isoform to |E| new nodes that are connected only to it, and have sequence similarity 1 to it -then it is easy to see that in the optimal solution each isoform is assigned a GO term.
2. We used the L 1 norm for difference between predicted and observed sequence similarities in the definition of the GO assignment problem, but the same reduction can be done with the L 2 norm with minimal changes.