DMIL-IsoFun: predicting isoform function using deep multi-instance learning

3 Results and validation

In this section, we first compare the performance of DMIL-IsoFun against six state-of-the-art methods on the prediction of the functions of maize isoforms. We then analyze the respective contributions of each subnet of DMIL-IsoFun, of the sequence similarity and the co-expressed network and of differentiating isoforms associated with photosynthesis. We further test DMIL-IsoFun on the human genome.

3.1 Performance comparison with the existing methods

The 2020 B73 v5 genome assembly project provides annotations, which enable direct performance evaluation and comparison at the isoform-level, instead of the typical approximate gene-level evaluation done by aggregating isoform-level predictions. We randomly partition the isoforms into a training set (80%) and a validation set (20%) for 10 independent rounds, and ensure that the isoforms of the same gene are partitioned into the same set in each round. We compare DMIL-IsoFun against miSVM (Eksi et al., 2013), iMILP (Li et al., 2014b), DeepIsoFun (Shaw et al., 2019), DIFFUSE (Chen et al., 2019), IsoFun (Yu et al., 2020) and Disofun (Wang et al., 2020). All the input parameters are set as suggested by the authors, or optimized in the suggested ranges. The values of the parameters for DMIL-IsoFun are given in the Supplementary Table S4.

For a comprehensive evaluation, we use four widely used evaluation metrics: AUROC, AUPRC, F_max and S_min (Jiang et al., 2016). AUPRC and AUROC are widely adopted for binary classification; we compute them for each GO term and report the average of all terms. AUROC is the area under the receiver operator characteristics curve. AUPRC is the area under the precision-recall curve that is more sensitive to class-imbalance than AUROC. F_max is the overall maximum harmonic mean of precision and recall across all possible thresholds on the predicted isoform-term association matrix $Z \in R^{n \times | T |}$ (Jiang et al., 2016). S_min uses the information theoretic analogs of precision and recall based on the GO hierarchy to measure the minimum semantic distance between the predictions and ground truths across all possible thresholds (Jiang et al., 2016; Zhou et al., 2019). The first two evaluation metrics are term-centric and the last two are gene(isoform)-centric. These metrics quantify the performance of isoform function prediction from different perspectives; as such, it is difficult for an approach to consistently outperform the other ones across all the metrics. It is worth mentioning that unlike other evaluation metrics, the smaller the value of S_min, the better the performance is.

From the average results in Table 1, we can see that DMIL-IsoFun almost always achieves a performance superior to the other compared methods across the four evaluation metrics. More specifically, DMIL-IsoFun improves the AUROC, S_min and $F_{\max}$ of the second-best method (DIFFUSE) by at least 63.3%, 29.6% and 40.8%, which proves the effectiveness of DMIL-IsoFun on leveraging isoform sequences, RNA-seq datasets and gene–isoform relations to differentiate the functions of individual isoforms. Like DMIL-IsoFun and DIFFUSE, DeepIsofun also builds on deep neural networks, but it is outperformed by the former two. This is because DeepIsoFun solely uses isoform expression data and equally initializes all annotations of a gene to its isoforms, without accounting for the important gene-isoform relation. For similar reasons, iMILP also loses to DIFFUSE and DMIL-IsoFun. miSVM takes into account the gene–isoform relations, but it is still outperformed by the other methods due to the sole utilization of isoform expression data. DIFFUSE also leverages the sequence and co-expression information, but it does not model well the gene–isoform relation. As a result, it loses to DMIL-IsoFun by a large margin. Another possible cause of this is that DMIL-IsoFun combines more effectively sequence and co-expression network data by using a GCN than DIFFUSE. The performance margin between DMIL-IsoFun and DIFFUSE suggests that our choice of GCN for refining isoform-level annotations is more effective than the CRF approach adopted by DIFFUSE. Both Disofun and IsoFun integrate the gene-level interactions and the co-expression network; they neglect the important isoform sequence data, which encode important functional sites and domains that help differentiate the functions of individual isoforms. As such, they often lose to DIFFUSE and DMIL-IsoFun. We observe that the AUPRC value of MF and BP is significantly lower than that of CC. That is because among 387 GO terms and 31 040 isoforms retained for experiment, 32 CC terms have an average of 1054 annotations for each term, and 128 MF terms have an average of 1023 annotations for each term and 227 BP terms have an average of 919 annotations for each term. Each CC term has more annotated isoforms than others, 14.7% larger than BP and 3% higher than MF. The number of CC terms used for experiments is also the smallest. So the prediction of CC terms is relatively easier than BP and MF. We further used a signed rank test to check the difference between DMIL-IsoFun and each compared method; all the P-values are smaller than 0.01. In summary, these results and comparisons demonstrate the effectiveness of leveraging deep MIL and GCNs for isoform function prediction.

Table 1.

Experimental results of predicting GO annotations of individual isoforms of maize

Methods	CC				MF				BP
	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max
miSVM	0.470	0.492	2.218	0.417	0.505	0.074	1.259	0.063	0.528	0.033	1.007	0.107
iMILP	0.628	0.494	2.265	0.554	0.530	0.119	3.351	0.089	0.579	0.044	3.351	0.106
IsoFun	0.557	0.467	2.030	0.674	0.561	0.149	3.445	0.250	0.529	0.099	3.445	0.347
Disofun	0.595	0.510	2.112	0.381	0.526	0.123	3.451	0.177	0.579	0.055	4.615	0.180
DeepIsoFun	0.604	0.554	0.869	0.408	0.576	0.202	2.275	0.303	0.552	0.178	2.501	0.420
DIFFUSE	0.516	0.553	0.801	0.512	0.502	0.179	2.307	0.289	0.505	0.177	2.561	0.396
DMIL-IsoFun	0.854	0.789	0.564	0.721	0.846	0.265	1.259	0.705	0.825	0.093	1.576	0.724

Methods	CC				MF				BP
	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max
miSVM	0.470	0.492	2.218	0.417	0.505	0.074	1.259	0.063	0.528	0.033	1.007	0.107
iMILP	0.628	0.494	2.265	0.554	0.530	0.119	3.351	0.089	0.579	0.044	3.351	0.106
IsoFun	0.557	0.467	2.030	0.674	0.561	0.149	3.445	0.250	0.529	0.099	3.445	0.347
Disofun	0.595	0.510	2.112	0.381	0.526	0.123	3.451	0.177	0.579	0.055	4.615	0.180
DeepIsoFun	0.604	0.554	0.869	0.408	0.576	0.202	2.275	0.303	0.552	0.178	2.501	0.420
DIFFUSE	0.516	0.553	0.801	0.512	0.502	0.179	2.307	0.289	0.505	0.177	2.561	0.396
DMIL-IsoFun	0.854	0.789	0.564	0.721	0.846	0.265	1.259	0.705	0.825	0.093	1.576	0.724

Note: The data in boldface are the statistically best result per column by pairwise t-test.

Table 1.

Experimental results of predicting GO annotations of individual isoforms of maize

Methods	CC				MF				BP
	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max
miSVM	0.470	0.492	2.218	0.417	0.505	0.074	1.259	0.063	0.528	0.033	1.007	0.107
iMILP	0.628	0.494	2.265	0.554	0.530	0.119	3.351	0.089	0.579	0.044	3.351	0.106
IsoFun	0.557	0.467	2.030	0.674	0.561	0.149	3.445	0.250	0.529	0.099	3.445	0.347
Disofun	0.595	0.510	2.112	0.381	0.526	0.123	3.451	0.177	0.579	0.055	4.615	0.180
DeepIsoFun	0.604	0.554	0.869	0.408	0.576	0.202	2.275	0.303	0.552	0.178	2.501	0.420
DIFFUSE	0.516	0.553	0.801	0.512	0.502	0.179	2.307	0.289	0.505	0.177	2.561	0.396
DMIL-IsoFun	0.854	0.789	0.564	0.721	0.846	0.265	1.259	0.705	0.825	0.093	1.576	0.724

Methods	CC				MF				BP
	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max	AUROC	AUPRC	$S_{\min} ↓$	F_max
miSVM	0.470	0.492	2.218	0.417	0.505	0.074	1.259	0.063	0.528	0.033	1.007	0.107
iMILP	0.628	0.494	2.265	0.554	0.530	0.119	3.351	0.089	0.579	0.044	3.351	0.106
IsoFun	0.557	0.467	2.030	0.674	0.561	0.149	3.445	0.250	0.529	0.099	3.445	0.347
Disofun	0.595	0.510	2.112	0.381	0.526	0.123	3.451	0.177	0.579	0.055	4.615	0.180
DeepIsoFun	0.604	0.554	0.869	0.408	0.576	0.202	2.275	0.303	0.552	0.178	2.501	0.420
DIFFUSE	0.516	0.553	0.801	0.512	0.502	0.179	2.307	0.289	0.505	0.177	2.561	0.396
DMIL-IsoFun	0.854	0.789	0.564	0.721	0.846	0.265	1.259	0.705	0.825	0.093	1.576	0.724

Note: The data in boldface are the statistically best result per column by pairwise t-test.

Besides the isoform-level evaluation, we report the gene-level evaluation results of DMIL-IsoFun and other compared methods in the Supplementary Table S3. Moreover, we applied DMIL-IsoFun and the other methods on the Human dataset and report the results (approximate evaluation at the gene-level) in the Supplementary Table S4. We observe that DMIL-IsoFun again achieves significantly better results than the competitive methods on different genomes and evaluation measures.

3.2 Further analysis

3.2.1 Analyzing the effects of model components

To investigate which component of DMIL-IsoFun contributes to the improved performance of DMIL-IsoFun, we perform an ablation study by removing the components from our model and measuring how the performance of the model is affected. We introduced two variants: (i) DMIL-IsoFun-GCN only uses one-hot encoding to encode the isoform sequence and the GCN to fuze sequence and expression data, along with the known isoform-level annotations; (ii) DMIL-IsoFun-CNN directly uses the initialized annotations of isoforms from MILCNN, without using the RNA-seq data and GCN. Following the same experimental configuration, we list the prediction results of these variants in Table 2. We observe that the results of DMIL-IsoFun-CNN are the lowest. This proves the necessity of utilizing GCN and RAN-seq datasets to further differentiate the annotations of individual isoforms initialized by MILCNN subnet. DMIL-IsoFun-GCN ranks second, which again confirms the power of the GCN on merging sequence and expression data to explore the non-linear relationship between isoforms and GO terms. In practice, the AUROC of DMIL-IsoFun-GCN reduces by at least 12.2%, and the AUPRC drops by at least 25.1% when not using the sequence data, which is consistent with previous study that sequence data contain important functional sites and domains that help differentiate the function of individual isoforms (Chen et al., 2019). From the ablation study, we can conclude that both the MILCNN and the GCN subnets contribute to the improved performance of isoform function prediction.

Table 2.

Prediction results of DMIL-IsoFun and its variants

		AUC	AUPRC	$S_{\min} ↓$	F_max
CC	DMIL-IsoFun	0.854	0.789	0.564	0.721
	DMIL-IsoFun-GCN	0.693	0.693	0.898	0.540
	DMIL-IsoFun-CNN	0.508	0.508	0.953	0.307
MF	DMIL-IsoFun	0.846	0.265	1.259	0.705
	DMIL-IsoFun-GCN	0.711	0.193	1.817	0.361
	DMIL-IsoFun-CNN	0.499	0.105	2.562	0.086
BP	DMIL-IsoFun	0.825	0.093	1.576	0.635
	DMIL-IsoFun-GCN	0.684	0.071	2.833	0.436
	DMIL-IsoFun-CNN	0.486	0.041	3.692	0.109

		AUC	AUPRC	$S_{\min} ↓$	F_max
CC	DMIL-IsoFun	0.854	0.789	0.564	0.721
	DMIL-IsoFun-GCN	0.693	0.693	0.898	0.540
	DMIL-IsoFun-CNN	0.508	0.508	0.953	0.307
MF	DMIL-IsoFun	0.846	0.265	1.259	0.705
	DMIL-IsoFun-GCN	0.711	0.193	1.817	0.361
	DMIL-IsoFun-CNN	0.499	0.105	2.562	0.086
BP	DMIL-IsoFun	0.825	0.093	1.576	0.635
	DMIL-IsoFun-GCN	0.684	0.071	2.833	0.436
	DMIL-IsoFun-CNN	0.486	0.041	3.692	0.109

Note: DMIL-IsoFun-GCN only uses the GCN, i.e. the features of isoform nodes use one-hot encoding; DMIL-IsoFun-CNN directly uses the MILCNN, i.e. this variant does not use co-expression data.

Table 2.

Prediction results of DMIL-IsoFun and its variants

		AUC	AUPRC	$S_{\min} ↓$	F_max
CC	DMIL-IsoFun	0.854	0.789	0.564	0.721
	DMIL-IsoFun-GCN	0.693	0.693	0.898	0.540
	DMIL-IsoFun-CNN	0.508	0.508	0.953	0.307
MF	DMIL-IsoFun	0.846	0.265	1.259	0.705
	DMIL-IsoFun-GCN	0.711	0.193	1.817	0.361
	DMIL-IsoFun-CNN	0.499	0.105	2.562	0.086
BP	DMIL-IsoFun	0.825	0.093	1.576	0.635
	DMIL-IsoFun-GCN	0.684	0.071	2.833	0.436
	DMIL-IsoFun-CNN	0.486	0.041	3.692	0.109

		AUC	AUPRC	$S_{\min} ↓$	F_max
CC	DMIL-IsoFun	0.854	0.789	0.564	0.721
	DMIL-IsoFun-GCN	0.693	0.693	0.898	0.540
	DMIL-IsoFun-CNN	0.508	0.508	0.953	0.307
MF	DMIL-IsoFun	0.846	0.265	1.259	0.705
	DMIL-IsoFun-GCN	0.711	0.193	1.817	0.361
	DMIL-IsoFun-CNN	0.499	0.105	2.562	0.086
BP	DMIL-IsoFun	0.825	0.093	1.576	0.635
	DMIL-IsoFun-GCN	0.684	0.071	2.833	0.436
	DMIL-IsoFun-CNN	0.486	0.041	3.692	0.109

Note: DMIL-IsoFun-GCN only uses the GCN, i.e. the features of isoform nodes use one-hot encoding; DMIL-IsoFun-CNN directly uses the MILCNN, i.e. this variant does not use co-expression data.

3.2.2 Validation of predicted isoform functions

We further take ‘DNA binding’ (GO: 0003677), ‘zinc ion binding’ (GO: 0008270) and ‘phosphatidylinositol phosphate kinase activity’ (GO: 0016307) as the testbed, which contains annotated isoforms spliced from the same gene (Breuza et al., 2016). Table 3 lists the known annotations of 10 isoforms of 4 genes. DMIL-IsoFun correctly differentiates 9 out of 10, which results in higher accuracy than the other methods. We observe that the two deep neural network-based models (DeepIsoFun and DIFFUSE) are inclined to assign the same GO term to all isoforms of a multi-isoform gene, since they initialize all annotations of a gene without differentiation, and do not capture well the gene–isoform relation. In contrast, our DMIL-IsoFun takes into account this important relation and differentiates the initial annotations by MILCNN. We also see that the other four methods (Disofun, IsoFun, iMILP and miSVM) are biased toward negative predictions. This is because the positive annotations of isoforms are much fewer than the negative ones, and these methods do not take into account the intrinsic class imbalance issue of isoform function prediction. For a similar reason, DIFFUSE and DeepIsoFun also make more negative predictions than DMIL-IsoFun, which instead accounts for the class imbalance issue. We also observe that DMIL-IsoFun has a higher recall than the other methods, owing to the consideration of class-imbalance. Overall, these case studies confirm that DMIL-IsoFun can differentiate the GO annotations of individual isoforms spliced from the same gene.

Table 3.

Known and predicted positive/negative(_✓/_×) annotations of individual isoforms of each compared method

GO term	Gene	Isoform	Known Annotations	DMIL-IsoFun	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
DNA binding	Zm00001e036212	Zm00001e036212-T001	×	×	×	×	✓	×	×	×
(GO: 0003677)		Zm00001e036212-T002	✓	✓	×	×	✓	×	×	×
	Zm00001e026664	Zm00001e026664-T001	×	×	×	×	×	×	×	×
Zinc ion binding		Zm00001e026664-T002	✓	✓	×	×	×	×	×	×
(GO: 0008270)	Zm00001e012033	Zm00001e012033-T001	×	×	×	×	×	×	×	×
		Zm00001e012033-T002	✓	✓	×	✓	×	×	×	×
Phosphatidylinositol	Zm00001e012593	Zm00001e012593-T001	✓	✓	✓	×	×	✓	×	✓
Phosphate		Zm00001e012593-T002	✓	✓	✓	×	×	×	×	×
Kinase activity		Zm00001e012593-T003	×	✓	×	×	×	×	×	×
(GO: 0016307)		Zm00001e012593-T004	×	×	×	×	×	×	×	×
Accuracy	—	—	—	9/10	7/10	6/10	5/10	6/10	5/10	6/10
Recall	—	—	—	5/6	2/6	1/6	1/6	1/6	0/6	1/6

GO term	Gene	Isoform	Known Annotations	DMIL-IsoFun	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
DNA binding	Zm00001e036212	Zm00001e036212-T001	×	×	×	×	✓	×	×	×
(GO: 0003677)		Zm00001e036212-T002	✓	✓	×	×	✓	×	×	×
	Zm00001e026664	Zm00001e026664-T001	×	×	×	×	×	×	×	×
Zinc ion binding		Zm00001e026664-T002	✓	✓	×	×	×	×	×	×
(GO: 0008270)	Zm00001e012033	Zm00001e012033-T001	×	×	×	×	×	×	×	×
		Zm00001e012033-T002	✓	✓	×	✓	×	×	×	×
Phosphatidylinositol	Zm00001e012593	Zm00001e012593-T001	✓	✓	✓	×	×	✓	×	✓
Phosphate		Zm00001e012593-T002	✓	✓	✓	×	×	×	×	×
Kinase activity		Zm00001e012593-T003	×	✓	×	×	×	×	×	×
(GO: 0016307)		Zm00001e012593-T004	×	×	×	×	×	×	×	×
Accuracy	—	—	—	9/10	7/10	6/10	5/10	6/10	5/10	6/10
Recall	—	—	—	5/6	2/6	1/6	1/6	1/6	0/6	1/6

Table 3.

Known and predicted positive/negative(_✓/_×) annotations of individual isoforms of each compared method

GO term	Gene	Isoform	Known Annotations	DMIL-IsoFun	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
DNA binding	Zm00001e036212	Zm00001e036212-T001	×	×	×	×	✓	×	×	×
(GO: 0003677)		Zm00001e036212-T002	✓	✓	×	×	✓	×	×	×
	Zm00001e026664	Zm00001e026664-T001	×	×	×	×	×	×	×	×
Zinc ion binding		Zm00001e026664-T002	✓	✓	×	×	×	×	×	×
(GO: 0008270)	Zm00001e012033	Zm00001e012033-T001	×	×	×	×	×	×	×	×
		Zm00001e012033-T002	✓	✓	×	✓	×	×	×	×
Phosphatidylinositol	Zm00001e012593	Zm00001e012593-T001	✓	✓	✓	×	×	✓	×	✓
Phosphate		Zm00001e012593-T002	✓	✓	✓	×	×	×	×	×
Kinase activity		Zm00001e012593-T003	×	✓	×	×	×	×	×	×
(GO: 0016307)		Zm00001e012593-T004	×	×	×	×	×	×	×	×
Accuracy	—	—	—	9/10	7/10	6/10	5/10	6/10	5/10	6/10
Recall	—	—	—	5/6	2/6	1/6	1/6	1/6	0/6	1/6

GO term	Gene	Isoform	Known Annotations	DMIL-IsoFun	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
DNA binding	Zm00001e036212	Zm00001e036212-T001	×	×	×	×	✓	×	×	×
(GO: 0003677)		Zm00001e036212-T002	✓	✓	×	×	✓	×	×	×
	Zm00001e026664	Zm00001e026664-T001	×	×	×	×	×	×	×	×
Zinc ion binding		Zm00001e026664-T002	✓	✓	×	×	×	×	×	×
(GO: 0008270)	Zm00001e012033	Zm00001e012033-T001	×	×	×	×	×	×	×	×
		Zm00001e012033-T002	✓	✓	×	✓	×	×	×	×
Phosphatidylinositol	Zm00001e012593	Zm00001e012593-T001	✓	✓	✓	×	×	✓	×	✓
Phosphate		Zm00001e012593-T002	✓	✓	✓	×	×	×	×	×
Kinase activity		Zm00001e012593-T003	×	✓	×	×	×	×	×	×
(GO: 0016307)		Zm00001e012593-T004	×	×	×	×	×	×	×	×
Accuracy	—	—	—	9/10	7/10	6/10	5/10	6/10	5/10	6/10
Recall	—	—	—	5/6	2/6	1/6	1/6	1/6	0/6	1/6

In B73 v5 genome assembly data, the protein produced by ‘Zm00001e042100-T001’ is a component of psaA/psaB protein engaged in the Photosystem (Jiao et al., 2005). This isoform is unique to plants (i.e. maize and Arabidopsis) and participates in the photosynthesis. GO: 0015979 corresponds to ‘photosynthesis’. Among the nine terms positively annotated to ‘Zm00001e042100-T001’, our DMIL-IsoFun correctly identifies six (see Table 4), which is more than any of the compared methods. This study confirms that DMIL-IsoFun can more effectively integrate multi-type data to differentiate GO annotations of isoforms at a finer granular level.

Table 4.

Prediction of the compared methods on maize isoform (Zm00001e042100-T001) with respect to nine positive annotations

GO terms	Ours	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
GO: 0015979	✓	×	×	×	×	×	×
GO: 0009579	✓	×	×	×	✓	×	×
GO: 0016021	×	×	×	✓	×	×	×
GO: 0031224	×	✓	×	×	✓	×	×
GO: 0044425	×	×	×	✓	×	×	×
GO: 0009987	✓	✓	✓	✓	✓	✓	×
GO: 0044237	✓	✓	✓	✓	✓	✓	✓
GO: 0008152	✓	✓	✓	✓	✓	✓	✓
GO: 0016020	✓	×	×	×	×	×	×
Recall	6/9	4/9	3/9	5/9	5/9	3/9	2/9

GO terms	Ours	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
GO: 0015979	✓	×	×	×	×	×	×
GO: 0009579	✓	×	×	×	✓	×	×
GO: 0016021	×	×	×	✓	×	×	×
GO: 0031224	×	✓	×	×	✓	×	×
GO: 0044425	×	×	×	✓	×	×	×
GO: 0009987	✓	✓	✓	✓	✓	✓	×
GO: 0044237	✓	✓	✓	✓	✓	✓	✓
GO: 0008152	✓	✓	✓	✓	✓	✓	✓
GO: 0016020	✓	×	×	×	×	×	×
Recall	6/9	4/9	3/9	5/9	5/9	3/9	2/9

Table 4.

Prediction of the compared methods on maize isoform (Zm00001e042100-T001) with respect to nine positive annotations

GO terms	Ours	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
GO: 0015979	✓	×	×	×	×	×	×
GO: 0009579	✓	×	×	×	✓	×	×
GO: 0016021	×	×	×	✓	×	×	×
GO: 0031224	×	✓	×	×	✓	×	×
GO: 0044425	×	×	×	✓	×	×	×
GO: 0009987	✓	✓	✓	✓	✓	✓	×
GO: 0044237	✓	✓	✓	✓	✓	✓	✓
GO: 0008152	✓	✓	✓	✓	✓	✓	✓
GO: 0016020	✓	×	×	×	×	×	×
Recall	6/9	4/9	3/9	5/9	5/9	3/9	2/9

GO terms	Ours	DIFFUSE	DeepIsoFun	Disofun	IsoFun	iMILP	miSVM
GO: 0015979	✓	×	×	×	×	×	×
GO: 0009579	✓	×	×	×	✓	×	×
GO: 0016021	×	×	×	✓	×	×	×
GO: 0031224	×	✓	×	×	✓	×	×
GO: 0044425	×	×	×	✓	×	×	×
GO: 0009987	✓	✓	✓	✓	✓	✓	×
GO: 0044237	✓	✓	✓	✓	✓	✓	✓
GO: 0008152	✓	✓	✓	✓	✓	✓	✓
GO: 0016020	✓	×	×	×	×	×	×
Recall	6/9	4/9	3/9	5/9	5/9	3/9	2/9

4 Discussion

The differentiation of functions of alternatively spliced isoforms helps explaining the proteome complexity and various complex diseases at a higher resolution than the canonical gene-level analysis. In this article, we introduced DMIL-IsoFun, a method that merges genomics and transcriptomics data to identify the functions of individual isoforms spliced from the same gene. DMIL-IsoFun builds on the principle that the functions of a gene are aggregated from its isoforms, and isoforms with similar sequences and co-expression share similar functions. DMIL-IsoFun firstly introduces a MIL CNN to extract the feature vectors of isoform sequences and to initialize the annotations of individual isoforms using gene–isoform relations; then, it alters the GCN to account for the class-imbalance data to further differentiate annotations of individual isoforms. DMIL-IsoFun significantly outperforms state-of-the-art methods for predictions at both the gene and isoform-levels. In the future, we will study how to reliably combine multiple gene-level, transcript-level and phenotype heterogeneous data sources to further improve the performance of DMIL-IsoFun, and to explore isoform–disease associations.

Funding

This work was supported by the National Natural Science Foundation of China [61872300, 62031003].

Conflict of interest: none declared.

References

Bengio

Y.

et al. (

2003

)

A neural probabilistic language model

.

J. Mach. Learn. Res

.,

3

,

1137

–

1155

.

OpenURL Placeholder Text

Breuza

L.

et al. ; The UniProt Consortium. (

2016

)

The UniProtKB guide to the human proteome

.

Database

,

2016

,

bav120

.

Chen

H.

et al. (

2019

)

DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning

.

Bioinformatics

,

35

,

i284

–

i294

.

Dessimoz

C.

,

Škunca

N.

(

2017

)

The Gene Ontology Handbook

.

Humana Press

,

New York, NY, USA

.

Eksi

R.

et al. (

2013

)

Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data

.

PLoS Comput. Biol

.,

9

,

e1003314

.

Graveley

B.R.

(

2001

)

Alternative splicing: increasing diversity in the proteomic world

.

Trends Genet

.,

17

,

100

–

107

.

Gray

C.B.

et al. (

2017

)

CaMKiiδ subtypes differentially regulate infarct formation following ex vivo myocardial ischemia/reperfusion through NF-κb and TNF-α

.

J. Mol. Cell. Cardiol

.,

103

,

48

–

55

.

Greene

A.L.

et al. (

2000

)

Overexpression of SERCA2b in the heart leads to an increase in sarcoplasmic reticulum calcium transport function and increased cardiac contractility

.

J. Biol. Chem

.,

275

,

24722

–

24727

.

He

K.

et al. (

2015

)

Spatial pyramid pooling in deep convolutional networks for visual recognition

.

IEEE Trans. Pattern Anal. Mach. Intell

.,

37

,

1904

–

1916

.

Jiang

Y.

et al. (

2016

)

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

.

Genome Biol

.,

17

,

184

.

Jiao

S.

et al. (

2005

)

Biochemical and molecular characterization of photosystem i deficiency in the ncs6 mitochondrial mutant of maize

.

Plant Mol. Biol

.,

57

,

303

–

313

.

Kipf

T.N.

,

Welling

M.

(

2017

)

Semi-supervised classification with graph convolutional networks

.

in ICLR.

pp. 1–10

.

OpenURL Placeholder Text

Langfelder

P.

,

Horvath

S.

(

2008

)

WGCNA: an R package for weighted correlation network analysis

.

BMC Bioinformatics

,

9

,

559

.

Li

H.D.

et al. (

2014a

)

The emerging era of genomic data integration for analyzing splice isoform function

.

Trends Genet

.,

30

,

340

–

347

.

Li

W.

et al. (

2014b

)

High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method

.

Nucleic Acids Res

.,

42

,

e39

.

Lin

T.Y.

et al. (

2020

)

Focal loss for dense object detection

.

IEEE Trans. Pattern Anal. Mach. Intell

.,

42

,

318

–

327

.

Luo

T.

et al. (

2017

) Functional annotation of human protein coding isoforms via non-convex multi-instance learning. In:

ACM KDD

. pp.

345

–

354

.

Melamud

E.

,

Moult

J.

(

2009

)

Stochastic noise in splicing machinery

.

Nucleic Acids Res

.,

37

,

4873

–

4886

.

Mittendorf

K.F.

et al. (

2012

)

Tailoring of membrane proteins by alternative splicing of pre-mrna

.

Biochemistry

,

51

,

5541

–

5556

.

Park

C.Y.

et al. (

2013

)

Functional knowledge transfer for high-accuracy prediction of under-studied biological processes

.

PLoS Comput. Biol

.,

9

,

e1002957

.

Shaw

D.

et al. (

2019

)

DeepIsoFun: a deep domain adaptation approach to predict isoform functions

.

Bioinformatics

,

35

,

2535

–

2544

.

Smith

L.M.

,

Kelleher

N.L.

(

2018

)

Proteoforms as the next proteomics currency

.

Science

,

359

,

1106

–

1107

.

Teng

M.

et al. (

2016

)

A benchmark for RNA-seq quantification pipelines

.

Genome Biol

.,

17

,

1

–

12

.

Ver Heyen

M.

et al. (

2001

)

Replacement of the muscle-specific sarcoplasmic reticulum ca2+-ATPase isoform serca2a by the nonmuscle SERCA2b homologue causes mild concentric hypertrophy and impairs contraction-relaxation of the heart

.

Circ. Res

.,

89

,

838

–

846

.

Wang

K.

et al. (

2020

)

Differentiating isoform functions with collaborative matrix factorization

.

Bioinformatics

,

36

,

1864

–

1871

.

PubMed

OpenURL Placeholder Text

Westenbrink

B.D.

et al. (

2015

)

Mitochondrial reprogramming induced by caMKII mediates hypertrophy decompensation

.

Circ. Res

.,

116

,

e28

–

e39

.

Yang

X.

et al. (

2016

)

Widespread expansion of protein interaction capabilities by alternative splicing

.

Cell

,

164

,

805

–

817

.

Yu

G.

et al. (

2020

)

Isoform function prediction based on bi-random walks on a heterogeneous network

.

Bioinformatics

,

36

,

303

–

310

.

Yu

G.

et al. (

2021

)

Imbalance deep multi-instance learning for predicting isoform–Cisoform interactions

.

Int. J. Intell. Syst

.,

36

,

2797

–

2824

.

Crossref

Zhao

Y.

et al. (

2020

)

A literature review of gene function prediction by modeling gene ontology

.

Front. Genet

.,

11

,

400

.

Zhou

G.J.

et al. (

2020

)

Predicting functions of maize proteins using graph convolutional network

.

BMC Bioinformatics

,

21

,

420

.

Zhou

N.

et al. (

2019

)

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

.

Genome Biol

.,

20

,

244

.

Zhou

Z.H.

et al. (

2012

)

Multi-instance multi-label learning

.

Artif. Intell

.,

176

,

2291

–

2320

.

Crossref