Biolinguistic graph fusion model for circRNA–miRNA association prediction

Abstract Emerging clinical evidence suggests that sophisticated associations with circular ribonucleic acids (RNAs) (circRNAs) and microRNAs (miRNAs) are a critical regulatory factor of various pathological processes and play a critical role in most intricate human diseases. Nonetheless, the above correlations via wet experiments are error-prone and labor-intensive, and the underlying novel circRNA–miRNA association (CMA) has been validated by numerous existing computational methods that rely only on single correlation data. Considering the inadequacy of existing machine learning models, we propose a new model named BGF-CMAP, which combines the gradient boosting decision tree with natural language processing and graph embedding methods to infer associations between circRNAs and miRNAs. Specifically, BGF-CMAP extracts sequence attribute features and interaction behavior features by Word2vec and two homogeneous graph embedding algorithms, large-scale information network embedding and graph factorization, respectively. Multitudinous comprehensive experimental analysis revealed that BGF-CMAP successfully predicted the complex relationship between circRNAs and miRNAs with an accuracy of 82.90% and an area under receiver operating characteristic of 0.9075. Furthermore, 23 of the top 30 miRNA-associated circRNAs of the studies on data were confirmed in relevant experiences, showing that the BGF-CMAP model is superior to others. BGF-CMAP can serve as a helpful model to provide a scientific theoretical basis for the study of CMA prediction.


INTRODUCTION
Although the being of circular ribonucleic acids (circRNAs) was first identified as early as the 1970s [1,2], they have not been considered a crucial element in ribonucleic acid (RNA) expression analysis in human physiology.After 2013 [3,4], a new batch of circRNAs was discovered and quantified due to high-resolution and high-throughput RNA-seq data, especially end-to-end reading and deep sequencing.According to the published literature, there are three types of identification methods for circRNAs: the method of ab initio prediction [3] based on RNA-seq comparison tools and algorithms, such as segemehl [5], and tools specifically designed to find circRNAs, such as CIRI [6].These studies prove that circRNA is an endogenous, single-stranded, noncoding RNA (ncRNA) molecule with a covalent closed-loop structure generated as a back splicing linkage at the downstream 5 end splice site and the upstream 3 of RNA transcripts [3,7], and it has vital biological functions, and treatment of diseases are critical by the identification of circRNA-disease associations [8][9][10].Massive experimental analysis has shown that microRNA (miRNA) is an ncRNA and is an indispensable part of gene regulation in eukaryotes [11][12][13].Up to now, increasing plentiful evidence reveals that circRNAs are miRNA sponges [14,15], which can be used as a new class of biomarker through some biological experiments validation [16][17][18].Despite this fact, exploring circRNA-miRNA interactions through some wet-lab experiments is commonly laborintensive.To alleviate this trouble, plenty of simulation methods have been employed to speed up the identification of circRNA-miRNA associations (CMAs) [19,20].
Recent research has revealed that rapidly spring-up computer algorithm models have provided well-grounded solutions for CMA prediction, realizing higher accuracy (Acc.) while working with huge data in machine learning technologies quickly.For instance, Lan et al. [21] put forward a NECMA model to predict CMAs using the inner product and neighborhood regularization logic matrix decomposition.Qian et al. [22] developed a CMIVGSD model based on singular value decomposition and graph variational autoencoders to predict miRNA-associated circRNAs.Guo et al. [19] designed WSCD, which uses convolutional neural network and deep neural network (DNN) to predict CMAs based on structural DNN embedding.Wang et al. [23] integrate node similarity to form feature fusion by multi-modal information to infer the scores between circRNA-associated miRNAs pairs.
Additionally, most pre-existing CMAs prediction models neglect completely correlated biological information contained in their sequences and their inf luence on miRNAs function and complex associations with circRNAs.Accordingly, the following certain restrictions are obviously worth addressing: (i) rational the feature fusion model fusion of multi-modal heterogeneous information to extract interaction between circRNAs and association with different miRNAs; (ii) appropriate sequences and interaction information strategies are considered to learn good representations and (iii) experiment cross-reactions and noise inf luences need to be resolved.
Encouraged by the above analysis consideration and the latest field correlation prediction research method [24][25][26], we put forward BGF-CMAP, a machine learning frame for predicting CMAs based on graph representation learning [27][28][29], which employs a homogeneous embedding fusion model of deep learning and factorization.To be specific, we first extract the potential biological attribute feature from the circRNA sequence-based word embedding, a natural language processing method, word2vec [30,31].Second, multi-source behavior feature information contains their impact on the circRNA-miRNA relationship of interaction [32], which are obtained as heterogeneous graphs by validated interaction pairs and the same number of unlabeled samples to construct a dependable molecular association network.Consequently, the network is input into a fusion model of large-scale information network embedding (LINE) [33,34] and graph factorization (GF) [35] for low-dimensional embedding vector generation.Finally, gradient boosting decision tree (GBDT) [36] classifier is utilized to infer the potential CMAs effectively.In conclusion, the framework of the BGF-CMAP model is shown in Figure 1.Supplementary data are available online at https://github.com/look0012/BGF-CMAP.

Prediction performance of the BGF-CMAP model
In the evaluation, we used our constructed dataset using the 5fold cross-validation (CV) method to access the ability of BGF-CMAP model to predict potential CMAs in terms of Acc., sensitivity (Sen.), specificity (Spe.), precision (Pre.),Matthews correlation coefficient (MCC), area under receiver operating characteristic (ROC) (AUC) and area under precision recall (AUPR).All experimental results and the mean of prediction in boldface are shown in Table 1.BGF-CMAP acquired a mean AUC of 0.9075 and an SD of 0.0016, of which the AUCs of 5-fold experiments were 0.9067, 0.9103, 0.9076, 0.9070 and 0.9061, respectively.In Figure 2, the AUC of BGF-CMAP can be obtained by summing the areas under panel A plotting the ROC curve, and the AUPR of BGF-CMAP refers to the AUPR curve enclosed by Pre. and recall in panel B, reaching the upper left and right corner of the image with a large area under the curve, respectively.Overall, the above statistics prove that assuming the model has state-of-the-art performance, it can excellently provide solid evidence for an advanced understanding of the circRNA-miRNA relationship by effectively predicting potential CMAs.

Comparison of different feature extraction strategies
The purpose and advantage of combining sequence information with interaction information is to capture the linear and nonlinear relationship between features, which makes the model more robust and accurate.To confirm whether the effectiveness of two different graph embedding methods (LINE and GF) better than the proposed model to model performance, we contrasted testing LINE and GF models with BGF-CMAP, respectively, named BGF-CMAP-LINE and BGF-CMAP-GF.BGF-CMAP-LINE and BGF-CMAP-GF with BGF-CMAP performed the same 5-fold CV experiment on the same dataset, the comparison is summarized in the histogram in Figure 2C, and their specific values are shown in Table 1.According to the data in Table 1, the average values of Acc., Sen., Spe., Pre., MCC and AUC obtained from the BGF-CMAP-LINE model were 3.65, 3.32, 5.77, 5.33 and 9.1% and 0.0342 less than the presented model in this paper's BGF-CMAP model, respectively.Comparing the models in Table 1 indicates that the average values of the BGF-CMAP-GF model are smaller than the built BGF-CMAP model.Similarly, Figure 2C also shows the advantage of BGF-CMAP through the comparison of prediction performance.In conclusion, the feature extraction effect of BGF-CMAP is better than that of the two unilateral information feature extraction strategies.

Comparison with Laplacian eigenmaps model
To infer the effectiveness of the proposed model utilizing biological attribute feature and multi-source behavior feature as model attributes to optimize this model capability, we compared it with the behavior topology feature vector generated by the LAP method.For fairness and consistency, we used the method of low-dimensional embedding vectors generated by LAP to replace the graph embedding of the proposed fusion model during the experience process, and the other parts of the model remained unchanged.Specifically, we extract the interaction behavior features and sequence the attribute features using LAP and Word2vec, respectively, and then integrate these two features into the GBDT classifier for model training.Using our dataset, we train the model with the LAP method using the 5-fold CV to obtain the values summarized in Table 2.As seen from Table 2, BGF-CMAP achieved the better result, and its prediction Acc., Spe., Pre., MCC. and AUC are higher than the LAP model by 8.91, 22.6, 14.4 and 16.18% and 0.0814, respectively.This result suggests that a fusion model combining LINE and GF used by the proposed model can effectively build the characterization of vectors and train the computational model, which contributes to improving the model and achieving the most potential prediction performance.The AUPR and ROC curve advantages of BGF-CMAP can be seen from the comparison of Figure 2E and F.

Comparison with diverse classifier models
To ensure the selection of an optimal classification method through feature extraction, we compare diverse classifier models to assess the impact on the features and performance of BGF-CMAP in this study.Specifically, we retained the biological attribute feature and behavior topology feature extraction methods as unvaried and only substituted the GBDT model

Comparison with other state-of-the-art models
As recent research on the CMAs has intensified, many distinguished scholars have proposed different approaches to predicting CMAs.We look forward to comparing BGF-CMAP with the above methods more equitably to evaluate its predictive performance.The proposed pipeline has many competitive advantages, as shown in Figure 1.Specifically, the biggest advantage of word2vec is to encode words with similar meanings into high-dimensional vectors with similar distances so that the encoding has semantic characteristics.GF uses approximate factorization of adjacency matrices as embedding.LINE extends this approach and attempts to maintain 1st and 2nd order approximations, aiming to map each node in the graph into a lowdimensional vector space, thereby capturing the similarities and relationships between nodes.The edge sampling method of LINE overcomes the problem of node embedding aggregation, which can easily occur in the traditional stochastic gradient method, and improves the efficiency and effect of the result [33].Compared with our model, other models in this research field do not have the advantages of this model's integration of the above algorithms, so they cannot show good competitive results.Because there is a precise comparison here, we counted the AUC and AUPR scores generated by the prior models, and these listed results in Table 4, containing our model and only several newly published papers in the new research field of CMAs prediction with CMIVGSD, WSCD, KGDCMI [23] and SGCNCMI [20].The Table 4 shows BGF-CMAP realized the highest AUC and AUPR scores, which were 0.0133 and 0.0233 times superior to the second-best SGCNCMI model and exceeded the mean value of the other three methods by about 0.0198 and 0.0372 times.It can be deduced from Table 4 that the P-value of 0.0270 is <0.05, so it can be concluded that our BGF-CMAP has significant differences compared to other models.Therefore, from the above comparison, it can be concluded that BGF-CMAP can provide the most competitive theoretical guidance for further academic research.

Case studies
To further research the effectiveness of BGF-CMAP to identify new miRNA candidate circRNAs, we conducted case studies by training the model with known circRNA-miRNA pairs and deduced all unknown CMAs with the trained model.Candidates for unknown interaction pairs are then ranked according to the higher scores, and the predictive validity is made sure it is correct by finding relevant research literature or correlation experiment.Concluding model prediction outcomes are presented in Table 5 from which we can see that only 7 of the top 30 miRNA-related cir-cRNAs pairs have been not verified in the recent literature.In order to validate the reliability, robustness and accurate predictive ability of our model, we tested our BGF-CMAP model using the latest literature datasets CMI-9905 [23] and 9589 pairs of CMAs (called CMI-9589) [22].The final experimental outcomes are shown in Figure 3. From Figure 3, it can be observed that our model achieved AUC values of 0.8984 and 0.9019, which fully show the strong applicability of our model.In addition, we conducted another case study on the model using an independent dataset mentioned above.The model was used to predict all unknown CMAs, and out of the top 10 scoring CMAs, 8 of the associated relationships have been successfully reported in relevant literature, as shown in Table 6.Overall, the case study suggests that BGF-CMAP has a superior predictive performance for the prediction of potential CMAs and that these valuable circRNA candidates for miRNA studies will most likely be selected for additional wet-lab experimental studies to reduce the deficiency of manual errors.

DISCUSSION
In this paper, we proposed BGF-CMAP, a graph representation learning model based on a fusion model combining two homogeneous embedding methods for predicting CMAs.We first constructed low-dimensional embedding vector generation based on word embedding using the sequence information and then constructed a low-dimensional representation based on the fusion model combining LINE and GF, while retaining graph topology and attributes of nodes.Next, we got the training features of the fusion vectors by combining the biological The reasons for the excellent predictive performance of the BGF-CMAP model in this paper are due to several advantages, which are described in the following text.Especially, the model takes into consideration the attribute features and behavior features provided by circRNA and miRNA sequences, whereas previous approaches have focused less on the full application of this information.More specifically, (i) comparing with the previous method, which focuses on the CMA feature extraction information, the sequence information is used to enhance the feature expression.(ii) It can execute genome-wide quantitative testing and analysis, and virtually every reading segment can well characterize its correlation RNA sequence, and there are no experiment cross-reactions and noise inf luences.Furthermore, the sequence information is combined with the interaction information to obtain extremely reliable feature vectors of behavior attributes and to increase the quality of feature extraction.
Nevertheless, the following certain restrictions of the BGF-CMAP model are certainly worth resolving in the future.For instance, (i) sequence information is not wholly employed by the proposed model, and the characteristics part of its additional information hiding needs to be studied ulteriorly.(ii) The application of this framework requires parameter settings, which possibly leads to some minor simulation errors compared with the actual experiment.(iii) Explore more on the mechanism of perfect multi-source heterogeneous fusion.Combined with the above discussion, we intend to research large feature dimensions and multi-source heterogeneous fusion model to make unremitting efforts to conduct perfectly the CMAs prediction model.

Dataset
To use high-reliability sequence information of circRNA and miRNA to assess the performance of the BGF-CMAP model, we utilized the currently available experimentally verified CircBank database [37] and miRbase [38,39] database as the high-quality selected dataset to assess the quality of the above model.CircBank utilizes circRNA data from the circBase database (http://www.circbase.org/),encompassing information for comprehensive analysis and processing, especially the complete circular RNA sequence on its website.It offers circRNA annotation details, mature sequences, protein-coding potential, IRES prediction, m 6 A modification status, circRNA conservation and miRNA-circRNA interaction information, among other features.First, we removed the redundancy of high-quality CMAs from the CircBank database, and second, we extracted CMAs from published journals, checked their names and sequences against standard circBase and miRbase databases and finally merged them into our constructed database.Therefore, we can describe the built dataset as follows: In this study, we first clarified 20 208 pairs of experimentally validated CMAs involving 3569 circRNAs and 1152 miRNAs from the built dataset.Here, D + and D − denote the set of 20 208 positive and 4 091 280 negative samples, respectively.D denotes a union set between elements.Second, we established the adjacent matrix DM of the built dataset with the dimension of 3569 × 1152.When miRNA m(i) is not associated with circRNA c(j), the element DM i, j of DM is set to 0; Otherwise, it is given a value of 1.

Evaluation criteria
For the evaluation performed in machine learning, different evaluation metrics to assess the prediction capability of the proposed BGF-CMAP model, including Spe., Pre., Sen., MCC and Acc.These evaluation indices are defined as follows: Acc. = TP + TN TP + TN + FP + FN , ( 6 ) where the abbreviations of the evaluation criteria include TP for true positive, TN for true negative, FP for false positive and FN for false negative.Additionally, we visualized the ROC curves by computing the TP and FP rates that were generated from BGF-CMAP and calculated their average AUC, AUPR for considering the imbalance.Also, a reliable 5-fold CV was applied to reduce the overfitting and to assess the performance of our proposed BGF-CMAP.

Attribute feature extraction by Word2vec
The word2vec was proposed by Google [30].The prevailed word2vec is a word embedding model, which uses the relationship between the words appearing in the sentences to get word vectors from higher dimensional to lower dimensional space in machine learning [40,41].Analogously, a node is used as a word and a sequence of nodes is used as a sentence, as described in the sequence analysis of the circRNA and miRNA node vector representation by the word2vec algorithm.
Here, word2vec made use of the extracted circRNA and miRNA sequence features in a vector space dependent on CBOW (bagof-words) [24,42] model that is utilized to characterize the node features in this study.The optimal target word determined by CBOW performs n predictions before and after the target word, with the objective function of the output value of CBOW, as follows: where ω is weight matrix, ω t stands for the target word and parameters of the context words are regarded as ω t−n , • • • , ω t+n .
In the study, through the gensim python software package [43], using the word2vec model, which trains the circRNA and miRNA sequence vectors, we achieved a 64-dimensional objective vector.

Behavior feature extraction by homogeneous graph embedding
Homogeneous graph embedding [44,45] is a graph representation learning method, which aims to retain graph topology when learning low-dimensional representations of vertices.It is also acknowledged as network embedding or non-attributed graph embedding containing methods based on random walks, deep learning and matrix decomposition.Therefore, GF, LAP in matrix factorization and LINE of deep learning are selected for this study.
The widely accepted network representation learning model (LINE [46]) can be considered as learning low-dimensional dense vectors by preserving 1st and 2nd order proximity using a multilayer perceptron.For objective functions of 1st and 2nd orders, respectively, are shown below: Here, we set V = {v i , L, v n and E = e ij n ij=1 to denote vertices and edges, respectively, which means e i,j goes from v i to v j , in graph G (V, E); d (•, •) represents the distance between two distributions; P1 (•, •) and P2 (•|v i ) are regarded as its representation and the To solve the storage and the construction for the downstream classifier, GF is selected to factorize the adjacent matrix of the graph.This object function is described as , where Y, Z and λ denote the weight adjacency matrix, the factor matrix and the regularization parameter, respectively.In the operations of behavior feature extraction, combining the advantages of deep learning and factorization, a fusion graph embedding model based on ensemble LINE and GF is proposed, which is shown in Figure 4.
To compare whether the fusion model (BGF-CMAP) is convincing, we propose model LAP, a matrix factorization aiming to factorize high-the dimensional matrix to obtain the embeddings.In particular, the obtained objective function is minimized by Laplacian eigenmaps [47].
where Y i and Y j are the embeddings of a node and the embedding of a graph denoted by Y; W ij and L are regarded as its representation weight and the Laplacian of the graph G, respectively.In this paper, the built open-NE library is applied to carry out the behavioral feature extraction of the graph embedding model.After training the model described above, we obtain the results with 64 dimensions per node.

GBDT classifier for prediction
Jerome H. Friedman put forward GBDT [48][49][50], which is one of the best decision tree-based algorithms that be used either for categorization or regression or for filtering features.Under categorization strategy, through multiple iterations and the residual of the previous classifier, the GBDT model can be better to get a weak classifier.We can express the loss function of the weak classifier as Here, F m−1 (x) represents the iteration result at this point, and we set the weak classifier (a decision tree) to the parameter value of T (x i ; θ m ) and methods; the number of iterations is denoted by m.
We can see that the main computational costs of our proposed model are the feature extraction phase using graph embedding and the training stage of the GBDT classifier.First, in the graph embedding feature extraction stage, the time complexity of both LINE and GF can be approximately regarded as O |E| d , where |E| shows the number of edges in a graph and d shows the embedding dimension of each node.Specifically, the time complexity O |E| d indicates that the algorithm roughly requires E * d computational operations during the calculation process.Second, in the GBDT training stage, the time complexity is as follows: the time complexity of constructing each decision tree is O Nf log(N) , where N is the number of samples and f is the number of features.GBDT usually iterates multiple rounds, and each round constructs a decision tree.Hence, the overall time complexity of the GBDT training stage is approximately O TNf log(N) , where T is the number of iterations.Therefore, the total time complexity of BGF-CMAP can be approximated as the sum of O |E| d and O TNf log(N) , which is approximately equal to O TNf log(N) .Additionally, in terms of space complexity, the space complexity of graph embedding methods LINE and GF is relatively low and is mainly determined by the storage space of nodes, edges and adjacency lists.On the other hand, the space complexity of the GBDT model is usually high, especially when the dataset is large, the deepness of decision trees is deep and the number of leaf nodes is large, the memory space required by the model will increase accordingly.Therefore, we prioritize considering the space complexity of GBDT as the approximate space complexity of the model.After reasonable parameter tuning, in our work, the model algorithm was implemented and run using PyCharm Community Edition 2021.1 × 64, and the server was equipped with Intel(R) Core (TM) i7-12700H CPU and 16GB RAM.Ultimately, the average time for the training and prediction stages of running the GBDT classifier on a computer is approximately 16 min 21 s.

Key Points
• The complex association between circRNA and miRNA is a key regulatory factor in various pathological processes and plays a key role in most complex human diseases.

FUNDING
Natural Science Foundation of Guangxi (Grants 2022JJD170019, 2023JJA170118 and 2021AC19394); Natural Science Foundation of

Figure 1 .
Figure 1.The workf low of BGF-CMAP model.A. Construction of nodes and nodes topology feature by pre-processing a heterogeneous association graph.B. The low dimensional representation vectors of circRNA and miRNA node sequences and topologies were learned by Word Embedding and Graph Embedding models.C. The f lowchart of prediction by GBDT.

Figure 2 .
Figure 2. Performance evaluation of various models.(1) A and B show AUC and AUPR were achieved by our proposed BGF-CMAP.A. AUC was obtained by summing the areas under panel A plotting the ROC curve.B. AUPR refers to the area under the curve enclosed by precision and recall.(2) C. Comparison of prediction capability of diverse tactics.(3) D. Results of 5-fold CV acquired by different classifier models.(4) E and F show AUC and AUPR were achieved by the LAP method.E. ROC curves yielded by the results of LAP on the dataset using 5-fold CV.F. AUPR yielded by the results of LAP on the dataset using 5-fold CV.

Table 3 :
The mean results were performed by different classifiers model

Table 4 :
The AUC and AUPR scores generation of various models (KNN), logistic regression (LR), rotation forest (RF), support vector machine (SVM) and AdaBoost algorithm for research.Table3lists the mean values of the 5-fold CV experiments achieved by the above models on the same dataset, and it is also shown in bar Figure2D.Summarizing from Table3, AdaBoost achieved the second rank in Acc., Sen., Spe., Pre., MCC and AUC, but it was 1.54, 1.26, 1.82, 1.76 and 3.09% and 0.0189 lower than the best BGF-CMAP model results.And the comparison in Figure2Dalso shows that the BGF-CMAP model is the best result.In a nutshell, these results have demonstrated that our BGF-CMAP model by GFBDT classifier outperforms other classifier models.

Table 6 :
10CMAs pairs predicted by BGF-CMAP • Merging multiple information, incorporating graph representation learning (LINE and GF) and deep learning (CBOW) to more sufficiently extract attribute features and behavior features provided by circRNA and miRNA sequences than previous approaches.• We proposed BGF-CMAP, which has adopted a homogeneous embedding fusion model and word2vec to train the model by fully extracting the interaction features combined with sequence information to predict CMAs.• Abundant prediction experiments are adequate to demonstrate the dependability of the model, and the comparison can be concluded that our model can provide further academic research.