Abstract

MicroRNAs (miRNAs) silence genes by binding to messenger RNAs, whereas long non-coding RNAs (lncRNAs) act as competitive endogenous RNAs (ceRNAs) that can relieve miRNA silencing effects and upregulate target gene expression. The ceRNA association between lncRNAs and miRNAs has been a research hotspot due to its medical importance, but it is challenging to verify experimentally. In this paper, we propose a novel deep learning scheme, i.e. sequence pre-training-based graph neural network (SPGNN), that combines pre-training and fine-tuning stages to predict lncRNA–miRNA associations from RNA sequences and the existing interactions represented as a graph. First, we utilize a sequence-to-vector technique to generate pre-trained embeddings based on the sequences of all RNAs during the pre-training stage. In the fine-tuning stage, we use Graph Neural Network to learn node representations from the heterogeneous graph constructed using lncRNA–miRNA association information. We evaluate our proposed scheme SPGNN on our newly collected animal lncRNA–miRNA association dataset and demonstrate that combining the |$k$|-mer technique and Doc2vec model for pre-training with the Simple Graph Convolution Network for fine-tuning is effective in predicting lncRNA–miRNA associations. Our approach outperforms state-of-the-art baselines across various evaluation metrics. We also conduct an ablation study and hyperparameter analysis to verify the effectiveness of each component and parameter of our scheme. The complete code and dataset are available on GitHub: https://github.com/zixwang/SPGNN.

INTRODUCTION

MicroRNAs (miRNAs) are a class of non-coding RNAs with a length of approximately 22 nucleotides. The seed sequence (2-8nt at the 5’ end) of miRNAs can bind with the 3’-untranslated region (3’UTR) or 5’-untranslated region (5’UTR) of target messenger RNAs (mRNAs), resulting in the silencing of corresponding genes [1, 2]. Long non-coding RNAs (lncRNAs) are the type of RNAs that contain more than 200 nucleotides and do not encode proteins [3]. Recently, many studies [4, 5] have reviewed that lncRNAs can act as competitive endogenous RNAs (ceRNAs) of miRNAs to relive the silencing effect of miRNAs, thereby upregulating the expression of the miRNAs target gene. These lncRNAs can also be called miRNA sponges. For example, Zhu et al. [4] reported that lncRNA SOX2OT could regulate ICAM1 (gene) by sponging miR-215-5p. Similarly, Yu et al. [5] demonstrated that lncRNA LINC00924 could upregulate NDRG2 (gene) to inhibit epithelial-mesenchymal transition by sponging miR-6755-5p in HBV-related hepatocellular carcinoma.

The ceRNA association between lncRNAs and miRNAs has become a popular research topic due to its medical importance. Techniques such as quantitative reverse transcription-PCR [6], microarray [7] and HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation) [8] are usually used to verify ceRNA associations. Although the lncRNA–miRNA associations obtained through experimental verification are highly credible, these commonly used experimental methods still have some drawbacks. First of all, all of these experimental techniques are labor- and material-intensive. Besides, most of these experimental techniques are carried out to verify the lncRNA–miRNA associations of interest one by one, which increases the likelihood of failure and may lead to negative results despite investing significant manpower and resources. Thus, multiple methods based on computing and artificial intelligence have been proposed for the prediction of lncRNA–miRNA associations.

These predictive methods can provide direction and improve the success rate of experiments. For example, Kang et al. [9] proposed a method based on dual-path parallel ensemble pruning for lncRNA–miRNA association prediction. However, this work mainly focused on lncRNA–miRNA associations in plants. According to the ceRNA hypothesis [10], the crosstalk between RNAs is related to the complementary pairing of their sequences. Thus, it is of vital importance to take the sequence information of RNAs into account to predict lncRNA–miRNA associations. Huang et al. [11] computed the sequence similarity of lncRNA and miRNA through Needleman-Wunsch pairwise sequence alignment [12] and constructed a model for lncRNA–miRNA associations prediction. Recently, with the rapid development of natural language processing technologies, some researchers have started to pre-train the sequence information of various biomedical entities such as proteins, RNA and molecules using the NLP models to obtain the vector representations of these entities. Afterwards, these vector representations containing sequence information are inputted into subsequent prediction models. For example, Yu et al. [13] proposed a pre-trained model named preMLI to recover the lncRNA–miRNA associations. In particular, preMLI [13] firstly generated embeddings of lncRNA and miRNA based on their proposed rna2vec model to obtain the vector representations of RNAs. Afterwards, these vector representations were then fed into a deep feature mining model consisting of convolutional neural networks and bidirectional gated recurrent units for predicting lncRNA–miRNA associations. However, this method solely relied on the sequence information of lncRNA and miRNA as input and did not consider other crucial information such as associations verified in the laboratory. This limitation prevented the optimization of parameters during the training process, which ultimately hindered its effectiveness.

Besides, as RNA can be represented as a sequence, there are multiple methods available in NLP that can be used to pre-train the sequence information of biomedical entities [14]. One such method is Word2vec [15], which learns word embeddings by capturing the context of words in text. This method utilizes a neural network architecture to predict the likelihood of a word given its surrounding context and has been demonstrated to be effective in sequence embedding tasks. Another method is Doc2vec [16], which is an extension of Word2vec that learns embeddings for entire documents rather than individual words. The Transformer [17] is an architecture that uses self-attention mechanisms to weigh the importance of different timesteps in the input sequence. It has been shown to be effective in sequence embedding tasks. Our research has discovered that using sequence embedding techniques on RNA sequences as the pre-training component can enhance the performance of lncRNA–miRNA association prediction.

Although sequential embedding pre-training models have demonstrated their superior performance in the association prediction, they still suffer from the limitation that each lncRNA or miRNA is assumed to be independent of the other. Accordingly, to address the limitation, we also analyze various graph embedding learning techniques for predicting lncRNA–miRNA associations, where we regard each lncRNA or miRNA as a node in the graph and the associations among them as edges of the graph. Specifically, graph embedding learning [18] is a method for representing nodes within a graph into a low-dimensional vector space, which enables the utilization of conventional machine learning techniques on graph-structured data like lncRNA–miRNA associations. By using graph embedding techniques, we can effectively capture both the structural and feature-based information between nodes and use models such as graph neural networks (GNNs) [19–21] to perform classification or regression tasks. Studies on graph embedding can generally be categorized into two groups: unsupervised and supervised techniques. Unsupervised techniques, such as DeepWalk [22], Node2Vec [23], and LINE [24], try to learn node embeddings by maintaining both the local and global structural information of the graph without node features. However, as we aim to recover the associations between RNAs by leveraging their available relations and features, the supervised setting is more appropriate.

Supervised methods such as Graph Convolutional Network (GCN) [20] and Simple Graph Convolution (SGC) [25] would be more suitable for predicting lncRNA–miRNA associations. GCN aggregates information from the surrounding nodes of each node by utilizing a graph convolutional layer, whereas GraphSAGE [19] utilizes a similar layer but in a distinct way by sampling a fixed-size neighborhood for each node, which enables GraphSAGE to handle larger graphs efficiently by avoiding excessive computational costs. Graph Attention Network (GAT) [26], in contrast, uses the attention mechanism to gather information from the surrounding nodes, which allows it to focus on important neighboring nodes and disregard less significant ones. SGC, a variation of GCN, simplifies the process by incorporating a linear operation to gather information from surrounding nodes, resulting in fewer parameters and lower computational costs. GNN-FiLM [27] modifies the feature-wise information acquired by the model according to the input graph structure, enabling the application of different linear transformations to different features of the input data, which improves the interpretability of the model. GATv2Conv [21], an extension of GAT, gathers information from the surrounding nodes by using V2-Convolution, in which self-attention heads are replaced by multiple convolutional kernels to learn different features of the input data. More recently, Graph Isomorphism Network [28], EdgeConv [29] and Efficient Graph Convolution [30], as extensions of the GCN have been proposed. In our study, we will evaluate different well-known and effective methods for learning from graph data and find that a combination of pre-training with RNA sequence embedding and fine-tuning with graph embedding is the most effective for the task.

To summarize, we develop a novel graph-based pre-training scheme, i.e. sequence pre-training-based graph neural network (SPGNN), to discover the associations between lncRNAs and miRNAs. Specifically, our proposed scheme first incorporates a sequence-to-vector technique to generate the pre-trained embeddings based on the sequence of all RNAs during the pre-training stage. In the subsequent fine-tuning stage, SPGNN uses different GNNs as its fine-tuning models for the final prediction. To evaluate the performance of SPGNN, we conduct experiments on our collected animal RNA dataset, which is described in Section Materials. In our experiments, we aim to demonstrate the advantages of SPGNN over other state-of-the-art models.

The contributions of our paper can be summarized as follows:

  1. We propose to use the sequence-to-vector technique to capture the latent sequential information embedded in the nucleic acid sequence of each RNA, hence generating meaningful representations during the pre-training stage.

  2. We propose a novel SPGNN scheme that can leverage the pre-trained embeddings by fine-tuning these embeddings through a general graph neural network.

  3. Our intensive experiments, ablation study and hyperparameter analysis demonstrate the effectiveness of each stage and show that SPGNN can outperform state-of-the-art baselines.

METHODS

In this section, we first provide a detailed description of our collected dataset in Section Materials. Afterwards, we present an overview of SPGNN (Section Overview) followed by the details of our proposed scheme.

Materials

LncACTdb [31] is a comprehensive database of experimentally supported associations among ceRNA and the corresponding personalized networks contributing to precision medicine. All lncRNA–miRNA associations collected in this database were experimentally verified. We obtained 1057 lncRNA–miRNA associations from LncACTdb 3.0 [31], which included 284 lncRNAs and 520 miRNAs. Sequences of lncRNA were obtained from LNCipedia [32] and NONCODE [33]. LNCipedia is a public database for lncRNA sequences and annotations, which contains 127 802 transcripts and 56 946 genes. NONCODE is a comprehensive database of collection and annotation of noncoding RNAs, especially lncRNAs in animals. Sequences of miRNAs were obtained from miRBase. The miRBase database is a searchable database of published miRNA sequences and annotation.

Table 1

Statistics of our dataset

Entity typeNumEdge typeNum
lncRNA284lncRNA–miRNA1057
miRNA520
Entity typeNumEdge typeNum
lncRNA284lncRNA–miRNA1057
miRNA520
Table 1

Statistics of our dataset

Entity typeNumEdge typeNum
lncRNA284lncRNA–miRNA1057
miRNA520
Entity typeNumEdge typeNum
lncRNA284lncRNA–miRNA1057
miRNA520

Description

Overview

In this section, we will describe our proposed scheme SPGNN that we use to generate pre-trained embeddings for lncRNAs or miRNAs and fine-tune with general graph neural networks to recover the unseen associations between them. First, in Section Preliminary, we will present the notations used in this article and briefly describe our research objective. Second, in Section Pre-training stage, we demonstrate the pre-training stage of SPGNN, where we use the |$k$|-mer method to split all RNA sequences into fragments of equal length, then we use NLP-based document embedding technique to generate pre-trained embeddings. Third, in Section Fine-tuning stage, we describe a general fine-tuning method to incorporate a GNN to leverage the pre-trained embeddings obtained from Section Pre-training stage. Finally, we present how to train SPGNN and predict unseen associations in Section Prediction. To illustrate the SPGNN scheme, we use Figure 2 to give a brief overview.

Preliminary

To initialize our task, we first need to define all entities and their associated information. Hence, we denote |$\mathcal{V}_{lnc}$| and |$\mathcal{V}_{mi}$| as the sets of lncRNA and miRNA, respectively. Given all the associations between lncRNAs and miRNAs, we use |$\mathcal{G}$| to represent the association graph (As the collected associations between lncRNAs and miRNAs do not have specific directions, our association graph |$\mathcal{G}$| is an undirected graph.), i.e. |$\mathcal{G} = (\mathcal{V},\mathcal{E})$|⁠, where |$\mathcal{V}$| is the set of all RNAs (⁠|$\mathcal{V} = \mathcal{V}_{lnc}+\mathcal{V}_{mi}$|⁠) and |$\mathcal{E}$| is the set of associations between RNAs. As every RNA that belongs to |$\mathcal{V}$| has a unique sequence, we use |$\boldsymbol{\mathrm{c}}_r$| (⁠|$\boldsymbol{\mathrm{c}}_r\in \mathcal{C}$|⁠) to denote the nucleic acid sequence of |$r$|⁠, which is a miRNA or lncRNA. Subsequently, in the following sections, we will try to encode each RNA into a low-dimensional embedding |$\boldsymbol{\mathrm{e}}_r$| by incorporating its associated RNAs and its sequence information.

Pre-training stage

The first stage of our approach, the pre-training stage, aims to fully exploit the sequence information of all RNAs with the sequence embedding technique, which represents the sequences of RNA as compact numerical vectors that capture the underlying patterns and relationships in the sequences and preserve their biological significance. Effective utilization of RNA sequence information is crucial for the downstream fine-tuning stage that uses these representations for graph embedding learning.

Before embedding the RNA sequences, a |$k$|-mer counting technique is applied to all lncRNA and miRNA sequences. Such a method involves dividing the RNA sequences into overlapping segments of fixed length |$k$|⁠. By breaking down the RNA sequence into |$k$|-mers, we are able to apply the sequence embedding algorithm to smaller segments of equal length, making it more efficient than using the entire RNA sequence as input. To illustrate the concept of |$k$|-mer, we present Figure 1. We consider the small |$k$|-mer RNA segments as words and the whole RNA sequences as documents for applying natural language processing techniques.

Figure 1

The 3-mers of a part of the RNA Sequence miR-141-3p.

Figure 2

A detailed overview of our proposed SPGNN scheme which includes the |$k$|-mers processing, pre-training and fine-tuning stages. In this figure, 3-mers is used as an example.

After applying the |$k$|-mer technique to partition each RNA sequence into overlapping subsequences of length |$k$| that reflect its local composition and structure, we use the Doc2vec method [16] in the pre-training stage, which is a natural language processing technique used to represent variable-length text documents as fixed-length vectors in a high-dimensional space. In particular, Doc2vec is an extension of the popular Word2vec algorithm [15], which learns word embeddings from large corpora and allows the creation of document-level embeddings of the text. By using Doc2vec, we can create document-level embeddings of RNA sequences that capture both their local and global features, which gives us a superior alternative over other embedding methods.

The main intuition behind the Doc2vec method in the pre-training stage is to associate a vector representation with |$k$|-mer segment list of each RNA sequence in a corpus. In this task, we use the Distributed Memory (DM) approach to train the Doc2vec model. The principle of the DM model is to average the sequence vectors and |$k$|-mer RNA segment vectors to obtain features that predict the next |$k$|-mer segment in a sequence, which can be seen as an extension to the Continuous Bag-of-Words model in the Word2vec model. By doing so, both local and global features of the RNA sequence are captured. The following equation shows how to embed the RNA sequence using the DM approach:

$$ \begin{align}& \begin{aligned} \mathbf{y_{t}} = \text{softmax}\left(\mathbf{b} + \sum_{i=0}^{C} \mathbf{U_{i}} \mathbf{W_{t-i}} + \mathbf{V} \mathbf{D}\right), \end{aligned}\end{align} $$
(1)

where |$\mathbf{y_{t}}$| is the output vector that predicts the next |$k$|-mer segment, given the context segment and the sequence vector, |$\text{softmax}(\cdot )$| is a function that normalizes the input to a probability distribution, |$\mathbf{b}$| is the bias term, |$\mathbf{U_i}$| is the vector representation of the |$i$|th |$k$|-mer segment in the RNA sequence being processed, |$\mathbf{W_{t-i}}$| is the weight of the |$i$|th context |$k$|-mer segment, |$\mathbf{V}$| is the weight of the entire RNA sequence and |$\mathbf{D}$| is the feature vector of the RNA sequence that represents the RNA sequence being processed.

In principle, the |$k$|-mer technique and Doc2vec method are powerful and flexible tools for representing and analyzing RNA sequences in a high-dimensional vector space during the pre-training stage. It is noteworthy that BERT [34], a pre-training method based on the Transformer architecture, has gained substantial attention in the field of natural language processing, demonstrating remarkable performance for sequence embedding. In addition, for RNA sequences, the RNABERT model [35], which is based on the BERT architecture, has also been advanced as a means of RNA sequence embedding. A comprehensive comparison of our proposed model, which leverages both the |$k$|-mer technique and Doc2vec method, with RNABERT model will be presented in Section 3.

To conclude, the pre-training stage in our SPGNN scheme employs a |$k$|-mer technique to handle variable-length RNA sequences, and utilizes pre-trained vector representations generated by the Doc2vec model to extract sequence information and accelerate the subsequent analysis process.

Fine-tuning stage

In the pre-training stage, we have successfully derived the RNA sequence embedding from the RNA sequences of lncRNAs and miRNAs. Based on that, we move on to the construction of the lncRNA–miRNA association graph by using the known associations between lncRNAs and miRNAs as edges and their corresponding RNA sequence embeddings as node features. In order to obtain node representations that accurately reflect the intricacies of the graph’s structure and capture the complex associations between lncRNAs and miRNAs, it is crucial that we utilize a GNN in this stage. The utilization of a GNN allows to generate node representations that depend on the graph structure and incorporate information from neighboring nodes, ensuring the accuracy and robustness of the overall prediction process.

In general, the GNN framework utilizes the |$\text{AGGREGATE}$| and |$\text{UPDATE}$| functions to produce an updated node embedding. The |$\text{AGGREGATE}$| function takes node embeddings from the current node’s neighborhood and outputs a message, |$\mathbf{m}_{\mathcal{N}(u)}$|⁠, based on the combined information. The function can be written as:

$$ \begin{align}& \mathbf{m}_{\mathcal{N}(u)}^{(l)} = \text{AGGREGATE}^{(l)} (\{{\mathbf{h}_v^{(l)}, \forall v \in \mathcal{N}(u)\}}),\end{align} $$
(2)

where |$\mathbf{m}_{\mathcal{N}(u)}^{(l)}$| is the aggregate representation of the neighbors of node |$u$| in layer |$l$|⁠, |$\mathcal{N}(u)$| is the set of neighbors of node |$u$| and |$\mathbf{h}_v^{(l)}$| is the node representation of node |$v$| in layer |$l$|⁠. |$\text{AGGREGATE}^{(l)}(\cdot )$| is the aggregation function used to combine the node representations of the neighbors of node |$u$| in layer |$l$|⁠.

Then, the |$\text{UPDATE}$| function combines the current message with the previous node embedding to generate the updated embedding. It can be expressed as:

$$ \begin{align}& \begin{aligned} \mathbf{h}_{u}^{(l)} &= \text{UPDATE}(\mathbf{h}_{u}^{(l-1)}, \mathbf{m}_{\mathcal{N}(u)}^{(l-1)}) \\ &= \sigma(\mathbf{W}_{self}^{(l)}\mathbf{h}_{u}^{(l-1)}+\mathbf{W}_{neighbor}^{(l)}\mathbf{m}_{\mathcal{N}(u)}^{(l-1)}), \end{aligned}\end{align} $$
(3)

where |$\mathbf{h}_{u}^{(l)}$| and |$\mathbf{h}_{u}^{(l-1)}$| are the node representations of node |$u$| in layer |$l$| and layer |$l-1$|⁠, respectively, |$\mathcal{N}(u)$| is the set of neighbors of node |$u$|⁠, |$\mathbf{m}_{\mathcal{N}(u)}^{(l-1)}$| is the aggregate representation of the neighbors of node |$u$| in layer |$l-1$|⁠, |$\mathbf{W}_{self}^{(l)}$| and |$\mathbf{W}_{neighbor}^{(l)}$| are trainable parameter matrices for layer |$l$|⁠, and |$\sigma (\cdot )$| is the non-linear activation function (e.g. ReLU or tanh), |$\text{UPDATE}(\cdot )$| function defines the way of updating the representation of node |$u$|⁠. Note that when |$l=0$|⁠, the initial embeddings at the start of the iterations are set as the input features for all nodes. The final output of the GNN is used to determine the embedding for each node in the graph.

Our proposed SPGNN approach is a general framework in the sense that it can be extended to different GNNs. Therefore, by concatenating our proposed pre-training stage with different GNNs, we may obtain different performances depending on the actual backbone GNN. In the scope of this paper, we incorporate six different GNNs, including SGC [25], GCN [20], GraphSAGE [19], GAT [26], GNN-FiLM [27] and GATv2Conv [21], which are six commonly used graph-based models.

Some of these variants, including SGC, GCN, GraphSAGE, use convolutional operations to update node representations based on the features of their neighbors. Other variants, including GAT, GNN-FiLM and GATv2Conv, use attention mechanisms to assign different weights to different neighbors based on their relevance to the central node. These six variants can be used as the fine-tuning approach for predicting lncRNA–miRNA associations. Comparison details are shown in Section 3.

Prediction

After we employ a GNN architecture for fine-tuning and obtain the embedding of each node, we use the dot predictor function to predict the presence or absence of edges between lncRNA and miRNA nodes in a graph. The dot predictor function computes the edge feature by assigning the node features to the node data and calculating the dot product between them, then storing the result as the edge scores feature.

We use this dot predictor function to predict associations in lncRNA–miRNA networks using two sets of graphs: positive and negative samples. Pairs of lncRNA and miRNA that are known to have an established association are categorized as positive samples, and the set of all lncRNA–miRNA pairs lacking a known association are considered negative samples. To obtain robust and reliable predictions, we randomly sample an equal number of negative samples as positive samples. We calculate the link scores for both sets of graphs as follows:

$$ \begin{align} & \begin{aligned} {Score}_{pos} = \text{DotPredictor}(\mathcal{G}_{pos}, h) \end{aligned} \end{align} $$
(4)
$$ \begin{align} & \begin{aligned} {Score}_{neg} = \text{DotPredictor}(\mathcal{G}_{neg}, h), \end{aligned} \end{align} $$
(5)

where |$\mathcal{G}_{pos}$| and |$\mathcal{G}_{neg}$| are graphs with positive and negative links, respectively. All graphs use the same node features |$h$| that are produced by the GNN in the fine-tuning stage.

Then, we compute the loss with |${Score}_{pos}$| and |${Score}_{neg}$| using binary cross-entropy with logits loss, which is a loss function that is commonly used in binary classification tasks. It is defined as:

$$ \begin{align}& \begin{aligned} L(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log \sigma(\hat{y}_i) + (1 - y_i) \log (1 - \sigma(\hat{y}_i)) \right], \end{aligned}\end{align} $$
(6)

where |$y_i$| is the true label of the |$i$|th example of |${Score}_{pos}$| and |${Score}_{neg}$|⁠, |$\sigma (\cdot )$| is the sigmoid function, |$\hat{y_i}$| is the predicted probability of the positive class for the |$i$|th example and |$N$| is the total number of examples of positive and negative sets.

The lncRNA–miRNA pairings are arranged in accordance with their respective computed probabilities of the association at last. Notably, the probability of association between a given lncRNA–miRNA pairing can be seen as an indicator of their interrelation.

EXPERIMENTS AND RESULTS

In this section, we showcase the lncRNA–miRNA association prediction result with our proposed scheme SPGNN and demonstrate the effectiveness of our scheme.

Evaluation metrics

We use the F1 score, the Area Under the Receiver Operating Characteristic Curve (AUC), Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG) [36] as the metrics to evaluate the performance of lncRNA–miRNA association prediction results.

The F1 score is a widely used metric for binary classification, which considers both precision and recall to evaluate a model’s accuracy. By combining these two metrics into a single value, the F1 score offers a balanced assessment of a model’s performance. The precision, recall, and F1 score are calculated using the following formulas:

$$ \begin{align} & \begin{aligned} {\text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}}} \end{aligned} \end{align} $$
(7)
$$ \begin{align} & \begin{aligned} {\text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}}} \end{aligned} \end{align} $$
(8)
$$ \begin{align} & \begin{aligned} {\text{F1 Score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}} \end{aligned} \end{align} $$
(9)

AUC is a metric to assess the accuracy of a binary classification model by comparing the true positive rate (TPR) and false positive rate (FPR). The TPR and FPR are calculated as:

$$ \begin{align} & \begin{aligned} \text{TPR} = \frac{\text{True Positive}}{\text{True Positive + False Negative}} \end{aligned} \end{align} $$
(10)
$$ \begin{align} & \begin{aligned} \text{FPR} = \frac{\text{False Positive}}{\text{False Positive + True Negative}} \end{aligned} \end{align} $$
(11)

The receiver operating characteristic (ROC) curve displays TPR against FPR at different classification thresholds, and the AUC represents the area under the ROC curve, providing a comprehensive evaluation of the classifier’s performance at all thresholds.

AP is another performance metric used to assess the accuracy of the classifier in our task, which is calculated as the weighted average precision at each threshold in the precision-recall curve. The weight is determined by the increase in recall from the previous threshold.

NDCG metric is based on the idea of computing the discounted cumulative gain (DCG) of the ranked items and normalizing the DCG score to take into account the ranking position of the relevant items. The DCG score is calculated by adding up the relevance of relevant items with discounts based on their ranking position. The NDCG score is determined by dividing the DCG score by the best possible DCG score that can be achieved through a perfect ranking. NDCG provides a more comprehensive evaluation of the model performance compared to other metrics, like precision and recall.

We evaluate the performance of our method using 5-fold cross-validation. This involves randomly dividing the observed lncRNA–miRNA associations data into five equal subsets, using one as testing set and the rest as training set. We then randomly select the same number of unobserved edges in the lncRNA–miRNA association graph as the negative samples in the training and validation sets. This process is repeated five times, with each of the five subsets serving as validation data in turn. Then we average the cross-validation results for a single estimation.

Experimental setting

We implement our method SPGNN based on the Pytorch framework and the PyTorch Geometric package in Python for creating graph-based models. We adopt the Adam optimizer to train our model, with the learning rate set to 0.01. The seed value in Python is set to 42 to ensure consistency across different random operations used in the model.

Our implementation of the pre-training stage for RNA sequence embedding employs a fixed |$k$|-value of 3 in the |$k$|-mer method, following previous works such as [37–39]. As a result, both lncRNA and miRNA sequences are processed into three-character-long |$k$|-mers. Then we mainly investigate the Doc2vec model and its comparison with other methods in terms of accuracy and efficiency to demonstrate the necessity and effectiveness of pre-training in our proposed scheme SPGNN. The following hyperparameters of the Doc2vec model are used as the default hyperparameters in our experiments in order to obtain the best possible performance. The RNA embedding vector size is set to 100, which represents the length of the RNA sequence embeddings. The minimum count is set to 1, meaning that words that appear only once in the corpus are included in the vocabulary. The Doc2vec model is trained for 100 epochs, which are the number of iterations over the entire corpus. After acquiring the embedding vectors for each node, we normalize them in order to enhance the efficacy of our graph neural network’s classification capabilities.

For the fine-tuning stage of the SPGNN scheme, we use the SGC model as the default model architecture, which consists of two SGC layers, with the ReLU activation function in between. We use 100 input node features as the default setting in our experiments, and we assign 128 and 64 as the number of hidden and output node features, respectively. In our SGC model, we set the number of hops |$K=2$| for both layers, meaning that each node aggregates information from its two-hop neighbors. This particular model architecture is selected as the default for the fine-tuning stage due to its widespread usage and efficacy in node representation learning with graph-structured data.

Examination of pre-training

In order to evaluate the performance of pre-train in our scheme, we compare several pre-training methods which are commonly used in initial node embedding for GNN. For example, the Random Embedding method and the Adjacency Matrix Embedding method do not take any RNA sequence information into account. The Random Embedding method initializes the node embeddings with random values generated from a Gaussian distribution, and the Adjacency Matrix Embedding method leverages the adjacency matrix that represents the relationships between nodes in a lncRNA–miRNA association graph to obtain node embeddings. Then we also compare three different methods related to sequence pre-training within our SPGNN scheme, which are SPGNN-RNABERT, SPGNN-Text2vec and SPGNN-Doc2vec. SPGNN-RNABERT uses a pre-trained RNABERT model to represent the lncRNA and miRNA in our dataset. SPGNN-Text2vec employs a ”bag of words” representation for the |$k$|-mers of the RNA sequence so that each lncRNA or miRNA sequence is transformed into a numerical representation based on the frequency of |$k$|-mer that appear in the sequence. SPGNN-Doc2vec is our default model in pre-training, which is described in Section Experimental setting.

The results in Table 2 reveal that the SPGNN-Doc2vec pre-training method is the most effective among all the pre-training methods in terms of the evaluation metrics, with an F1 score of 0.760, AUC score of 0.844, AP score of 0.850 and NDCG score of 0.972, which are significantly higher than those of Random Embedding (0.702, 0.733, 0.712, 0.934). These results demonstrate the advantages of our proposed scheme SPGNN in terms of its ability to capture the complex information in the RNA sequences over Random embedding and other methods that cannot consider RNA representation.

Table 2

Performance comparison of different pre-training methods fine-tuned with SGC on lncRNA–miRNA association prediction task. The best results of four evaluation metrics (F1, AUC, AP and NDCG) are highlighted in bold

MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
Random EmbeddingSGC0.7020.7330.7120.934
Adjacency Matrix Embedding0.6580.7240.7890.961
SPGNN-Text2vec0.6870.7500.8020.963
SPGNN-RNABERT0.7500.8380.8470.972
SPGNN-Doc2vec0.7600.8440.8500.972
MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
Random EmbeddingSGC0.7020.7330.7120.934
Adjacency Matrix Embedding0.6580.7240.7890.961
SPGNN-Text2vec0.6870.7500.8020.963
SPGNN-RNABERT0.7500.8380.8470.972
SPGNN-Doc2vec0.7600.8440.8500.972
Table 2

Performance comparison of different pre-training methods fine-tuned with SGC on lncRNA–miRNA association prediction task. The best results of four evaluation metrics (F1, AUC, AP and NDCG) are highlighted in bold

MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
Random EmbeddingSGC0.7020.7330.7120.934
Adjacency Matrix Embedding0.6580.7240.7890.961
SPGNN-Text2vec0.6870.7500.8020.963
SPGNN-RNABERT0.7500.8380.8470.972
SPGNN-Doc2vec0.7600.8440.8500.972
MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
Random EmbeddingSGC0.7020.7330.7120.934
Adjacency Matrix Embedding0.6580.7240.7890.961
SPGNN-Text2vec0.6870.7500.8020.963
SPGNN-RNABERT0.7500.8380.8470.972
SPGNN-Doc2vec0.7600.8440.8500.972
Table 3

Comparison of the fine-tuning results of various GNN methods on the evaluation metrics of F1, AUC, AP and NDCG. The pre-training method used is SPGNN-Doc2vec. The results show that the SPGNN-Doc2vec fine-tuned with SGC method outperforms the other GNN methods in terms of the evaluation metrics

MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
SPGNN-Doc2vecWithout Fine-tune0.4550.5000.5090.855
GAT0.6150.6270.5810.870
GCN0.7470.8420.8490.971
GraphSAGE0.5540.5340.5830.886
GNN-FiLM0.5080.5270.5990.884
GATv2Conv0.6360.6110.5680.857
SGC0.7600.8440.8500.972
MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
SPGNN-Doc2vecWithout Fine-tune0.4550.5000.5090.855
GAT0.6150.6270.5810.870
GCN0.7470.8420.8490.971
GraphSAGE0.5540.5340.5830.886
GNN-FiLM0.5080.5270.5990.884
GATv2Conv0.6360.6110.5680.857
SGC0.7600.8440.8500.972
Table 3

Comparison of the fine-tuning results of various GNN methods on the evaluation metrics of F1, AUC, AP and NDCG. The pre-training method used is SPGNN-Doc2vec. The results show that the SPGNN-Doc2vec fine-tuned with SGC method outperforms the other GNN methods in terms of the evaluation metrics

MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
SPGNN-Doc2vecWithout Fine-tune0.4550.5000.5090.855
GAT0.6150.6270.5810.870
GCN0.7470.8420.8490.971
GraphSAGE0.5540.5340.5830.886
GNN-FiLM0.5080.5270.5990.884
GATv2Conv0.6360.6110.5680.857
SGC0.7600.8440.8500.972
MethodsEvaluation metrics
Pre-trainFine-tuneF1AUCAPNDCG
SPGNN-Doc2vecWithout Fine-tune0.4550.5000.5090.855
GAT0.6150.6270.5810.870
GCN0.7470.8420.8490.971
GraphSAGE0.5540.5340.5830.886
GNN-FiLM0.5080.5270.5990.884
GATv2Conv0.6360.6110.5680.857
SGC0.7600.8440.8500.972

Examination of fine-tuning

Then, we evaluate different fine-tuning methods of our SPGNN scheme. As previously demonstrated in Section Experimental setting, the default model for the fine-tuning stage is a two-layer SGC model. To ensure a thorough analysis, we implement several other two-layer GNN models for comparison purposes. Specifically, we include GCN, GAT, GraphSAGE, GNN-FiLM and GATv2Conv, each of which shares the same input and output node embedding vector size as the default SGC model.

In addition to comparing various fine-tuning methods, we also include a SPGNN-Doc2vec pre-training model without a fine-tuning stage for comparison, which serves as a useful benchmark to determine the relative effectiveness of the fine-tuning stage.

The results of a comparison of various fine-tuning GNN methods with SPGNN-Doc2vec pre-training are presented in Table 3. And the performance of each pre-training and fine-tuning method is illustrated in Figure 3. We identify that the most effective fine-tuning method for our proposed SPGNN scheme is the SGC model, which is able to capture the graph association information between different lncRNA and miRNA.

Figure 3

Evaluation metrics (F1, AUC, AP and NDCG) for different pre-training and fine-tuning methods on the lncRNA–miRNA association prediction task. (Up) A comparison of F1, AUC, AP and NDCG scores for five different pre-training methods. (Down) A comparison of F1, AUC, AP and NDCG scores for seven different fine-tuning methods. Both graphs show that SPGNN-Doc2vec is the best pre-training method and SGC is the best fine-tuning method based on the evaluation metrics used.

Method comparison

After obtaining satisfactory results with our SPGNN scheme by combining the |$k$|-mer technique, Doc2vec model and fine-tuning with SGC, we move forward to assess the performance of our model in comparison to a state-of-the-art model termed as preMLI [13], which is also proposed to predict the association between lncRNA and miRNA. Table 4 presents a comparison between the preMLI and our SPGNN-Doc2vec-SGC method using our dataset, showcasing their performance across different evaluation metrics. The preMLI method achieves an F1 score of 0.657, an AUC of 0.690, an AP of 0.698 and an NDCG of 0.930. In comparison, our SPGNN-Doc2vec-SGC method surpasses preMLI in all metrics by a large margin. As a result, this comparison clearly indicates the superiority of the SPGNN-Doc2vec-SGC method over preMLI in terms of prediction accuracy.

Table 4

The comparison of preMLI and our SPGNN-Doc2Vec-SGC method. SPGNN-Doc2Vec-SGC outperforms preMLI in all metrics

MethodsEvaluation metrics
F1AUCAPNDCG
preMLI0.6570.6900.6980.930
SPGNN-Doc2Vec-SGC0.7600.8440.8500.972
MethodsEvaluation metrics
F1AUCAPNDCG
preMLI0.6570.6900.6980.930
SPGNN-Doc2Vec-SGC0.7600.8440.8500.972
Table 4

The comparison of preMLI and our SPGNN-Doc2Vec-SGC method. SPGNN-Doc2Vec-SGC outperforms preMLI in all metrics

MethodsEvaluation metrics
F1AUCAPNDCG
preMLI0.6570.6900.6980.930
SPGNN-Doc2Vec-SGC0.7600.8440.8500.972
MethodsEvaluation metrics
F1AUCAPNDCG
preMLI0.6570.6900.6980.930
SPGNN-Doc2Vec-SGC0.7600.8440.8500.972

Analysis of parameters

In this section, we analyze the influence of two important parameters, |$k$|-value in |$k$|-mers and RNA embedding vector size in the pre-training stage, on the performance of the model. We tune the parameters and show the F1, AUC, AP and NDCG values in Tables 5, 6 and 7.

Table 5

The evaluation metrics F1, AUC, AP and NDCG for different |$k$|-values, which has a significant impact on the performance of the |$k$|-mer algorithm. The best-performing |$k$|-value in terms of F1, AUC, AP and NDCG is highlighted in bold.

Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDGC
1003SPGNN-Doc2vecSGC0.7600.8440.8500.972
40.7560.8400.8470.971
50.7550.8390.8470.971
60.7560.8400.8460.971
70.7570.8430.8480.972
80.7580.8430.8480.972
Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDGC
1003SPGNN-Doc2vecSGC0.7600.8440.8500.972
40.7560.8400.8470.971
50.7550.8390.8470.971
60.7560.8400.8460.971
70.7570.8430.8480.972
80.7580.8430.8480.972
Table 5

The evaluation metrics F1, AUC, AP and NDCG for different |$k$|-values, which has a significant impact on the performance of the |$k$|-mer algorithm. The best-performing |$k$|-value in terms of F1, AUC, AP and NDCG is highlighted in bold.

Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDGC
1003SPGNN-Doc2vecSGC0.7600.8440.8500.972
40.7560.8400.8470.971
50.7550.8390.8470.971
60.7560.8400.8460.971
70.7570.8430.8480.972
80.7580.8430.8480.972
Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDGC
1003SPGNN-Doc2vecSGC0.7600.8440.8500.972
40.7560.8400.8470.971
50.7550.8390.8470.971
60.7560.8400.8460.971
70.7570.8430.8480.972
80.7580.8430.8480.972
Table 6

The evaluation metrics for the lncRNA–miRNA association prediction using different |$k$|-value of lncRNA and miRNA

K-valueEvaluation metrics
lncRNAmiRNAF1AUCAPNDCG
830.7510.8430.8490.972
40.7510.8420.8480.971
50.7510.8430.8490.972
1230.7590.8430.8490.972
40.7600.8430.8490.972
50.7590.8430.8490.972
1630.7490.8430.8480.972
40.7590.8430.8490.972
50.7590.8430.8480.972
K-valueEvaluation metrics
lncRNAmiRNAF1AUCAPNDCG
830.7510.8430.8490.972
40.7510.8420.8480.971
50.7510.8430.8490.972
1230.7590.8430.8490.972
40.7600.8430.8490.972
50.7590.8430.8490.972
1630.7490.8430.8480.972
40.7590.8430.8490.972
50.7590.8430.8480.972
Table 6

The evaluation metrics for the lncRNA–miRNA association prediction using different |$k$|-value of lncRNA and miRNA

K-valueEvaluation metrics
lncRNAmiRNAF1AUCAPNDCG
830.7510.8430.8490.972
40.7510.8420.8480.971
50.7510.8430.8490.972
1230.7590.8430.8490.972
40.7600.8430.8490.972
50.7590.8430.8490.972
1630.7490.8430.8480.972
40.7590.8430.8490.972
50.7590.8430.8480.972
K-valueEvaluation metrics
lncRNAmiRNAF1AUCAPNDCG
830.7510.8430.8490.972
40.7510.8420.8480.971
50.7510.8430.8490.972
1230.7590.8430.8490.972
40.7600.8430.8490.972
50.7590.8430.8490.972
1630.7490.8430.8480.972
40.7590.8430.8490.972
50.7590.8430.8480.972
Table 7

The evaluation metrics F1, AUC, AP and NDCG for different RNA embedding vector size. The best-performing RNA embedding vector size in terms of F1, AUC, AP and NDCG is highlighted in bold.

Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDCG
503SPGNN-Doc2vecSGC0.7610.8410.8460.971
1000.7600.8440.8500.972
1500.7570.8440.8500.972
2000.7570.8440.8500.972
Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDCG
503SPGNN-Doc2vecSGC0.7610.8410.8460.971
1000.7600.8440.8500.972
1500.7570.8440.8500.972
2000.7570.8440.8500.972
Table 7

The evaluation metrics F1, AUC, AP and NDCG for different RNA embedding vector size. The best-performing RNA embedding vector size in terms of F1, AUC, AP and NDCG is highlighted in bold.

Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDCG
503SPGNN-Doc2vecSGC0.7610.8410.8460.971
1000.7600.8440.8500.972
1500.7570.8440.8500.972
2000.7570.8440.8500.972
Parameters/methodsEvaluation metrics
Embedding sizeK-valuePre-trainFine-tuneF1AUCAPNDCG
503SPGNN-Doc2vecSGC0.7610.8410.8460.971
1000.7600.8440.8500.972
1500.7570.8440.8500.972
2000.7570.8440.8500.972

Examination of k-value

We explore the impact of varying the value of |$k$| from 3 to 8 in |$k$|-mers with RNA embedding vector size in the pre-training stage set to 100. The choice of the |$k$| value can have some effect on the performance of RNA sequencing representations. A small value of |$k$| may result in an embedding that captures more sequence information of lncRNA and miRNA but also includes more noise, while a large value of |$k$| may sacrifice some of the finer details but lead to a more robust embedding.

The results of the experiment are presented in Table 5, which shows that the highest performance is achieved when using |$k$|-value of 3 in |$k$|-mers. The F1, AUC, AP, and NDCG values are 0.760, 0.844, 0.850 and 0.972, respectively. However, when |$k$| is set between 4 and 8, the values of the four evaluation metrics show little variation, indicating that the parameter |$k$| has a relatively small impact on the results.

We have also conducted experiments exploring the utilization of various |$k$|-values for miRNA and lncRNA analysis. This consideration is because of the distinct characteristics of miRNA and lncRNA. A miRNA is a short RNA sequence with an average length of about 20 nucleotides, whereas lncRNA refers to long-chain non-coding RNA molecules. Three |$k$|-values (3, 4 and 5) for miRNA are considered, with three lncRNA |$k$|-values (8, 12 and 16) for each |$k$|-values for miRNA. The results shown in Table 6 indicate that the choice of |$k$|-value for both lncRNA and miRNA has a small impact on the performance of the prediction task. Across different |$k$|-values, no significant variations can be observed across different evaluation metrics. This implies that the prediction of lncRNA–miRNA associations is robust against the variations in |$k$|-values within a certain range.

Examination of embedding size in the pre-training stage

We also investigate the influence of the RNA embedding vector size within the Doc2vec model in the pre-training stage by varying the embedding size from 50 to 200 and setting |$k$|-value in |$k$|-mers to 3. As shown in Table 7, the best performance is achieved when the embedding size is 100, 150 or 200. It is worth noting that, based on the experimental results, the impact of the embedding size on the results is also relatively small.

To summarize our findings, the results of our experiments demonstrate the impact of various parameters, such as the |$k$|-value and RNA embedding vector size, on the performance of our model. Our analysis reveals that optimal performance is attained by employing an embedding size of 100, 150 or 200, and a |$k$|-value of 3, with the embeddings pre-trained via Doc2vec and fine-tuned using SGC. Nevertheless, it should be noted that the influence of these parameters on the results is relatively minor.

Case study

To further verify the capability of our SPGNN to predict potential lncRNA–miRNA associations, we conducted a case study focusing on three specific RNAs (i.e. SNHG14, TUG1 and miR-214). In this study, we trained the model and selected the best-performing model generated through our SPGNN, which has the capability to generate node features for each RNA. The ranking basis was determined by evaluating the correlation values derived from the node features.

SNHG14 is an lncRNA associated with a variety of diseases. We use SPGNN to predict potential miRNAs associated with SNHG14. The results are presented in Table 8. Among the top-10 predicted miRNAs, relevant research evidence can be found in PubMed for 5 of them. These miRNAs include miR-107, miR-203, miR-217, miR-185-5p and miR-320a. For example, Liu et al. [40] found that SNHG14 could promote migration and invasion of clear cell renal cell carcinoma by sponging miR-203. Besides, Xu et al. [41] found that SNHG14 contributes to the development of hepatocellular carcinoma via sponging miR-217.

Table 8

The top-10 predicted results of SNHG14

RankMiRNAPMID
1miR-141Unknown
2miR-10733176720
3miR-20329312804
4miR-21732581548
5miR-185-5p33928771
6miR-363-3pUnknown
7miR-222-3pUnknown
8miR-9Unknown
9miR-448Unknown
10miR-320a36204642
RankMiRNAPMID
1miR-141Unknown
2miR-10733176720
3miR-20329312804
4miR-21732581548
5miR-185-5p33928771
6miR-363-3pUnknown
7miR-222-3pUnknown
8miR-9Unknown
9miR-448Unknown
10miR-320a36204642
Table 8

The top-10 predicted results of SNHG14

RankMiRNAPMID
1miR-141Unknown
2miR-10733176720
3miR-20329312804
4miR-21732581548
5miR-185-5p33928771
6miR-363-3pUnknown
7miR-222-3pUnknown
8miR-9Unknown
9miR-448Unknown
10miR-320a36204642
RankMiRNAPMID
1miR-141Unknown
2miR-10733176720
3miR-20329312804
4miR-21732581548
5miR-185-5p33928771
6miR-363-3pUnknown
7miR-222-3pUnknown
8miR-9Unknown
9miR-448Unknown
10miR-320a36204642

TUG1 is another lncRNA involved in a variety of diseases. Similarly, we use SPGNN to predict potential miRNAs associated with TUG1. The results are presented in Table 9. Among the top-10 predicted miRNAs, relevant research evidence can be found in PubMed for 7 of them. These miRNAs include miR-320a, miR-9-5p, miR-143-3p, miR-214, miR-143, miR-145 and miR-21. For example, Tan et al. [42] found that LncRNA TUG1 promotes bladder cancer malignant behaviors by regulating the miR-320a/FOXQ1 axis. Besides, Yao et al. [43] TUG1 knockdown repressed the viability, migration and differentiation of osteoblasts by sponging miR-214.

Table 9

The top-10 predicted results of TUG1

RankMiRNAPMID
1miR-424-5pUnknown
2miR-320a34920123
3miR-9-5p35785160
4miR-143-3p36386842
5miR-21435126706
6miR-14331264280
7miR-14536794657
8miR-206Unknown
9miR-101-3pUnknown
10miR-2130468492
RankMiRNAPMID
1miR-424-5pUnknown
2miR-320a34920123
3miR-9-5p35785160
4miR-143-3p36386842
5miR-21435126706
6miR-14331264280
7miR-14536794657
8miR-206Unknown
9miR-101-3pUnknown
10miR-2130468492
Table 9

The top-10 predicted results of TUG1

RankMiRNAPMID
1miR-424-5pUnknown
2miR-320a34920123
3miR-9-5p35785160
4miR-143-3p36386842
5miR-21435126706
6miR-14331264280
7miR-14536794657
8miR-206Unknown
9miR-101-3pUnknown
10miR-2130468492
RankMiRNAPMID
1miR-424-5pUnknown
2miR-320a34920123
3miR-9-5p35785160
4miR-143-3p36386842
5miR-21435126706
6miR-14331264280
7miR-14536794657
8miR-206Unknown
9miR-101-3pUnknown
10miR-2130468492

In addition to the above two lncRNAs, we also conducted a case study on miRNAs. miR-214 is an important MiRNA involved in the occurrence and development of many diseases. We use SPGNN to predict potential lncRNAs associated with miR-214. The results are presented in Table 10. Among the top-10 predicted lncRNAs, relevant research evidence can be found in PubMed for 4 of them. These lncRNAs include SNHG3, LINC00665, ZFAS1 and SNHG5. For example, Xi et al. [44] found that SNHG3 involoved in the initiation and progression of multiple human cancers by sponging miR-214. Besides, Wan et al. [45] found that LINC00665 could accelerate hepatocellular carcinoma growth and warburg effect by sponging miR-214.

Table 10

The top 10 predicted results of miR-214

RanklncRNAPMID
1SNHG335030969
2VCANUnknown
3LINC0066534804162
4HMGA2Unknown
5FGD5-AS1Unknown
6FN1Unknown
7ZFAS135386284
8SNHG533433357
9TMPO-AS1Unknown
10TTN-AS1Unknown
RanklncRNAPMID
1SNHG335030969
2VCANUnknown
3LINC0066534804162
4HMGA2Unknown
5FGD5-AS1Unknown
6FN1Unknown
7ZFAS135386284
8SNHG533433357
9TMPO-AS1Unknown
10TTN-AS1Unknown
Table 10

The top 10 predicted results of miR-214

RanklncRNAPMID
1SNHG335030969
2VCANUnknown
3LINC0066534804162
4HMGA2Unknown
5FGD5-AS1Unknown
6FN1Unknown
7ZFAS135386284
8SNHG533433357
9TMPO-AS1Unknown
10TTN-AS1Unknown
RanklncRNAPMID
1SNHG335030969
2VCANUnknown
3LINC0066534804162
4HMGA2Unknown
5FGD5-AS1Unknown
6FN1Unknown
7ZFAS135386284
8SNHG533433357
9TMPO-AS1Unknown
10TTN-AS1Unknown

CONCLUSION

miRNAs are short non-coding RNAs that silence genes by binding to their mRNAs. lncRNAs are RNA molecules that do not encode proteins and can act as ceRNAs or sponges of miRNAs, which can relieve the silencing effect of miRNAs and upregulate the expression of target genes. Experimental techniques, such as quantitative reverse transcription-PCR and HITS-CLIP, are used to verify ceRNA associations, but they are labor-intensive and material-intensive. Multiple computational methods have been proposed for the prediction of lncRNA–miRNA associations, including sequence alignment-based models, pre-trained models using natural language processing technologies, and deep feature mining models. However, some of these methods only consider sequence information and do not incorporate associations verified in the laboratory into training, limiting their ability to optimize parameters during the training process.

To address these problems, we present a novel graph-based pre-training scheme, SPGNN, which is designed to discover associations between lncRNAs and miRNAs. Our scheme consists of two stages: pre-training and fine-tuning. In the pre-training stage, we use a sequence-to-vector technique that incorporates a |$k$|-mer technique to handle RNA sequences of varying lengths and employs the Doc2vec model to generate pre-trained embeddings for all RNAs based on their nucleic acid sequences. In the fine-tuning stage, we use the SGC network as our fine-tuning model to predict the associations between lncRNAs and miRNAs using the pre-trained embeddings.

Our experiments on our newly collected animal RNA dataset show that our scheme could effectively capture the latent sequential information embedded in RNA sequences and leverage it for graph association prediction. We also demonstrate that our scheme can outperform state-of-the-art baselines on various evaluation metrics. Moreover, we conduct an ablation study and a hyperparameter analysis to verify the effectiveness of each component and parameter of our scheme.

Our work contributes to the field of bioinformatics by providing a new scheme that utilizes RNA sequence information and graph information to predict associations between lncRNAs and miRNAs. We also provide insights into how different sequence embedding techniques and GNNs can be applied to graph-based problems in bioinformatics. In the future, we aim to extend our approach to other types of RNAs and biological networks, thereby facilitating the discovery of new associations and enhancing our understanding of the underlying mechanisms of various complex biological processes.

Key Points
  • We design a novel graph-based pre-training scheme, namely sequence pre-training-based graph neural network (SPGNN), to discover associations between lncRNAs and miRNAs. It includes two stages: pre-training and fine-tuning.

  • In the pre-training stage of our SPGNN scheme, a |$k$|-mer technique is utilized to handle RNA sequences of varying lengths, and pre-trained vector representations are generated using the Doc2vec model to extract sequence information and accelerate the subsequent analysis process.

  • In the fine-tuning stage, we incorporate GNN models that leverage the embeddings generated during the pre-training stage. Among several commonly used GNN models, SGC is the most suitable model to predict the associations between lncRNAs and miRNAs.

  • We conduct an extensive set of experiments to evaluate the effectiveness of each stage of our proposed approach, as well as to determine the optimal hyperparameters for our model. Our results demonstrate that each stage of our approach is essential to achieving high performance in predicting the associations between lncRNAs and miRNAs.

ACKNOWLEDGMENTS

This research was supported by the MBZUAI-WIS project and the MBZUAI start-up funding.

FUNDING

This research was supported by the MBZUAI-WIS project and the MBZUAI start-up funding.

DATA AVAILABILITY

All data and codes have been released on Github.

Author Biographies

Shangsong Liang is an assistant professor at the Mohamed bin Zayed University of Artificial Intelligence. He received his Ph.D. degree from the University of Amsterdam in 2014. His expertise lies in the fields of machine learning, information retrieval and data mining. He worked as a (visiting) postdoctoral research scientist at the University of Massachusetts Amherst and the University College London, University College London, and King Abdullah University of Science and Technology, and has extensively published his work in top-tier conferences and journals, including SIGIR, KDD, WWW, CIKM, AAAI, WSDM, NeurIPS, TKDE and TOIS. He is the recipient of an Outstanding Reviewer Award in SIGIR 2017.

Zixiao Wang is currently a Ph.D. student in Machine Learning at the Mohamed bin Zayed University of Artificial Intelligence. He holds a Bachelor's degree from Huazhong University of Science and Technology and a Master's degree from Mohamed bin Zayed University of Artificial Intelligence. His research interests primarily revolve around Graph Neural Networks and bioinformatics.

Siwei Liu is a postdoctoral fellow at Mohamed bin Zayed University of Artificial Intelligence, UAE. He received the B.S.degree and Ph.D. degree from the University of Glasgow in 2018 and 2023, respectively. His research interests include graph neural networks, recommender systems and AI4Science.

Shiyang Liang is working at 944 Hospital of Joint Logistic Support Force of the Chinese PLA. He received his Master's degree in Internal medicine, and bachelor’s degree in clinical medicine from Air Force Medical University. His research interests primarily revolve around Graph Neural Networks, AI4Science and digestive system disease.

Jingjie Wang is currently a professor of gastroenterology at Tangdu Hospital, Air Force Medical University. His research interests primarily revolve around functional gastrointestinal disorder and hepatic disease.

Zhaohan Meng is currently a PhD candidate at the University of Glasgow. He received the B.S. degree from University of Glasgow in 2023. His research interests include knowledge graphs, natural language processing and AI4Science.

REFERENCES

1.

He
 
L
,
Hannon
 
GJ
.
MicroRNAs: small RNAs with a big role in gene regulation
.
Nat Rev Genet
 
2004
;
5
(
7
):
522
31
.

2.

Bartel
 
DP
.
Metazoan microRNAs
.
Cell
 
2018
;
173
(
1
):
20
51
.

3.

Uszczynska-Ratajczak
 
B
,
Lagarde
 
J
,
Frankish
 
A
, et al.   
Towards a complete map of the human long non-coding RNA transcriptome
.
Nat Rev Genet
 
2018
;
19
(
9
):
535
48
.

4.

Zhu
 
W
,
Peng
 
F
,
Cui
 
X
, et al.   
LncRNA SOX2OT facilitates LPS-induced inflammatory injury by regulating intercellular adhesion molecule 1 (ICAM1) via sponging miR-215-5p
.
Clin Immunol
 
2022
;
238
:
109006
.

5.

Kai
 
Y
,
Mei
 
Y
,
Wang
 
Z
, et al.   
LncRNA LINC00924 upregulates NDRG2 to inhibit epithelial-mesenchymal transition via sponging miR-6755-5p in hepatitis B virus-related hepatocellular carcinoma
.
J Med Virol
 
2022
;
94
(
6
):
2702
13
.

6.

Xie
 
C
,
Yin
 
Z
,
Liu
 
Y
.
Analysis of characteristic genes and ceRNA regulation mechanism of endometriosis based on full transcriptional sequencing
.
Front Genet
 
2022
;
13
.

7.

Zhongxia Tang
 
Y
,
Zhang
 
HW
,
Liu
 
C
, et al.   
Integrated analysis of lncRNA-miRNA-mRNA ceRNA network in mixed dry eye disease
.
Contrast Media Mol Imaging
 
2022
;
2022
.

8.

Li
 
J-H
,
Liu
 
S
,
Zhou
 
H
, et al.   
Starbase v2. 0: decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA interaction networks from large-scale clip-seq data
.
Nucleic Acids Res
 
2014
;
42
(
D1
):
D92
7
.

9.

Kang
 
Q
,
Meng
 
J
,
Chenglin
 
S
,
Luan
 
Y
.
Mining plant endogenous target mimics from miRNA–lncRNA interactions based on dual-path parallel ensemble pruning method
.
Brief Bioinform
 
2022
;
23
(
1
):
bbab440
.

10.

Salmena
 
L
,
Poliseno
 
L
,
Tay
 
Y
, et al.   
A ceRNA hypothesis: the Rosetta stone of a hidden RNA language
.
Cell
 
2011
;
146
(
3
):
353
8
.

11.

Huang
 
Y-A
,
Chan
 
KCC
,
You
 
Z-H
.
Constructing prediction models from expression profiles for large scale lncRNA–miRNA interaction profiling
.
Bioinformatics
 
2018
;
34
(
5
):
812
9
.

12.

Cock
 
PJA
,
Antao
 
T
,
Chang
 
JT
, et al.   
Biopython: freely available python tools for computational molecular biology and bioinformatics
.
Bioinformatics
 
2009
;
25
(
11
):
1422
3
.

13.

Xinyu
 
Y
,
Jiang
 
L
,
Jin
 
S
, et al.   
preMLI: a pre-trained method to uncover microRNA–lncRNA potential interactions
.
Brief Bioinform
 
2022
;
23
(
1
):
bbab470
.

14.

Kalyan
 
KS
,
Sangeetha
 
S
.
SECNLP: a survey of embeddings in clinical natural language processing
.
J Biomed Inform
 
2020
;
101
:
103323
.

15.

Mikolov
 
T
,
Chen
 
K
,
Corrado
 
G
, et al
Dean
 
J
.
Efficient estimation of word representations in vector space
.
In: International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net,
2013
.

16.

Le
 
Q
,
Mikolov
 
T
.
Distributed representations of sentences and documents
.
In: International Conference on Machine Learning
. New York, NY, United States: Association for Computing Machinery,
2014
; pp. 1188–96.

17.

Vaswani
 
A
,
Shazeer
 
N
,
Parmar
 
N
 et al.   
Attention is all you need
. In:
Advances in Neural Information Processing Systems
. NY, United States: Curran Associates Inc, pp.
5998
6008
,
2017
.

18.

Liu
 
S
,
Meng
 
Z
,
Macdonald
 
C
, et al.   
Graph neural pre-training for recommendation with side information
.
ACM Trans Inf Syst
 
2023
;
41
(
3
):
1
28
.

19.

Hamilton
 
WL
,
Ying
 
R
,
Leskovec
 
J
.
Inductive representation learning on large graphs
. In:
Advances in Neural Information Processing Systems
. NY, United States: Curran Associates Inc, pp.
1024
34
,
2017
.

20.

Kipf
 
TN
,
Welling
 
M
.
Semi-supervised classification with graph convolutional networks
. In:
International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net,
2017
.

21.

Brody
 
S
,
Alon
 
U
,
Yahav
 
E
.
How attentive are graph attention networks?
In:
International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net,
2022
.

22.

Perozzi
 
B
,
Al-Rfou
 
R
,
Skiena
 
S
.
Deepwalk: Online learning of social representations
. In:
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
New York, NY, United States: ACM
,
2014
, pp.
701
10
.

23.

Grover
 
A
,
Leskovec
 
J
.
node2vec: scalable feature learning for networks
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
New York, NY, United States: ACM
,
2016
, pp.
855
64
.

24.

Tang
 
J
,
Wang
 
MZ
,
Yan
 
M
, et al.   
Line: large-scale information network embedding
. In:
Proceedings of the 24th International Conference on World Wide Web
.
New York, NY, United States: ACM
,
2015
, pp. 1067–77.

25.

Chiang
 
WL
,
Hou
 
B-J
,
Ko
 
C-Y
.
Simplifying graph convolutional networks
. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
.
New York, NY, United States: ACM
,
2019
, pp.
1867
76
.

26.

Velickovic
 
P
,
Cucurull
 
G
,
Casanova
 
A
, et al.   
Graph attention networks
. In:
International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net,
2018
.

27.

Brockschmidt
 
M
.
GNN-film: Graph neural networks with feature-wise linear modulation
. In:
International Conference on Machine Learning
.
New York, NY, United States: ACM
,
2020
, pp.
1144
52
.

28.

Xu K, Hu W, Leskovec J, Jegelka S. How Powerful are Graph Neural Networks?. In

International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net, 2018.

29.

Wang
 
Y
,
Sun
 
Y
,
Liu
 
Z
, et al.   
Dynamic graph CNN for learning on point clouds
.
ACM Trans Graphics
 
2019
;
38
(
5
):
1
12
.

30.

Tailor
 
SA
,
Opolka
 
F
,
Lio
 
P
, et al.   
Do we need anisotropic graph neural networks?
In:
International Conference on Learning Representations
. Amherst, MA, United States: OpenReview.net,
2021
.

31.

Wang
 
P
,
Guo
 
Q
,
Qi
 
Y
, et al.   
LncACTdb 3.0: an updated database of experimentally supported ceRNA interactions and personalized networks contributing to precision medicine
.
Nucleic Acids Res
 
2022
;
50
(
D1
):
D183
9
.

32.

Volders
 
P-J
,
Anckaert
 
J
,
Verheggen
 
K
, et al.   
LNCipedia 5: towards a reference set of human long non-coding RNAs
.
Nucleic Acids Res
 
2019
;
47
(
D1
):
D135
9
.

33.

Zhao
 
L
,
Wang
 
J
,
Li
 
Y
, et al.   
Noncodev6: an updated database dedicated to long non-coding rna annotation in both animals and plants
.
Nucleic Acids Res
 
2021
;
49
(
D1
):
D165
71
.

34.

Kenton J.D.M.W.C, Toutanova LK. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In

Proceedings of NAACL-HLT
. PA, United States: Association for Computational Linguistics, 2019, (pp. 4171–4186).

35.

Akiyama
 
M
,
Sakakibara
 
Y
.
Informative rna base embedding for rna structural alignment and clustering by deep representation learning
.
NAR Genom Bioinform
 
2022
;
4
(
1
):
lqac012
.

36.

Jarvelin
 
K
,
Kekalainen
 
J
.
Cumulated gain-based evaluation of IR techniques
.
ACM Trans Inf Syst
 
2002
;
20
(
4
):
422
46
.

37.

Dutta
 
A
,
Dubey
 
T
,
Singh
 
KK
, et al.
Splicevec: distributed feature representations for splice junction prediction
.
Comput Biol Chem
 
2018
;
74
:
434
41
.

38.

Zhang
 
Y
,
Liu
 
Y
,
Jian
 
X
, et al.   
Leveraging the attention mechanism to improve the identification of DNA n6-methyladenine sites
.
Brief Bioinform
 
2021
;
22
(
6
):
bbab351
.

39.

Sohrabi-Jahromi
 
S
,
Söding
 
J
.
Thermodynamic modeling reveals widespread multivalent binding by rna-binding proteins
.
Bioinformatics
 
2021
;
37
(
Supplement_1
):
i308
16
.

40.

Liu
 
G
,
Ye
 
Z
,
Zhao
 
X
, et al.
Sp1-induced up-regulation of lncRNA SNHG14 as a ceRNA promotes migration and invasion of clear cell renal cell carcinoma by regulating n-wasp
.
Am J Cancer Res
 
2017
;
7
(
12
):
2515
.

41.

Xiaoyong
 
X
,
Song
 
F
,
Jiang
 
X
, et al.   
Long non-coding RNA SNHG14 contributes to the development of hepatocellular carcinoma via sponging miR-217
.
Onco Targets Ther
 
2020
;
13
:
4865
.

42.

Tan
 
J
,
Liu
 
B
,
Zhou
 
L
, et al.   
LncRNA TUG1 promotes bladder cancer malignant behaviors by regulating the miR-320a/FOXQ1 axis
.
Cell Signal
 
2022
;
91
:
110216
.

43.

Yao
 
Z
,
An
 
W
,
Moming
 
A
, et al.
Long non-coding RNA TUG1 knockdown repressed the viability, migration and differentiation of osteoblasts by sponging miR-214
.
Exp Ther Med
 
2022
;
23
(
3
):
1
7
.

44.

Xi
 
X
,
Zhengbo
 
H
,
Qiang
 
W
, et al.   
High expression of small nucleolar RNA host gene 3 predicts poor prognosis and promotes bone metastasis in prostate cancer by activating transforming growth factor-beta signaling
.
Bioengineered
 
2022
;
13
(
1
):
1895
907
.

45.

Wan
 
H
,
Tian
 
Y
,
Zhao
 
J
, et al.   
LINC00665 targets miR-214-3p/MAPK1 axis to accelerate hepatocellular carcinoma growth and Warburg effect
.
J Oncol
 
2021
;
2021
.

Author notes

Zixiao Wang, Shiyang Liang and Siwei Liu contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)