Abstract

Since miRNAs can participate in the posttranscriptional regulation of gene expression, they may provide ideas for the development of new drugs or become new biomarkers for drug targets or disease diagnosis. In this work, we propose an miRNA–disease association prediction method based on meta-paths (MDPBMP). First, an miRNA–disease–gene heterogeneous information network was constructed, and seven symmetrical meta-paths were defined according to different semantics. After constructing the initial feature vector for the node, the vector information carried by all nodes on the meta-path instance is extracted and aggregated to update the feature vector of the starting node. Then, the vector information obtained by the nodes on different meta-paths is aggregated. Finally, miRNA and disease embedding feature vectors are used to calculate their associated scores. Compared with the other methods, MDPBMP obtained the highest AUC value of 0.9214. Among the top 50 predicted miRNAs for lung neoplasms, esophageal neoplasms, colon neoplasms and breast neoplasms, 49, 48, 49 and 50 have been verified. Furthermore, for breast neoplasms, we deleted all the known associations between breast neoplasms and miRNAs from the training set. These results also show that for new diseases without known related miRNA information, our model can predict their potential miRNAs. Code and data are available at https://github.com/LiangYu-Xidian/MDPBMP.

Introduction

With the development of genomics and bioinformatics, noncoding RNAs (ncRNAs) have gradually gained more attention from scientists [1]. Many ncRNAs have been confirmed to play an important role in the regulation of human disease genes, especially microRNAs (miRNAs; [2, 3]). MiRNAs are a class of noncoding single-stranded RNA molecules encoded by endogenous genes that are ~22 nucleotides in length and bind to target mRNAs mainly through sequence-specific base pairing and participate in posttranscriptional gene expression regulation [4]. The identification of miRNA–disease associations is of great significance for medical practitioners to develop new therapeutic approaches targeting miRNAs [5, 6]. According to different predictive strategies, the existing methods can be divided into four categories: machine learning-, information dissemination-, scoring function- and matrix transformation-based methods.

For machine learning-based methods [7–12], Fu et al. [13] used a stacked autoencoder to extract high-level features of nodes from miRNA and disease similarity networks, spliced the feature vectors of miRNA and disease as the feature vector of miRNA–disease pairs and input the three-layer neural network to predict miRNA–disease associations. Chen et al. [14] proposed a prediction model called EGBMMDA based on XGBoost (extreme gradient boosting machine). Based on statistical measurement, graph theory measurement and matrix decomposition theory, the model defines 18 measurement indices and concatenates all indicator values to obtain the feature vector of the miRNA–disease pair. The collection of negative samples is a major problem in this type of method.

For information dissemination-based methods [15, 16], You et al. [17] used a depth-first search algorithm to infer potential miRNA–disease associations (PBMDA). Chen et al. [18] proposed a novel computational model of Bipartite Network Projection for MiRNA-Disease Association prediction (BNPMDA). This type of method relies on the connectivity of the network, has a preference for nodes with more known connections, and ignores isolated nodes.

For scoring function-based methods, Chen et al. [19] proposed the concepts of super disease and super miRNA and developed a new predictive model, Super-Disease and MiRNA for potential MiRNA-Disease Association prediction (SDMMDA). Xie et al. [20] constructed transfer weights based on known miRNA and disease similarity and correctly configured the initial information. Then, a two-step bipartite network algorithm was used to infer potential miRNA–disease associations (Weighted Bipartite Network Projection for MicroRNA-Disease association prediction). These methods rely on the definition of similarity. At present, it is difficult to have an evaluation method to illustrate the accuracy of the definition of similarity.

For matrix transformation-based methods, Chen et al. [21] proposed a semi-supervised model, Regularized Least Squares for MiRNA-Disease Association (RLSMDA), to predict miRNA–disease associations using the regularized least squares method, which can meet the condition of model training without negative samples. Based on the hypothesis that miRNA and disease-related information can be revealed through distribution semantics, Peng et al. [22] proposed an improved ILRMR (Improved Low-Rank Matrix Recovery method) model based on low-rank matrix restoration, which also does not require negative samples in the prediction process. The matrix-based method also has the problem of how to verify the rationality of the definition of similarity.

In addition, there are also many methods based on deep learning, such as Fu et al. [23] proposed the MAGNN (metapath aggregated graph neural network) model, which includes three parts: node content transformation, intra-metapath aggregation and inter-metapath aggregation. The existing prediction models based on meta-paths do not consider the characteristic information of miRNAs and disease nodes, ignore the information carried by the intermediate nodes of meta-paths and lack sufficient types of meta-paths.

Therefore, inspired by Fu et al. [23], we propose a novel miRNA–disease association prediction method based on meta-paths (MDPBMP). This approach not only extracts the characteristic information of the miRNA and disease nodes themselves but also effectively captures the information carried by the intermediate nodes on the meta-path. Compared with the other methods, MDPBMP obtained the highest area under curve (AUC) value. The introduction of disease–gene associations enriches the biological information of the nodes, and the selection of multiple meta-paths helps to better extract structural features in the network, thereby improving the accuracy of association predictions.

Materials and methods

Data sources

miRNA–disease associations

HMDD [24] is a commonly used miRNA–disease association database. All miRNA–disease associations collected in this database were experimentally verified. We downloaded the miRNA–disease associations provided by the latest version (v3.2) and the old version (v2.0) [25] from the HMDD database. We merged the two datasets and deleted duplicate miRNA–disease associations and ultimately obtained 18 732 miRNA–disease associations, including 894 diseases and 1206 miRNAs.

Disease–gene associations

DisGeNET [26] is a knowledge management platform that integrates and standardizes disease-related gene and variant data from multiple data sources. The latest version (v7.0) contains 1 134 924 gene–disease associations, involving 21 671 genes and 30 170 diseases. We matched the 894 diseases involved in the miRNA–disease associations with the 30 170 diseases involved in the gene–disease associations in DisGeNET and obtained 184 020 gene–disease associations, including 360 diseases and 17 119 genes. To ensure the reliability of the data, those gene–disease associations with a DisGeNET score > 0.1 and an evidence index > 0.5 will be selected. Ultimately, 16 985 gene–disease associations were obtained, including 328 diseases and 6022 genes.

miRNA functional similarity

The miRNA functional similarity data were downloaded from MISIM v2.0 [27] provided by Li et al. The database contains 1044 miRNAs, which overlap with 1042 miRNAs involved in miRNA–disease associations (see Section ‘miRNA–disease associations’). Only miRNA–miRNA associations with similarity >0.5 were retained, and we obtained 35 032 miRNA–miRNA interactions.

Our MDPBMP model

The overall framework of the MDPBMP model is shown in Figure 1.

Figure 1

The overall framework of the MDPBMP model. (A) The three types of data were downloaded from different databases and used to construct the miRNA–disease–gene three-layer heterogeneous network. (B) Seven meta-paths were defined based on three types of nodes and associations in the heterogeneous network. (C) (i) One-hot coding is used to construct initial feature vectors for different types of nodes, and then linear transformation is used to project the initial feature vectors of all nodes into the same vector space. (ii) The information carried by all nodes on the meta-path instances is extracted, and then the multi-head attention mechanism is used to aggregate the information extracted from the meta-path instances. (iii) The idea of graph attention layer is used to aggregate the information extracted from different meta-paths. (D) The association probabilities of miRNAs and diseases are calculated by their feature vectors, and the prediction results are verified using three other miRNA–disease association databases.

Defining meta-paths

The information network can be regarded as a directed graph |$Q=(V,E)$|⁠, where |$V$| is the set of nodes |$v\in V$|⁠, and |$E$| is the set of edges |$e\in E$|⁠. Let |$A$| denotes the set of node types and |$R$| denotes the set of edge types. The network schema |${T}_Q=(A,R)$| is used to describe the meta-structure of the heterogeneous network |$Q=(V,E)$|⁠, which can clearly show the number of node types and edge types in the complex heterogeneous information network. The meta-path |$P$| is defined on the network schema |${T}_Q=(A,R)$|⁠, and the specific form is |${A}_1\overset{R_1}{\to }{A}_2\overset{R_2}{\to}\cdots \overset{R_{l-1}}{\to }{A}_l$|⁠, which can be abbreviated as |${A}_1{A}_2\cdots{A}_l$|⁠. In essence, the meta-path describes a combination relationship |$R={R}_1\circ{R}_2\cdots \circ{R}_l$| between node type |${A}_1$| and node type |${A}_l$|⁠, where |$\circ$| represents the combination operation between the relations. Different meta-paths often have different semantics. Given a meta-path |$P$|⁠, if there is a meta-path instance |$p=({v}_1,{v}_2,\cdots, {v}_l)$| between two nodes that obeys meta-path |$P$|⁠, then the types of all nodes on path |$p$| must exist in set |$A$|⁠, and the type of each edge |${e}_i=\langle{v}_i,{v}_{i+1}\rangle$| in path |$p$| is the same as the corresponding |${R}_i$| in meta-path |$P$|⁠.

Obviously, the miRNA–disease–gene three-layer heterogeneous network is an undirected graph, so the information carried in the graph can be considered to be symmetrical. If the disease |${d}_1$| is connected to the disease |${d}_2$| according to a certain meta-path |$P$|⁠, then there is still a path that satisfies the meta-path |$P$| from the disease |${d}_2$| to the disease |${d}_1$|⁠. This requires that the selected meta-paths in this heterogeneous information network are all symmetrical, that is, they satisfy |$P={P}^{-1}$|⁠. Based on this premise, all meta-paths with lengths >5 can be composed of meta-paths with lengths equal to 2, 3, 4 and 5. Based on the above two conditions, the three types of nodes and the relationships in the network, we defined seven types of meta-paths. The meta-paths are defined as follows:

  • (i) MM: miRNA–miRNA. The edges between miRNAs are derived from miRNA functional similarity data, so the miRNAs at both ends of the meta-path instance are similar.

  • (ii) MDM: miRNA–disease–miRNA. This meta-path indicates that miRNAs related to the same disease should be similar.

  • (iii) MDGDM: miRNA–disease–gene–disease–miRNA. The two diseases in this meta-path instance are related to the same gene, so the two diseases should be similar. Then, miRNAs related to similar diseases should also be similar.

  • (iv) DGD: disease–gene–disease. This meta-path indicates that diseases related to the same gene should be similar.

  • (v) DMD: disease–miRNA–disease. This meta-path indicates that diseases related to the same miRNA should be similar.

  • (vi) DMMD: disease–miRNA–miRNA–disease. This meta-path indicates that diseases related to similar miRNAs should be similar.

  • (vii) GDMDG: gene–disease–miRNA–disease–gene. The two diseases in this meta-path instance are related to the same miRNA, so the two diseases should be similar. Then, genes related to similar diseases should also be similar.

Learning node feature vectors based on meta-paths

Building the initial characteristics of the node
We use one-hot encoding [28] to set an initial feature vector for each type of node in the network. The feature vectors of miRNA |$m\in{V}_M$|⁠, disease |$d\in{V}_D$| and gene |$g\in{V}_G$| are |${x}_m\in{\mathbb{R}}^{1206}$|⁠, |${x}_d\in{\mathbb{R}}^{894}$| and |${x}_g\in{\mathbb{R}}^{6022}$|⁠, respectively. |${V}_M,{V}_D$| and |${V}_G$| represents the miRNA node set, disease node set and gene node set in the miRNA–disease–gene three-layer heterogeneous network. To be able to process these three types of nodes in the same framework, we use a specific linear transformation method for each type of node to map them into the same vector space. For any miRNA |$m\in{V}_M$|⁠, its feature vector |${x}_m$| is transformed as follows:
$$\begin{equation} {h}_m^{\prime }={W}_M\cdot{x}_m \end{equation}$$
(1)
where |${W}_M\in{\mathbb{R}}^{n\times 1206}$| is the parameter weight matrix of the miRNA node, and |${h}_m^{\prime}\in{\mathbb{R}}^n$| is the hidden feature vector of miRNA |$m$| after the initial feature vector of miRNA |$m$| is mapped to the n-dimensional vector space. Similarly, for each disease and gene node, their initial feature vectors are mapped into the n-dimensional vector space in the same way:
$$\begin{equation} {h}_d^{\prime }={W}_D\cdot{x}_d \end{equation}$$
(2)
$$\begin{equation} {h}_g^{\prime }={W}_G\cdot{x}_g \end{equation}$$
(3)
where |${x}_d$| and |${x}_g$| are the initial feature vectors of disease and gene nodes, respectively. |${W}_D\in{\mathbb{R}}^{n\times 984}$| is the parameter weight matrix of the disease node. |${W}_G\in{\mathbb{R}}^{n\times 6022}$| is the parameter of the gene node. |${h}_d^{\prime}\in{\mathbb{R}}^n$| and |${h}_g^{\prime}\in{\mathbb{R}}^n$| are the hidden feature vectors after the initial feature vectors of disease, and gene nodes are mapped to the n-dimensional vector space.
Internal aggregation of meta-paths
Given a meta-path |$P$|⁠, there are generally multiple corresponding meta-path instances in a heterogeneous network. We denote an instance of the meta-path |$P$| as |$P(v,u)$|⁠, with |$v$| as the starting node and |$u$| as the ending node. The idea of maximum pooling in the convolutional neural network is used to encode the meta-path instances. A single vector of the meta-path instance with the same dimension as the hidden feature vector of the node is generated. The specific method is to select the maximum value of the feature vector of all nodes in the meta-path instance in each dimension, represented by the following formula:
$$\begin{equation} {h}_{P\left(v,u\right),i}=\max \left(\left\{{h}_{t,i}^{\prime },\forall t\in P\left(v,u\right)\right\}\right)\kern0.5em i=1,2,\cdots, n \end{equation}$$
(4)
where |$t$| is any node on the meta-path instance, |$P(v,u)$|⁠, |${h}_{t,i}^{\prime }$| represents the value of the hidden feature vector of |$t$| in the ith dimension and |${h}_{P(v,u),i}$| represents the value of the vector |${h}_{P(v,u)}\in{\mathbb{R}}^n$| in the ith dimension. In a complex heterogeneous network, meta-path |$P$| often has multiple instances connecting |$v$| and |$u$|⁠. We only consider one of them, represented by |$P(v,u)$|⁠.
After encoding each meta-path instance of |$P(v,u)$| into a single vector, the vectors with the same starting node need to be aggregated. |${N}_v^P$| is used to represent the set of end nodes of path |$P$|⁠, i.e. |$u\in{N}_v^P$|⁠, where |$v$| is the starting node of |$P$|⁠. |${N}_v^P$| is called the neighbor of node |$v$| based on meta-path |$P$|⁠. The different types and lengths of edges for each meta-path lead to their different contributions to the starting node, so they need to be classified according to the types of meta-paths. Considering that the information obtained from different meta-path instances may contribute differently to the starting node, we adopt the idea of a graph attention layer to perform weighted summation on the instances of meta-path |$P$|⁠. A normalized weight |${\alpha}_{vu}^P$| is first calculated for each meta-path instance as follows:
$$\begin{equation} {e}_{vu}^P= LeakyReLU\left({a}_P^T\cdot \left[{h}_v^{\prime}\left\Vert{h}_{P\left(v,u\right)}\right.\right]\right) \end{equation}$$
(5)
$$\begin{equation} {\alpha}_{vu}^P=\frac{\exp \left({e}_{vu}^P\right)}{\sum_{s\in{N}_v^P}\exp \left({e}_{vs}^P\right)} \end{equation}$$
(6)
where |${a}_P^T\in{\mathbb{R}}^{2n}$| is the attention vector to be trained for meta-path |$P$|⁠. |$\Vert$| indicates the splicing of vectors. LeakyReLU is a nonlinear activation function. The importance of a meta-path instance |$P(v,u)$| to the starting node |$v$| is denoted by |${e}_{vu}^P$|⁠. Equation (6) uses a normalized exponential function to normalize the importance |${e}_{vu}^P$| of meta-path instances |$P(v,u)$| satisfying |$u\in{N}_v^P$|⁠. Then, a single vector of meta-path instances with |$v$| as the starting node is weighted and summed. The formula is as follows:
$$\begin{equation} {h}_v^P=\sigma \left(\sum \limits_{u\in{N}_v^P}{\alpha}_{vu}^P\cdot{h}_{P\left(v,u\right)}\right) \end{equation}$$
(7)
where |$\sigma (\cdot )$| is the activation function and |${h}_v^P$| represents the feature vector of |$P$|⁠. To further improve the expression ability of the graph attention layer, we use the multi-head attention mechanism to improve the calculation method of |${h}_v^P$|⁠. There are K independent attention vectors called in Formula (5), and then each attention vector will calculate a set of weights by Formula (6). Since the initial sizes of the input attention vectors are different, their weight indices will also vary. The obtained K weight indices are inputted into Formula (8) to obtain K vectors |${h}_v^P$|⁠. Then, they are stitched together to obtain the final vector |${h}_v^P\in{\mathbb{R}}^{nK}$|⁠. The improved formula is as follows:
$$\begin{equation} {h}_v^P=\underset{k=1}{\overset{K}{\Vert }}\sigma \left(\sum \limits_{u\in{N}_v^P}{\left[{\alpha}_{vu}^P\right]}_k\cdot{h}_{P\left(v,u\right)}\right) \end{equation}$$
(8)
where |${[{\alpha}_{vu}^P]}_k$| represents the normalized importance of |$P(v,u)$| under the kth attention mechanism. The introduction of a multihead attention mechanism here helps stabilize the learning process and reduce the high variance introduced by the heterogeneity of the graph.

Taking disease nodes as an example, there are three meta-paths, DGD, DMD and DMMD, whose starting node type is D (Disease). We first find the meta-path instances of meta-path DGD with disease |$d\in{V}_D$| as the starting node. The vector information of each meta-path instance is calculated using Equation (4). These vectors are weighted and summed K times according to Equations (5)–(7). Then, these K results are processed using Equation (8) to obtain the vector representation |${h}_d^{\mathrm{DGD}}$| of disease |$d$| based on meta-path DGD. Next, the same operation is performed on meta-paths DMD and DMMD. Finally, three vector representations of disease |$d$| can be obtained, denoted as |$\{{h}_d^{\mathrm{DGD}},{h}_d^{\mathrm{DMD}},{h}_d^{\mathrm{DMMD}}\}$|⁠.

In the same way, there are three meta-paths MM, MDM and MDGDM whose starting node type is |$M$|(miRNA) and one meta-path GDMDG whose starting node type is |$G$|(Gene). Therefore, any miRNA |$m\in{V}_M$| based on meta-paths MM, MDM and MDGDM can obtain three vector representations |$\{{h}_m^{\mathrm{MM}},{h}_m^{\mathrm{MDM}},{h}_m^{\mathrm{MDGDM}}\}$|⁠. For any gene |$g\in{V}_G$| based on meta-path GDMDG, a vector can be obtained to represent |$\{{h}_g^{\mathrm{GDMDG}}\}$|⁠.

Aggregating information carried by different meta-paths
After obtaining multiple vector representations of nodes, we need to merge them into one feature vector. Taking nodes of miRNA as an example, the feature vector of meta-path |${P}_i$| is first initialized by averaging the feature vectors |${h}_m^{P_i}$| of all miRNA |$m\in{V}_M$| nodes based on meta-path |${P}_i$|⁠. The calculation formula is as follows:
$$\begin{equation} {s}_{P_i}=\frac{1}{\left|{V}_M\right|}\sum \limits_{m\in{V}_M}\tanh \left({W}_M^{\prime}\cdot{h}_m^{P_i}+{b}_M\right) \end{equation}$$
(9)
where |${W}_M^{\prime}\in{\mathbb{R}}^{d_M\times nK}$| and |${b}_M\in{\mathbb{R}}^{d_M}$| are the learnable parameters of miRNA-type nodes. There are three meta-paths represented by |${P}_i$|⁠, namely, MM, MDM and MDGDM. Then, the weight of each meta-path is calculated and normalized by the initial feature vector of meta-path |${P}_i$| using the attention mechanism. The formula is as follows:
$$\begin{equation} {e}_{P_i}={q}_M^T\cdot{s}_{P_i} \end{equation}$$
(10)
$$\begin{equation} {\beta}_{P_i}=\frac{\exp \left({e}_{P_i}\right)}{\sum_{P\in{\varPsi}_M}\exp \left({e}_P\right)} \end{equation}$$
(11)
where |${q}_M\in{\mathbb{R}}^{d_M}$| is the parameterized attention vector of miRNA-type nodes. |${\varPsi}_M=\{\mathrm{MM},\mathrm{MDM},\mathrm{MDGDM}\}$| represents the set of optional meta-paths for miRNA nodes, that is, the set of meta-paths starting with miRNA nodes. The relative importance of meta-path |${P}_i$| to miRNA-type nodes is denoted by |${\beta}_{P_i}$|⁠. For miRNA node |$m\in{V}_M$|⁠, a weighted summation is performed on the vectors obtained based on different meta-paths. The calculation formula is as follows:
$$\begin{equation} {h}_m^{\varPsi_M}=\sum \limits_{P\in{\varPsi}_M}{\beta}_P\cdot{h}_m^P \end{equation}$$
(12)
Finally, an additional linear transformation with a nonlinear function is used to project the node feature vector of miRNA into a vector space with the desired output size. The formula is as follows:
$$\begin{equation} {h}_m=\sigma \left({W}_o\cdot{h}_m^{\varPsi_M}\right) \end{equation}$$
(13)
where |${W}_o\in{\mathbb{R}}^{d_o\times nK}$| is the weight matrix of miRNA nodes.

The same operation is performed for nodes of disease and gene nodes to obtain a vector representation of each node in the network for further training of the subsequent model.

Calculating miRNA–disease association scores

All miRNA–disease pairs with known associations were regarded as positive samples, and the set of all miRNA–disease pairs with unknown associations was regarded as a negative sample set. In the validation set and test set, the number of negative samples is the same as that of positive samples, which is randomly sampled from the negative sample set. To reduce the influence of negative sampling on the prediction results, in the training process, the negative samples are sampled uniformly immediately instead of using fixed negative samples. After obtaining positive and negative samples, the model weights are optimized by minimizing the loss function in the following formula:
$$\begin{equation} L=-\sum \limits_{\left(m,d\right)\in \varOmega}\log \sigma \left({h}_d^T\cdot{h}_m\right)-\sum \limits_{\left({m}^{\prime },{d}^{\prime}\right)\in{\varOmega}^{-}}\log \sigma \left(-{h}_{d^{\prime}}^T\cdot{h}_{m^{\prime }}\right) \end{equation}$$
(14)
where |$\Omega$| is a set of miRNA–disease pairs with known associations, and |${\Omega}^{-}$| is a set of negative samples sampled from all miRNA–disease pairs with unknown associations. To prevent the model from overfitting, the early stopping method is used to stop training when the model performs best.
After obtaining the miRNA node embedding vector |${h}_m$| and the disease embedding vector |${h}_d$| generated by the training model, the association probability between miRNA |$m$| and disease |$d$| is calculated using the following formula:
$$\begin{equation} {p}_{md}=\sigma \left({h}_d^T\cdot{h}_m\right) \end{equation}$$
(15)

Finally, the miRNA–disease pairs were sorted according to the calculated association probability |${p}_{md}$|⁠. The greater the association probability between miRNA |$m$| and disease |$d$|⁠, the higher the likelihood that they are related.

Results

Performance comparison with other algorithms

Based on known miRNA–disease associations (see Section ‘miRNA–disease associations’), we used 5-fold cross-validation [29] to compare our MDPBMP with the other three miRNA–disease association prediction models WBNPMD [20], BNPMDA [18] and PBMDA [17]. The AUC values of MDPBMP, WBNPMD, BNPMDA and PBMDA are shown in Figure 2 and are 0.92140, 0.88745, 0.85261 and 0.89857, respectively. Our MDPBMP obtained the highest AUC value. WBNPMD and BNPMDA achieved lower AUC values. The reasons may be that they both predict miRNA–disease associations through resource allocation and transfer. Excessive reliance on the similarity matrix in the prediction process will affect the prediction performance of the model. Compared with these two models, MDPBMP and PBMDA use 0.5 as the threshold to screen the similarity of miRNAs, which improves the reliability of the similarity value. However, PBMDA only uses the number and length of paths between nodes to evaluate the associated probability and ignores the node information inside the path and the information carried by the node itself. Our MDPBMP makes up for these two deficiencies by constructing feature vectors for nodes and aggregating information from all nodes on each meta-path instance and obtains the highest AUC value.

Figure 2

ROC curve obtained by four prediction models in 5-fold cross-validation.

Robustness verification

To verify the robustness of our MDPBMP model, we used two different training datasets, HMDD v2.0 and HMDD v3.2. We downloaded 5430 known experimentally verified miRNA–disease associations from HMDD v2.0 [25], including 495 miRNAs and 383 diseases. The disease–gene associations associated with these 383 diseases were selected from DisGeNET [26]. After screening for the two conditions of disease–gene association score >0.1 and evidence index >0.5, 5937 disease–gene associations were obtained, of which 3790 genes were involved. The miRNA functional similarity matrix was obtained from the work of Wang et al. [30]. After deleting the value of similarity score <0.5, 10 049 miRNA–miRNA associations were obtained. The information of the two training datasets is shown in Table 1.

Table 1

Comparison of the number of items in training datasets HMDD v2.0 and HMDD v3.2

Datasets itemsmiRNADiseaseGenemiRNA–disease associationDisease–gene associationmiRNA–miRNA association
HMDD v2.049538337905430593710 049
HMDD v3.21206894602218 73216 95835 032
Datasets itemsmiRNADiseaseGenemiRNA–disease associationDisease–gene associationmiRNA–miRNA association
HMDD v2.049538337905430593710 049
HMDD v3.21206894602218 73216 95835 032
Table 1

Comparison of the number of items in training datasets HMDD v2.0 and HMDD v3.2

Datasets itemsmiRNADiseaseGenemiRNA–disease associationDisease–gene associationmiRNA–miRNA association
HMDD v2.049538337905430593710 049
HMDD v3.21206894602218 73216 95835 032
Datasets itemsmiRNADiseaseGenemiRNA–disease associationDisease–gene associationmiRNA–miRNA association
HMDD v2.049538337905430593710 049
HMDD v3.21206894602218 73216 95835 032

During the experiment, the known miRNA–disease associations were divided into training, validation and test sets at a ratio of 8:1:1. For the validation and test sets, negative samples with the same number of positive samples are randomly selected from unknown miRNA–disease associations. The negative samples in each training process are averagely sampled from the remaining unknown associations. Due to the small sample size of the test set, the resulting curve drawn is less smooth. Figures 3 and 4 show the ROC and PR curves obtained by the MDPBMP model using these two sets of data, respectively. The AUC and area under precision and recall curve (AUPR) values obtained by the MDPBMP model using HMDD v2.0 data are 0.92941 and 0.92807, respectively. The AUC and AUPR values obtained using HMDD v3.2 data are 0.93778 and 0.94096, respectively. Compared with 5-fold cross-validation, this verification method has the same number of positive and negative samples in the true label, so it can obtain a higher AUC value. Regardless of which set of data are used, the MDPBMP model shows good predictive performance. When more complete data and richer information are provided, the predictive performance of the MDPBMP model will improve.

Figure 3

The ROC curve obtained when the MDPBMP model uses two sets of data.

Figure 4

The PR curve obtained when the MDPBMP model uses two sets of data.

Figure 5 further shows the comparison of multiple indicators of the two sets of results, including accuracy, precision, recall and F1 scores. As shown in the figure, the four index values obtained by the MDPBMP model using HMDD v2.0 data are 0.85727, 0.84892, 0.86924 and 0.85896, and the four index values obtained using HMDD v3.0 data are 0.87053, 0.88556, 0.85104 and 0.86796, respectively. Except for the recall value, the prediction results using the HMDD v3.0 data are greater than the HMDD v2.0 data on the other three indicators. Once again, it is verified that when more sufficient miRNA–disease associations are provided, the prediction results of the model will be more accurate.

Figure 5

Comparison of multiple indicators of forecast results. Green and blue represent the results based on HMDD v2.0 and HMDD v3.2, respectively.

Analysis of parameters

We evaluate the sensitivity of the model to the parameters in this section. First, the number of heads K in the multi-head attention mechanism is tested. Figure 6 shows the AUC and AUPR values obtained by the model when K is 1, 4, 8 and 16. As shown in the figure, with the change in the value of K, the change trends of AUC and AUPR are consistent. The model obtains the best results when K is equal to 8. When K is 1, the model is equivalent to not introducing an attention mechanism. At this time, the performance of the model is lower than those of the other K values. This shows that the use of a multi-head attention mechanism can more reasonably distribute the weights of various meta-paths and their instances. When K is 16, the performance of the model decreases due to the introduction of too many parameters. Therefore, we ultimately take K equal to 8 as the number of heads in the multi-head attention mechanism.

Figure 6

The effect of the value of K in the multi-head attention mechanism on model performance. The four colors from left to right indicate that the values of K are 1, 4, 8 and 16.

Second, we also explored the impact of the hidden feature vector dimension n and the output feature vector dimension do on model performance. Since it is a hidden vector, the default hidden vector dimension n is less than or equal to the output feature vector dimension do. Considering that the node type with the least number of nodes in the network is disease, which contains 894 disease nodes, four candidate values of 16, 32, 64 and 128 are selected for the hidden feature vector dimension n, and four candidate values of 32, 64, 128 and 256 are selected for the output feature vector dimension do. When a certain value is selected for the output feature vector dimension do, the optional range of the hidden vector dimension n is all candidate dimensions less than or equal to do.

Figures 7 and 8 show the AUC and AUPR values of the prediction results of the MDPBMP model when the hidden vector dimension n and the output feature vector dimension do take different values, respectively. The abscissa represents the hidden vector dimension n, and the depth represents the output feature vector dimension do. The trends of the AUC and AUPR values remained consistent as the vector dimension size was changed. Keeping the hidden vector dimension n constant, as the output feature vector dimension do increases from 32 to 128, the performance of the model is essentially the best when do is taken to be twice as much as n. However, when do is taken to be 256 and n is taken to be 64 or 128, the model prediction performance decreases significantly due to the limitation of the number of nodes. Keeping the output feature vector dimension do constant, the prediction performance of the model when the hidden vector dimension n is equal to do is essentially lower than that when n is less than do, which indicates that it is reasonable that the default hidden vector dimension n is less than or equal to the output feature vector dimension do. Therefore, n and do are set to be 64 and 128, respectively.

Figure 7

The effect of vector dimension on the AUC value.

Figure 8

The effect of vector dimension on the AUPR value.

Case study

Four common diseases, lung, esophageal, colon and breast neoplasms, were selected for verification. It is worth noting that when verifying breast tumors, we deleted the edges between breast tumors and all miRNAs in the training data and verified the model’s predictive performance for new diseases by testing the accuracy of the model’s prediction results for breast tumors. The prediction results were verified by the databases dbDEMC 2.0 [31], miRCancer [32] and miR2Disease [33]. We use dbDEMC 2.0 as the main verification database. dbDEMC is an integrated database, which is the largest database of miRNA differential expression of human cancer. If a prediction association does not exist in the dbDEMC 2.0 database, then it is verified in the miRCancer and miR2Diseas databases. Therefore, only one verifiable database will be provided in the Evidence column of Tables 25, although there may be multiple verifiable databases. When none of the three databases were able to verify the prediction results, we manually searched the relevant literature for validation. In the Evidence columns of Tables 25, ‘literature’ indicates that there are relevant documents that can verify the prediction result, and ‘null’ indicates that we have not found relevant evidence.

Table 2

Top 50 miRNAs associated with lung neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-429dbDEMC26hsa-mir-452dbDEMC
2hsa-mir-15bdbDEMC27hsa-mir-133a-2dbDEMC
3hsa-mir-16-2dbDEMC28hsa-mir-455dbDEMC
4hsa-mir-23bdbDEMC29hsa-mir-181dbDEMC
5hsa-mir-92a-2dbDEMC30hsa-mir-502dbDEMC
6hsa-mir-425dbDEMC31hsa-mir-708dbDEMC
7hsa-mir-218-1dbDEMC32hsa-mir-199a-2dbDEMC
8hsa-mir-16-1dbDEMC33hsa-mir-20bdbDEMC
9hsa-mir-424dbDEMC34hsa-mir-509dbDEMC
10hsa-mir-483dbDEMC35hsa-mir-422adbDEMC
11hsa-mir-99bdbDEMC36hsa-mir-194-1dbDEMC
12hsa-mir-181bdbDEMC37hsa-mir-491dbDEMC
13hsa-mir-106bdbDEMC38hsa-mir-449bdbDEMC
14hsa-mir-125b-2dbDEMC39hsa-mir-151adbDEMC
15hsa-mir-204dbDEMC40hsa-mir-128dbDEMC
16hsa-mir-320adbDEMC41hsa-mir-28dbDEMC
17hsa-mir-92bdbDEMC42hsa-mir-378adbDEMC
18hsa-mir-30dbDEMC43hsa-mir-488dbDEMC
19hsa-mir-219dbDEMC44hsa-mir-133dbDEMC
20hsa-mir-302bdbDEMC45hsa-mir-370dbDEMC
21hsa-mir-302adbDEMC46hsa-mir-433dbDEMC
22hsa-mir-193bdbDEMC47hsa-mir-208bnull
23hsa-mir-409dbDEMC48hsa-mir-383dbDEMC
24hsa-mir-940dbDEMC49hsa-mir-190adbDEMC
25hsa-mir-363dbDEMC50hsa-mir-33dbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-429dbDEMC26hsa-mir-452dbDEMC
2hsa-mir-15bdbDEMC27hsa-mir-133a-2dbDEMC
3hsa-mir-16-2dbDEMC28hsa-mir-455dbDEMC
4hsa-mir-23bdbDEMC29hsa-mir-181dbDEMC
5hsa-mir-92a-2dbDEMC30hsa-mir-502dbDEMC
6hsa-mir-425dbDEMC31hsa-mir-708dbDEMC
7hsa-mir-218-1dbDEMC32hsa-mir-199a-2dbDEMC
8hsa-mir-16-1dbDEMC33hsa-mir-20bdbDEMC
9hsa-mir-424dbDEMC34hsa-mir-509dbDEMC
10hsa-mir-483dbDEMC35hsa-mir-422adbDEMC
11hsa-mir-99bdbDEMC36hsa-mir-194-1dbDEMC
12hsa-mir-181bdbDEMC37hsa-mir-491dbDEMC
13hsa-mir-106bdbDEMC38hsa-mir-449bdbDEMC
14hsa-mir-125b-2dbDEMC39hsa-mir-151adbDEMC
15hsa-mir-204dbDEMC40hsa-mir-128dbDEMC
16hsa-mir-320adbDEMC41hsa-mir-28dbDEMC
17hsa-mir-92bdbDEMC42hsa-mir-378adbDEMC
18hsa-mir-30dbDEMC43hsa-mir-488dbDEMC
19hsa-mir-219dbDEMC44hsa-mir-133dbDEMC
20hsa-mir-302bdbDEMC45hsa-mir-370dbDEMC
21hsa-mir-302adbDEMC46hsa-mir-433dbDEMC
22hsa-mir-193bdbDEMC47hsa-mir-208bnull
23hsa-mir-409dbDEMC48hsa-mir-383dbDEMC
24hsa-mir-940dbDEMC49hsa-mir-190adbDEMC
25hsa-mir-363dbDEMC50hsa-mir-33dbDEMC
Table 2

Top 50 miRNAs associated with lung neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-429dbDEMC26hsa-mir-452dbDEMC
2hsa-mir-15bdbDEMC27hsa-mir-133a-2dbDEMC
3hsa-mir-16-2dbDEMC28hsa-mir-455dbDEMC
4hsa-mir-23bdbDEMC29hsa-mir-181dbDEMC
5hsa-mir-92a-2dbDEMC30hsa-mir-502dbDEMC
6hsa-mir-425dbDEMC31hsa-mir-708dbDEMC
7hsa-mir-218-1dbDEMC32hsa-mir-199a-2dbDEMC
8hsa-mir-16-1dbDEMC33hsa-mir-20bdbDEMC
9hsa-mir-424dbDEMC34hsa-mir-509dbDEMC
10hsa-mir-483dbDEMC35hsa-mir-422adbDEMC
11hsa-mir-99bdbDEMC36hsa-mir-194-1dbDEMC
12hsa-mir-181bdbDEMC37hsa-mir-491dbDEMC
13hsa-mir-106bdbDEMC38hsa-mir-449bdbDEMC
14hsa-mir-125b-2dbDEMC39hsa-mir-151adbDEMC
15hsa-mir-204dbDEMC40hsa-mir-128dbDEMC
16hsa-mir-320adbDEMC41hsa-mir-28dbDEMC
17hsa-mir-92bdbDEMC42hsa-mir-378adbDEMC
18hsa-mir-30dbDEMC43hsa-mir-488dbDEMC
19hsa-mir-219dbDEMC44hsa-mir-133dbDEMC
20hsa-mir-302bdbDEMC45hsa-mir-370dbDEMC
21hsa-mir-302adbDEMC46hsa-mir-433dbDEMC
22hsa-mir-193bdbDEMC47hsa-mir-208bnull
23hsa-mir-409dbDEMC48hsa-mir-383dbDEMC
24hsa-mir-940dbDEMC49hsa-mir-190adbDEMC
25hsa-mir-363dbDEMC50hsa-mir-33dbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-429dbDEMC26hsa-mir-452dbDEMC
2hsa-mir-15bdbDEMC27hsa-mir-133a-2dbDEMC
3hsa-mir-16-2dbDEMC28hsa-mir-455dbDEMC
4hsa-mir-23bdbDEMC29hsa-mir-181dbDEMC
5hsa-mir-92a-2dbDEMC30hsa-mir-502dbDEMC
6hsa-mir-425dbDEMC31hsa-mir-708dbDEMC
7hsa-mir-218-1dbDEMC32hsa-mir-199a-2dbDEMC
8hsa-mir-16-1dbDEMC33hsa-mir-20bdbDEMC
9hsa-mir-424dbDEMC34hsa-mir-509dbDEMC
10hsa-mir-483dbDEMC35hsa-mir-422adbDEMC
11hsa-mir-99bdbDEMC36hsa-mir-194-1dbDEMC
12hsa-mir-181bdbDEMC37hsa-mir-491dbDEMC
13hsa-mir-106bdbDEMC38hsa-mir-449bdbDEMC
14hsa-mir-125b-2dbDEMC39hsa-mir-151adbDEMC
15hsa-mir-204dbDEMC40hsa-mir-128dbDEMC
16hsa-mir-320adbDEMC41hsa-mir-28dbDEMC
17hsa-mir-92bdbDEMC42hsa-mir-378adbDEMC
18hsa-mir-30dbDEMC43hsa-mir-488dbDEMC
19hsa-mir-219dbDEMC44hsa-mir-133dbDEMC
20hsa-mir-302bdbDEMC45hsa-mir-370dbDEMC
21hsa-mir-302adbDEMC46hsa-mir-433dbDEMC
22hsa-mir-193bdbDEMC47hsa-mir-208bnull
23hsa-mir-409dbDEMC48hsa-mir-383dbDEMC
24hsa-mir-940dbDEMC49hsa-mir-190adbDEMC
25hsa-mir-363dbDEMC50hsa-mir-33dbDEMC
Table 3

Top 50 miRNAs associated with esophageal neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-221dbDEMC26hsa-mir-372dbDEMC
2hsa-mir-29bdbDEMC27hsa-mir-16-1dbDEMC
3hsa-mir-133adbDEMC28hsa-mir-378dbDEMC
4hsa-mir-9dbDEMC29hsa-mir-135a-2null
5hsa-mir-15bdbDEMC30hsa-mir-335dbDEMC
6hsa-mir-429dbDEMC31hsa-mir-200bliterature
7hsa-mir-222dbDEMC32hsa-mir-483dbDEMC
8hsa-mir-218dbDEMC33hsa-mir-26adbDEMC
9hsa-mir-138dbDEMC34hsa-mir-196adbDEMC
10hsa-mir-206dbDEMC35hsa-mir-23adbDEMC
11hsa-mir-16-2dbDEMC36hsa-mir-199adbDEMC
12hsa-mir-129dbDEMC37hsa-mir-494dbDEMC
13hsa-mir-24dbDEMC38hsa-mir-17dbDEMC
14hsa-mir-29dbDEMC39hsa-mir-181bdbDEMC
15hsa-mir-320dbDEMC40hsa-mir-30ddbDEMC
16hsa-mir-106adbDEMC41hsa-mir-124dbDEMC
17hsa-let-7literature42hsa-mir-142dbDEMC
18hsa-mir-23bdbDEMC43hsa-mir-1-2literature
19hsa-mir-191dbDEMC44hsa-mir-29b-1null
20hsa-mir-218-1literature45hsa-mir-29adbDEMC
21hsa-mir-127dbDEMC46hsa-mir-381dbDEMC
22hsa-mir-424dbDEMC47hsa-mir-181cdbDEMC
23hsa-mir-195dbDEMC48hsa-let-7idbDEMC
24hsa-mir-16dbDEMC49hsa-mir-125b-2dbDEMC
25hsa-mir-30cdbDEMC50hsa-mir-125adbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-221dbDEMC26hsa-mir-372dbDEMC
2hsa-mir-29bdbDEMC27hsa-mir-16-1dbDEMC
3hsa-mir-133adbDEMC28hsa-mir-378dbDEMC
4hsa-mir-9dbDEMC29hsa-mir-135a-2null
5hsa-mir-15bdbDEMC30hsa-mir-335dbDEMC
6hsa-mir-429dbDEMC31hsa-mir-200bliterature
7hsa-mir-222dbDEMC32hsa-mir-483dbDEMC
8hsa-mir-218dbDEMC33hsa-mir-26adbDEMC
9hsa-mir-138dbDEMC34hsa-mir-196adbDEMC
10hsa-mir-206dbDEMC35hsa-mir-23adbDEMC
11hsa-mir-16-2dbDEMC36hsa-mir-199adbDEMC
12hsa-mir-129dbDEMC37hsa-mir-494dbDEMC
13hsa-mir-24dbDEMC38hsa-mir-17dbDEMC
14hsa-mir-29dbDEMC39hsa-mir-181bdbDEMC
15hsa-mir-320dbDEMC40hsa-mir-30ddbDEMC
16hsa-mir-106adbDEMC41hsa-mir-124dbDEMC
17hsa-let-7literature42hsa-mir-142dbDEMC
18hsa-mir-23bdbDEMC43hsa-mir-1-2literature
19hsa-mir-191dbDEMC44hsa-mir-29b-1null
20hsa-mir-218-1literature45hsa-mir-29adbDEMC
21hsa-mir-127dbDEMC46hsa-mir-381dbDEMC
22hsa-mir-424dbDEMC47hsa-mir-181cdbDEMC
23hsa-mir-195dbDEMC48hsa-let-7idbDEMC
24hsa-mir-16dbDEMC49hsa-mir-125b-2dbDEMC
25hsa-mir-30cdbDEMC50hsa-mir-125adbDEMC
Table 3

Top 50 miRNAs associated with esophageal neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-221dbDEMC26hsa-mir-372dbDEMC
2hsa-mir-29bdbDEMC27hsa-mir-16-1dbDEMC
3hsa-mir-133adbDEMC28hsa-mir-378dbDEMC
4hsa-mir-9dbDEMC29hsa-mir-135a-2null
5hsa-mir-15bdbDEMC30hsa-mir-335dbDEMC
6hsa-mir-429dbDEMC31hsa-mir-200bliterature
7hsa-mir-222dbDEMC32hsa-mir-483dbDEMC
8hsa-mir-218dbDEMC33hsa-mir-26adbDEMC
9hsa-mir-138dbDEMC34hsa-mir-196adbDEMC
10hsa-mir-206dbDEMC35hsa-mir-23adbDEMC
11hsa-mir-16-2dbDEMC36hsa-mir-199adbDEMC
12hsa-mir-129dbDEMC37hsa-mir-494dbDEMC
13hsa-mir-24dbDEMC38hsa-mir-17dbDEMC
14hsa-mir-29dbDEMC39hsa-mir-181bdbDEMC
15hsa-mir-320dbDEMC40hsa-mir-30ddbDEMC
16hsa-mir-106adbDEMC41hsa-mir-124dbDEMC
17hsa-let-7literature42hsa-mir-142dbDEMC
18hsa-mir-23bdbDEMC43hsa-mir-1-2literature
19hsa-mir-191dbDEMC44hsa-mir-29b-1null
20hsa-mir-218-1literature45hsa-mir-29adbDEMC
21hsa-mir-127dbDEMC46hsa-mir-381dbDEMC
22hsa-mir-424dbDEMC47hsa-mir-181cdbDEMC
23hsa-mir-195dbDEMC48hsa-let-7idbDEMC
24hsa-mir-16dbDEMC49hsa-mir-125b-2dbDEMC
25hsa-mir-30cdbDEMC50hsa-mir-125adbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-221dbDEMC26hsa-mir-372dbDEMC
2hsa-mir-29bdbDEMC27hsa-mir-16-1dbDEMC
3hsa-mir-133adbDEMC28hsa-mir-378dbDEMC
4hsa-mir-9dbDEMC29hsa-mir-135a-2null
5hsa-mir-15bdbDEMC30hsa-mir-335dbDEMC
6hsa-mir-429dbDEMC31hsa-mir-200bliterature
7hsa-mir-222dbDEMC32hsa-mir-483dbDEMC
8hsa-mir-218dbDEMC33hsa-mir-26adbDEMC
9hsa-mir-138dbDEMC34hsa-mir-196adbDEMC
10hsa-mir-206dbDEMC35hsa-mir-23adbDEMC
11hsa-mir-16-2dbDEMC36hsa-mir-199adbDEMC
12hsa-mir-129dbDEMC37hsa-mir-494dbDEMC
13hsa-mir-24dbDEMC38hsa-mir-17dbDEMC
14hsa-mir-29dbDEMC39hsa-mir-181bdbDEMC
15hsa-mir-320dbDEMC40hsa-mir-30ddbDEMC
16hsa-mir-106adbDEMC41hsa-mir-124dbDEMC
17hsa-let-7literature42hsa-mir-142dbDEMC
18hsa-mir-23bdbDEMC43hsa-mir-1-2literature
19hsa-mir-191dbDEMC44hsa-mir-29b-1null
20hsa-mir-218-1literature45hsa-mir-29adbDEMC
21hsa-mir-127dbDEMC46hsa-mir-381dbDEMC
22hsa-mir-424dbDEMC47hsa-mir-181cdbDEMC
23hsa-mir-195dbDEMC48hsa-let-7idbDEMC
24hsa-mir-16dbDEMC49hsa-mir-125b-2dbDEMC
25hsa-mir-30cdbDEMC50hsa-mir-125adbDEMC
Table 4

Top 50 miRNAs associated with colon neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-183dbDEMC26hsa-mir-182dbDEMC
2hsa-mir-34cdbDEMC27hsa-mir-146bdbDEMC
3hsa-mir-214dbDEMC28hsa-mir-124-1literature
4hsa-mir-9dbDEMC29hsa-mir-144dbDEMC
5hsa-mir-206dbDEMC30hsa-mir-100dbDEMC
6hsa-mir-129dbDEMC31hsa-mir-503dbDEMC
7hsa-mir-320miRCancer32hsa-mir-184dbDEMC
8hsa-mir-29literature33hsa-mir-135adbDEMC
9hsa-let-7literature34hsa-mir-124-3literature
10hsa-mir-92a-2dbDEMC35hsa-mir-320adbDEMC
11hsa-mir-425dbDEMC36hsa-mir-340dbDEMC
12hsa-mir-16dbDEMC37hsa-mir-134dbDEMC
13hsa-mir-34bdbDEMC38hsa-mir-200literature
14hsa-mir-99adbDEMC39hsa-mir-196a-2dbDEMC
15hsa-mir-372dbDEMC40hsa-mir-29cdbDEMC
16hsa-mir-135a-2null41hsa-mir-30edbDEMC
17hsa-mir-26adbDEMC42hsa-mir-92bdbDEMC
18hsa-mir-99bdbDEMC43hsa-mir-7dbDEMC
19hsa-mir-199adbDEMC44hsa-mir-30literature
20hsa-mir-494dbDEMC45hsa-mir-451adbDEMC
21hsa-mir-124dbDEMC46hsa-mir-139dbDEMC
22hsa-mir-1-2dbDEMC47hsa-mir-122dbDEMC
23hsa-mir-381dbDEMC48hsa-mir-149dbDEMC
24hsa-mir-181cdbDEMC49hsa-mir-29b-2dbDEMC
25hsa-mir-497dbDEMC50hsa-mir-26a-1dbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-183dbDEMC26hsa-mir-182dbDEMC
2hsa-mir-34cdbDEMC27hsa-mir-146bdbDEMC
3hsa-mir-214dbDEMC28hsa-mir-124-1literature
4hsa-mir-9dbDEMC29hsa-mir-144dbDEMC
5hsa-mir-206dbDEMC30hsa-mir-100dbDEMC
6hsa-mir-129dbDEMC31hsa-mir-503dbDEMC
7hsa-mir-320miRCancer32hsa-mir-184dbDEMC
8hsa-mir-29literature33hsa-mir-135adbDEMC
9hsa-let-7literature34hsa-mir-124-3literature
10hsa-mir-92a-2dbDEMC35hsa-mir-320adbDEMC
11hsa-mir-425dbDEMC36hsa-mir-340dbDEMC
12hsa-mir-16dbDEMC37hsa-mir-134dbDEMC
13hsa-mir-34bdbDEMC38hsa-mir-200literature
14hsa-mir-99adbDEMC39hsa-mir-196a-2dbDEMC
15hsa-mir-372dbDEMC40hsa-mir-29cdbDEMC
16hsa-mir-135a-2null41hsa-mir-30edbDEMC
17hsa-mir-26adbDEMC42hsa-mir-92bdbDEMC
18hsa-mir-99bdbDEMC43hsa-mir-7dbDEMC
19hsa-mir-199adbDEMC44hsa-mir-30literature
20hsa-mir-494dbDEMC45hsa-mir-451adbDEMC
21hsa-mir-124dbDEMC46hsa-mir-139dbDEMC
22hsa-mir-1-2dbDEMC47hsa-mir-122dbDEMC
23hsa-mir-381dbDEMC48hsa-mir-149dbDEMC
24hsa-mir-181cdbDEMC49hsa-mir-29b-2dbDEMC
25hsa-mir-497dbDEMC50hsa-mir-26a-1dbDEMC
Table 4

Top 50 miRNAs associated with colon neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-183dbDEMC26hsa-mir-182dbDEMC
2hsa-mir-34cdbDEMC27hsa-mir-146bdbDEMC
3hsa-mir-214dbDEMC28hsa-mir-124-1literature
4hsa-mir-9dbDEMC29hsa-mir-144dbDEMC
5hsa-mir-206dbDEMC30hsa-mir-100dbDEMC
6hsa-mir-129dbDEMC31hsa-mir-503dbDEMC
7hsa-mir-320miRCancer32hsa-mir-184dbDEMC
8hsa-mir-29literature33hsa-mir-135adbDEMC
9hsa-let-7literature34hsa-mir-124-3literature
10hsa-mir-92a-2dbDEMC35hsa-mir-320adbDEMC
11hsa-mir-425dbDEMC36hsa-mir-340dbDEMC
12hsa-mir-16dbDEMC37hsa-mir-134dbDEMC
13hsa-mir-34bdbDEMC38hsa-mir-200literature
14hsa-mir-99adbDEMC39hsa-mir-196a-2dbDEMC
15hsa-mir-372dbDEMC40hsa-mir-29cdbDEMC
16hsa-mir-135a-2null41hsa-mir-30edbDEMC
17hsa-mir-26adbDEMC42hsa-mir-92bdbDEMC
18hsa-mir-99bdbDEMC43hsa-mir-7dbDEMC
19hsa-mir-199adbDEMC44hsa-mir-30literature
20hsa-mir-494dbDEMC45hsa-mir-451adbDEMC
21hsa-mir-124dbDEMC46hsa-mir-139dbDEMC
22hsa-mir-1-2dbDEMC47hsa-mir-122dbDEMC
23hsa-mir-381dbDEMC48hsa-mir-149dbDEMC
24hsa-mir-181cdbDEMC49hsa-mir-29b-2dbDEMC
25hsa-mir-497dbDEMC50hsa-mir-26a-1dbDEMC
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-183dbDEMC26hsa-mir-182dbDEMC
2hsa-mir-34cdbDEMC27hsa-mir-146bdbDEMC
3hsa-mir-214dbDEMC28hsa-mir-124-1literature
4hsa-mir-9dbDEMC29hsa-mir-144dbDEMC
5hsa-mir-206dbDEMC30hsa-mir-100dbDEMC
6hsa-mir-129dbDEMC31hsa-mir-503dbDEMC
7hsa-mir-320miRCancer32hsa-mir-184dbDEMC
8hsa-mir-29literature33hsa-mir-135adbDEMC
9hsa-let-7literature34hsa-mir-124-3literature
10hsa-mir-92a-2dbDEMC35hsa-mir-320adbDEMC
11hsa-mir-425dbDEMC36hsa-mir-340dbDEMC
12hsa-mir-16dbDEMC37hsa-mir-134dbDEMC
13hsa-mir-34bdbDEMC38hsa-mir-200literature
14hsa-mir-99adbDEMC39hsa-mir-196a-2dbDEMC
15hsa-mir-372dbDEMC40hsa-mir-29cdbDEMC
16hsa-mir-135a-2null41hsa-mir-30edbDEMC
17hsa-mir-26adbDEMC42hsa-mir-92bdbDEMC
18hsa-mir-99bdbDEMC43hsa-mir-7dbDEMC
19hsa-mir-199adbDEMC44hsa-mir-30literature
20hsa-mir-494dbDEMC45hsa-mir-451adbDEMC
21hsa-mir-124dbDEMC46hsa-mir-139dbDEMC
22hsa-mir-1-2dbDEMC47hsa-mir-122dbDEMC
23hsa-mir-381dbDEMC48hsa-mir-149dbDEMC
24hsa-mir-181cdbDEMC49hsa-mir-29b-2dbDEMC
25hsa-mir-497dbDEMC50hsa-mir-26a-1dbDEMC
Table 5

The top 50 miRNAs associated with new disease breast neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-93dbDEMC26hsa-mir-98dbDEMC
2hsa-mir-126dbDEMC27hsa-mir-199adbDEMC
3hsa-mir-155dbDEMC28hsa-mir-15bdbDEMC
4hsa-mir-92adbDEMC29hsa-mir-21dbDEMC
5hsa-mir-99adbDEMC30hsa-mir-362dbDEMC
6hsa-mir-145dbDEMC31hsa-mir-33bdbDEMC
7hsa-mir-200cdbDEMC32hsa-mir-486dbDEMC
8hsa-mir-370dbDEMC33hsa-let-7cdbDEMC
9hsa-mir-139dbDEMC34hsa-mir-27bdbDEMC
10hsa-mir-26bdbDEMC35hsa-mir-34cdbDEMC
11hsa-mir-25dbDEMC36hsa-mir-92a-1dbDEMC
12hsa-mir-181adbDEMC37hsa-mir-26adbDEMC
13hsa-mir-10bdbDEMC38hsa-mir-361dbDEMC
14hsa-mir-320dbDEMC39hsa-mir-34adbDEMC
15hsa-mir-196a-2miR2Disease40hsa-mir-92-1HMDD v3.2
16hsa-mir-29adbDEMC41hsa-mir-375dbDEMC
17hsa-mir-150dbDEMC42hsa-mir-517adbDEMC
18hsa-mir-146dbDEMC43hsa-mir-146bmiR2Disease
19hsa-mir-423miRCancer44hsa-mir-23bdbDEMC
20hsa-mir-17dbDEMC45hsa-mir-181cdbDEMC
21hsa-mir-320adbDEMC46hsa-mir-143dbDEMC
22hsa-mir-451dbDEMC47hsa-mir-452dbDEMC
23hsa-mir-15adbDEMC48hsa-mir-886dbDEMC
24hsa-mir-146adbDEMC49hsa-mir-33adbDEMC
25hsa-mir-31dbDEMC50hsa-mir-9-3miR2Disease
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-93dbDEMC26hsa-mir-98dbDEMC
2hsa-mir-126dbDEMC27hsa-mir-199adbDEMC
3hsa-mir-155dbDEMC28hsa-mir-15bdbDEMC
4hsa-mir-92adbDEMC29hsa-mir-21dbDEMC
5hsa-mir-99adbDEMC30hsa-mir-362dbDEMC
6hsa-mir-145dbDEMC31hsa-mir-33bdbDEMC
7hsa-mir-200cdbDEMC32hsa-mir-486dbDEMC
8hsa-mir-370dbDEMC33hsa-let-7cdbDEMC
9hsa-mir-139dbDEMC34hsa-mir-27bdbDEMC
10hsa-mir-26bdbDEMC35hsa-mir-34cdbDEMC
11hsa-mir-25dbDEMC36hsa-mir-92a-1dbDEMC
12hsa-mir-181adbDEMC37hsa-mir-26adbDEMC
13hsa-mir-10bdbDEMC38hsa-mir-361dbDEMC
14hsa-mir-320dbDEMC39hsa-mir-34adbDEMC
15hsa-mir-196a-2miR2Disease40hsa-mir-92-1HMDD v3.2
16hsa-mir-29adbDEMC41hsa-mir-375dbDEMC
17hsa-mir-150dbDEMC42hsa-mir-517adbDEMC
18hsa-mir-146dbDEMC43hsa-mir-146bmiR2Disease
19hsa-mir-423miRCancer44hsa-mir-23bdbDEMC
20hsa-mir-17dbDEMC45hsa-mir-181cdbDEMC
21hsa-mir-320adbDEMC46hsa-mir-143dbDEMC
22hsa-mir-451dbDEMC47hsa-mir-452dbDEMC
23hsa-mir-15adbDEMC48hsa-mir-886dbDEMC
24hsa-mir-146adbDEMC49hsa-mir-33adbDEMC
25hsa-mir-31dbDEMC50hsa-mir-9-3miR2Disease
Table 5

The top 50 miRNAs associated with new disease breast neoplasms in the prediction results

RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-93dbDEMC26hsa-mir-98dbDEMC
2hsa-mir-126dbDEMC27hsa-mir-199adbDEMC
3hsa-mir-155dbDEMC28hsa-mir-15bdbDEMC
4hsa-mir-92adbDEMC29hsa-mir-21dbDEMC
5hsa-mir-99adbDEMC30hsa-mir-362dbDEMC
6hsa-mir-145dbDEMC31hsa-mir-33bdbDEMC
7hsa-mir-200cdbDEMC32hsa-mir-486dbDEMC
8hsa-mir-370dbDEMC33hsa-let-7cdbDEMC
9hsa-mir-139dbDEMC34hsa-mir-27bdbDEMC
10hsa-mir-26bdbDEMC35hsa-mir-34cdbDEMC
11hsa-mir-25dbDEMC36hsa-mir-92a-1dbDEMC
12hsa-mir-181adbDEMC37hsa-mir-26adbDEMC
13hsa-mir-10bdbDEMC38hsa-mir-361dbDEMC
14hsa-mir-320dbDEMC39hsa-mir-34adbDEMC
15hsa-mir-196a-2miR2Disease40hsa-mir-92-1HMDD v3.2
16hsa-mir-29adbDEMC41hsa-mir-375dbDEMC
17hsa-mir-150dbDEMC42hsa-mir-517adbDEMC
18hsa-mir-146dbDEMC43hsa-mir-146bmiR2Disease
19hsa-mir-423miRCancer44hsa-mir-23bdbDEMC
20hsa-mir-17dbDEMC45hsa-mir-181cdbDEMC
21hsa-mir-320adbDEMC46hsa-mir-143dbDEMC
22hsa-mir-451dbDEMC47hsa-mir-452dbDEMC
23hsa-mir-15adbDEMC48hsa-mir-886dbDEMC
24hsa-mir-146adbDEMC49hsa-mir-33adbDEMC
25hsa-mir-31dbDEMC50hsa-mir-9-3miR2Disease
RankmiRNAEvidenceRankmiRNAEvidence
1hsa-mir-93dbDEMC26hsa-mir-98dbDEMC
2hsa-mir-126dbDEMC27hsa-mir-199adbDEMC
3hsa-mir-155dbDEMC28hsa-mir-15bdbDEMC
4hsa-mir-92adbDEMC29hsa-mir-21dbDEMC
5hsa-mir-99adbDEMC30hsa-mir-362dbDEMC
6hsa-mir-145dbDEMC31hsa-mir-33bdbDEMC
7hsa-mir-200cdbDEMC32hsa-mir-486dbDEMC
8hsa-mir-370dbDEMC33hsa-let-7cdbDEMC
9hsa-mir-139dbDEMC34hsa-mir-27bdbDEMC
10hsa-mir-26bdbDEMC35hsa-mir-34cdbDEMC
11hsa-mir-25dbDEMC36hsa-mir-92a-1dbDEMC
12hsa-mir-181adbDEMC37hsa-mir-26adbDEMC
13hsa-mir-10bdbDEMC38hsa-mir-361dbDEMC
14hsa-mir-320dbDEMC39hsa-mir-34adbDEMC
15hsa-mir-196a-2miR2Disease40hsa-mir-92-1HMDD v3.2
16hsa-mir-29adbDEMC41hsa-mir-375dbDEMC
17hsa-mir-150dbDEMC42hsa-mir-517adbDEMC
18hsa-mir-146dbDEMC43hsa-mir-146bmiR2Disease
19hsa-mir-423miRCancer44hsa-mir-23bdbDEMC
20hsa-mir-17dbDEMC45hsa-mir-181cdbDEMC
21hsa-mir-320adbDEMC46hsa-mir-143dbDEMC
22hsa-mir-451dbDEMC47hsa-mir-452dbDEMC
23hsa-mir-15adbDEMC48hsa-mir-886dbDEMC
24hsa-mir-146adbDEMC49hsa-mir-33adbDEMC
25hsa-mir-31dbDEMC50hsa-mir-9-3miR2Disease

Lung neoplasms

Lung neoplasms have become the leading cause of death for patients with malignant tumors in China. In the past three decades, the registered mortality rate of lung neoplasms in China has increased by 464.84% [34]. The survival rate of patients with lung neoplasms within five years is much lower than those of many other important cancers [35]. Table 2 shows the top 50 miRNAs related to lung neoplasms in the prediction results of the MDPBMP model. Forty-nine miRNAs could be verified from the dbDEMC database. Although no association of hsa-mir-208b with lung neoplasms was found in the three databases, the miRCancer database provided evidence of the association of lung neoplasms with the miRNA hsa-mir-208a. In the naming rules of miRNAs, researchers distinguish highly homologous miRNAs by adding lowercase letters after the number of the miRNA name [36]. Therefore, there are highly homologous miRNAs with hsa-mir-208b that are associated with lung neoplasms, such as hsa-mir-208a. In addition, we also found through a literature search that Liu et al. [37] proposed that matrine can reduce the proliferation of lung neoplasm cells by inducing apoptosis and changing miRNA expression profiles, including downregulating the expression level of hsa-mir-208b. Therefore, it is reasonable to infer that hsa-mir-208b is related to lung tumors.

Esophageal neoplasms

In the United States, 4–10 per 100 000 people die from esophageal neoplasms each year [38]. According to reports, even with advanced treatment, the overall survival rate of patients within 5 years is only ~20% [39]. Therefore, improving the understanding of the biological mechanism of esophageal cancer is essential for disease diagnosis and prevention. Table 3 shows the top 50 miRNAs associated with esophageal neoplasms in the prediction results of the MDPBMP model. A total of 44 of the top 50 miRNAs were verified. For the remaining six miRNAs, there were miRNAs similar to them and associated with esophageal neoplasms. The 17th ranked miRNA, hsa-let-7, whose naming predates the birth of miRNA naming rules, is now mostly used to represent the miRNA let-7 family. In the let-7 family, there are many miRNAs known to be associated with esophageal neoplasms, such as hsa-let-7b, hsa-let-7c, hsa-let-7i and hsa-let-7fd. For the miRNAs ranked 20th, 29th, 43rd and 44th, the dbDEMC database provided the association of hsa-mir-218, hsa-mir-135a, hsa-mir-1 and hsa-mir-29b with esophageal neoplasms. In the miRNA naming rules, ‘-1’ and ‘-2’ are added to the names of miRNAs that are transcribed and processed from DNA sequences on different chromosomes and have the same mature body sequence [40]. Therefore, although their association with esophageal neoplasms is not given in the database, it is known that the miRNAs that share the same mature sequence are associated with esophageal neoplasms. Therefore, there is a high probability that they are associated with esophageal neoplasms. In addition, for the 31-ranked miRNA hsa-mir-200b, the dbDEMC database provides the association of hsa-mir-200a and hsa-mir-200c with esophageal neoplasms. Since they are highly homologous miRNAs, it is reasonable to speculate that hsa-mir-200b is also associated with esophageal neoplasms.

In addition to hsa-mir-29b-1 and hsa-mir-135a-2, we also verified the association of the remaining four miRNAs with esophageal neoplasms by searching the literature. Xia et al. [41] performed a meta-analysis by searching the PubMed, EMBASE and ISI Web of Science databases. They suggest that low expression of let-7 predicts poor prognosis for many cancer patients, including esophageal cancer. Jiang et al. [42] suggested that pri-miR-218, a primary transcript of hsa-mir-218-1, has an impact on the risk and prognosis of patients with esophageal squamous cell carcinoma, which accounts for >90% of esophageal neoplasms [43]. Zhang et al. [44] demonstrated through biological experiments that hsa-mir-200b induces esophageal squamous cell carcinoma cell cycle arrest and inhibits cell growth. Esophageal neoplasms include both adenocarcinoma and squamous cell carcinoma. Zhao et al. [10] found that hsa-mir-1-2 expression levels were reduced in patients with esophageal adenocarcinoma. In sum, a total of 48 of the top 50 miRNAs associated with esophageal neoplasms predicted by MDPBMP have been validated.

Colon neoplasms

Colon neoplasms are called malignant cancers. They are the third most common cancers and one of the most serious diseases in the world, accounting for ~10% of all cancer cases [45]. According to reports, almost half of colon neoplasm patients die of metastatic disease within 5 years after diagnosis [46]. However, patients with early colon neoplasms only show subtle symptoms, which make it difficult for patients to detect cancer at an early stage. Table 4 shows the top 50 ranking results of miRNAs associated with colon neoplasms predicted by the MDPBMP model. As shown in the table, 42 of the top 50 miRNAs were verified in the dbDEMC database. For the remaining eight miRNAs, there were miRNAs similar to them and related to colon neoplasms. For miRNA hsa-mir-320, although the dbDEMC database does not directly give its association with colon neoplasms, the database gives the association of hsa-mir-320a with colon neoplasms. Similarly, the dbDEMC database also provides the association of hsa-mir-29a, hsa-mir-29b and hsa-mir-29c with colon neoplasms but not the association of hsa-mir-29 with colon neoplasms. The same situation also appeared for hsa-mir-200 and hsa-mir-30. For miRNAs ranked 28th and 34th, the dbDEMC database provides three other miRNAs related to them and associated with colon neoplasms, namely, hsa-mir-124a, hsa-mir-124-3p and hsa-mir-124-5p. For the 16th-ranked miRNA hsa-mir-135a-2, the dbDEMC database contains known associations of hsa-mir-135a with colon neoplasms. The 9th ranked hsa-let-7 is similar to the situation in esophageal neoplasms. The dbDEMC database contains associations of many miRNAs in the let-7 family with colon neoplasms, such as hsa-let-7a, hsa-let-7d, Hsa-let-7e, hsa-let-7i, hsa-let-7f and hsa-let-7g.

In addition to hsa-mir-135a-2, we also verified the associations of the remaining six miRNAs with colon neoplasms by searching the literature. Jiang et al. [47] proposed miR-29 as a tumor promoter to mediate epithelial-mesenchymal transition (EMT) and promote metastasis in colon neoplasms. Xicola et al. [48] showed that the let-7 miRNA-binding region of TGFBR1 is related to colon neoplasms with a strong ability to repair genetic mismatches. Zhou et al. [49] searched for relevant research up to October 2018 in PubMed, EMBASE, Web of Science and Cochrane Library. The survey results show that miR-124 has anti-neoplastic effects in a variety of tumors, such as lung neoplasms and colorectal neoplasms. The three members of the miR-124 family, hsa-mir-124-1, hsa-mir-124-2 and hsa-mir-124-3, cooperate with each other and produce synergistic effects. Their methylation in patients indicates a poor prognosis. Taniguchi et al. [50] experimentally confirmed that in clinical colorectal neoplasm (CRC) and adenoma (CRA) patient samples, the expression level of miR-124 is reduced and that miR-124 induces apoptosis and autophagic survival of human colon neoplasm cells. Wellner et al. [51] proved that the EMT-activator ZEB1 inhibited the expression of hsa-mir-200. At the same time, it not only promotes the spread of tumor cells but is also necessary for the tumor initiation ability of pancreatic neoplasms and colorectal neoplasm cells. Thus, targeting the ZEB1-miR-200 feedback loop may be the basis for the treatment of pancreatic and colorectal neoplasms. Tang et al. [52] showed that hsa-mir-30 targets LIN28B and LIN28B-stabilized IRS1 to promote the growth of colorectal neoplasm cells. In sum, a total of 49 of the top 50 miRNAs associated with colon neoplasms predicted by MDPBMP were verified.

Breast neoplasms

Breast neoplasms are very common cancers, most of which occur in women [53]. According to statistics, the global incidence of breast neoplasms is increasing, and it is estimated that 3.2 million breast neoplasm patients will be diagnosed annually by 2050 [54]. According to the current medical level, the only way to improve the cure rate and reduce the mortality rate is to detect and treat breast neoplasms in an early stage [55]. To verify the predictive ability of the MDPBMP model for new diseases, we selected breast neoplasms as new diseases for detection. In the data preprocessing stage, the associations of breast neoplasms with all miRNAs downloaded from the HMDD v3.2 database were removed, making it an isolated disease node, that is, there is no miRNA known to be associated with it. The predictive ability of the model for new diseases is validated by testing the accuracy of the model’s prediction results for breast neoplasms. The top 50 miRNAs predicted by the MDPBMP model associated with new disease breast neoplasms are given in Table 5. Considering that all known associations of breast neoplasms with miRNAs were removed in the data preprocessing stage, the associations provided by HMDD v3.2 were also used to verify the prediction results. As shown in the table, all the miRNAs in the first 50 miRNAs were verified in the database, which shows that the MDPBMP model is suitable for new disease prediction and can achieve better prediction results. Although there were no miRNAs associated with breast neoplasms in the heterogeneous network, the MDPBMP model selected the meta-path ‘disease–gene–disease’ based on the assumption that diseases associated with the same gene tend to be similar. Using genes as mediators to train feature vectors for new diseases reduces the effects of unknown associated miRNAs and improves the accuracy of the model in predicting related miRNAs for new diseases.

Conclusion

With the deepening of ncRNA research, researchers have discovered that miRNAs in ncRNAs can participate in and regulate posttranscriptional gene expression. Since currently approved drugs can only target a small number of proteins in the human body [56–60], whether miRNAs can be used as biomarkers for the diagnosis, treatment and prognosis of human diseases has become a new research topic. At this stage, a large amount of miRNA-related data has been generated, including some experimentally verified miRNA–disease associations. Studies have been conducted using these data to predict more miRNA–disease associations by computational models. The existing prediction models based on meta-paths do not consider the characteristic information of miRNAs and disease nodes, ignore the information carried by the intermediate nodes of meta-paths and lack sufficient types of meta-paths.

To address these problems, we propose an improved meta-path-based miRNA–disease association prediction method. Compared with the previous algorithms, the MDPBMP model introduces disease–gene associations, which not only provide richer biological information but also increase the types of nodes in the heterogeneous information network. As the types of nodes increase, more kinds of meta-paths can be selected, not only limited to ‘disease–disease–miRNA’ and ‘disease–miRNA–miRNA’. In addition, we create feature vectors for all nodes and update the feature vector of the starting node by fusing the feature information of each node on the meta-path instance. This approach not only extracts the characteristic information of the miRNA and disease nodes themselves but also effectively captures the information carried by the intermediate nodes on the meta-path. The introduction of disease–gene associations enriches the biological information of the nodes, and the selection of multiple meta-paths helps to better extract structural features in the network, thereby improving the accuracy of association predictions.

The MDPBMP model still has certain shortcomings, which can be improved in follow-up work. First, we can try to replace the initial feature vectors with biological significance for miRNAs, diseases and gene nodes, such as using miRNA, gene sequence information [61, 62] and disease MeSH descriptors. Second, feature vectors can be constructed for different types of links, that is, to distinguish the contributions of the three types of associations: miRNA–miRNA, miRNA–disease and disease–gene. The node information plus link information can be used to represent the information carried on the meta-path instance.

Key Points

• Compared with the existing meta-path-based prediction model, MDPBMP creates feature vectors for all nodes, and updates the feature vector of the starting node by fusing the feature information of each node on the meta-path instance, which can not only extract the characteristic information of miRNAs and disease nodes, but also effectively capture the information carried by the intermediate nodes on the meta-path.

• MDPBMP introduces the associated data of diseases and genes. Compared with associated networks that only contain miRNAs and diseases, there are more types of optional meta-paths in the network, which helps to better extract the structural features of the network and predict new diseases and miRNAs associations.

• Our model can achieve good results on a lot of real-world cases. Among the top 50 predicted miRNAs for lung neoplasms, esophageal neoplasms, colon neoplasms and breast neoplasms, 49, 48, 49 and 50 have been verified.

Acknowledgements

Thanks to all those who maintain excellent databases and to all experimentalists who enabled this work by making their data publicly available.

Funding

National Natural Science Foundation of China (grant nos. 62072353 and 62132015).

Liang Yu is a professor in the College of Computer Science and Technology at Xidian University. Her research interests include bioinformatics, data mining and machine learning.

Yujia Zheng is a master in the College of Computer Science and Technology at Xidian University. She is interested in machine learning.

Lin Gao is a professor in the College of Computer Science and Technology at Xidian University. Her research interests include bioinformatics, data mining and machine learning.

References

1.

Navarro
 
E
,
Mallén
 
A
,
Cruzado
 
JM
, et al.   
Unveiling ncRNA regulatory axes in atherosclerosis progression
.
Clin Transl Med
 
2020
;
9
(
1
):
5
.

2.

Laffont
 
B
,
Rayner
 
KJ
.
MicroRNAs in the pathobiology and therapy of atherosclerosis
.
Can J Cardiol
 
2017
;
33
(
3
):
313
24
.

3.

Fasoulakis
 
Z
,
Daskalakis
 
G
,
Diakosavvas
 
M
, et al.   
MicroRNAs determining carcinogenesis by regulating oncogenes and tumor suppressor genes during cell cycle
.
MicroRNA (Shariqah, United Arab Emirates)
 
2020
;
9
(
2
):
82
92
.

4.

Ambros
 
V
.
The functions of animal microRNAs
.
Nature
 
2004
;
431
(
7006
):
350
5
.

5.

Brunetti
 
O
,
Russo
 
A
,
Scarpa
 
A
, et al.   
Micro-RNA in pancreatic adenocarcinoma: predictive/prognostic biomarkers or therapeutic targets?
 
Oncotarget
 
2015
;
6
(
27
):
23323
41
.

6.

Chen
 
D
,
Yang
 
X
,
Liu
 
M
, et al.   
Roles of miRNA dysregulation in the pathogenesis of multiple myeloma
.
Cancer Gene Ther
 
2021
;
28
(
12
):
1256
68
.

7.

Hajieghrari
 
B
,
Farrokhi
 
N
,
Goliaei
 
B
, et al.   
In silico identification of conserved MiRNAs from physcomitrella patens ESTs and their target characterization
.
Curr Bioinform
 
2018
;
14
(
1
):
33
42
.

8.

Han
 
W
,
Lu
 
D
,
Wang
 
C
, et al.   
Identification of key mRNAs, miRNAs, and mRNA-miRNA network involved in papillary thyroid carcinoma
.
Curr Bioinf
 
2021
;
16
(
1
):
146
53
.

9.

Khan
 
A
,
Zahra
 
A
,
Mumtaz
 
S
, et al.   
Integrated in-silico analysis to study the role of microRNAs in the detection of chronic kidney diseases
.
Curr Bioinf
 
2020
;
15
(
2
):
144
54
.

10.

Zhao
 
M
,
Wang
 
J
,
Yuan
 
M
, et al.   
Multivariate gene expression-based survival predictor model in esophageal adenocarcinoma
.
Thorac Cancer
 
2020
;
11
(
10
):
2896
908
.

11.

Xu
 
G
,
Li
 
X
,
Yang
 
D
, et al.   
Bioinformatics study of RNA interference on the effect of HIF-1 alpha on apelin expression in nasopharyngeal carcinoma cells
.
Curr Bioinform
 
2019
;
14
(
5
):
386
90
.

12.

Zhao
 
Z
,
Zhang
 
C
,
Li
 
M
, et al.   
Integrative analysis of miRNA-mediated competing endogenous RNA network reveals the lncRNAs-mRNAs interaction in glioblastoma stem cell differentiation
.
Curr Bioinf
 
2020
;
15
(
10
):
1187
96
.

13.

Fu
 
L
,
Peng
 
Q
.
A deep ensemble model to predict miRNA-disease association
.
Sci Rep
 
2017
;
7
(
1
):14482.

14.

Chen
 
X
,
Huang
 
L
,
Xie
 
D
, et al.   
EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction
.
Cell Death Dis
 
2018
;
9
(
1
):3.

15.

Lv
 
Z
,
Zhang
 
J
,
Ding
 
H
, et al.   
RF-PseU: a random forest predictor for RNA pseudouridine sites
.
Front Bioeng Biotechnol
 
2020
;
8
:
134
.

16.

Xuan
 
P
,
Han
 
K
,
Guo
 
Y
, et al.   
Prediction of potential disease-associated microRNAs based on random walk
.
Bioinformatics
 
2015
;
31
(
11
):
1805
15
.

17.

You
 
Z-H
,
Huang
 
ZA
,
Zhu
 
Z
, et al.   
PBMDA: a novel and effective path-based computational model for miRNA-disease association prediction
.
PLoS Comput Biol
 
2017
;
13
(
3
):
e1005455
.

18.

Chen
 
X
,
Xie
 
D
,
Wang
 
L
, et al.   
BNPMDA: bipartite network projection for MiRNA-disease association prediction
.
Bioinformatics
 
2018
;
34
(
18
):
3178
86
.

19.

Chen
 
X
,
Jiang
 
ZC
,
Xie
 
D
, et al.   
A novel computational model based on super-disease and miRNA for potential miRNA-disease association prediction
.
Mol Biosyst
 
2017
;
13
(
6
):
1202
12
.

20.

Xie
 
G
,
Fan
 
Z
,
Sun
 
Y
, et al.   
WBNPMD: weighted bipartite network projection for microRNA-disease association prediction
.
J Transl Med
 
2019
;
17
(
1
):
322
, 322.

21.

Chen
 
X
,
Yan
 
G-Y
.
Semi-supervised learning for potential human microRNA-disease associations inference
.
Sci Rep
 
2014
;
4
:5501.

22.

Peng
 
L
,
Peng
 
M
,
Liao
 
B
, et al.   
Improved low-rank matrix recovery method for predicting miRNA-disease association
.
Sci Rep
 
2017
;
7
(
1
):6007.

23.

Fu
,
X.
, et al.   MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. in
Proceedings of The Web Conference 2020
. Taipei, Taiwan: Association for Computing Machinery,
2020
;2331–41.

24.

Huang
 
Z
,
Shi
 
J
,
Gao
 
Y
, et al.   
HMDD v3.0: a database for experimentally supported human microRNA-disease associations
.
Nucleic Acids Res
 
2019
;
47
(
D1
):
D1013
7
.

25.

Li
 
Y
,
Qiu
 
C
,
Tu
 
J
, et al.   
HMDD v2.0: a database for experimentally supported human microRNA and disease associations
.
Nucleic Acids Res
 
2014
;
42
(
D1
):
D1070
4
.

26.

Piñero
 
J
,
Ramírez-Anguita
 
JM
,
Saüch-Pitarch
 
J
, et al.   
The DisGeNET knowledge platform for disease genomics: 2019 update
.
Nucleic Acids Res
 
2020
;
48
(
D1
):
D845
55
.

27.

Gong
 
JT
,
Chen
 
Y
,
Pu
 
F
, et al.   
Understanding membrane protein drug targets in computational perspective
.
Curr Drug Targets
 
2019
;
20
(
5
):
551
64
.

28.

Dao
 
FY
,
Lv
 
H
,
Wang
 
F
, et al.   
Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique
.
Bioinformatics
 
2019
;
35
(
12
):
2075
83
.

29.

Yang
 
H
,
Luo
 
Y
,
Ren
 
X
, et al.   
Risk prediction of diabetes: big data mining with fusion of multifarious physical examination indicators
.
Inf Fusion
 
2021
;
75
:
140
9
.

30.

Wang
 
D
,
Wang
 
J
,
Lu
 
M
, et al.   
Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases
.
Bioinformatics
 
2010
;
26
(
13
):
1644
50
.

31.

Yang
 
Z
,
Ren
 
F
,
Liu
 
C
, et al.   
dbDEMC: a database of differentially expressed miRNAs in human cancers
.
BMC Genomics
 
2010
;
11
(
S4
):S5.

32.

Xie
 
B
,
Ding
 
Q
,
Han
 
H
, et al.   
miRCancer: a microRNA-cancer association database constructed by text mining on literature
.
Bioinformatics
 
2013
;
29
(
5
):
638
44
.

33.

Jiang
 
Q
,
Wang
 
Y
,
Hao
 
Y
, et al.   
miR2Disease: a manually curated database for microRNA deregulation in human disease
.
Nucleic Acids Res
 
2009
;
37
:
D98
104
.

34.

She
 
J
,
Yang
 
P
,
Hong
 
Q
, et al.   
Lung cancer in China: challenges and interventions
.
Chest
 
2013
;
143
(
4
):
1117
26
.

35.

Wang
 
J
,
Zhao
 
YC
,
Lu
 
YD
, et al.   
Integrated bioinformatics analyses identify dysregulated miRNAs in lung cancer
.
Eur Rev Med Pharmacol Sci
 
2014
;
18
(
16
):
2270
4
.

36.

Desvignes
 
T
,
Batzel
 
P
,
Berezikov
 
E
, et al.   
miRNA nomenclature: a view incorporating genetic origins, biosynthetic pathways, and sequence variants
.
Trends Genet
 
2015
;
31
(
11
):
613
26
.

37.

Liu
 
Y-Q
,
Li
 
Y
,
Qin
 
J
, et al.   
Matrine reduces proliferation of human lung cancer cells by inducing apoptosis and changing miRNA expression profiles
.
Asian Pac J Cancer Prev
 
2014
;
15
(
5
):
2169
77
.

38.

Ashktorab
 
H
,
Nouri
 
Z
,
Nouraie
 
M
, et al.   
Esophageal carcinoma in African Americans: a five-decade experience
.
Dig Dis Sci
 
2011
;
56
(
12
):
3577
82
.

39.

Milano
 
F
,
Krishnadath
 
KK
.
Novel therapeutic strategies for treating esophageal adenocarcinoma: the potential of dendritic cell immunotherapy and combinatorial regimens
.
Hum Immunol
 
2008
;
69
(
10
):
614
24
.

40.

Budak
 
H
,
Bulut
 
R
,
Kantar
 
M
, et al.   
MicroRNA nomenclature and the need for a revised naming prescription
.
Brief Funct Genomics
 
2016
;
15
(
1
):
65
71
.

41.

Xia
 
Y
,
Zhu
 
Y
,
Zhou
 
X
, et al.   
Low expression of let-7 predicts poor prognosis in patients with multiple cancers: a meta-analysis
.
Tumor Biol
 
2014
;
35
(
6
):
5143
8
.

42.

Jiang
 
L
,
Wang
 
C
,
Sun
 
C
, et al.   
The impact of pri-miR-218 rs11134527 on the risk and prognosis of patients with esophageal squamous cell carcinoma
.
Int J Clin Exp Pathol
 
2014
;
7
(
9
):
6206
12
.

43.

Zhang
 
X
,
He
 
C
,
He
 
C
, et al.   
Nuclear PKM2 expression predicts poor prognosis in patients with esophageal squamous cell carcinoma
.
Pathol Res Practice
 
2013
;
209
(
8
):
510
5
.

44.

Zhang
 
H-F
,
Alshareef
 
A
,
Wu
 
C
, et al.   
miR-200b induces cell cycle arrest and represses cell growth in esophageal squamous cell carcinoma
.
Carcinogenesis
 
2016
;
37
(
9
):
858
69
.

45.

Phipps
 
AI
,
Lindor
 
NM
,
Jenkins
 
MA
, et al.   
Colon and rectal cancer survival by tumor location and microsatellite instability: the colon cancer family registry
.
Dise Colon Rectum
 
2013
;
56
(
8
):
937
44
.

46.

Hardingham
 
JE
,
Hewett
 
PJ
,
Sage
 
RE
, et al.   
Molecular detection of blood-borne epithelial cells in colorectal cancer patients and in patients with benign bowel disease
.
Int J Cancer
 
2000
;
89
(
1
):
8
13
.

47.

JIANG
 
H
,
ZHANG
 
G
,
WU
 
JH
, et al.   
Diverse roles of miR-29 in cancer (review)
.
Oncol Rep
 
2014
;
31
(
4
):
1509
16
.

48.

Xicola
 
RM
,
Bontu
 
S
,
Doyle
 
BJ
, et al.   
Association of a let-7 miRNA binding region of TGFBR1 with hereditary mismatch repair proficient colorectal cancer (MSS HNPCC)
.
Carcinogenesis
 
2016
;
37
(
8
):
751
8
.

49.

Zhou
 
Z
,
Lv
 
J
,
Wang
 
J
, et al.   
Role of MicroRNA-124 as a prognostic factor in multiple neoplasms: a meta-analysis
.
Dis Markers
 
2019
;
2019
:1654780.

50.

Taniguchi
 
K
,
Sugito
 
N
,
Kumazaki
 
M
, et al.   
MicroRNA-124 inhibits cancer cell growth through PTB1/PKM1/PKM2 feedback cascade in colorectal cancer
.
Cancer Lett
 
2015
;
363
(
1
):
17
27
.

51.

Wellner
 
U
,
Schubert
 
J
,
Burk
 
UC
, et al.   
The EMT-activator ZEB1 promotes tumorigenicity by repressing stemness-inhibiting microRNAs
.
Nat Cell Biol
 
2009
;
11
(
12
):
1487
U236
.

52.

Tang
 
M
,
Zhou
 
J
,
You
 
L
, et al.   
LIN28B/IRS1 axis is targeted by miR-30a-5p and promotes tumor growth in colorectal cancer
.
J Cell Biochem
 
2020
;
121
(
8–9
):
3720
9
.

53.

Kelsey
 
JL
,
Horn-Ross
 
PL
.
Breast cancer: magnitude of the problem and descriptive epidemiology
.
Epidemiol Rev
 
1993
;
15
(
1
):
7
16
.

54.

Tao
 
ZQ
,
Shi
 
A
,
Lu
 
C
, et al.   
Breast cancer: epidemiology and Etiology
.
Cell Biochem Biophys
 
2015
;
72
(
2
):
333
8
.

55.

Milosevic
 
M
,
Jankovic
 
D
,
Milenkovic
 
A
, et al.   
Early diagnosis and detection of breast cancer
.
Technol Health Care
 
2018
;
26
(
4
):
729
59
.

56.

Lv
 
Z
,
Cui
 
F
,
Zou
 
Q
, et al.   
Anticancer peptides prediction with deep representation learning features
.
Brief Bioinform
 
2021
;
22
(
5
):bbab008.

57.

Lv
 
Z
,
Wang
 
P
,
Zou
 
Q
, et al.   
Identification of sub-Golgi protein localization by use of deep representation learning features
.
Bioinformatics
 
2020
;
36
(
24
):
5600
.

58.

Liu
 
J
,
Lian
 
X
,
Liu
 
F
, et al.   
Identification of novel key targets and candidate drugs in oral squamous cell carcinoma
.
Curr Bioinfor
 
2020
;
15
(
4
):
328
37
.

59.

Munir
 
A
,
Malik
 
SI
,
Malik
 
KA
.
Proteome Mining for the Identification of putative drug targets for human pathogen clostridium tetani
.
Curr Bioinfor
 
2019
;
14
(
6
):
532
40
.

60.

Zhuang
 
J
,
Dai
 
S
,
Zhang
 
L
, et al.   
Identifying breast cancer-induced gene perturbations and its application in guiding drug repurposing
.
Curr Bioinfor
 
2020
;
15
(
9
):
1075
89
.

61.

Bugnon
 
LA
,
Yones
 
C
,
Milone
 
DH
, et al.   
Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning
.
Brief Bioinform
 
2021
;
22
(
3
):bbaa184.

62.

Yousef
 
M
,
Parveen
 
A
,
Kumar
 
A
.
Computational methods for predicting mature microRNAs
.
Methods Mol Biol (Clifton, NJ)
 
2022
;
2257
:
175
85
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)