Abstract

Motivation

Long non-coding RNAs (lncRNAs) are important regulators in wide variety of biological processes, which are linked to many diseases. Compared to protein-coding genes (PCGs), the association between diseases and lncRNAs is still not well studied. Thus, inferring disease-associated lncRNAs on a genome-wide scale has become imperative.

Results

In this study, we propose a machine learning-based method, DislncRF, which infers disease-associated lncRNAs on a genome-wide scale based on tissue expression profiles. DislncRF uses random forest models trained on expression profiles of known disease-associated PCGs across human tissues to extract general patterns between expression profiles and diseases. These models are then applied to score associations between lncRNAs and diseases. DislncRF was benchmarked against a gold standard dataset and compared to other methods. The results show that DislncRF yields promising performance and outperforms the existing methods. The utility of DislncRF is further substantiated on two diseases in which we find that top scoring candidates are supported by literature or independent datasets.

Availability and implementation

https://github.com/xypan1232/DislncRF

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Long non-coding RNAs (lncRNAs) play crucial roles in many biological processes and are involved in a variety of diseases (Chen et al., 2013; Mirza et al., 2015). Currently, the role of protein-coding genes in diseases is well investigated and collected in public databases, e.g. DISEASES (Pletscher-Frankild, 2015), DisGeNET (Pinero et al., 2015) and OUGene (Pan and Shen, 2016). However, the vast majority of genetic susceptibility loci related to diseases is located in non-coding DNA regions, intergenic and intronic for PCGs (Hindorff et al., 2009), a substantial fraction of which contain lncRNA genes (Esteller, 2011). Still, most of the associations between diseases and lncRNAs are unknown. A computational approach is thus needed to help identify disease-associated lncRNAs and to provide a comprehensive overview thereof.

Recent studies have shown that genetic disorders usually manifest themselves only in a few tissues (Kitsak et al., 2016; Lage et al., 2008; Magger et al., 2012; Winter et al., 2004). In addition, disease-causing mutations in human PCGs often lead to tissue-specific phenotypes, indicating associations between diseases and tissues (Blokzijl et al., 2016; Bornigen et al., 2013; Magger et al., 2012). Combining text-mined disease–tissue associations from biomedical literature with a tissue expression atlas revealed that the average expression of disease-associated genes is higher in the disease-related tissues than in other tissues (Lage et al., 2008).

Several studies have used tissue expression data to infer lncRNA–disease associations. Guilt-by-association is a widely used strategy that scores candidate genes based on their similarity in expression to known disease-associated genes (Chen et al., 2016a, 2017). For example, the LRLSLDA method infers disease-associated lncRNAs by assuming that lncRNAs involved in the same disease have similar expression patterns (Chen and Yan, 2013). The KATZLDA and FMLNCSIM methods extend this approach by also taking into account functional similarity metrics between the lncRNAs (Chen, 2015a; Chen et al., 2016b). A common limitation of all three methods is that they use only disease associations for lncRNAs thus ignore the well studies PCGs, which are often co-expressed with lncRNAs in diseases (Tsoi et al., 2015). In contrast, LnCaNet integrates experimentally verified associations between cancer types and PCGs with co-expression associations between PCGs and lncRNAs (Liu and Zhao, 2016). However, lncRNA–disease associations are scored based on correlated expression with individual disease-associated PCGs, and the method does not cover other diseases than cancers.

With the objective to reduce the number of false positives and improve the prediction accuracy, some approaches integrate more information. GeneTIER (Antanaviciute et al., 2015) prioritizes candidate disease genes using disease–tissues associations. Similarly, Tissue Specific Expression Analysis (TSEA) identifies genes enriched in disease-associated tissues. The method first defines sets of tissue-enriched genes, and then identifies significant consistency between tissue-enriched genes and disease-associated genes (Wells et al., 2015). Both methods require extra information in advance, such as disease-tissue or gene-tissue associations. Other studies prioritize disease-associated genes using functional interaction networks, where it is assumed that similar diseases may be caused by functionally associated genes. Guan et al., used tissue expression data to construct tissue-specific functional network to identify disease-associated genes instead of using global functional network (Guan et al., 2012). Similarly, NetWAS combines tissue-specific interaction networks and genome-wide association study (GWAS) to infer disease-gene associations (Greene et al., 2015). Tissue-specific network can improve the quality of predictions and reduce the noise compared with only using GWAS. In addition to the methods listed above, several studies have used machine learning to infer genes associated with a specific disease of interest. Recently Cogill and Wang (2016) applied support vector machines (SVM) (Vapnik, 1998) to infer genes associated to autism spectrum disorders (ASD) using expression profiles in healthy brain as features.

In this study, we propose a computational method DislncRF, which takes advantage of available RNA-seq expression profiles across multiple healthy tissues and well-studied disease-associated PCGs to infer disease-associated lncRNAs. With DislncRF, we train multiple balanced Random Forest (RF) models (Breiman, 2001) to learn expression patterns from PCGs involved and not involved in a disease, and apply the learned models to infer disease-associated lncRNAs. In addition, DislncRF is also able to automatically identify disease-tissue associations.

2 Materials and methods

We created a training set consisting of tissue expression profiles from RNA-seq data for PCGs involved and not involved in a specific disease. For each RNA-seq dataset and each disease, we trained RF models to predict PCGs associated to that specific disease from those not associated to that specific disease. Subsequently, these RF models are applied to predict if lncRNAs are associated to that specific disease or not, based on the given RNA-seq dataset.

2.1 Data source

We collect tissue RNA-seq expression data, disease-PCG associations and disease–lncRNA associations as follows:

2.1.1 Tissue RNA-seq data

We use four sources of RNA-seq expression datasets with variable number of tissues to train and evaluate DislncRF:

  1. The Genotype-Tissue Expression project, GTEx: we directly use the processed expression profiles (GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8) from GTEx portal (GTEx Consortium 2013; GTEx Consortium 2015), which contains expression data in 53 human tissues.

  2. The human body map 2.0 project, E-MTAB-513: It contains the raw RNA-seq data for 19 samples across 16 human tissues (Derrien et al., 2012).

  3. Gene evolution in tetrapods, GSE43520: This dataset consists of raw RNA-seq data for 11 human samples across four human tissues (Brawand et al., 2011)

  4. LncRNAs evolution in mammalians, GSE30352: It consists of raw RNA-seq data from 21 samples across six human tissues (Necsulea et al., 2014)

2.1.2 Disease–PCG associations

For disease-PCG associations, we download the integrated associations from DISEASES database (Pletscher-Frankild, 2015), including knowledge, experiments and text mining channels. This database gives confidence scores to the associations according to the supporting evidence. We only use associations with confidence scores greater or equal to 2. Likewise, we require diseases to have at least 50 associated PCGs with expression profiles to avoid the problem with too few data for machine learning model training. In total, we obtained sufficiently many disease-associated genes for 237 diseases. Supplementary Table S1 shows the number of diseases with different cutoffs for confidence and number of associated PCGs.

For validation of predicting disease-PCG associations, we split the dataset constructed from disease–PCG pairs into two parts: the data for 20% of the PCGs is randomly kept as the independent test set consisting of disease-PCG pairs, and the remaining 80% is used as the training set. We evaluate the classifier performance on PCG expressions for each disease of the 237 diseases.

2.1.3 Disease-lncRNA associations

We collected human experimentally verified lncRNA-disease associations from LncRNADisease (Chen et al., 2013) and Lnc2Cancer (Ning et al., 2016). After mapping the associated diseases into the disease ontology terms and gene names into Ensembl gene identifiers, we build a gold standard set with 735 unique disease-lncRNA associations.

Here, we use the experimentally verified lncRNA–disease associations only to evaluate the performance of DislncRF. As negative examples, we randomly selected the same number of lncRNAs not associated with the disease as we have lncRNAs associated with it, thus resulting in a dataset that is balanced not only overall but also for each disease. Since no disease–lncRNA associations were used for training the models, neither as positive nor negative examples, this dataset is completely independent of the training set.

2.2 Data preprocessing

We downloaded the RNA-seq reads for the three datasets E-MTAB-513, GSE43520 and GSE30352 and processed them using the latest version of the TISSUES database (Palasca et al., 2018). The pipeline uses STAR version 2.5.0b (Dobin et al., 2013) to map the raw reads from all datasets to the reference genome with 19 732 PCGs and 13 336 lncRNAs used by GTEx, and quantifies the expression levels of genes using Cufflinks (Trapnell et al., 2010). After obtaining the quantified expression levels, we calculate the median expression value x across the samples for each tissue, which is the same strategy used in GTEx. Finally, we log-transform the values, using log2(1+x) to avoid problems with genes for which zero expression was observed in a given tissue.

To remove the impact of different scales of expression levels, we normalized them to tissue expression specificity, which is the fraction of (log-transformed) expression of one gene in one tissue relative to the sum of its expression in all tissues (Tsoi et al., 2015):
Tis=eisjeij,
(1)
where s is the tissue index and i is the gene index. Figure 1 illustrates the expression level and tissue specificity of two genes associated to inflammatory bowel disease. The two genes have different scales of expression levels but have similar scale of tissue specificity level.
Fig. 1.

In E-MTAB-513 (human body map 2.0 dataset) (Derrien et al., 2012), the expression level and tissue specificity of two inflammatory bowel disease associated genes MADCAM1 and TAC1. (A) The expression values of gene MADCAM1; (B) The tissue specificity value of gene MADCAM1; (C) The expression values of gene TAC1; (D) The tissue specificity value of gene TAC1

2.3 Random forest classification

A random forest is an ensemble of multiple decision trees (Breiman, 2001), in which each tree is built from bootstrapping samples and randomly selected feature subset of original features. RFs can be used for classification, as well as for feature importance analysis. During the training process, out-of-bag (OOB) error is calculated for individual feature before and after randomly permuting the values of that feature, the importance score is averaging the above two OOB error difference over all trees.

For selecting the parameters of each RF model, we applied GridSearchCV (Pedregosa et al., 2011) to optimize the parameters by 3-fold cross-validation scheme over a parameter grid (min_samples_leaf: [1, 2, 3], max_features: [’auto’, ’sqrt’, ’log2’], n_estimators: [5, 10, 20, 50]). The variables are defined as follows: min_samples_leaf is the minimum number of samples in leaf node, max_features is the number of features for splitting a node, and n_estimators is the number of trees.

2.4 DislncRF pipeline

The DislncRF pipeline, while intended to predict disease-associated lncRNAs, consists of RF models trained on PCGs involved/not involved in a disease. This leads to a class imbalance problem, since the number of genes implicated in a disease is much fewer than the number of genes not implicated in this disease (e.g. 50 vs. 19 732 for Autistic disorder). To avoid classifiers becoming biased to the negative class while making use of there being more such training examples, we constructed five balanced RFs, each trained on the same disease-associated PCGs but different subsets of PCGs not involved in disease. The workflow of DislncRF is shown in Figure 2. For each disease X among the total 237 diseases:

Fig. 2.

The flowchart of DislncRF. For each disease X, we firstly extract N disease X associated protein coding genes (PCGs) from DISEASES database, and randomly sampling 5N negative PCGs not involved in disease X under three criterion described in text. Then we train five random forest models on the balanced dataset with the same positive set but different negative set. Finally, the five trained RFs are used for inferring disease X associated lncRNAs

  1. Construct the positive training set with N positive examples of PCGs involved in disease X. Randomly select five sets of each N negative examples, resulting in five datasets each with the same positive but different negative examples, using the following criteria: (i) The PCGs should be associated with other diseases than X, to prevent that the positive examples are biased towards well studied PCGs compared to the negative examples. (ii) The PCGs must have no evidence of association with disease X, also not with a DISEASES confidence score below the threshold used to define positive examples. PCGs with a confidence score below 2 for disease X are thus neither used as positive nor as negative examples.

  2. Train five RF classifiers, one for each dataset, to discriminate positive example PCGs from negative examples from their tissue specificity profile and sum of log-transformed expression values across tissues.

  3. Apply the trained RF models for lncRNAs to obtain probability scores of an lncRNA to be associated to disease X. The final score is the average of the probabilities from the five trained RF models.

In the end, after obtaining the association scores between the 237 diseases and 13 336 lncRNAs, we evaluate the predicted disease-associated lncRNAs using the gold standard set.

2.5 Baseline methods

To verify the advantage of DislncRF over existing approaches, we compared its prediction performance with that of the coding-non-coding co-expression (CNC) method (Liao et al., 2011; Liu and Zhao, 2016) under guilt-by-association framework. For each disease X and lncRNA, CNC proceeds as follows: (i) Extract PCGs associated with the disease. (ii) Calculate Pearson’s correlation coefficients (PCC) between PCGs associated with X and the lncRNA. (iii) Keep only those co-expression pairs, which have absolute PCC > 0.3 and P-value < 0.01 (similar to Liu and Zhao, 2016). (iv) Assign the mean value of up to K largest PCC value, where K is the predefined number of largest PCCs with this lncRNA, as the predicted score for the lncRNA with disease X.

As a second baseline method, we used the following simple K nearest neighbor (KNN) approach, where K is the predefined number of neighbors: (i) Calculate the pairwise PCCs between all lncRNAs and all PCGs. (ii) For each lncRNA, obtain the K nearest PCGs according to PCC and count how many of them are associated with each disease X. This count is used as the score between the lncRNA and disease X. We evaluate KNN and calculate the fraction of correct predicted associations for each raw score (Junge et al., 2017).

2.6 Validation of tissue importance

A merit of the RF algorithm is that it can also learn the importance score of input features. We made use of this to analyze the importance of each tissue for prediction of each disease. To validate that the features extracted by the RF models are consistent with biological knowledge of the disease, we compared them to manually curated and text-mined disease–tissue associations (Binder et al., 2014; Pafilis et al., 2013). To quantify the agreement between the RFs and the two sets of disease–tissue associations, we calculated Pearson correlation coefficients between the tissue importance scores and the confidence scores for the disease–tissue associations.

3 Results

In this study, we report the performance of DislncRF on PCGs and lncRNAs, and compare it with two baseline methods to demonstrate the advantages of DislncRF. Lastly, two case studies are investigated for predicted disease–lncRNA associations from DislncRF.

3.1 Validation of DislncRF for PCGs

For each disease, we optimized parameters for the five RFs using GridSearchCV (see the Supplementary Material). As shown in Table 1, DislncRF achieves high performance on the independent test set for prediction of the disease-associated PCGs on the four RNA-seq datasets (E-MTAB-513, GSE43520, GSE30352 and GTEx). Supplementary Figure S1 shows the box plots of individual measurements of the 237 diseases. For GTEx, the average performance overall diseases is a Matthews correlation coefficient (MCC) of 0.676, which is slightly better than what is obtained for the three other datasets. This is presumably because the GTEx dataset is more complete, covering 53 tissues, as opposed to at most 16 tissues in the other datasets. It is thus likely include additional tissues of relevance for the studied diseases. Conversely, we get the lowest MCC performance on GSE43520, most likely because this dataset only covers four tissues, and many disease-associated tissues are not included. Hence, many informative signals from the associated tissues cannot be captured for the disease-associated genes. We see that DislncRF yield high sensitivity and low precision on these four datasets due to that our test data has five times as many negative examples than positive ones, even though high accuracy and area under the ROC curve (AUC) can be obtained on the balanced test set.

Table 1.

The average performance of the disease-associated gene classification using PCG expression profiles on the test set consisting of disease-PCG pairs

DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.9020.8750.9530.6760.6120.905
E-MTAB-5130.8990.8770.9550.6750.6120.905
GSE435200.8870.8160.9310.5770.5010.862
GSE303520.8910.8320.9390.6040.5300.878
DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.9020.8750.9530.6760.6120.905
E-MTAB-5130.8990.8770.9550.6750.6120.905
GSE435200.8870.8160.9310.5770.5010.862
GSE303520.8910.8320.9390.6040.5300.878

Note: In this study, we trained one model for each of 237 diseases. AUC is the area under ROC curve and AUPRC is the area under Precision-Recall Curve.

Table 1.

The average performance of the disease-associated gene classification using PCG expression profiles on the test set consisting of disease-PCG pairs

DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.9020.8750.9530.6760.6120.905
E-MTAB-5130.8990.8770.9550.6750.6120.905
GSE435200.8870.8160.9310.5770.5010.862
GSE303520.8910.8320.9390.6040.5300.878
DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.9020.8750.9530.6760.6120.905
E-MTAB-5130.8990.8770.9550.6750.6120.905
GSE435200.8870.8160.9310.5770.5010.862
GSE303520.8910.8320.9390.6040.5300.878

Note: In this study, we trained one model for each of 237 diseases. AUC is the area under ROC curve and AUPRC is the area under Precision-Recall Curve.

3.2 Predicting disease-associated lncRNAs

After training models on PCG profiles, we used the trained models to predict disease-associated lncRNAs. As shown in Figure 3A, DislncRF yields the AUC of 0.687, 0.649, 0.592 and 0.718 on E-MTAB-513, GSE43520, GSE30352 and GTEx, respectively. In addition, DislncRF yields the Area under Precision-Recall Curve (AUPRC) of 0.669, 0.642, 0.575 and 0.687, respectively. We also report other performance measurements in Table 2.

Fig. 3.

ROC comparison for inferring disease-associated lncRNAs on E-MTAB-513, GSE43520, GSE30352 and GTEx using DislncRF and CNC. (A) The performance of DislncRF; (B) The performance of CNC

Table 2.

The performance of the DislncRF for predicting disease-lncRNA associations, which is benchmarked against the independent disease-lncRNA test set (Section 2.1.3)

DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.5270.7650.7180.3010.6920.687
E-MTAB-5130.4760.7590.6870.2450.6640.669
GSE435200.5150.6810.6490.1990.6170.642
GSE303520.5030.6110.5920.1150.5640.575
DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.5270.7650.7180.3010.6920.687
E-MTAB-5130.4760.7590.6870.2450.6640.669
GSE435200.5150.6810.6490.1990.6170.642
GSE303520.5030.6110.5920.1150.5640.575
Table 2.

The performance of the DislncRF for predicting disease-lncRNA associations, which is benchmarked against the independent disease-lncRNA test set (Section 2.1.3)

DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.5270.7650.7180.3010.6920.687
E-MTAB-5130.4760.7590.6870.2450.6640.669
GSE435200.5150.6810.6490.1990.6170.642
GSE303520.5030.6110.5920.1150.5640.575
DatasetSensitivitySpecificityAUCMCCPrecisionAUPRC
GTEx0.5270.7650.7180.3010.6920.687
E-MTAB-5130.4760.7590.6870.2450.6640.669
GSE435200.5150.6810.6490.1990.6170.642
GSE303520.5030.6110.5920.1150.5640.575

Considering that there are more verified lncRNAs for cancer diseases than non-cancer diseases in our gold standard set, we further separated the in total 237 diseases into two subgroups, namely cancer (44) and non-cancer diseases (193). We performed the same analyses for both subgroups as was done for all diseases. We also constructed the Receiver Operating Characteristic (ROC) curve (Supplementary Fig. S2). The results indicates that DislncRF obtain slightly higher performance for cancer type than non-cancer diseases. The reason might be that the PCGs and lncRNAs are more studied in cancer and the cancer is a more homogeneous collection of tissues than the rest, which provides better training (PCGs) and test (lncRNAs) set.

DislncRF performs better on PCGs than on lncRNAs, which is not surprising as the RF models are trained on PCGs. This is not surprising as the models are trained on PCGs, which in many ways have differences to lncRNAs. In particular, PCGs are typically higher expressed than lncRNAs and consequently are observed to be expressed in more tissues using any given cutoff on expression level (Supplementary Fig. S3).

While some performance measures depend on the balance between the number of positive and negative examples others do not. In Tables 1 and 2, the first three measure are independent of this balance, whereas the last three all decrease as the fraction of negative examples increases. Since the number of false positive predictions at a given score cutoff is directly proportional to the number of negative examples, the Precision will scale as 1/(1+x) where x is the ratio of negative to positive examples. The AUPRC metric will scale identically since recall does not depend on the ratio. In case of MCC, the relationship is a bit more complex but can also be explicitly modelled.

3.3 Comparing with baseline methods CNC and KNN

To put the performance of DislncRF into perspective, we compare its prediction performance with baseline method CNC, which is evaluated as being exactly done for DislncRF. As can be seen in Figure 3B, the performance of CNC is close to random (roughly AUC = 0.54) in all four studied datasets when K =3. We also tested K =1 and K =5 for CNC, the K value has no big impact on the performance (Supplementary Fig. S4). The results indicate that the PCC-based significantly co-expressed disease-associated PCGs are not enough to infer high-confidence disease-associated lncRNAs, and it easily suffers to noises. However, machine learning-based DislncRF use a bunch of positive and negative training data to learn expression patterns, which are more robust to noises. CNC has higher runtime than DislncRF, especially for prediction step, which requires calculating Pearson correlation coefficients between all disease-associated PCGs and candidate lncRNAs.

To evaluate KNN with K =50, we, for each lncRNA, checked the number of disease-associated PCGs belonging to the 50 nearest PCGs by their PCC values. Using a threshold of more than one nearest neighbor to predict the disease association result in no true positive predictions and using the minimum threshold one neighbor result in 6 true positives, 497 false positives and a precision of 1.2%, which is less than the prior of 2.4%. This shows that KNN performs no better than random for this problem, showing that it is not usable as a baseline.

We observe that the supervised strategy of DislncRF outperforms the unsupervised strategy of the two coexpression-based methods, which might not be a surprise since the supervised framework can take much more features than only co-expression into account.

3.4 Linking tissues to diseases

We also analyzed the tissue importance based on the RF feature selection. The RFs rank the importance for individual tissues in the detection of disease-associated genes. As shown in Figure 4, DislncRF identifies the most important tissues as being lung for non-small cell lung carcinoma and kidney for kidney disease. The results agree with the curated and text mining tissue–disease associations and obviously make biological sense. In addition, we calculated the PCC between importance scores from the RF model and the associated scores from curated and text mining for 237 diseases. When averaging over the diseases we obtain PCCs of 0.313, 0.759, 0.440 and 0.348 on E-MTAB-513, GSE43520, GSE30352 and GTEx, respectively. Contrary to the baselines CNC and KNN, the RF-based DislncRF is often able to point to the tissue that is important for a given disease. It shows the RF models are biologically meaningful, which gives more reason to believe their predictions.

Fig. 4.

Tissue importance for disease-associated gene detection are from random forest feature analysis, association scores of text mining and curated knowledge are extracted from DISEASES database. (A) The tissue importance for non-small cell lung carcinoma; (B) The tissue importance for Kidney disease

3.5 Case studies

To demonstrate the applicability of DislncRF to predict disease–lncRNA associations genome-wide, we present case studies of lncRNAs associated to prostate cancer and inflammatory bowel disease, for which some of disease associations can be confirmed from recent published studies (Mirza et al., 2015; Ning et al., 2016).

3.5.1 Prostate cancer

We extracted experimentally verified prostate cancer associated lncRNAs from Lnc2Cancer database (Ning et al., 2016). In total, 27 are verified lncRNAs associated with prostate cancer. To demonstrate the prediction accuracy for prostate cancer, we pool the predicted scores of prostate cancer from DislncRF for the 27 verified lncRNAs and 27 other randomly selected lncRNAs from the remaining 13 309 lncRNAs. As shown in Figure 5, DislncRF yields the AUCs of 0.750, 0.620, 0.695 and 0.875 on E-MTAB-513, GSE43520, GSE30352 and GTEx, respectively.

Fig. 5.

The prediction performance of DislncRF for prostate cancer

For GTEx, we additionally checked the top-20 lncRNAs that were predicted to be associated with prostate cancer but not annotated as such in the Lnc2Cancer database. Of the top-20 lncRNAs predicted by DislincRF to be associated with prostate cancer, MIG205HG has the most compelling literature support. This lncRNA is produced from the same host gene as miR-205-5p, a miRNA that is differentially expressed in prostate cancer (Verdoodt et al., 2013) and associated with prostate cancer risk and progression (Luu et al., 2017). MIR205HG has also been shown to deplete miR-590-3p (Di et al., 2018), which is implicated in prostate cancer (Sun et al., 2017). Very recently, MIG205HG was also found to be differentially regulated in a reanalysis of two published prostate cancer transcriptomics studies (Ye et al., 2018). A second candidate, RP11-7K24, has similarly been shown to be differentially expressed in prostate cancer (Han et al., 2017), but did not have any additional supporting evidence. For most of the remaining candidates, we were unable to find any literature at all, which only emphasizes the need for new tools to study them.

3.5.2 Inflammatory bowel disease (IBD)

In a recent study (Mirza et al., 2015), we performed microarray expression profile for IBD including Crohn disease (CD) and ulcerative colitis (UC). To detect dysregulated lncRNAs in IBD, differential expression analysis was carried out using the LIMMA package (Smyth, 2004) to identify up- and down-regulated lncRNAs based on inflamed UC vs Control and inflamed CD vs control. Here, we use this dataset for further testing of the DislncRF IBD predictions.

From (Mirza et al., 2015), we extracted significantly up- and down-regulated lncRNAs associated with IBD using two criteria: False Discovery Rate (FDR) <0.01 and an absolute fold change (FC) >2. In total, we got 123 differentially expressed lncRNAs. Of these, 16, 2, 1 and 12 lncRNAs belong to the top 100 candidate IBD lncRNAs predicted for E-MTAB-513, GSE43520, GSE30352 and GTEx, respectively. The corresponding Venn diagram is shown in Figure 6.

Fig. 6.

Venn diagram for the 123 significantly up- and down-regulated lncRNAs associated with IBD and the top-100 predicted candidates from DislncRF on E-MTAB-513, GSE43520, GTEx. There are 28 overlapped lncRNAs between the top 100 lncRNAs of E-MTAB-513 and GTEx. Of these, six are significantly differentially expressed lncRNAs that are associated to IBD. Both E-MTAB-513 and GTEx have only two shared lncRNAs with GSE43520

Furthermore, we obtained 39 IBD loci associated to differentially expressed lncRNAs from Mirza et al., 2015. We evaluated the top 100 lncRNA candidates overlapping with these (Table 3). For GTEx dataset, we can find five verified lncRNAs in top 100 candidates.

Table 3.

The number of 39 IBD loci associated differentially expressed lncRNAs in predicted top 100 candidates among 13 336 lncRNAs. - mean no candidates

DatasetlncRNAs being in top 100Rank
E-MTAB-513SLC12A5-AS1, PSMB8-AS1,RP11-94L15, IHCP514th, 24th, 32th, 62th
GSE43520
GSE30352SMIM25, AL357060, RAJ00963227th, 37th, 43th
GTExPSMB8-AS1, HCP5, AC245128, RP11-94L15, RP11-290F2012th, 27th, 32th, 41th, 43th
DatasetlncRNAs being in top 100Rank
E-MTAB-513SLC12A5-AS1, PSMB8-AS1,RP11-94L15, IHCP514th, 24th, 32th, 62th
GSE43520
GSE30352SMIM25, AL357060, RAJ00963227th, 37th, 43th
GTExPSMB8-AS1, HCP5, AC245128, RP11-94L15, RP11-290F2012th, 27th, 32th, 41th, 43th
Table 3.

The number of 39 IBD loci associated differentially expressed lncRNAs in predicted top 100 candidates among 13 336 lncRNAs. - mean no candidates

DatasetlncRNAs being in top 100Rank
E-MTAB-513SLC12A5-AS1, PSMB8-AS1,RP11-94L15, IHCP514th, 24th, 32th, 62th
GSE43520
GSE30352SMIM25, AL357060, RAJ00963227th, 37th, 43th
GTExPSMB8-AS1, HCP5, AC245128, RP11-94L15, RP11-290F2012th, 27th, 32th, 41th, 43th
DatasetlncRNAs being in top 100Rank
E-MTAB-513SLC12A5-AS1, PSMB8-AS1,RP11-94L15, IHCP514th, 24th, 32th, 62th
GSE43520
GSE30352SMIM25, AL357060, RAJ00963227th, 37th, 43th
GTExPSMB8-AS1, HCP5, AC245128, RP11-94L15, RP11-290F2012th, 27th, 32th, 41th, 43th

We also investigated the ROC curve of the 123 differentially expressed lncRNAs (DELs) and the 39 loci associated lncRNAs (LALs) for IBD. We first obtained the predicted scores between IBD and all lncRNAs using DislncRF. For each lncRNA in the 123 DELs and 39 LALs, we directly extracted the predicted IBD scores, and randomly selected the same number of unique negative lncRNAs not in the 123 DELs and 39 LALs, respectively. Then we pool them to evaluate the performance (Figure 7). The results demonstrate that DislncRF performs the best in GTEx for DELs and LALs, with an AUC of 0.748 and 0.753, respectively. The analysis also indicates that more tissues can provide more informative signals for the IBD-associated gene detections, even if some of them are not considered relevant to IBD.

Fig. 7.

The ROC curve for (A) the 123 differentially expressed lncRNAs and (B) the 39 loci associated lncRNAs of IBD

4 Discussion

We considered the disease-associated gene detection as a supervised learning problem, which requires high-quality training data. We extracted high-confidence disease–PCG associations from the DISEASES database as the positive set. Since the DISEASES database is based on current knowledge, there are bound to be true disease–PCG associations, which have not yet been discovered. Therefore, when benchmarking against a database like DISEASES, even a perfect method would make correct, novel predictions that get counted as false positives. Still, a better method will obviously perform better on the given benchmark data. We can thus conclude that DislncRF outperforms existing methods.

There are several promising avenues to explore and possibilities to improve DislncRF. (i) More biological information should be integrated, such as GO information, SNPs and genomic context. For example, disease-associated lncRNAs are generally related to neighbor disease-associated PCGs on genome (Kumar et al., 2013). This is useful information, which could be integrated into DislncRF. One simple strategy is that the predicted scores of DislncRF are scaled up (multiply a factor greater than 1) for lncRNAs in upstream and downstream of disease-associated PCGs, otherwise scale down (multiply a factor smaller than 1). (ii) We can construct machine learning models on a training set consisting of PCGs profiles, and then apply them for the test set with lncRNAs profiles. This requires that the training and test sets are taken from the same data distribution, while not overlap in the same or have near identical examples, as this will lead to overfitting. However, as indicated in Supplementary Figure S3, there are some differences in tissue distributions between PCGs and lncRNAs, which leads to a covariate shift (Shimodaira, 2000) between training and test set. However, this is at least partially addressed by how we normalize the expression profiles.

Other future work includes extending the classification scheme from training a binary classifier for each disease to developing a multi-class framework, taking into account the full contingency matrix of how often genes associated with one disease are predicted for another disease. The overall performance of such a framework can be evaluated through a multi-class correlation coefficient (Gorodkin, 2004) and will allow for more detailed assessment of which diseases hard to distinguish from one another.

5 Conclusion

In this study, we presented DislncRF, a method for prediction of disease-associated lncRNAs using genome-wide tissue expression profiles and disease-associated PCGs. In contrast to case–control studies, which require disease-specific datasets to detect differentially expressed genes for diseases, DislncRF is based on training of multiple balanced random forest models on generic tissue expression profiles from disease-associated PCGs. This enabled us to learn expression patterns for disease-associated genes, which were then applied to infer disease-associated lncRNAs. DislncRF yielded promising performance and performed considerably better than baseline methods based on correlation coefficients, such as the coding-non-coding co-expression based method CNC. Analyzing the feature importance of the RF models showed that DislincRF consistently puts most emphasis on the tissues known to be important for each disease. The source code of DislncRF is freely available at https://github.com/xypan1232/DislncRF and http://rth.dk/resources/DislncRF.

Acknowledgements

We thank Alberto Santos Delgado for comments on an earlier version of this manuscript.

Funding

This work was supported by PhD fellowship from University of Copenhagen, the Innovation Fund Denmark (0603-00320B), the Novo Nordisk Foundation (NNF14CC0001) and the Danish Center for Scientific Computing (DCSC, DeIC).

Conflict of Interest: none declared.

References

Antanaviciute
 
A.
 et al.  (
2015
)
GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles
.
Bioinformatics
,
31
,
2728
2735
.

Binder
 
J.X.
 et al.  (
2014
)
COMPARTMENTS: unification and visualization of protein subcellular localization evidence
.
Database (Oxford), 2014
,
bau012
.

Blokzijl
 
F.
 et al.  (
2016
)
Tissue-specific mutation accumulation in human adult stem cells during life
.
Nature
,
538
,
260
264
.

Bornigen
 
D.
 et al.  (
2013
)
Concordance of gene expression in human protein complexes reveals tissue specificity and pathology
.
Nucleic Acids Res
.,
41
,
e171
.

Brawand
 
D.
 et al.  (
2011
)
The evolution of gene expression levels in mammalian organs
.
Nature
,
478
,
343
348
.

Breiman
 
L.
(
2001
)
Random forests
.
Mach. Learn
.,
45
,
5
32
.

Chen
 
G.
 et al.  (
2013
)
LncRNADisease: a database for long-non-coding RNA-associated diseases
.
Nucleic Acids Res
.,
41
,
D983
D986
.

Chen
 
X.
(
2015
)
KATZLDA: kATZ measure for the lncRNA-disease association prediction
.
Sci. Rep
.,
5
,
16840.

Chen
 
X.
 et al.  (
2016a
)
IRWRLDA: improved random walk with restart for lncRNA-disease association prediction
.
Oncotarget
,
7
,
57919
57931
.

Chen
 
X.
 et al.  (
2016b
)
FMLNCSIM: fuzzy measure-based lncRNA functional similarity calculation model
.
Oncotarget
,
7
,
45948
45958
.

Chen
 
X.
 et al.  (
2017
)
Long non-coding RNAs and complex diseases: from experimental results to computational models
.
Brief Bioinform
,
18
,
558
576
.

Chen
 
X.
,
Yan
G.Y.
(
2013
)
Novel human lncRNA-disease association inference based on lncRNA expression profiles
.
Bioinformatics
,
29
,
2617
2624
.

Cogill
 
S.
,
Wang
L.
(
2016
)
Support vector machine model of developmental brain gene expression data for prioritization of autism risk gene candidates
.
Bioinformatics
,
32
,
3611
3618
.

Derrien
 
T.
 et al.  (
2012
)
The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression
.
Genome Res
.,
22
,
1775
1789
.

Di
 
A.S.
 et al.  (
2018
)
Long non-coding MIR205HG depletes Hsa-miR-590-3p leading to unrestrained proliferation in head and neck squamous cell carcinoma
.
Theranostics
,
8
,
1850
1868
.

Dobin
 
A.
 et al.  (
2013
)
STAR: ultrafast universal RNA-seq aligner
.
Bioinformatics
,
29
,
15
21
.

Esteller
 
M.
(
2011
)
Non-coding RNAs in human disease
.
Nat. Rev. Genet
.,
12
,
861
874
.

Gorodkin
 
J.
(
2004
)
Comparing two K-category assignments by a K-category correlation coefficient
.
Comput. Biol. Chem
.,
28
,
367
374
.

Greene
 
C.S.
 et al.  (
2015
)
Understanding multicellular function and disease with human tissue-specific networks
.
Nat. Genet
.,
47
,
569
576
.

GTEx Consortium. (

2013
)
The Genotype-Tissue Expression (GTEx) project
.
Nat. Genet
.,
45
,
580
585
.

GTEx Consortium. (

2015
)
Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans
.
Science
,
348
,
648
660
.

Guan
 
Y.
 et al.  (
2012
)
Tissue-specific functional networks for prioritizing phenotype and disease genes
.
PLoS Comput. Biol
.,
8
,
e1002694.

Han
 
J.
 et al.  (
2017
)
LncRNAs2Pathways: identifying the pathways influenced by a set of lncRNAs of interest based on a global network propagation method
.
Sci. Rep
.,
7
,
46566
.

Hindorff
 
L.A.
 et al.  (
2009
)
Potential etiologic and functional implications of genome-wide association loci for human diseases and traits
.
Proc. Natl. Acad. Sci. USA
,
106
,
9362
9367
.

Junge
 
A.
 et al.  (
2017
)
RAIN: RNA-protein association and interaction networks
.
Database (Oxford)
,
2017
, baw167.

Kitsak
 
M.
 et al.  (
2016
)
Tissue specificity of human disease module
.
Sci. Rep
.,
6
,
35241.

Kumar
 
V.
 et al.  (
2013
)
Human disease-associated genetic variation impacts large intergenic non-coding RNA expression
.
PLoS Genet
.,
9
,
e1003201.

Lage
 
K.
 et al.  (
2008
)
A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes
.
Proc. Natl. Acad. Sci. USA
,
105
,
20870
20875
.

Liao
 
Q.
 et al.  (
2011
)
Large-scale prediction of long non-coding RNA functions in a coding–non-coding gene co-expression network
.
Nucleic Acids Res
.,
39
,
3864
3878
.

Liu
 
Y.
,
Zhao
M.
(
2016
)
lnCaNet: pan-cancer co-expression network for human lncRNA and cancer genes
.
Bioinformatics
,
32
,
1595
1597
.

Luu
 
H.N.
 et al.  (
2017
)
miRNAs associated with prostate cancer risk and progression
.
BMC Urol
.,
17
,
18.

Magger
 
O.
 et al.  (
2012
)
Enhancing the prioritization of disease-causing genes through tissue specific protein interaction networks
.
PLoS Comput. Biol
.,
8
,
e1002690.

Mirza
 
A.H.
 et al.  (
2015
)
Transcriptomic landscape of lncRNAs in inflammatory bowel disease
.
Genome Med
.,
7
,
39.

Necsulea
 
A.
 et al.  (
2014
)
The evolution of lncRNA repertoires and expression patterns in tetrapods
.
Nature
,
505
,
635
640
.

Ning
 
S.
 et al.  (
2016
)
Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers
.
Nucleic Acids Res
.,
44
,
D980
D985
.

Pafilis
 
E.
 et al.  (
2013
)
The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text
.
PLoS One
,
8
,
e65390.

Palasca
 
O.
 et al.  (
2018
)
TISSUES 2.0: an integrative web resource on mammalian tissue expression
.
Database (Oxford)
,
2018
, bay003.

Pan
 
X.
,
Shen
H.B.
(
2016
)
OUGENE: a disease associated over-expressed and under-expressed gene database
.
Sci. Bull
.,
61
,
752
754
.

Pedregosa
 
F.
 et al.  (
2011
)
Scikit-learn: machine learning in Python
.
J. Mach. Learn. Res
.,
12
,
2825
2830
.

Pinero
 
J.
 et al.  (
2015
)
DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes
.
Database (Oxford), 2015
,
bav028
.

Pletscher-Frankild
 
S.
 et al.  (
2015
)
DISEASES: text mining and data integration of disease-gene associations
.
Methods
,
74
,
83
89
.

Shimodaira
 
H.
(
2000
)
Improving predictive inference under covariate shift by weighting the log-likelihood function
.
J. Stat. Plann. Infer
.,
90
,
227
244
.

Smyth
 
G.K.
(
2004
)
Linear models and empirical bayes methods for assessing differential expression in microarray experiments
.
Stat. Appl. Genet. Mol. Biol
.,
3
, Article3.

Sun
 
Z.Q.
 et al.  (
2017
)
MiR-590-3p promotes proliferation and metastasis of colorectal cancer via Hippo pathway
.
Oncotarget
,
8
,
58061
58071
.

Trapnell
 
C.
 et al.  (
2010
)
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
.
Nat. Biotechnol
.,
28
,
511
515
.

Tsoi
 
L.C.
 et al.  (
2015
)
Analysis of long non-coding RNAs highlights tissue-specific expression patterns and epigenetic profiles in normal and psoriatic skin
.
Genome Biol
.,
16
,
24.

Vapnik
 
V.N.
(
1998
)
Statistical Learning Theory
. 1st edn.
Wiley
,
New York
.

Verdoodt
 
B.
 et al.  (
2013
)
MicroRNA-205, a novel regulator of the anti-apoptotic protein Bcl2, is downregulated in prostate cancer
.
Int. J. Oncol
.,
43
,
307
314
.

Wells
 
A.
 et al.  (
2015
)
The anatomical distribution of genetic associations
.
Nucleic Acids Res
.,
43
,
10804
10820
.

Winter
 
E.E.
 et al.  (
2004
)
Elevated rates of protein secretion, evolution, and disease among tissue-specific genes
.
Genome Res
.,
14
,
54
61
.

Ye
 
Y.
,
Li
S.L.
 et al.  (
2018
)
Construction and analysis of mRNA, miRNA, lncRNA and TF regulatory networks reveal the key genes associated with prostate cancer
.
PLoS One
,
13
,
e0198055
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data