RBPTD: a database of cancer-related RNA-binding proteins in humans

Abstract RNA-binding proteins (RBPs) play important roles in regulating the expression of genes involved in human physiological and pathological processes, especially in cancers. Many RBPs have been found to be dysregulated in cancers; however, there was no tool to incorporate high-throughput data from different dimensions to systematically identify cancer-related RBPs and to explore their causes of abnormality and their potential functions. Therefore, we developed a database named RBPTD to identify cancer-related RBPs in humans and systematically explore their functions and abnormalities by integrating different types of data, including gene expression profiles, prognosis data and DNA copy number variation (CNV), among 28 cancers. We found a total of 454 significantly differentially expressed RBPs, 1970 RBPs with significant prognostic value, and 53 dysregulated RBPs correlated with CNV abnormality. Functions of 26 cancer-related RBPs were explored by analysing high-throughput RNA sequencing data obtained by crosslinking immunoprecipitation, and the remaining RBP functions were predicted by calculating their correlation coefficient with other genes. Finally, we developed the RBPTD for users to explore functions and abnormalities of cancer-related RBPs to improve our understanding of their roles in tumorigenesis. Database URL: http: //www.rbptd.com


Introduction
RNA-binding proteins (RBPs) are proteins that combine with RNA to form protein complexes to regulate RNA expression and exert its functions in physiological and pathological processes (1,2). RBPs can regulate gene expression by different ways. For instance, it can regulate Figure 1. Workflow of RBPTD development. Expression profiles, prognostic information and copy number variation (CNV) data were downloaded from The Cancer Genome Atlas (TCGA). To find the cancer-related RBPs, we used the DESeq2 package in the R software environment to analyse differentially expressed genes (DEGs), and the survival package of R software was used for survival analysis. To reveal the causes of abnormality of dysregulated RBPs, the genomic identification of significant targets in cancer (GISTIC) was used for CNV analysis. We also predicted the function of RBPs by CLIP-Seq experimental data and co-expression analysis. Finally, we then developed a database using the obtained RNA binding protein (RBP) results.
gene expression at the post-transcriptional level by affecting RNA stability and transportation (3)(4)(5)(6)(7)(8). In addition, it also regulates gene expression by directing ribosomes to affect the rate of protein synthesis at the translational level (9). Therefore, the normal RBPs play important roles in human cellular processes.
RBPs are closely associated with a variety of human diseases, especially cancers (5,6,(10)(11)(12)(13)(14). Substantial studies have revealed that dysregulated RBPs have been proven to play essential roles in tumorigenesis (15,16). For example, RBP U2AF1 affects pre-mRNA splicing of a large number of known oncogenic drivers to promote tumorigenesis (17). Therefore, it is essential to integrate the expression profiles of RBPs to systematically investigate their functions in cancers.
Previous studies have revealed that DNA copy number variation (CNV) has a significant impact on gene expression regulation and is involved in tumorigenesis (18,19). Therefore, the identification of CNV-related dysregulation of RBPs may reveal the causes of abnormality of some dysregulated RBPs. Since RBPs mainly exert their functions by combining with transcripts, high-throughput experimental methods, such crosslinking immunoprecipitation sequencing (CLIP-Seq), can be used to identify their target sites in gene transcripts. Thus, it is important to integrate multiple types of data to investigate their potential functions and the causes of abnormality.
To systematically investigate the functions of RBPs in cancers, we first identified cancer-related RBPs, including significantly differentially expressed RBPs and those with significant prognostic value, by analysing their gene expression profiles and prognostic information. We then explored their potential functions by analysing CLIP-Seq data for 26 RBPs and calculating the correlation of expression among the remaining RBPs with other genes. Finally, we investigated CNV-related dysregulated RBPs and developed a database to manage and present the results for reference.

Methods
The analytical workflow of this study consists of four sections: cancer-related RBPs identification, CNV-analysis, function prediction and database construction ( Figure 1).

Identification of cancer-related RBPs
RBPs were identified by experimental methods including CLIP-Seq and computational prediction methods (20) following previously established methods (6,21,22). Gene expression profiles of 28 types of cancers were downloaded from The Cancer Genome Atlas (TCGA). Significantly differentially expressed RBPs were identified by comparing their expression levels in cancer tissues with those in adjacent normal tissues using the DESeq2 (ver. 3.7) with the default settings (23,24). The raw P values were then adjusted to the false discovery rate (FDR) using the Benjamini-Hochberg procedure. Significant differential expression was assessed at a level of FDR ≤ 0.05. To identify RBPs with significant prognostic value, patients of each cancer type were divided into two groups according to mean RBP expression levels. The Kaplan-Meier method was then applied to determine differences in survival curves using the survival (25) package (ver. 2.38) in the R software environment. Significance was determined at a level of P value ≤0.05.

CNV analysis
To identify CNV-related dysregulated RBPs among 28 types of cancer, the copy number segment of each cancer patient was downloaded from TCGA. These data were then analysed using the genomic identification of significant targets in cancer (GISTIC) (ver. 2.0) (26) with the human hg38 reference genome. According to the G-score calculated by GISTIC, significantly amplified or deleted regions were identified for each patient. The false discovery rate (FDR) was then calculated for aberrant regions, and regions with FDR ≤ 0.05 were identified (26)(27)(28).

Function prediction
RBPs predominantly exert their functions by combining with RNA to affect their stability or functions. The CLIP-Seq data of RBPs were applied to identify RBP binding sites in gene transcripts obtained from the StarBase (ver. 3.0) (29,30) and the CLIP-Seq of 26 cancer-related RBPs were found. In addition, Pearson correlation coefficients between RBPs Numbers of significant RNA binding proteins (RBPs), categorised by survival (SUR), differential expression (DE) and copy number variants (CNV). Boolean operator (&) indicates RBPs common among multiple categories. and other genes were calculated using the Hmisc package in R to predict their potential targets. Putative regulatory pairs with coefficients ≥0.4 and P ≤ 0.05 were retained.

Database construction
The RBPTD database was constructed based on the Apache (ver. 2.0), PHP (ver. 7) and MySQL (ver. 5.3) software packages. We used the Vue Cli and BootstrapVue tools as a framework for building user interfaces. The gene expression boxplot and Kaplan-Meier survival curves were provided by the ECharts charting library.

Identification of cancer-related RBPs
Our comparison of gene expression profiles between cancer tissues and adjacent normal tissues across 28 cancer types identified 454 significantly differentially expressed RBPs (|log 2 fold change| ≥1 and FDR ≤ 0.05). Among these, 236 RBPs were cancer type-specific, and 218 were differentially expressed in more than one cancer type (Figure 2A). Among the significantly differentially expressed RBPs found in multiple cancer types, CPEB1 was differentially expressed in nine types of cancer; previous studies have reported its ability of promoting tumour migration in breast (31), liver (32) and endometrial cancers (33). Our analysis of the prognosis of RBPs in 26 types of cancer identified 1970 RBPs with significant prognostic value; of these, 218 (approximately 11%) were found in more than one type of cancer ( Figure 2B).

Causes and functions of dysregulated RBPs
To explore the potential functions of RBPs, we predicted gene transcript target sites using CLIP-Seq. We found a total of 97 105 binding sites among 26 RBPs in the experimental data. Because the amount of available CLIP-Seq data was limited, we predicted the regulated genes of remaining RBPs by calculating their correlation coefficient with other genes. In total, we found 26 030 969 regulated RBP-gene pairs, involving 1924 RBPs and 19 564 putative target genes (Table 1).

Database interface
We constructed a website to search and display our integrated data. The website mainly consists of four sections. The search interface can be used to search for RBPs in one or more cancer types with various filter options, such as gene expression fold change, prognostic value and change in CNV ( Figure 4A). The web interface also allows users to search the database by RBP name or ensemble identifier ( Figure 4B).

Discussion
RBPTD is a database for the systematic investigation of RBP functions and abnormal causes of cancer-related RBPs. The database recognises 1991 cancer-related RBPs including 454 significantly differentially expressed RBPs and 1970 RBPs with significant prognostic value among 28 types of cancer. To explore their functions, target sites of 26 RBPs were identified in this study using CLIP-Seq data. Besides, we found 26 039 691 significantly related RBPgene regulatory pairs for the remaining RBPs by calculating their Pearson correlation coefficients. Finally, we found 45 CNV-related dysregulated RBPs.
This study is the first study to incorporate multiple high-throughput data from different dimensions to systematically investigate cancer-related RBPs. Previous studies mainly focused on specific aspect of RBPs, such as dysregulated RBPs identification and SNP-related RBPs (15,16,34). In our study, we predicted cancer-related RBPs by the analysis of differential gene expression, prognosis and CNV. In addition, we not only explored the functions of RBPs by analysing CLIP-Seq data and calculating their correlation with other genes, but also investigated the causes of abnormality for the dysregulated RBPs.
There were several limitations to the current study. Some types of cancer have been reported without control datasets; therefore, significantly differentially expressed genes could not be identified for these types of cancer. In addition, the limited number of CLIP-Seq data limited the function prediction of RBPs. The function prediction based on Pearson correlation may be false positive, and have no experimental data for support. In this study, we developed RBPTD to systematically identify cancer-related RBPs, predict their potential functions in cancers and explore causes of abnormalities among significantly differentially expressed RBPs. We expect that our database will promote further understanding of the roles of RBPs in tumorigenesis.