CCRDB: a cancer circRNAs-related database and its application in hepatocellular carcinoma-related circRNAs

Abstract Circular RNAs (circRNAs) are widely expressed in human cells and tissues and can form a covalently closed exon circularization, which have stable patterns and play important regulatory roles in physiological or pathological process. There is still lack of a comprehensively disease-related knowledge base for in-depth analysis of circRNAs. In this paper, a cancer circRNAs-related database (CCRDB) was established. The CCRDB’s initial circRNAs data were collected by sequencing experimental data of 10 samples from 5 patients with hepatocellular carcinoma (HCC), where a total of 11 501 circRNAs were found and can easily be expanded by collecting and analyzing external data sources such as circBASE (1). Using CCRDB, we have further studied the relationships between circRNAs and HCC and found that circRNAs (hsa_circ_ 0002130, hsa_circ_0084615, hsa_circ_0001445, hsa_circ_0001727 and hsa_circ_0001361) and the corresponding genes ID [C3 (2, 3), ASPH (4), SMARCA5 (5), ZKSCAN1 (6) and FNDC3B (7)], respectively, might be the potential biomarker targets for HCC. Furthermore, our experiment also found that some new circRNAs chromosome sites chr12:23998917 24048958 and chr16:72090429 72093087 and the corresponding genes ID (SOX5 (8) and HP (9), respectively), might be the potential biomarker targets for HCC. These results indicate that CCRDB can effectively reveal the relationships between circRNAs and HCC. As the first circRNAs database to provide analysis and comparison functions, it is of great significance for researchers to further study the rules of circRNAs, to understand the causes of circRNAs in disease discovery and to find target genes for therapeutic approaches.


Introduction
Circular RNA (or circRNAs) is a type of noncoding RNA that forms a covalently closed continuous loop with the 3 and 5 ends binding together. This feature confers numerous properties to circRNAs, many of which have only been identified recently. Some circRNAs can act as microRNA sponges to block the function of microRNAs, thereby affecting gene regulation and expression, and are widely involved in life activities and play important regulatory roles in tumorigenesis and development (10)(11). For example, circRNAs CDR1as/ciRS-7 (a circular RNA sponge for antisense of microRNAs-7 or CDR1) inhibits the expression of microRNAs-7, thereby increasing expression of the target gene of microRNAs-7. Sex-determining region on Y chromosome gene (Sry) has also been shown to be used as a sponge of microRNAs-138 (12). Liu et al. found that circRNAs-CER regulates MMP13 expression by acting as a competitive endogenous RNA (ceRNA) (13). Other studies have shown that circRNA is involved in the development of various diseases, including atherosclerosis, neurological diseases and cancer (14)(15). Guarnerio et al. (16) found that tumors carrying chromosomal translocations also contained circRNAs from rearranged genomes: abnormal fusion of circRNA (f-circRNA). They further confirmed that these circRNAs may be functionally relevant in promoting tumorigenesis, suggesting their diagnostic and therapeutic potential. Meanwhile, the development of high-throughput sequencing technology (17) has greatly expanded the scope of transcriptome research and provided a way to view circRNAs in different samples. However, the specific role of most circRNAs has not yet been identified.
At present, hepatocellular carcinoma (HCC) is one of the most common malignancies and the sixth largest cancer killer in the world. Most HCC is caused by chronic hepatitis B virus infection and subsequent cirrhosis (18). It has been reported that the fact that a cellular circRNA has been found stable in saliva (19) and exosome (20) makes cir-cRNA a promising biomarker for diagnosis. Similarly, some studies have shown that if the expression of microarray-7(miR-7) is up-regulated in HCC cells, the cell cycle may be stagnated in G1/S phase, thus inhibiting the proliferation of cancer cells (21). In recent years, Qin et al. found that hsa-circ 0005075 is a potential target for diagnosis and treatment of HCC. Their results showed that circRNA can successfully distinguish between tumors and normal samples (22,23). Li et al. found that CIRCMTO1 could act as a sponge of carcinogenic microRNA9 to up-regulate the expression of p21 and significantly affect the proliferation of HCC cells. CIRCMTO1 might be used as a prognostic factor and therapeutic target for HCC (24). Huang et al. found that has circrna 100338 inhibited the expression of microRNA141-3P and played an important role in the regulation of metastatic potential of HCC cells and provided one of the first circRNAs biomarkers for HCC clinical studies (25). Fu et al. showed that HSA-CIRC 0353570 was closely related to the clinicopathological characteristics of HCC patients. The background of liver cirrhosis was related to the decrease of HSA-CIRC 0353570 (26). Chen found that HSAXCIRCY05996 interacted with microRNAs-129-5P and regulated Notch1 mRNA expression by acting as a sponge of microRNAs-129-5P. It has been reported that Notch plays an important role in the occurrence and metastasis of HCC (27)(28)(29).
In recent years, researchers have paid more and more attention to the study of circRNAs, and many circRNAsrelated databases have been published, such as circBase, circNet and database for cancer-specific circRNAs(CSCD) (1,(30)(31)(32)(33)(34). Among them, circBase merges and unifies circRNAs data sets from public references and provides evidence to support its expression in the genomic context (29). CircNet provides a common database of tissue-specific circRNAs expression profiles and circRNA-miRNA gene regulatory networks and provides new methods and nomenclature to identify new circRNAs. None of them is specifically targeted at the comparison of disease-related RNAs, and it is difficult to study the biological effects and regulatory mechanisms of disease-related information. CSCD is a comprehensive database of cancer-specific circRNAs that provides general information and regulatory property queries, but it does not provide new RNA discoveries, nor does it provide tools and methods for indepth analysis.

Database sources
The CCRDB data sources include our experimental data and external data from other author's literatures. Several individuals were selected to conduct the experiment. Five pairs of circRNAs differentially expressed in HCC cells and normal tissues adjacent to the cancer were screened. We divided them into two groups: group B was normal cells and group C was HCC cells.
Combining with the published circRNAs database cir-cBASE, we annotated the circRNAs in the samples according to the source region. In our experiment, a total of 11 501 circRNAs were found and listed in Table 1. Compared with the circBASE database, 4989 circRNAs were not included in the circBASE, and they were new circRNAs found in our experiment. Among them, we found 5033, 2446, 3101, 1068 and 2249 circRNAs in normal cells (group B) and 3741, 3233, 2561, 2555 and 2209 circRNAs in cancer cells (group C). The CCRDB also collects external data sets from existing circBASE database where thousands of circRNAs have recently been shown to be expressed in Homo sapiens cells, which are published from literatures (35)(36)(37)(38)(39)(40)(41)(42). This data set consists of basic circRNAs information along with their genomic coordinates, annotation, predicted miRNA seed matches and sample's junction reads. Other external data is very easily added to the CCRDB database. In total, the CCRDB includes 364 582 circRNAs from 62 human organ samples. Table 2 below shows statistics of the CCRDB.

Database structure
In CCRDB, we mainly consider three aspects, i.e. circRNAs information, annotation information and analysis information. Major information in the CCRDB is listed in the Table 3 below.

Database construction
The main purpose of our CCRDB database is to integrate and maintain a high quality circRNAs database and analysis platform to further discover the relationships between circRNAs and HCC. It is a comprehensive and fully functional circRNAs resource library. Figure 1 below illustrates the main structure of the CCRDB, which is based on the client/server architecture. The CCRDB database contains a list of circRNAs, functional annotations and analysis function of the circRNAs.
In terms of data structure, it is implemented by a relational database and a textual database, which can adapt to heterogeneous data. The database implements functions such as data modeling, data extraction, conversion and loading, etc. In order to eliminate differences between data samples from various sources, we label the data according to circRNAs ID and gene ID, which facilitate the implementation of subsequent analysis applications. The junction reads number of the circRNAs that support head to tail connection SM MS SMS CircRNAs reads alignment signal #non junction reads The number of reads to circRNA that support head to tail flank area (flanking). Junction reads ratio a parameter that can be used to measure the reliability of circRNAs CircRNA type the circRNA type characterized by the region Gene ID the corresponding gene ID according to the location of circRNAs

Field name Description
Group ID A comparison group Identifier of sample B and C CircRNA ID CircRNAs Identifier CircBase ID CircBase database Identifier Gene ID The corresponding gene ID according to the location of circRNAs B-Expression The number of junction reads that supports the circRNAs head to tail connection in the sample B C-Expression The number of junction reads that supports the circRNAs head to tail connection in the C sample B-TPM Normalized treatment (TPM) of sample B (When the corresponding circRNAs is not detected in a certain sample, the value will be reset to 0.001.) C-TPM Normalized treatment (TPM) of sample C (When the corresponding circRNAs is not detected in a certain sample, the value will be reset to 0.001.) Log2 Ratio (1C/1B) Samples B and C's junction reads that were compared with log2 Up-Down-Regulation Up or down regulation according to the normalized expression comparison from sample B to C P-value P-value FDR FDR for the P-value

Usage
As a comprehensive and interactive database, CCRDB provides the following main functions, including search, analyse application, download and upload. Users can browse circRNAs by selecting the sample name, circRNA ID (for example, Chr X: 891303|892653 representing the donor and receptor sites of each circRNA), circBASE ID, gene id and more to get more intuitive information ( Figure 2A). All the information will include sample type, circRNA ID, circBase ID, gene ID, sample source and etc. By clicking on any circRNA ID, the circRNA-related chromosome location, start and end sites will be displayed in the upper right corner of the home page. It supports the number of junction reads that are connected at the beginning and the end of circRNAs and supports the aligning of circRNAs. The number of reads aligned to the flanking regions at the ends of the circRNAs is used as a parameter to measure the reliability of circRNAs, junction reads ratio and the type of circRNAs in detail ( Figure 2B).
We innovatively provide a comparative analysis platform to provide data analysis functions by importing different samples of circRNAs data from different organs. The comparison of two groups of circRNAs data can come from different sources, which is flexible and suitable for various comparative analyses.
The number of junction reads that supports the connection between the head and tail of the circRNAs is used as a comparison criterion to measure the strength of circRNAs signal. The corresponding circRNAs in the sample selected by the user will get the relevant tabular data or up-and-   We can use the upload function to import data to be analyzed, and its semaphore is based on junction reads. In analysis application, select the circRNAs data of the comparison group to be compared to carry on a pairwise comparison by choosing the result condition (FDR and log2Ratio) ( Figure 2B). You can get the differences in the selected comparison group. The result of circRNAs comparison can be expressed by table or graph ( Figure 2C and D) for further analytical studies. After selecting several comparison groups for comparison, we can integrate the conclusions of the above comparison groups to get more interesting results. Through comparative analysis, we can obtain the common differencing results from many sample's circRNAs, such as circRNA signal, intensity, regulatory CircRNAs is classified by new expression pattern, and new circRNAs is found and named.
Provide the first comprehensive cancer-specific circRNAs database.
Provide new circRNAs discovery and analysis tools to search for candidate target genes.

Comparisons with other databases
We compare horizontally with other circRNA databases (such as circBase (1), CSCD (32), CircNET (34) listed in Table 4). CCRDB can achieve the following functions: (i) discover new circRNA by sequencing the normal and pathological cells of the same person's same tissues to avoid background effects of genetic differences among different people, (ii) provide a platform for circRNA differential analysis application and (iii) link and extend with external data sources, such as circBase, GO, pubmed, etc., to display a comprehensive network of RNA discovery and regulation.
In general, the CCRDB provides users with interactive tools, a concise home page interface and a search engine to achieve a convenient and flexible query through sequence, gene and genome location. Taken together, the CCRDB can be an integrated resource for circRNA to provide not only valuable relationship between circRNAs and diseases, but also the new analysis tool to mine much more knowledge from the data as well.

Results
After the establishment of the new database, we further studied the circRNAs and the relationship between circR-NAs and HCC and found some interesting results.

Analysis method
We set up comparison groups for analysis. Two samples of sequencing circRNAs are used to form a comparison group. They can be from the same person (organ), or they can be chosen from different person's (organ's) sample. A comparison group selection method is that circRNAs are obtained from the same person's circRNAs sequencing data to avoid background effects such as genetic differences among people. By using the circRNAs comparative analysis application, we compare the results between the circRNAs of the human cancer cells and the circRNAs of the same human's adjacent normal cells. Semaphore of the comparison group must be chosen for the comparative signal strength. The main principle of the circRNAs comparative analysis application is to compare the signal expression of the samples, which is the number of junction reads that supports circRNAs' head to tail connections. It is the field name of '#junction reads' in the circRNAs information listed in Table 3.
The P-value method is calculated in hypothesis test. The formula of P-value is shown below, where x and y are expressions of the two samples' circRNAs in the There are two major parameters, FDR and |log2Ratio|. log2Ratio| is the ratio of the semaphores when two samples are compared with log2. FDR is the false discovery rate of P-value. Usually |log2Ratio| is set to be greater than or equal to 1, and FDR is less than 0.001. These two parameters can be set according to actual needs.  Figure 5a is the common significant differences in circRNAs and the corresponding genes. (b) Figure 5b is the circRNAs only with the corresponding genes that are newly found in this experiment.

HCC cells shows distinctly different circRNAs from normal cells
Using the comparative analysis application, we select the same person (organ) as the comparison group samples, of which sample B was normal cells and sample C showed hepatoma cells. (We can also choose comparison groups in other ways). We labeled them 1B&1C, 2B&2C, . . . 5B&5C, respectively. The circRNAs expressed in the same organ (liver) of several groups of people were identified. The numbers of differences found in circRNAs between samples B and C were 6808, 4652, 4365, 3102 and 3534, respectively, compared with five different comparison groups. The numbers of significant differences were 111, 44, 21, 47 and 25, respectively. These differences and significant differences are analyzed, as shown in Table 5. By setting the FDR and |log2Ratio| parameters, the results of the analysis with significant differences are obtained. The result of expression level 1B vs 1C is shown in Figure 3.
We put all comparison groups together. The significant differences of the same category in all groups are compared. And the numbers of comparison groups are analyzed where their differences are in the same regulatory direction.
All the significant differences between cancer cells and their adjacent normal cells of the same person were analyzed. Figure 4 shows the count of the comparison groups in which their circRNAs have common significant differences and the same regulation directions in all selected comparison groups of the experimental samples.
In the comparison group of five persons, there were 31 circRNAs with two or more comparison groups, which their significant differences have the same regulatory directions, including 20 circRNAs with circBASE ID data and 11 without circBASE ID data, as they are newly found.
There are three circRNAs with significant differences in the same direction of regulation that have been found in five comparison groups (5/5, in 100%). There are five cir-cRNAs with significant differences in the same direction of regulation that have been found in four comparison groups (4/5, in 80%). There are five circRNAs with significant differences in the same direction of regulation that have been found in four comparison groups (3/5, in 60%).
The changes of circRNAs from normal cells to diseased cells in different comparison groups were generally consistent with the same regulatory directions (UP or DOWN). This helps us to find the corresponding regulatory or target genes from the significant variation of circRNAs, as shown in Figure 5a and b.

Highly probable carcinomatous circRNAs
The circRNAs with same significant differences and same regulation directions, which occurred many times (compar-ison groups count) in the comparison groups through our analysis application, seem to strongly related to the disease. Corresponding candidate regulatory genes or target genes can be found from the circRNAs, as shown in the Figure 5.
We have found that, Has circ 0002130-related geneID C3 showed significant differences in five of five comparison groups (5/5), which is down-regulated in our experimental samples. According to the report of the papers, the gene C3, inhibiting cancer in HCC, was found to be the biomarker candidates for distinguishing early HCC from cirrhosis. Hsa circ 0001445 (related gene SMARCA5, 4/5 found in the experiment), hsa circ 0001727 (related gene ZKSCAN1, 4/5 found in the experiment), chr12:23998917| -24048958 (related gene SOX, 5/5 found in the experiment) and chr16:72090429|72093087 (related gene HP, 4/5 found in the experiment), were down-regulated, which was consistent with the results of related papers. Hsa circ 0084615 (related gene ASPH, 4/5 found in experiment) and hsacirc0001361(related gene FNDC3B, 3/5 found in experiment), were up-regulated, which was consistent with the results of related papers. Details are shown in Table 6 below.

Summary and future directions
We sequenced the circRNAs of hepatocytes and constructed a new database CCRDB. Using the new database CCRDB and its analyzing tools, we further studied circRNAs and the relationship between circRNAs and HCC. It is of great significance for researchers to further analyze the rules of circRNAs, to understand the causes of circRNAs in disease discovery and to search for target genes for therapeutic approaches. Researchers can easily add circRNA sequencing data from other organs to this database and use the comparative analysis tools to provide powerful analytical functions to facilitate the discovery of new knowledge.
The future direction for development is to mine more circRNAs data from literatures and experiment to compile a more comprehensive database and offer a variety of analytical functions, including verification of analysis results, and intelligent tools by artificial intelligence technology.