SPONGEdb: a pan-cancer resource for competing endogenous RNA interactions

Abstract microRNAs (miRNAs) are post-transcriptional regulators involved in many biological processes and human diseases, including cancer. The majority of transcripts compete over a limited pool of miRNAs, giving rise to a complex network of competing endogenous RNA (ceRNA) interactions. Currently, gene-regulatory networks focus mostly on transcription factor-mediated regulation, and dedicated efforts for charting ceRNA regulatory networks are scarce. Recently, it became possible to infer ceRNA interactions genome-wide from matched gene and miRNA expression data. Here, we inferred ceRNA regulatory networks for 22 cancer types and a pan-cancer ceRNA network based on data from The Cancer Genome Atlas. To make these networks accessible to the biomedical community, we present SPONGEdb, a database offering a user-friendly web interface to browse and visualize ceRNA interactions and an application programming interface accessible by accompanying R and Python packages. SPONGEdb allows researchers to identify potent ceRNA regulators via network centrality measures and to assess their potential as cancer biomarkers through survival, cancer hallmark and gene set enrichment analysis. In summary, SPONGEdb is a feature-rich web resource supporting the community in studying ceRNA regulation within and across cancer types.

1 Architecture of the Database The database uses the innodb version 8.0.16 on a MySQL Community Server-GPL license and is based on a Linux server. It contains information from TCGA (cancer information and pan-cancer analysis), ENCODE (additional gene information) and miRbase (additional miRNA information). The database is normalized (third normal form) to prohibit redundancies and inconsistencies . The main table is the dataset table. From one data set there can be several runs with different parameters, which are specified in the run table. In Fig. 1 the general architecture of the database is shown. Figure 1: The database architecture -The main table is the data set table in which a cancer is specified. For one cancer type more than one run with different SPONGE parameters can be defined resulting in different ceRNA networks. The gene and miRNA table are filled with all genes and miRNAs found in the ceRNA networks and are filled with additional information about those. Fig. 2 shows the database in detail. The dataset table is the main table of the database. It contains information about the cancer types used in the analysis. The run table defines the parameters of the specified actual run in the SPONGE tool. For one dataset there can be various runs, which can differ in the parameters and in the targeted databases, this information is contained in the target databases table. Different parameters can lead to different ceRNA interaction networks as a result of the SPONGE tool. In the interactions genegene table the ceRNA networks are saved. Each gene-gene interaction has a certain P-value, mscor and correlation. The single ceRNA interactions can be selected through those characteristics. The network analysis table describes different characteristics of the ceRNA interaction network like normalized betweenness or degree of each node. Furthermore, the ceRNA interactions can be specified with the miRNAs contributing in this relationship. This data is stored in interacting mirnas. The additional miRNA information is stored in the mirna table. The genes table contains additional information about the genes used in the ceRNA networks. Input data for SPONGE are paired gene and miRNA expression values. These are stored in the expression data gene and expression data mirna table. The survival rate table contains the survival rates of each gene of different cancer types, it is supported by the suvival pvalue table, which retrieves pValues from log rank tests based on raw survival analysis data. The patient information table contains the informations about the patients used in the survival analysis. The occurence mirna table is a table to speed up the access to data for the website. It counts the occurrence of a mirna inside a run. In the gene count table the overall amount of interactions of one gene has in one run is stored. In addition the number of significant interactions with a pValue lower than 0.05 is stored. Internal search indices fasten up the database and therefore reduce waiting time for answering requests, especially with the size of the interactions genegene table. The tables gene ontology, wikipathways and hallmarks contain the respective data to form links or information to external resources.

Architecture of the API in Detail
The API was built with Flask Version 1.1.1 on Python 3.7. The Representational State Transfer Application Programming Interface (REST-API) uses HTTP to GET, PUT, POST and DELETE data. GET is used to retrieve a resource. PUT changes the state or updates a resource, which can be an object, file or block. POST can create a resource and DELETE removes it. A restful system consists of a client who requests resources and a server who has the resources. Different architectural constraints were considered, like having a 'Uniform Interface' (UI). That means the resources are uniquely identifiable through a single URL, and only by using the underlying methods of the network protocol, such as GET. Moreover, all client-server operations are stateless, and any state management that is required takes place on the client. For the server we used a FLASK application, as it is flexible, minimalistic without losing power, routing the URLs is uncomplicated and easily extensible, which is a great advantage in terms of maintenance, if more endpoints and thus functionality are added to new web resource. The static-file-server contains data of all cancer types produced by SPONGE and a combined file with added significant interacting miRNAs. The static-file-server can be accessed via the web page. The API is splitted in five head points, which have different endpoints inside to access the data of the database. For the detailed architecture please look at the documentation at: https://exbio.wzw.tum.de/ sponge-api/ui/.
3 General use of the R and Python Package R package: install.packages("spongeWebR") Python package: pip install spongeWebPy Both packages have the same functionality and function names inclusive parameter settings. The main difference is the way to use list parameters like ensg number, gene symbol, mimat number, hs number or sample ID. The python package uses lists ["x"] and the R packages uses vectors c("x"). In the following we will describe just the python syntax, but R is equivalent.
To start further analysis with SPONGE data, it is important to get an overview about the available disease types. This can be retrieved with: get datasetInformation().
To retrieve all used parameters of the SPONGE method to re-create published results for the cancer type of interest, use the following function: get(disease name = "cancertype").
Another way to get an overview of the results is to search for a specific gene and get an idea in which ceRNA interaction network the gene of interest contributes most to: get geneCount(gene symbol = ["genesymbol"]).
The database also contains information about the raw expression values and survival analysis data, which can be used to for Kaplan-Meyer-Plots (KMPs) for example. These information can be addressed with package functions. To retrieve expression data use:  This file contains a mix of the interacting miRNAs.zip and the interactionNetwork.zip with the following headers: GeneA, GeneB, df, cor, pcor, mscor, p.val, p.adj, miRNA. Only the significant miRNAs are mentioned in a comma separated list.

Website Details
On top of that, the general design was made with help of the bootstrap framework, which provided basic structures, such as the footer and the header and additionally many other predefined CSS-classes. Additional JS libraries were installed to achieve different functionalities on the website. The tool jQuery offers a multitude of JS functions and is a dependency for various other JS libraries. The library DataTables offers all necessary operations to make big data tables easily manageable, such as filter and search operations. Sigma js package provided the functionalities for the networks. The Sigma js network visualizes gene-gene interactions, where edges and nodes can be searched and coloured and the created network can be downloaded. Furthermore, Force Atlas 2 was implemented to automatically group nodes in the network. On top of that, we used the graphing library Plotly.js to visualize additional information like general database statistics or even more detailed information about the genes such as expression heatmaps and survival analysis. The website is designed as a single page application (SPA), which means it is composed from one page. This approach has different advantages. Instead of all components of the website, only single components must be reloaded. Moreover, the website is easy to deploy and to version. The website consists of following sites: Tutorial, Home, Browse, Info and Download. LGG" , "BRCA" , "CESC" , "COAD" , "ESCA" , "HNSC" , "CCSK" , "KIRP" , "LIHC" , "LUAD" , "LUSC" , "OV" , "PAAD" , "PCPG" , "PRAD" , "SARC" , "STAD" , "TGCT" , "THYM" , "THCA" , "UCEC" ) ] t r a c e 2=go . S c a t t e r ( x=Xv , y=Yv , mode=" markers+t e x t " , name=" Markers and Text " , t e x t =["<b>PPP1R12B</b>" , "<b>ABCA9</b>" , "<b>DLC1</b>" , "<b>TCF4</b>" , "<b>LTBP2</b>" , "<b>ARHGAP20</b>" , "<b>ADGRA2</b>" , "<b>LAMA2</b>" , "<b>PLEKHH2</b>" , "<b>FAT4</            10 Analysis of ceRNA candidates with experimental evidence     Example code how to produce the Tay et al. heatmap: l i b r a r y ( g g p l o t 2 ) l i b r a r y ( spongeWeb ) l i b r a r y ( p l y r ) l i b r a r y ( t i d y r ) l i b r a r y ( egg ) # g e t network measures f o r a l l g e n e s f o r a l l c a n c e r t y p e s gene symbols <− c ( "PTEN" , "PTENP1" , "VCAN" , "CD34" , "CNOT6L" , "VAPA" , "ZEB2" , "RB1" ) c a n c e r t y p e s <− sort LGG" , "BRCA" , "CESC" , "COAD" , "ESCA" , "HNSC" , "CCSK" , "KIRP" , "LIHC" , "LUAD" , "LUSC" , "OV" , "PANCAN" , "PAAD" , "PCPG" , "PRAD" , "SARC" , "STAD" , "TGCT" , "THYM" , "THCA" , " ax3 . s e t x l a b e l ( " " ) ax3 . s e t y l a b e l ( " v a l u e " , f o n t s i z e =15) ax3 . t i c k p a r a m s ( a x i s="x" , which=" major " , l a b e l s i z e =15) ax3 . t i c k p a r a m s ( a x i s="y" , which=" major " , l a b e l s i z e =13) ax3 . s e t t i t l e ( " node d e g r e e " , f o n t s i z e =17) #add mean and median f o r a l l g e n e s i n d a t a b a s e a l l m e a n = np . mean ( n e t w o r k m e a s u r e s [ " n o d e d e g r e e " ] ) a l l m e d i a n = np . median ( n e t w o r k m e a s u r e s [ " n o d e d e g r e e " ] ) ax3 . h l i n e s ( y=a l l m e a n , xmin=0, xmax=ax3 . g e t x l i m ( ) [ 1 ] , l i n e s t y l e s=" s o l i d " , c o l o r=" r e d " , l a b e l="mean" ) ax3 . h l i n e s ( y=a l l m e d i a n , xmin=0, xmax=ax3 . g e t x l i m ( ) [ 1 ] , l i n e s t y l e s=" dashed " , c o l o r=" r e d " , l a b e l=" median " ) # add l e g e n d h a n d l e s , l a b e l s = ax1 . g e t l e g e n d h a n d l e s l a b e l s ( ) f i g . l e g e n d ( h a n d l e s , l a b e l s , l o c= ' l o w e r c e n t e r ' ) # show f i g u r e p l t . show ( )