SomamiR 2.0: a database of cancer somatic mutations altering microRNA–ceRNA interactions

SomamiR 2.0 (http://compbio.uthsc.edu/SomamiR) is a database of cancer somatic mutations in microRNAs (miRNA) and their target sites that potentially alter the interactions between miRNAs and competing endogenous RNAs (ceRNA) including mRNAs, circular RNAs (circRNA) and long noncoding RNAs (lncRNA). Here, we describe the recent major updates to the SomamiR database. We expanded the scope of the database by including somatic mutations that impact the interactions between miRNAs and two classes of non-coding RNAs, circRNAs and lncRNAs. Recently, a large number of miRNA target sites have been discovered by newly emerged high-throughput technologies for mapping the miRNA interactome. We have mapped 388 247 somatic mutations to the experimentally identified miRNA target sites. The updated database also includes a list of somatic mutations in the miRNA seed regions, which contain the most important guiding information for miRNA target recognition. A recently developed webserver, miR2GO, was integrated with the database to provide a seamless pipeline for assessing functional impacts of somatic mutations in miRNA seed regions. Data and functions from multiple sources including biological pathways and genome-wide association studies were updated and integrated with SomamiR 2.0 to make it a better platform for functional analysis of somatic mutations altering miRNA–ceRNA interactions.

The information guiding miRNA recognition is mainly encoded in the seed regions. Mutations that alter the seed region of an miRNA may have large functional effects because they may disrupt the interactions between the miRNA and many of its original targets and may create interactions with new targets (1,11,(31)(32)(33). We recently developed a web-server, miR2GO, to assess the functional impacts of mutations in miRNA seed regions (34). The updated SomamiR database exploits miR2GO as a tool for functional analysis of miRNA seed mutations.
When we created the SomamiR database in 2012, there were only a small number of experimentally identified miRNA target sites and no somatic mutation was mapped to those sites. However, two important latest advances have made it possible to map hundreds of thousands of somatic mutations to experimentally identified miRNA target sites on various ceRNAs. The first advance is that a large number of miRNA target sites have been identified by the newly emerging high-throughput technologies such as PAR-CLIP (photoactivatable-ribonucleosideenhanced crosslinking and immunoprecipitation) (35,36), HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation) (37-39) and CLASH (cross linking, ligation and sequencing of hybrids) (40)(41)(42). The second is the very rapid growth of somatic mutations discovered by whole-genome sequencing of many types of cancers.
In summary, the large amount of newly available data for miRNA and ceRNA sequences, the miRNA interactome, and cancer genome sequences have enabled us to perform a major update of the SomamiR database and make it a more useful and complete resource for analyzing the functional impacts of cancer somatic mutations in miRNAs and their target sites.

Somatic mutations in miRNA sequences
Genomic locations of pre-miRNAs and mature miRNAs were downloaded from miRBase (51,52). We found 987 miRNAs containing 2423 somatic mutations in 50 different types of cancers. Among these somatic mutations, 644 were in mature miRNAs and 1779 were in pre-miRNAs. The somatic mutations mapped to miRNA seed regions are expected to be the most consequential as the seed complementarity was found in most of the functional miRNA target bindings (53). We found 181 somatic mutations in miRNA seeds. We created links that automatically send the somatic mutations as queries to the miR2GO webserver, to allow users to easily perform functional analysis of the seed mutations ( Figure 1).

Somatic mutations in experimentally identified miRNA target sites: CLASH
We downloaded the miRNA target sites identified by CLASH experiments from starBase (54). The genomic coordinates of the target sites were mapped to human genome build hg38 by applying the liftOver utility from the UCSC Genome Browser (55,56). Somatic mutations in the genomic locations of the target sites were then collected from the COSMIC database (57). We found 31 863 somatic mutations located in the miRNA target sites on 4553 mRNAs and 9 lncRNAs.

Somatic mutations in experimentally identified miRNA target sites: PAR-CLIP and HITS-CLIP
We downloaded the miRNA target sites identified in 21 PAR-CLIP and 13 HITS-CLIP experiments from starBase (54). We then used liftOver to convert the genomic coordinates of the target sites to the current human genome build (hg38). We searched for the genomic locations of target sites for somatic mutations from the COSMIC database (57). Somatic mutations were found in the miRNA target sites on each of the three types of ceRNAs (  (59). After applying liftOver, the transcript locations were then compared against the target site locations to identify the target sites on circR-NAs.

Somatic mutations in predicted miRNA target sites
Somatic mutations in predicted miRNA target sites were also identified on the three classes of ceRNAs (Table 2): 1. Target sites on mRNAs: The genomic coordinates of the 3 UTRs for all the RefSeq genes were downloaded from the UCSC table browser (55). Somatic mutations in the 3 UTRs were then collected from COSMIC by comparing the genome locations. Mature miRNA sequences were downloaded from miRBase (release 21) and 3 UTR sequences were downloaded from UCSC table browser. For each somatic mutation in a 3 UTR, a reference and a mutated sequence were scanned for perfect sequence complementarity with the six classes of miRNA seeds described by Ellwanger et al. (60). The TargeScan context+ score (53) and the PITA score (61) were provided for assessing the impacts of somatic mutations on miRNA binding.

Target sites on lncRNAs:
We downloaded all the lncRNA transcript sequences in FASTA format from LNCipedia (58). Transcript locations in human genome (hg38) were then compared against COSMIC data to identify somatic mutations in lncRNA transcripts. We then determined the effects of somatic mutations on lncRNA-miRNA target sites by applying the six seed matches (60) and TargetScan on lncRNA sequences. 3. Target sites on circRNAs: circRNA sequences were downloaded from circBase (59). Somatic mutations were compared against the transcript locations of circRNAs in human genome build hg38. The alterations of target site binding for somatic mutations were determined by using the six seed matches (60) and TargetScan.

Biological pathways impacted by somatic mutations in miRNA target sites
The KEGG pathways (62) were downloaded from the 'keg-gPathway' table of the UCSC table browser. We found 20 020 genes in 199 pathways that contain somatic mutations in their miRNA target sites. The KEGG API interface is used to display the biological pathways. The genes with somatic mutations in miRNA target sites are highlighted in the pathways (Figure 1).

Genes associated with cancer risk that contain miRNA related somatic mutations
There has been a rapid growth of genome-wide association studies (GWAS) and candidate gene association studies (CGAS) in the past few years. Newly available GWAS results were processed for the update and 2500 new genephenotype associations were added to the database. Highscoring markers associated with cancer phenotypes were collected from the UCSC Table Browser ( 55), NHGRI GWAS Catalog (63) and the Cancer GAMAdb (64).

DATABASE ACCESS AND USAGE
The content of the SomamiR database is accessible through its browse and search interfaces. Six browsing options are provided on the database homepage. Users can browse the somatic mutations in miRNA sequences, the somatic mutations in the miRNA target sites identified from the CLASH, PAR-CLIP and HITS-CLIP experiments, the somatic mutations in predicted miRNA target sites and the KEGG biological pathways (62) in which genes with somatic mutations in their miRNA target sites are highlighted. Users can also browse the genes associated with cancer phenotypes in genome-wide association studies and candidate gene association studies.  Table S1). The entire database contents are downloadable as spreadsheet files from the download link in the database homepage. Detailed information about the SomamiR database is available in the help page of the database.

DISCUSSION
At the time we developed the first version of the So-mamiR database, only a few thousand somatic mutations were identified in the predicted miRNA-mRNA binding sites. The rapid drop of sequencing cost led to a very fast growth of somatic mutations from cancer genome sequencing projects in the past few years. Moreover, the emergence of high-throughput miRNA interactome mapping technologies such as CLASH, PAR-CLIP and HITS-CLIP, enabled large-scale experimental identification of miRNA target sites and thereby provided a source of highly reliable miRNA targets for the updated SomamiR database. We expect that the somatic mutations mapped to experimentally identified miRNA target sites will continue to increase rapidly in the future releases of the SomamiR database.
Our knowledge of miRNA regulatory mechanisms has been greatly expanded by recent findings of ceRNA network and crosstalk. Somatic mutations can disrupt and alter the ceRNA crosstalk and thereby contribute to the pathogenesis of cancers. It is very likely that the full scope and depth of miRNA regulation has yet to be discovered. This will provide the opportunity to further expand the Nucleic Acids Research, 2016, Vol. 44, Database issue D1009 scope of the SomamiR database by including the somatic mutations impacting new types of interactions involving miRNAs.
Both cancer genomics data and miRNA interactome data are growing very rapidly. It becomes increasingly important to automate the data processing and database update. We developed a semi-automatic data curation pipeline for updating the database contents, which will make it easier to keep the database contents up-to-date.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.