AREsite2: an enhanced database for the comprehensive investigation of AU/GU/U-rich elements

AREsite2 represents an update for AREsite, an on-line resource for the investigation of AU-rich elements (ARE) in human and mouse mRNA 3′UTR sequences. The new updated and enhanced version allows detailed investigation of AU, GU and U-rich elements (ARE, GRE, URE) in the transcriptome of Homo sapiens, Mus musculus, Danio rerio, Caenorhabditis elegans and Drosophila melanogaster. It contains information on genomic location, genic context, RNA secondary structure context and conservation of annotated motifs. Improvements include annotation of motifs not only in 3′UTRs but in the whole gene body including introns, additional genomes, and locally stable secondary structures from genome wide scans. Furthermore, we include data from CLIP-Seq experiments in order to highlight motifs with validated protein interaction. Additionally, we provide a REST interface for experienced users to interact with the database in a semi-automated manner. The database is publicly available at: http://rna.tbi.univie.ac.at/AREsite


INTRODUCTION
AU-rich elements (AREs) and GU-or U-rich elements (G/UREs) are sequence motifs found in many coding and non-coding RNAs. Upon interaction with RNA-binding proteins (RBPs) they can influence the half-life of RNA molecules. This interaction can induce RNA stabilization or destabilization, mediated by mechanisms that depend on the RBP and the genic motif context, but are otherwise not fully understood. The most prominent example is an important gene expression regulating mechanism known as AUrich element mediated decay (AMD) (1).
However, AMD is not the only RNA stability regulating process that depends on successful RNA-RBP interaction. RBPs interact e.g. with GU-rich elements (GRE), as well as U-rich elements (UREs) that have also been shown to modulate mRNA half-life (2)(3)(4)(5).
So far, mostly protein coding genes have been shown to be regulated by these mechanisms and only 3 UTR binding was shown to regulate mRNA half-life (6). Only recently CLIP-Seq (7) was introduced as a new method to identify RBP binding sites in a high-throughput manner. These CLIP-Seq experiments, identified many novel binding sites for RNA-binding proteins (RBP) involved in RNA regulation (see e.g. (4,(8)(9)(10), etc.), showing significant binding of RBPs in genic regions like introns or 5 UTRs, with unknown regulatory function. Furthermore, experiments show that binding sites often contain only partial matches with previously annotated motifs, such that a more relaxed view of motif preferences has become necessary. Therefore, the research community faces novel challenges regarding the investigation of RNA-RBP interplay beyond current paradigms. In silico methods play an important role in the identification of (novel) binding sites and the prediction of their regulatory role. Established databases like ARED (11), GRED (5), AURA (12) or the old AREsite (13) provide the user with information on motif location, accessibility and more, but are not designed to cope with more recent findings and high-throughput requests. On the one hand AREsite focuses solely on 3 UTRs of protein coding genes, while ARED and GRED are very restricted regarding motifs. More than 40 citations and 45 000 visitors, underline the need for new comprehensive bioinformatical resources in this re-

IMPROVEMENTS
AREsite2 accounts for recent developments by extending its analysis approach to the whole gene body, instead of restricting it to 3 UTRs or introns. The choice of region of interest remains with the user.
Furthermore, by applying more relaxed motif pattern definitions than e.g. ARED for annotation, we aim at a high coverage of experimentally validated and candidate binding sites relevant for interaction, dynamics and mechanisms of RNA-RBP interaction.
Experimentally validated binding sites are a solid basis for the detailed investigation of RNA elements that interact with proteins. To improve our annotation of motifs in this new release, we include binding sites from CLIPdb (14) pre-processed datasets for the prominent RBPs ELAVL1 (HuR), Zfp36 (TTP) and HNRNPD1 (Auf1) where available. Additionally, we will integrate new binding sites from experimental data when they become available, as we did for example with data from Mukherjee et.al. (10).
AREsite was to our knowledge the first database including the local structuredness of ARE motif sites in terms of opening energies and accessibilities. As RNA secondary structure proves important for successful RNA-RBP interactions, we integrated RNAplfold (15) derived accessibilities also in this new release. To further improve this feature, AREsite2 incorporates stable secondary structures in overlap with annotated motifs from genome wide scans with RNALfoldZ (16,17). Z-score filtered locally stable RNA secondary structures were predicted for all included genomes and visualization is embedded using forna (18).
The comprehensive manual literature search of version 1 was automated by interaction with PUBMED via the ENTREZ API.
Information retrieval for the experienced user with the need for semi-automatic requests is now possible via a REST interface. Table 1 provides a short comparison of supported features and changes between AREsite in versions 1 and 2.
Furthermore the backend was changed to a relational database system, allowing dumps of the whole database to be retrieved by the user and easing maintenance and updates of the database with new experimental results, annotations and species.

Genomes and annotation
Following genomes were used for annotation of motifs and secondary structure prediction H. Gene and transcript annotation for all genomes was retrieved from ENSEMBL (19) version 79 via their EN-SEMBL perl API. AREsite2 contains A/G/URE annotations for ∼60 000 genes in H. sapiens, ∼43 000 genes in M. musculus, ∼35 000 genes in D. melanogaster, ∼17 000 genes in D. rerio and ∼47 000 in C. elegans, multiplying the information content compared to version 1.

Motifs
While the previous release of AREsite includes only motifs ranging from the ARE core motif ATTTA to its extended 13-mer version WWWWATTTAWWWW, recent experiments (4,(8)(9)(10) have shown that this is not enough to cover the broad variation of RBP target motifs. With this new release we cover a far broader spectrum of AU/G/Urich motifs. Together with the fact that we do no longer focus on 3 UTR regions only, but include the whole gene body, as well as non-protein coding genes, the database has undergone a significant increase in size. However, this vast increase in annotated motifs also means that more motifs without (known) regulatory function are now included in the database. To cope with that and improve the gain of knowledge, we decided to integrate experimentally validated target sites of TTP, HuR and Auf1, being the most prominent RBPs involved in mRNA halflife regulation, and highlight them for the end user. To that purpose we used Bedtools (20) and extracted intersections of annotated mo-

Structural context
Secondary structure of an RNA molecule influences the binding probability of RBPs. Most ABPs are for example known to prefer single-stranded RNA molecules for interaction. Thus, we applied RNAplfold to predict the probabilities of being unpaired for stretches ±20nt around annotated motifs. As in version 1 of AREsite results of this analysis are rendered as downloadable SVGs and help to check the accessibility of motifs of interest for RBPs. Furthermore, we integrated the results of genome wide RNALfoldZ screens for locally stable RNA secondary structures. Overlaps of annotated motifs with Z-score filtered stable structures were predicted for all included genomes and are part of the output. If overlaps are found, the user can investigate the structure via a linkout to forna (18).

Conservation
Information on the conservation of the region of interest is provided at two stages. Once for the gene of interest, where we plot ENSEMBL (19) GERP (21) conservation scores for the whole gene body where available. Additionally, we provide multiple sequence alignments, retrieved from EN-SEMBL genomic alignments where available, for annotated motifs to visualize conservation on a per motif scale.

Literature
The ENTREZ API makes it possible to programmatically fetch publications from PUBMED for a given search string. This allows us to retrieve publications for each gene of interest in context of A/U/GRE motifs and binding proteins respectively. However, the main advantages of automatically retrieved publications is that we stay up-to-date with PUBMED. For convenience and transparency the user can follow the link to PubMed, which contains the used search string, to manually query from PUBMED.

Statistics
At http://rna.tbi.univie.ac.at/AREsite/statistics, we provide an interface for the user to request the number of genes containing at least one motif of interest in their gene body. The generated bar plot illustrates how many genes contain the selected motif in either intronic or exonic parts of 3 UTR, 5 UTR, CDS and total.

RESULTS
This section explains example output from AREsite2 for the gene Cxcl2 in Homo sapiens. If a search for the motifs ATTTA, WWTTTWW, GTTTG, TTTGTTT and AW-TAAA is started, database entries are provided for the user as svg-plots and html5-tables. For visualization we use the R (R Core Team (2015)) package Gviz. The output begins with information on the genomic location of the searched gene. Figure 1A presents the ideogram of hg38 chromosome 4 with highlighted position of Cxcl2. Figure 1B visualizes the gene body and known transcripts of Cxcl2 as annotated by ENSEMBL. Annotated motifs, colored accordingly, if overlapping experimental data was available (see section Motifs) are highlighted in Figure 1C. All of these figures contain a link to the ENSEMBL genome browser, where selected motifs are made available as custom tracks. EN-SEMBL (19) GERP (21) conservation scores for the whole gene body are visualized in Figure 1D where available. The search for more sequence patterns and parsing of the whole gene body leads to an increase in predicted motifs. Table 2 shows a comparison of genes per genome containing at least one core ARE (AUUUA), GRE (GUUUG) and URE (UUUUU). To cope with this massive numbers and help users to filter potentially interesting candidates, we provide the second part of the results sections. The first table ( Figure 2A) provides information on the genomic and genic location of an annotated motif, as well as experimental evidence for RBP interaction, if available. Accessibility or occupation of motifs by overlapping stable secondary structures, can be seen in the next table ( Figure 2B). Detailed conservation information for each motif can be derived as multiple sequence alignment from table three (Figure 2C). Concluding table provides the results of the literature search, sorted by newest publications ( Figure 2D). All tables are searchable, and content can be downloaded by the user.

CONCLUSIONS AND PERSPECTIVES
AREsite2 presents a major update to AREsite, including three additional genomes and a high amount of newly an-  notated motifs. Furthermore, the new backend allows for easier integration of more genomes, other motifs, experimental and structure data. We provide the whole database as mysql-dump and all annotated motifs in bed, bed12 and gtf format for download. The RESTful service makes it easy for advanced users to retrieve information without the need to download any of these files in a semi-automatic manner. An example script for that purpose is included in the supplementary data, the most recent version can readily be downloaded from the website directly. We aim to integrate more experimental data as soon as they become available, either through CLIPdb, or directly from source if feasible.

AVAILABILITY
The database is publicly available at: http://rna.tbi.univie. ac.at/AREsite An example script for interaction with the REST interface, a database dump and motif annotation as bed, bed12 and gtf files are available at: http://rna.tbi.univie.ac. at/AREsite/bulk.