POSTAR2: deciphering the post-transcriptional regulatory logics

Abstract Post-transcriptional regulation of RNAs is critical to the diverse range of cellular processes. The volume of functional genomic data focusing on post-transcriptional regulation logics continues to grow in recent years. In the current database version, POSTAR2 (http://lulab.life.tsinghua.edu.cn/postar), we included the following new features and data: updated ∼500 CLIP-seq datasets (∼1200 CLIP-seq datasets in total) from six species, including human, mouse, fly, worm, Arabidopsis and yeast; added a new module ‘Translatome’, which is derived from Ribo-seq datasets and contains ∼36 million open reading frames (ORFs) in the genomes from the six species; updated and unified post-transcriptional regulation and variation data. Finally, we improved web interfaces for searching and visualizing protein–RNA interactions with multi-layer information. Meanwhile, we also merged our CLIPdb database into POSTAR2. POSTAR2 will help researchers investigate the post-transcriptional regulatory logics coordinated by RNA-binding proteins and translational landscape of cellular RNAs.


INTRODUCTION
RNA-binding proteins (RBPs) control every aspect of posttranscriptional regulatory logics, including maturation, localization, degradation, modification, editing and translation of cellular RNAs (1)(2)(3). Several high-throughput sequencing technologies exist for determining RBP-binding sites and translational dynamics in vivo, most notably ultraviolet crosslinking followed by immunoprecipitation and sequencing (CLIP-seq) (4,5) and ribosome profiling (Riboseq) (6). In recent years, CLIP-seq and Ribo-seq have been widely used to decipher the post-transcriptional regulatory logics coordinated by RBPs and translational landscape of cellular RNAs in various species.
CLIP-seq studies have identified RBP-binding sites from a broad set of cell and tissue types from various species (7,8). In addition, large amounts of gene expression profiles, RNA modification sites, RNA editing sites, as well as disease-associated variants, have been identified attributed to efforts on large-scale genomics studies and development of bioinformatics algorithm. The regulatory mechanisms of RBP-binding sites underlie diseases and phenotypes can be revealed by combining information from RBP binding, other post-transcriptional regulatory events and genomic variation. Ribo-seq is a powerful technology for measuring translation efficiency by mapping the ribosome-binding positions across the transcriptome at a sub-codon resolution (6,9). Previous studies have shown that translation efficiency and translational dynamics can be regulated by RBP binding (2,10,11). However, the integration of these largescale datasets for the exploration of the coupling between post-transcriptional and translational regulation remains a great challenge.
Here, we developed POSTAR2 by systematically identifying RBP-binding sites derived from more CLIP-seq datasets, and predicting open reading frames (ORFs) using larger-scale Ribo-seq datasets from six species, including human, mouse, fly, worm, Arabidopsis and yeast. POSTAR2 provides an updated interactive user interface for searching and visualizing RNA-protein interactions and ORFs from various tissue types, cell lines, developmental stages and conditions. Moreover, by integrating microRNA (miRNA)-binding sites, RNA modifications sites, RNA editing sites, single nucleotide polymorphisms (SNPs), genome-wide association study (GWAS) variants and cancer somatic mutations, POSTAR2 can be used to explore the potential associations between RBP-binding sites and these data. POSTAR2 made significant improvements in data collection from more species, and could be useful for investigating the post-transcriptional regulatory logics

Collection of CLIP-seq datasets
POSTAR was developed to house and distribute RBPbinding sites from human and mouse (12). To expand and update our database, we manually collected newly published CLIP-seq data from the Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) databases (13). At present, POSTAR2 contains a large set of RBP-binding sites derived from CLIP-seq datasets and covers six species, including human, mouse, worm, fly, Arabidopsis and yeast ( Figure 1 and Table 1). We first obtained the processed datasets in human and mouse from POSTAR (12), and the processed datasets in worm and yeast from CLIPdb (7). In addition, we collected 298 new datasets of the six species from recent publications. We also updated 332 eCLIP-seq datasets released by the ENCODE consortium (14,15). In total, POSTAR2 contains 1160 CLIP-seq datasets, which cover 284 RBPs from six species (Figure 2A). To our knowledge, this is the largest collection of RBP-binding sites identified from various CLIP-seq technologies, including HITS-CLIP, PAR-CLIP, iCLIP, eCLIP and PIP-seq (Supplementary File S1 and Supplementary File S2).

Identification of RBP-binding sites
For the newly collected CLIP-seq datasets, we used the uniform preprocessing pipeline from CLIPdb (7) to preprocess the raw data. Briefly, we first trimmed the adaptor sequences from the raw reads using FASTX-Toolkit package (http://hannonlab.cshl.edu/fastx toolkit). We only retained reads with quality score above 20 in 80% of their nucleotides. The reads shorter than 13 nt after adaptor trimming were discarded. Finally, we collapsed identical reads to minimize polymerase chain reaction duplicates.
After preprocessing, the retained reads were aligned to their respective genomes using Bowtie (16) and NovoAlign (http://www.novocraft.com). Notably, to make the genomic coordinates of the binding sites consistent between the newly collected data and available data in POSTAR, we used the same genome versions in POSTAR for read alignment, i.e. human (hg19) and mouse (mm10), together with the genomes for four additional species, i.e. worm (ws235), yeast (R64-1-1), fly (dmel-r6.18) and Arabidopsis (TAIR10). We then used both CLIP technology-specific and nonspecific tools to identify binding sites for each dataset, respectively. Briefly, we used Piranha (17) to identify binding sites for HITS-CLIP, PAR-CLIP and iCLIP datasets with parameter -b 20 -d ZeroTruncatedNegativeBinomial -p 0.01. We also applied CLIP technology-specific tools for binding site identification with default parameters: using PARalyzer (18) for PAR-CLIP datasets, using CIMS (19) for HITS-CLIP datasets and using CITS (a module in CIMS software) (19,20) for iCLIP datasets. The binding site coordinates from HITS-CLIP, PAR-CLIP, iCLIP and PIP-seq, which are human genome hg19-based, were converted to hg38 using the UCSC liftOver tool. As for eCLIP, the hg38-based binding sites were directly downloaded from the ENCODE data portal (https://www.encodeproject.org/, NOV 2017). Finally, we identified millions of RBP-binding sites, and visualized the RBP-RNA interaction network in human ( Figure 2B).
We collected RNA-seq datasets from the 12 human cell/tissue types and 10 mouse cell/tissue types that are used in the CLIP experiments (Supplementary File S3), and mapped the reads using TopHat (29), followed by estimating the expression level of the genes using Cufflinks (30). For the 30 developmental stages from fly, 35 developmental stages from worm, 4 tissue types from Arabidopsis and 3 conditions (wild-type, glucose starvation and nitrogen starvation) for yeast, we obtained the gene expression data from the Expression Atlas (31) and our previous paper (32). We prepared and intersected miRNA-binding sites, RNA modification sites, RNA editing sites, SNPs and diseaseassociated variants with RBP-binding sites according to the same computational pipeline used in POSTAR (12). The coordinates of these genomic regions for human build hg19 were also converted to hg38 using the UCSC liftOver tool.
We used the same strategy from POSTAR (12) to predict sequence motifs and structural preferences of RBPbinding sites. Briefly, the binding sites from each CLIP-seq sample were separated into independent training and testing set. Then, we used MEME (33) and HOMER (34) to identify and report up to five sequence motifs in the training set. Next, we calculated the enrichment for the initially detected motifs in the testing set using FIMO (35) and selected the three most enriched sequence motifs. The sequence motifs were visualized using WebLogo (36). To predict structural preferences of RBP-binding sites, the binding sites from each CLIP-seq sample were extended to at least 60 nt in length. We then used RNAcontext (37) to detect local structural motifs. The structural annotation used in RNAcontext included paired (P), hairpin loop (L), bulge/internal/multi-loop (M) and unstructured (U). In addition, we used RNApromo (38) to predict structural elements that are enriched within the RBP-binding sites (Pvalue <0.05).

Ribo-seq datasets collection and ORF identification
We collected 171 Ribo-seq datasets as well as matched RNA-seq datasets from the six species from the GEO and SRA databases (13) for translation efficiency (TE) calculation ( Figure 2D; Supplementary File S4 and Supplementary File S5). For each Ribo-seq dataset, we overlapped with the annotated start codon and calculated its 5 distance to the first nucleotide of annotated start codons to infer the positions of peptidyl-site (P-site) for each read length. Thereafter, we applied this offset to represent the P-sites positions of all the reads that are of the same length and generated a P-site signal track for all transcripts based on the inferred P-sites positions for mapped reads. For each species, the ORFs were predicted by scanning the transcript sequence in which we defined any possible AUG start codon pairing with nearest in-frame stop codon (UAA, UAG and UGA) as an ORF. ORFs shorter than 300 nt were defined as small ORFs (sORF). All predicted ORFs are further categorized into different subtypes according to their relative position with the aORFs ( Figure 2E). In total, we identified ∼36 million ORFs among the six species, and numbers of ORFs showed the difference between different categories among six species ( Figure 2F). To identify translated ORFs across different tissue types, cell lines, developmental stages and conditions, we used several computational tools, including RiboWave (39), RiboTaper (40), ORFscore (41) and RibORF (42), to detect pattern of 3nt periodicity within each ORF, as well as the uneven distribution among different reading frames while translating. Default parameters were used for these tools.

Translation efficiency and translation density calculation
Translation efficiency (TE) measures the rate of messenger RNA translated into proteins, which can be estimated as the ratio between RPKM values of Ribo-seq and RNA-seq (6). We calculated TE under different tissue types, cell lines, developmental stages and conditions. We used either original signal of Ribo-seq (raw data) or denoised periodic footprint by RiboWave (39) (denoised data) as the estimation of riboseq signal strength.
Translation density is determined by normalizing the abundance of Ribo-seq reads along the studied ORF with the length of ORF to estimate the intensity of the ORF. We calculated translation density using both raw data (original ribo-seq signal) and denoised data (RiboWave-derived footprint) as input, and presented the results in both methods.

Database architecture
All data in POSTAR2 were processed and stored into a MySQL Database (version 5.6.39). The client-side user interface was implemented by the HTML5 and JavaScript libraries, including jQuery (http://jquery.com) and Bootstrap (http://getbootstrap.com). The server-side was used PHP scripts (version 5.6.39) and JavaScript. Plots of query results in POSTAR2 were generated by plotly.js library (https: //plot.ly) and Highcharts (https://www.highcharts.com). Tables of query results were produced by the DataTables JavaScript library (https://www.datatables.net) that allows users to search and sort results. Visualization was implemented using the UCSC Genome Browser. We have tested web in several popular browsers including Google Chrome, Safari, Internet Explorer and Firefox.

Web interface
POSTAR2 provides a user-friendly interface for searching and visualizing protein-RNA interactions with multi-layer information of post-transcriptional regulation, diseaseassociated variation, as well as translation landscape of RNAs. POSTAR2 contains three modules ( Figure 1B): (i) 'RBP' module; (ii) 'RNA' module, consisting of several sub-modules including 'Binding sites', 'Crosstalk', 'Variation' and 'Disease' and (iii) 'Translatome' module. Here, we briefly introduce each module below.
The 'RBP' module provides various annotations for the RBPs, including RNA recognition domains, RBP ontology, sequence motifs and structural preferences, as well as all the binding sites for the query RBP and enriched GO terms for the target genes ( Figure 1C, lower-left panel).
As for the 'RNA' module ( Figure 1C, upper panel), the 'Binding sites' sub-module provides all of the RBP-binding sites of the target gene, regardless of different CLIP-seq technologies or different peak calling methods. Furthermore, table and network view present the interaction of RBPs and target genes. We also collected multiple annotations for the target gene including genomic location, associated diseases, as well as expression patterns across different cell lines, tissue types, developmental stages or conditions. In addition, we defined 'RBP-binding hotspots' to decode number of binding proteins of each 20-nt bin along RNA's precursor, which delivers an overview of the RBP binding hot regions of each RNA's precursor to users. The 'Crosstalk' sub-module provides the interactions of RBPbinding sites and post-transcriptional regulations including miRNA targets, RNA modification and RNA editing ( Figure 1B). RBPs participate in various steps and play vital roles in most post-transcriptional regulation processes so that users can investigate potential crosstalk of these regulatory events in this module. To understand how various genomic variants affect RBP binding and cooperate to orchestrate post-transcriptional regulation, the 'Variation' sub-module and the 'Disease' sub-module integrate SNVs and disease-associated SNVs to provide insights into the causal SNVs underlying regulatory mechanisms and human diseases ( Figure 1B).
In addition to the above two modules, we also built a new module 'Translatome' for characterizing the translation landscape of RNAs ( Figure 1C, lower-right panel). Users can choose a species (e.g. human, mouse, fly, worm, Arabidopsis or yeast) and input a gene name to search within. POSTAR2 returns a summary frame and three tables, the summary frame contains a histogram shows the number of ORFs in different categories and a heat map provides the density of each ORF across various samples. These three tables present aORFs, extended/truncated ORFs and other ORFs, respectively, and each ORF is labeled according to the transcript ID, the relative reading frame of the ORF, the translation start site and termination site. Users can also sort ORFs by length in these tables to screen out sORF that are shorter than 300 nt. Moreover, each ORF ID provides a link for more details about the translation pattern of this ORF, including translation efficiency, translation density and identified translated region of the ORF. The column diagram provides visualization to compare translation state of the ORF across different tissue types, cell lines, developmental stages or conditions. In addition, users can select their interested conditions to simultaneously visualize signal tracks of each ORF along its located transcript.

Example applications
We designed a user-friendly interface, which provides a platform to connect protein-RNA interactions with multilayer information of post-transcriptional regulation and disease-associated variants, as well as translation landscape of RNAs. Here, we illustrate an example application with ADAM17 to demonstrate how to explore potential regulatory mechanism underlies human diseases.
ADAM17 encodes a membrane-bound protease and previous study demonstrate its role in tumorigenesis and invasiveness especially breast cancer (43). We observed overexpression of ADAM17 across most tumor samples compared with normal tissues using TCGA expression data (44). However, ADAM17 expression at protein level and the potential regulatory mechanism remains unexplored. We queried 'ADAM17' in the 'Translatome' module, POSTAR2 returned a histogram showing the numbers of categorized ORFs of ADAM17. Users can click on the ORF IDs for more details. Estimation on translation efficiency and signal track reveals the up-regulation at translation level in tumor samples compared to normal. For instance, both raw data and denoised data showed upregulated translation efficiency in tumor tissue compared to paired normal tissue of brain and kidney ( Figure 3A). To understand the potential mechanism that contribute to overexpression of ADAM17 at transcriptional and translational level, POSTAR2 shed light on RBP's role in the regulatory mechanism. In the 'RNA' module, lots of RBPbinding sites identified by different CLIP-seq, the interaction network and RBP-binding hotspots represents numbers of RBP involved in the regulation of ADAM17 (Fig-ure 3B). Among these RBPs, some RBPs such as EIF3B, EIF3G and EIF4A3 are the components of eukaryotic translation factor complex, which suggests that the interaction of these RBPs may participate in the translation regulatory of ADAM17. In addition, RBPs like FUS, TARDBP and ELAVL1 may contribute to the RNAs' stability, which results in the aberrant expression level of RNAs or proteins. In addition, the output of the 'Disease' sub-module shows that lots of cancer mutations locate in the RBP-binding region on ADAM17, especially in kidney tumor and brain tumor.

DISCUSSION AND FUTURE DIRECTIONS
POSTAR2 aims to decipher the post-transcriptional regulatory logics by integrating large-scale high-throughput sequencing datasets and other public resources. To our knowledge, POSTAR2 hosts the largest collection (∼40 million) of RBP-binding sites identified from CLIP-seq experiments, and enables the exploration for RNA-protein interactions with other post-transcriptional regulatory events and genomic variations. Moreover, Ribo-seq data were incorporated and analyzed to reveal the translational dynamics of RNAs. POSTAR2 enables integrated navigation of RBP-binding sites with multi-layer information of posttranscriptional regulation, phenotypes, diseases, as well as translational landscapes of RNAs.
In comparison with our previous version of POSTAR, POSTAR2 has the following novel features and improvements: (i) POSTAR2 integrates more CLIP-seq datasets from human and mouse. (ii) POSTAR2 includes CLIP-seq datasets from more species, including fly, worm, Arabidop-D210 Nucleic Acids Research, 2019, Vol. 47, Database issue sis and yeast. In total, we added and updated ∼500 CLIPseq datasets in POSTAR2. (iii) POSTAR2 has a new module 'Translatome', which provides ∼36 million ORFs in the genomes from the six species. (iv) POSTAR2 annotates the RBP-binding sites with updated functional data resource. For example, we updated ∼1 million RNA modification sites and RNA editing sites curated from other databases and publications (45)(46)(47); updated and added ∼20 million SNPs from the genomes of the six species (48), as well as latest results of mutation-calling for TCGA samples (49). Finally, POSTAR2 provides an updated interactive interface to facilitate the investigation and exploration of RNAprotein interactions and translational landscape.
As advances in high-throughput sequencing technologies, CLIP-seq and Ribo-seq technologies will be applied to more cell and tissue types in more species, and more functional genomics datasets will be generated. We will continue to integrate new incoming data and improve the web interface for navigation and visualization. We will maintain and keep updating POSTAR2 to ensure it remains a valuable resource for the research community.

DATA AVAILABILITY
POSTAR2 is freely available at http://lulab.life.tsinghua. edu.cn/postar. The datasets in POSTAR2 can be download and used in accordance with the GNU Public License and the license of their primary data sources.