APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals

Increasing amounts of genes have been shown to utilize alternative polyadenylation (APA) 3′-processing sites depending on the cell and tissue type and/or physiological and pathological conditions at the time of processing, and the construction of genome-wide database regarding APA is urgently needed for better understanding poly(A) site selection and APA-directed gene expression regulation for a given biology. Here we present a web-accessible database, named APASdb (http://mosas.sysu.edu.cn/utr), which can visualize the precise map and usage quantification of different APA isoforms for all genes. The datasets are deeply profiled by the sequencing alternative polyadenylation sites (SAPAS) method capable of high-throughput sequencing 3′-ends of polyadenylated transcripts. Thus, APASdb details all the heterogeneous cleavage sites downstream of poly(A) signals, and maintains near complete coverage for APA sites, much better than the previous databases using conventional methods. Furthermore, APASdb provides the quantification of a given APA variant among transcripts with different APA sites by computing their corresponding normalized-reads, making our database more useful. In addition, APASdb supports URL-based retrieval, browsing and display of exon-intron structure, poly(A) signals, poly(A) sites location and usage reads, and 3′-untranslated regions (3′-UTRs). Currently, APASdb involves APA in various biological processes and diseases in human, mouse and zebrafish.


INTRODUCTION
Polyadenylation processing of pre-mRNAs is an essential step in the generation of mature mRNAs, and involves the coupling of site-specific cleavage and addition of a polyadenylated tail at the 3 -end of nascent mRNAs (1,2). Polyadenylation not only impacts the stability and export of mature mRNAs from the nucleus (3)(4)(5), but also contributes to the translation initiation and efficiency in the cytoplasm (6,7). The use of alternative polyadenylation (APA) cleavage sites for different biological processes and diseases allows a single gene to encode multiple mRNA transcripts variable in length, particularly in the 3 -untranslated regions . The cellular APA mechanism may change the final protein sequence due to cleavage sites located in the intron or internal exons. Tandem APA-sites in the last exon of mRNAs can be alternatively used in polyadenylation in much higher frequency to generate different tandem 3 -UTRs in length (8)(9)(10). Recently, tandem 3 -UTRs have been revealed to play increasingly important roles in regulating gene expression networks because they enable the loss and gain of cis-regulatory elements in 3 -UTRs of nascent mRNAs in different biological conditions and diseases, notably the microRNA seed sites and other binding sites of transcriptional factors (9,(11)(12)(13). Thus, a database describing the precise map and usage quantification of different APA sites on a genome-wide scale for all genes is urgently needed for better understanding the APA-directed regulation of gene expression for a given biological process.
Several APA-related existing databases, such as polyA DB2 and polyCdb, use transcript-genome alignments and expressed sequence tags (ESTs) to identify and D60 Nucleic Acids Research, 2015, Vol. 43, Database issue characterize putative 3 -processing sites (14)(15)(16). Thus, they are more restrictive in data scale and 3 -processing site identification due to the limited accumulation of cDNAs and ESTs, especially the lack of poly(A) sites located in introns and internal exons that may change the protein sequences. Another, UTRdb, is based on the careful parsing of EMBL/GenBank records, and focuses on sequences but not on 3 -processing sites and the corresponding poly(A) signals on a genome-wide scale (17). The previous polyA-seq data available as UCSC tracks, including the recent APADB, utilize the next-generation sequencing (NGS) technology to identify 3 -ends, but they contains limited APA-datasets derived from human tissues and lack for datasets on human diseases, especially the datasets on embryogenesis of zebrafish model organism (18,19).
Here we present a new web-accessible database of APA sites on a genome-wide scale, named APASdb, based on the APA datasets deeply profiled by sequencing alternative polyadenylation sites (SAPAS) method reported previously (8,9,20). The NGS-coupled SAPAS method is as accurate as the existing RNA sequencing (RNA-seq) approaches for digital gene expression. SAPAS is capable of high-throughput sequencing and quantifying the 3 -ends of polyadenylated transcripts, and further identifying the location and usage of different APA sites located in introns and internal exons as well as another kind of tandem APA sites in the last exon. Therefore, APASdb has near perfect coverage for poly(A) sites of a whole-genome, and details all the heterogeneous cleavage sites downstream of each poly(A) signal (21,22), making it much better than the previous databases generated only using alignments of limited cDNAs and ESTs to a genome to identify the putative 3 -processing sites. Moreover, APASdb enables the usage quantification of different poly(A) sites on a genomewide scale by computing their corresponding reads after normalization. As a web-accessible database, APASdb provides convenient URL-based retrieval, browsing and presentation of several types of information on line, including exon-intron structure, poly(A) signal types and positions, poly(A) sites locations and usage reads, and 3 -UTR regions. Also, the APA data can be displayed in their genomic context via a popular genome browser (Gbrowse) (23)(24)(25). Currently, APASdb contains kinds of APA datasets from three model organisms, human (H. sapiens), mouse (M. musculus) and zebrafish (D. rerio). They detail APA sites in different types of cells, tissues, organs and embryos, as well as tissues in a variety of physiological and pathological conditions. However, as more datasets are generated with the SAPAS method, APASdb is expected to be increasingly valuable for researchers to study polyadenylation mechanisms and APA-mediated gene regulation and other biological functions.

SAPAS library preparation
Several sequencing libraries (Supplementary Notes), were prepared as described previously (8). Briefly, as indicated ( Figure 1A), total RNA was respectively extracted from different samples by TRIzol reagent (Invitrogen), and about 10 g of total RNA was randomly fragmented by heating. An anchored oligo d(T) primer and a 5 -template switching linker tagged with Illumina adaptors were used in template switch reverse transcription by SuperScript II reverse transcriptase (Invitrogen). Two mutations in the poly(A) were introduced by polymerase chain reaction (PCR) amplification with a determined number of cycles to ensure that the double strand cDNA remain in the exponential phase of amplification. The PCR products with a size of 250-400 bp were recovered by polyacrylamide gelelectrophoresis gelexcision and quantified by a Qubit 2.0 Fluoromete. The average size was determined by Agilent 2100 bioanalyzer. As followed, a quality control was carried out by plasmid recombinant and Sanger sequencing. The recovery was ligated to pGEM-T vector and transformed into Escherichia coli DH5a competent cells. Plasmid DNA was extracted and sequenced by ABI 3730 DNA Analyzer. Each end of the insert should be illumina sequence primer. The insert with long poly(A) stretch should be <5%, and most of the inserts should be mapped to the corresponding genome.

Pipeline and raw sequences
We designed a computational pipeline ( Figure 1B, further details in Supplementary Methods), to accurately map and quantify usage of different poly(A) sites on a genome scale, profiled by the SAPAS method. In summary, we first filtered Illumina-sequenced SAPAS reads to discard the reads with unrecognizable linker sequence, and trimmed to remove the linker and the 'T's that just followed the linker until a not-'T' was met. If the length of a trimmed read was <25 nt, we discarded the read too. We then aligned all qualified reads to the corresponding genome using Bowtie software, version 0.12.5 (26). For internal priming filtering, we used the uniquely mapped reads by detecting the downstream genomic sequence 1 to 20 of cleavage sites as previously (8), that is, the read was regarded as an internal priming candidate if this 20-nt region contained more than 12 'A's or one of the following patterns: 5 -AAAAAAAA-3 and 5 -GAAAA+GAAA+G-3 , where '+' means 'or more'. We defined cleavage sites by iteratively clustering the reads, locating tailing ends within 24 nt from each other and which were also aligned to the same strand of a chromosome. Cleavage clusters with two or more normalized reads were taken as poly(A) sites, and we searched for the corresponding poly(A) signals within the upstream sequence 1 to 50 nt from each poly(A) site. Using the gene structure and annotation from bioinformatics sites such as Ensembl and UCSC (27,28), we annotated the poly(A) sites and corresponding poly(A) signals.
Based on this pipeline, we processed SAPAS raw reads of samples from three model organisms, zebrafish (D. rerio), mouse (M. musculus) and human (H. sapiens), to generate poly(A) site datasets. These raw reads of samples (see details in Supplementary Notes), invole in zebrafish embryos in various development stages from 0 h post fertilization (hpf) to 5 day post fertilization (dpf), mouse thymic development from 15.5 days post fertilization (dpf) to 90 days post parturition (dpp), human normal 22 tissues (brain, lung, thyroid, spleen, stomach, kidney, cervix, heart, lymph node, placenta, uterus, bladder, breast, prostate, liver, pancreas, small intestine, thymus, adipose, skeletal Muscle, ovary and testicle), human breast cancer and normal cells, human carcinomatous and normal tissues of intestinum rectum, as well as the nasal polyps and nasal uncinate process mucosa of chronic rhinosinusitis patients with nasal polyps.

Database and website design
Based on the APA-site datasets deeply profiled by the SAPAS strategy, the APASdb website is developed with open source technologies. The datasets of samples involved in related cell and tissue types or specific physiological and pathological conditions, were adopted and integrated into a larger searching dataset (Table 1), to facilitate the query and comparison display of poly(A) sites of genes in our website. For example, the searching dataset named Zv9 embryonic development, integrating eight subsets across all the major developmental stages, describes the dynamic usage of APA-sites in zebrafish embryogenesis. Various types of information regarding locations and normalized usage reads of poly(A) sites, poly(A) signal types and positions, 3 -UTR regions and exon-intron structure of genes in the APASdb, are stored in a relational database using MySQL. Web-based HTML interactive interfaces combined with JAVA, PERl and PHP scripts provide access to the database. GD modules of PHP and Bioperl modules are used for dynamic and graphical representation (29).

Datasets in APASdb
As listed (Table 1), currently APASdb contains APAsite datasets involved in various cell and tissue types, or physiological and pathological conditions in three important models, human, mouse and zebrafish. The statistics and analysis of APA sites in these searching datasets seem to indicate that, APASdb not only keeps near perfect coverage of poly(A) sites but also contains much more novel poly(A) sites in the above-mentioned species (Supplementary Table S1, Supplementary Figure S1-1 to S1-6). Here, we only take the searching dataset (named Zv9 embryonic development) for detail discussion. This dataset consists of eight sample-subsets across all the major developmental stages in zebrafish embryogenesis, including 59 294 885 reads (obtained after mapping and filtering), and nearly 90% of them were mapped to annotated 3 -UTRs or 1-kb downstream regions (Supplementary Figure S1-1, left). Of the 108 290 poly(A) sites across all stages with five or more normalized reads, 12% were mapped to the known Ensembl transcription termination sites (TTS), and the remaining 88% were unreported previously, especially 23.5% and 14.6% were mapped to the 3 -UTR and 1-kb downstream from the Ensembl canonical genes, respectively (Supplementary Figure S1-1, right). The authenticity of the novel poly(A) sites was also checked by 3 RACE in our previous report (9), of 30 novel poly(A) sites (five sites for each of the location categories shown in Supplementary Figure S1-1, excluding Ensembl TTS), 28 (93%) were validated. Also, the previous comparison of our data with another RNA-seq dataset (9,30), showed that most (>70%) intergenic poly(A) sites were located within 5-kb downstream from RNA-seq reads, indicating the authenticity of these novel poly(A) sites. APASdb is intended to broaden the known poly(A) site coverage with growing numbers of APA-site datasets generated with the SAPAS method. These APA data are available in our APASdb website, and summarized in graphical representations for quantification and comparison of APA sites used in different biological processes and/or diseases.

General organization and access of APASdb
The general organization of the APASdb website is presented ( Figure 1D), and the APA datasets are available from our APASdb web server addressed at http://mosas. sysu.edu.cn/utr. Data can be quickly queried and presented by user's keywords in the web-query interface, where gene annotations were loaded as well as the gene-related poly(A) sites and poly(A) signals are highlighted in the corresponding genome sequence, including the heterogeneous cleavage sites clusters (indicated in Figure 1C). The interface dynamically creates a graphic to track them together with the corresponding exon-intron structure of transcript variants of the searched gene, not only detailing the location and quantification of the heterogeneous cleavage sites downstream of each poly(A) signal, but also adding the usage quantification of APA sites and the expression pattern of corresponding gene in various cells, tissues and organs, or in different biological situations and diseases. The APA datasets were integrated to Gbrowse database, so as to provide an interactive and graphical view of APA sites associated with genomes, genes, transcripts and transcript annotations on a genome-wide scale.

Searching APASdb
Our 'APAS Search' feature is designed to search the interesting datasets in APASdb and present APA information of a user's genes of interest according to the searching tips. Currently, Ensembl id (for D. rerio) and UCSC id (for M. musculus and H. sapiens) are allowed for precise query, also fuzzy query by using keywords such as gene name, symbol and simple description is permitted. Clicking the button labeled 'Example' yields an example keyword suited for searching a selected dataset (Figure 2A), and the subsets and detail descriptions on the searched dataset can be unfolded by clicking the corresponding '+'labeled icon ( Figure 2B). Fuzzy search, using a fuzzy keyword of 'chemokine', may lead to a media page to list all the matched APA sites-contained genes in a dynamic table ( Figure 2C), so as to facilitate the selective view of their corresponding APA information in a linked detail page (described next), but searching by using a precise keyword can give users quick access to the detailed page to view various types of information on APA sites and the related graphics.    The transcripts with the longest 3 -UTR first appear between 24 to 48 hpf, resulting from the usage of pA:31043625 in downstream region of 1 kb (Figure 3, up). In addition, summing the normalized reads of all the 3 -ends of polyadenylated transcripts appeared draws the curve diagram to show the expression of cxcl12a across all the major stages of zebrafish embryogenesis. It demonstrates that the expression of cxcl12a increases first and gets to the maximum value at 12 hpf, then decreases quickly. After 48 hpf, the cxcl12a expression keeps increasing slowly (Figure 3, down).  Figure  S5 to S12).

Dynamic and graphical browsing of APA sites
Based on the integration of APASdb and Gbrowse database, via a genome browser (Gbrowse), APASdb websites provides dynamic browsing of APA sites associated with genomes, genes, transcripts and annotations on a genome-wide scale ('APAS Browse' feature). The selective view in three layers, such as overview, region and details, are provided for browsing APA sites in several reference genomes, including human (GRch37/hg19), mouse (NCBI37/mm9) and zebrafish (Zv9/danRer7). One or more APA datasets from these species can be quickly loaded and graphically browsed online, by clicking the 'Update image' button after selecting the corresponding checkboxes of datasets in 'Tracks' panel (Supplementary Figure S13, bottom). This not only enables an overall picture of APA sites in certain cells, tissues and organs, or in a variety of physiological and pathological conditions, but also offers a more direct way to compare the usage of APA sites and further find the general and specific poly(A) sites. Here, we give an example for APA sites of G protein-coupled recep-  Figure S13, middle).

DISCUSSION
We present a comprehensive database of APA sites in human, mouse and zebrafish based upon a developed NGSdependent 3 -end sequencing strategy, namely, SAPAS. Thus, in a sense, we provide additional experimental support for poly(A) sites in the polyA DB2 and transcript termination database of UCSC and Ensembl. It seems that APASdb not only contains much more novel poly(A) sites, but also has near perfect coverage for APA sites of genes throughout human, mouse and zebrafish (Supplementary  Table S1, Supplementary Figure S1-1 to S1-6). At present, our APASdb are focused on the dataset generated with the SAPAS strategy, broadening the poly(A) site coverage with growing numbers of APA datasets. Also, the publicly available poly(A) data generated by the other NGS-based protocols, such as polyA-seq (19) and massive analysis of cDNA ends (MACE) (18,31), will be selected and added into our website to extend the usefulness of APASdb in the future. Actually, APASdb has more comprehensive APA datasets for human. It not only has subsets (total 11) involved in human diseases, but also keeps full subsets involved in human 22 normal tissues, much more than the polyA-seq data (only five tissues) and MACE data (only seven tissues). These subsets simultaneously detail the location and usage of APA sites and facilitate the analysis of tissue-specific APA sites in human tissues. Especially, the disease-related subsets display and compare the APA sites between human normal and tumor cells or tissues, such as breast cancer, rectal cancer and eosinophilic chronic rhinosinusitis with nasal polyps, so as to promote the investigation of diseases-related APA site switching. Also, there are eight subsets integrated to indicate the changes of APA sites in mouse thymic development, helping identifying the APA-site switching involved in vertebrate thymopoiesis. For the zebrafish model often used in genetics, eight subsets across all the major stages of embryogenesis are adopted and combined to profile the dynamic usage of APA sites and contribute to the understanding of tandem 3 -UTR regulation in the control of vertebrate embryogenesis. We also frequently observed that multiple cleavage sites downstream of a poly(A) signal were only a few nucleotides apart, an interesting phenomenon usually called heterogeneity (21,22,32). APASdb details the heterogeneous cleavage sites for all genes in a genome-wide fashion and compares the variation of heterogeneous cleavage sites clustered to a poly(A) site in various cells and tissues, or in a variety of physiological and pathological conditions. This may help studying the mechanism of polyadenylation, in particular, the selection of heterogeneous cleavage sites at a given time for a given 3 -end formation.
Overall, APASdb makes it possible to identify the condition-specific poly(A) sites, helpful to studying APAsite switching mechanism and function, especailly to looking for the loss and gain of miRNA binding-sites in the dynamic 3 -UTRs. Also, APASdb simultaneously presents the expression and position of APA sites and enables the identification of APA-site switching in association with many biological processes and diseases. As a user-friendly web database, APASdb will be an increasing valuable resource for the polyadenylation and 3 -UTR research community, especially for the studies on polyadenylation mechanisms and APA-mediated gene regulation, requiring identification of poly(A) sites and indication of their corresponding conditional usage in a large dataset.

SUPPLEMENTARY DATA
Supplementary Data is available at NAR Online.