GermlncRNA: a unique catalogue of long non-coding RNAs and associated regulations in male germ cell development

Spermatogenic failure is a major cause of male infertility, which affects millions of couples worldwide. Recent discovery of long non-coding RNAs (lncRNAs) as critical regulators in normal and disease development provides new clues for delineating the molecular regulation in male germ cell development. However, few functional lncRNAs have been characterized to date. A major limitation in studying lncRNA in male germ cell development is the absence of germ cell-specific lncRNA annotation. Current lncRNA annotations are assembled by transcriptome data from heterogeneous tissue sources; specific germ cell transcript information of various developmental stages is therefore under-represented, which may lead to biased prediction or fail to identity important germ cell-specific lncRNAs. GermlncRNA provides the first comprehensive web-based and open-access lncRNA catalogue for three key male germ cell stages, including type A spermatogonia, pachytene spermatocytes and round spermatids. This information has been developed by integrating male germ transcriptome resources derived from RNA-Seq, tiling microarray and GermSAGE. Characterizations on lncRNA-associated regulatory features, potential coding gene and microRNA targets are also provided. Search results from GermlncRNA can be exported to Galaxy for downstream analysis or downloaded locally. Taken together, GermlncRNA offers a new avenue to better understand the role of lncRNAs and associated targets during spermatogenesis. Database URL: http://germlncrna.cbiit.cuhk.edu.hk/


Introduction
Male infertility accounts for more than half of the diagnosed infertility cases worldwide (1,2). Though the unique cellular dynamics of germ cell development provides a representative model for understanding the fundamentals of developmental biology, our current understanding of the molecular mechanisms in male germ cell development remains largely elusive. This poses significant challenges on the effective development of therapeutic regimen and clinical management.
Spermatogenesis refers to the continuous multi-stage processes by which spermatogonial stem cells on the seminiferous tubular basement membrane proliferate and differentiate into subsequent cellular stages, including spermatogonia (Spga), spermatocytes (Spcy) and spermatids (Sptd), and finally to functional spermatozoa which are released into the seminiferous tubule lumen. Successful spermatogenesis relies on the precise transcriptional programs. To identify the regulatory networks involved in male germ cell development, we previously applied serial analysis of gene expression (SAGE) and developed GermSAGE (3) and GonadSAGE (4) databases. We identified a number of gene networks associated with stage-specific transcription factors (TFs) and promoter elements. Importantly, >45% transcripts were unannotated (3,(5)(6)(7)(8)(9), suggesting many novel transcripts and corresponding functions remain to be explored. Importantly, many of them were suggested to be non-coding RNAs (6,9).
Recently, long non-coding RNAs (lncRNAs) were widely identified as novel regulators in normal and disease development (10)(11)(12)(13)(14)(15)(16). Unlike small RNAs like piwi-interacting RNA (piRNA) or microRNA, the regulatory roles of lncRNAs are poorly defined. Recent studies demonstrated lncRNAs exert activating or inhibitory regulation through interaction with mRNA (17), DNA (18), microRNA (19), histone modifier (20), RNA-binding protein (21) and chromatin (22,23). Presently it is estimated that more than 40 000 unique lncRNAs are expressed in the mammalian cells (16). Recent studies of the role of lncRNAs in mammalian testis development and spermatogenesis suggested lncRNAs are dynamically regulated (24,25). Expression profiling analyses on primordial germ cell reprogramming and postnatal germ cell development have revealed that thousands of lncRNAs are significantly altered and correlated with nearby mRNA gene clusters (24). Comparison on neonatal and adult mouse testes has also demonstrated dynamic lncRNA expression and exhibited associations with epigenetic modifications and evolutionary conserved elements (26). Among the major male germ cell stages in spermatogenesis, type A spermatogonia shows the maximum number of lncRNA candidates (25). This is concordant with the expression pattern of mRNAs.
Though lncRNA research in male germ cell development presently exhibits momentum, only few functional lncRNAs in spermatogenesis such as Tsx, HongrES2, Mrhl and Spga-lncRNA have been reported (13). To systematically identify and predict functional lncRNAs, the knowledge of lncRNA annotation is a prerequisite. Although lncRNA annotations are publicly available in genomic databases like Ensembl and NONCODE (27,28), the transcripts are derived from expression data from major tissues and cell types. As the expression profile of lncRNAs was reported to be tissue-or cell-specific (29)(30)(31)(32). This partly explains why only few lncRNAs were identified in male germ cell development to date (13). Here we hypothesize that male germ cell-specific lncRNAs are under-represented in the current public annotations, leading to biased lncRNA predictions or failure to detect critical lncRNAs in specialized cell type during male germ development.
In this study, we report the construction of GermlncRNA, a specialized lncRNA database of male germ cell development by assembling male germ cell transcriptome evidence from three different expression platforms. In addition to conventional lncRNA annotations, we also include novel male germ cell-specific lncRNAs derived from de novo analysis of transcriptome data, and provide comprehensive regulatory and functional prediction information, which offers a new avenue to better understand on the role of lncRNAs and associated targets in spermatogenesis.

Database construction and structure
GermlncRNA is an open-access database developed on Linux using SQlite platform with a Personal Home Pagebased Yii framework to store, retrieve and exchange all acquired data.GermlncRNA was developed based on transcriptomic and regulatory data from our previous studies and public domain data. An overview of data organization in GermlncRNA is shown in a schema diagram (Supplementary Figure S1). The lncRNA annotations of Ensembl (release 74), RefSeq (NCBI37/mm9) and University of California Santa Cruz (UCSC) Genes (mm9) were obtained from the UCSC

lncRNA prediction in male germ cell development
To identify unique lncRNAs expressed in male germ cell development, we developed a bioinformatics pipeline known as hybrid transcriptome assembly (HTA, Supplementary Figure S2) to integrate the polyadenylated germ cell transcriptomes from GermSAGE (3) and tiling microarray (33) previously published by us, together with two recent RNA sequencing (RNA-Seq) datasets (34,35). In brief, raw reads from RNASeq studies SRP018124 and SRP010262 (34,35) were aligned to mm9 mouse genome and assembled by TopHat-Cufflinks with de novo approach (36), while tiling array data in cell intensity file format (GSE55318) were processed by Affymetrix Tiling Array Software using Affymetrix binary probe mapping genomic location files version 35 as a reference to identify transcribed fragments known as transfrags with bandwidth 75 bp 'One-side Upper' test (33). The genomic loci covered by both platforms were retained and were subsequently classified into annotated or unannotated categories. To generate a complete catalogue of annotated lncRNAs, annotations from five public databases, including Ensembl (28), RefSeq (37), UCSC Genes (38), NONCODE (27) and fRNAdb (39) were obtained in browser extensible data or gene transfer format files. The sequence of each annotated lncRNA was retrieved using Galaxy based on these annotations, and the lncRNA candidates with the identical sequence and genomic locus were regarded as the same transcript. The lncRNA transcripts not covered by any of the five reference public databases were identified as de novo putative lncRNAs. To reduce ambiguities on lncRNA prediction, all de novo putative lncRNAs overlapping coding genes were ignored. Only intergenic lncRNAs predicted to be non-coding by codingpotential assessment tool (40) were selected. These lncRNA candidates were referred to as 'novel' lncRNAs. Transcription evidence of both annotated and novel lncRNAs was further evaluated by searching for the presence of SAGE tags near the 3 0 -terminal from GermSAGE. To give a better picture on prediction quality, the lncRNAs were categorized into four classes according to the number (from 1 to 4) of expression evidence (Table 1). Each lncRNA candidate in GermlncRNA was assigned with a unique GermlncRNA identity (ID). In GermlncRNA database, the expression of lncRNAs was presented in fragments per kilobase of exon per million fragments mapped (FPKM) by RNASeq or tag counts (with tag sequence) by GermSAGE. We defined the RNA-seq thresholds for detectable and abundant expressions as 0.3 and 3 FPKM, respectively (41). SAGE tag of an lncRNA represents the 10 bp sequence downstream to the 3 0 most enzymatic restriction site CATG (42). The tag count represents the expression level of an lncRNA in the corresponding stage. Transcripts that showed more than one SAGE tag were considered expressed, whereas transcripts showed more than five tag counts were considered abundantly expressed. The normalized expression data from the whole testes of neonatal (6 day) and adult (8 week) mice were also included (GSE43442) (26) for comparison.

Association of regulatory features
Genomic data on regulatory features, including polyadenylation (PolyA), DNase hypersensitive sites, CAGE, histone modifications (testis) and conserved elements were all retrieved from the corresponding data tracks in UCSC Genome Browser (Table 2). To reveal the potential association of a regulatory feature with lncRNA, the number of the corresponding feature in the promoter, gene body or around 5 0 /3 0 -terminal region of lncRNAs was counted. The percentage of association with each feature for annotated and novel lncRNAs was calculated and presented in a bar chart, while the significance of difference between annotated and novel lncRNAs was assessed and compared by two-sided Chi-square test.

GermlncRNA function prediction
In GermlncRNA, the protein-coding transcripts next to novel lncRNAs in either direction were classified as cis-regulation category. Proximal coding gene neighbours of a novel lncRNA were identified from RefSeq gene annotation, and the distance between each novel lncRNA and its cis-acting target was calculated. Sequence homology between novel lncRNAs and coding mRNAs were predicted by basic local alignment search tool (BLAST) (43). lncRNAs that showed significant similarity to coding gene sequences on the other chromosomes (at least 85% within 200 bp) were classified as trans-acting. The gene ontologies of cis-and trans-associated gene were retrieved from The Database for Annotation, Visualization and Integrated Discovery (http://david.abcc.ncifcrf.gov/) (44,45).

MicroRNA target prediction
Potential microRNA (miRNA) targets on the novel lncRNAs were identified by Probability of Interaction by Target Accessibility (PITA) algorithm (http://genie.weizmann.ac.il/pubs/mir07/index.html) (46), which predicts the microRNA-target interaction by a free energy change (DDG) approach. In brief, full-length novel lncRNA sequences in fasta format were scanned for the presence of miRNA response elements (MREs). To identify significant miRNA interaction, the results were filtered by DDG cutoff of below À10. The representative miRNA targets for each novel lncRNA are listed in the search result table when 'predicted microRNA targets' is selected, and the complete details on miRNA candidates can be retrieved by clicking on the representative miRNAs.

Transcriptional regulation
To identify potential transcriptional regulation on the annotated lncRNAs, experimentally supported TFs binding information based on chromatin immunoprecipitationsequencing (ChIP-Seq) were retrieved from ChIPBase (47), with a focus on promoter interactions. Such information was then mapped to the annotated lncRNAs based on similar identifiers. The information was saved in 'ChIPBase' column.

Database overview
GermlncRNA database is the first comprehensive lncRNA catalogue on the key stages of male germ cell development, including type A spermatogonia (Spga), pachytene spermatocytes (Spcy) and round spermatids (Sptd). It also provides comprehensive information on regulatory signatures, expression and associated coding genes through cis-or trans-regulation. The data retrieved from GermlncRNA can be downloaded or further analysed by Galaxy ( Figure 1). GermlncRNA is divided into five sections, including home, data search panel, statistics, help and contact (Figure 2).

Home
The home section is the front page of GermlncRNA, where it provides basic background information about GermlncRNA and the database overview. This section is updated regularly to highlight any changes or new functions implemented in GermlncRNA.  Export to Galaxy tab. In addition to data retrieval, GermlncRNA also provide analysis option through Galaxy (48), a web tool that provides common genomic analyses in form of workflows. As no programming is required, it is highly accessible to scientists that do not have bioinformatics background. Details on Galaxy can be found here: https://usegalaxy.org/.

Statistics
Data statistics on lncRNA composition, distribution of exon number per lncRNA, size and chromosomes, lncRNA types, annotation sources and cis/trans-regulatory genes are illustrated in graphical format.

Help
The Help section contains the details on common questions related to GermlncRNA, data sources applied and supporting references.

Contact
Questions, suggestions or comments can be sent to our research team with the aid of online form in this section. We will address every comment promptly.

Annotated and novel lncRNA catalogues
GermlncRNA database contains a total of 110 476 annotated lncRNAs and 2790 novel intergenic lncRNAs (or simply novel lncRNA in this article) expressed in male germ cell development.  Figure 3). The size of novel lncRNAs ranged from 200 to 14 945bp, with an average of 800 bp. In total, 15.5% were located at the promoters of proteincoding genes. Because no supported mRNA evidence was reported previously, the novel lncRNAs may represent male germ cell specific lncRNAs. A total of 357 novel lncRNAs belonged to class four novel lncRNAs (Table 1), which were supported by all three transcriptome sources. A class four novel lncRNA example is shown ( Figure 4).

lncRNA expression analysis
In addition to genomic location, we also examined the expression pattern of lncRNAs. Based on FPKM measurement, 73.9% (2059/2786) of expressed novel lncRNAs and 42.6% (18409/43215) of expressed annotated lncRNAs were highly expressed at all three male germ cell stage (Table 1). Importantly, the percentage of highly expressed lncRNA candidates at every stage in expressed candidates for novel lncRNA group was higher than the annotated group at all the germ cell stages ]. In addition, we also detected a higher percentage of stage-specific lncRNA candidates in the novel group rather than in the annotated group at every stage, suggesting novel lncRNAs could preferentially be involved in stage-specific functions. Specific expression is defined as an lncRNA expression of at least 3 FPKM by RNASeq data and at least 3-fold higher than those at the other two stages (P < 0.05). Under this definition, 171 (6.1%), 343 (12.3%) and 735 (26.3%) of novel lncRNAs and 4005 (3.6%), 1631 (1.5%) and 4098 (3.7%) of annotated lncRNAs were identified to be specifically expressed in Spga, Spcy and Sptd, respectively. Stage-specific lncRNAs were indicated under the stage-specific expression column in GermlncRNA. However, we did not observe a large difference from the SAGE data. In total, 31.3% of annotated and 29.1% of novel lncRNAs were supported by at least one SAGE tag. At each stage, 4.7-4.9% of annotated and 3.9-4.3% of novel lncRNAs were abundant (tag count ! 5) ( Table 1). We did not identify any correlation between lncRNAs expression and the type of lncRNA.

Association with regulatory features
To elucidate the potential regulations associated with lncRNAs, the presence of regulatory features (Table 2, Figure 5) was examined, including (i) Transcription evidence: polyadenylation (PolyA) and CAGE signals for 3 0and 5 0 -terminals of a transcript, respectively; (ii) Chromatin-related features: DHS and five types of histone modifications, including H3K4me1, H3K4me3, H3K27ac, H3K27me3 and H3K36me3. These features are commonly associated with either promoter or enhancer (Table 2) and provide insights into possible regulation of lncRNA expression and (iii) Conserved elements: PhastCons mammals conserved elements (30-way multiz alignment) that mark evolutionary conserved regions among placental mammalian genomes predicted by phylogenetic hidden Markov model (phylo-HMM) (49).
To examine the difference and similarity between annotated and novel lncRNAs, we compared the feature association rates between them. For transcription features, a significantly higher percentage of annotated lncRNAs was found to accommodate PolyA (9.1 vs. 0.8%; P < 0.001 for testis, 18.8 vs. 1.9%; P < 0.001) for other tissues) or CAGE (70.0 vs. 6.6%; P < 0.001) signals when compared with novel lncRNAs ( Figure 5). For chromatin-related features, DHS was associated with roughly similar proportions of annotated and novel lncRNAs (33.1 vs. 36.1%; P ¼ 0.001), indicating that they are comparably accessible by DNA-binding proteins ( Figure 5). The associations of two enhancer-associated features, H3K4me1 (88.1 vs. 38.4%; P < 0.001) and H3K36me3 (33.7 vs. 7.9%; P < 0.001), were significantly higher for annotated lncRNAs (Figure 5), indicating a smaller proportion of novel lncRNAs were enhancer-associated. The annotated lncRNAs also showed a higher association for active promoter features, H3K4me3 (32.2 vs. 25.3%; P < 0.001) and Figure 3. Example of two annotated lncRNAs with similar loci, GlncRNA0062198 and GlncRNA0062199, as viewed in UCSC Genome Browser. As shown, the two lncRNAs have annotated loci of 3 0 -terminals differ by 2 bp and those of 5 0 -terminal by 104 bp. The former is annotated by NONCODE and fRNAdb, while the latter is annotated by fRNAdb only. Both of the lncRNAs were supported by Spga-specific SAGE, RNASeq and tiling array evidence, indicating a strong expression specifically in Spga. They are also covered by probes in lncRNA microarray and showed a stronger expression in neonatal testis than in adult testis. Furthermore, the expression is also supported by signals in PolyA sequencing data. They are located in the promoter region of a bidirectional protein-coding gene prolyl endopeptidase (Prep), whose expression is also Spga-specific, suggesting possible regulation in cis.

Function prediction of novel lncRNAs
One of the major challenges in lncRNA study is function prediction. While the mode of regulation for lncRNAs is less defined, it was hypothesized that lncRNAs regulate cellular functions through interacting with coding genes (50,51). For example, recent studies have showed that some lncRNAs involved in regulating expression of neighbour genes in cis (52,53), while others were reported to regulate expression of coding gene targets on different chromosomes in trans (54,55). To delineate functional regulation of the novel lncRNAs, the association between novel lncRNAs and the potential coding gene targets were examined. In GermlncRNA, proteincoding genes that are close genomic neighbours to novel lncRNAs in either direction were classified as cis-acting, and those that have significant similarity (at least 85% within 200 bp) in primary sequence to novel lncRNAs were classified as trans-acting. We hypothesized the potential functions of cis-and trans-acting lncRNAs were related to the coding partners, and therefore functional prediction of the coding partners by gene ontology was provided to illustrate the molecular function, biological Other than coding gene interactions, non-coding RNA interactions with microRNAs have also been reported in post-transcriptional regulation (56,57). MicroRNAs are short RNAs that associate with RNA-induced silencing complex (RISC) to inhibit gene expression. The complex binds to MREs at 3 0 -UTR (untranslated region) of target mRNA. Due to sequence similarity, some lncRNAs may rescue target gene expression that share similar MREs through competitive miRNA/ RISC complex binding. In GermlncRNA, microRNAs targets of the novel lncRNAs were predicted by PITA algorithm (46) and listed in a table for microRNA targets. Each lncRNA was predicted to contain 388 miRNAs recognition elements (MREs) on average, with an average free energy change (DDG) of À13.5.
For example, GlncRNA11284d was predicted to be a potential cis-acting lncRNA with two coding gene neighbours Rad18 and Oxtr, which are 9.9 kb upstream and 120 kb downstream to the lncRNA, respectively. A total of 182 miRNAs across 202 MREs, with an average DDG of À12.4 were identified. Gene ontology analysis suggested these two targets are involved in spermatogenesis and sperm ejaculation processes, respectively. The findings were supported by recent studies that Rad18 is critical in maintaining genomic stability in germ cell (58); while Oxtr involves in cAMP regulation in vas deferens epithelial cells (59). Taken together, the results suggested this cis-acting lncRNA may involve in various regulatory networks in male germ cell development.

Discussion
lncRNAs have been demonstrated to be critical regulators in normal and disease development (10)(11)(12)(13)(14)(15). Compared with protein-coding genes, lncRNA expressions are highly tissue-or cell-specific (29,60,61). This implies that lncRNAs expressed in specialized or rare cell types will be missed by public lncRNA annotation resources if the data sources do not contain the corresponding transcriptome information. The incomplete lncRNAs dataset will compromise research findings. To this end, we developed GermlncRNA to provide the first public online resource to search, visualize, download and analyse mouse male germ cell-specific lncRNAs.
To define annotated lncRNAs, we applied five public database resources for annotation. NONCODE contains the most lncRNA entries (67 595), followed by Ensembl (36 229), fRNAdb (30 335), UCSC (10 041) and Refseq (1279). It is interesting to note that there are significant differences on lncRNA annotation among these resources (Supplementary Figure S4). The difference could be attributed to the assembly algorithm and transcriptome datasets applied. Therefore, the quality and quantity of lncRNAs predicted will be highly variable. Even with the comprehensive annotation coverage in our analysis, we expected many germ cell specific lncRNAs would be missed. This is because the public resources did not include transcriptomes of specific male germ cell stages in the annotation assembly pipeline. The male germ cell information is mainly derived from testis transcriptome data. Unique RNA transcripts from minority male germ cell populations will be underrepresented or undetected.
With the aid of bioinformatic prediction and male germ cell transcriptomic evidence, we were able to identify 2790 intergenic lncRNAs by HTA. Although we were also able to identify other novel gene associated lncRNAs, we did not include them in the current version due to potential ambiguity issues such as potential alternative splicing isoform of a coding gene. In any case, these novel germ cell specific lncRNAs are worthy of further follow-up, and those demonstrating stage-specific expression patterns may contribute to the stage-specific cellular functions, such as proliferation, transcriptional regulation of meiotic and early haploid stages of spermatogenesis. To allow a better understanding on these novel lncRNA candidates, we have included function prediction information by identifying the potential cis-and trans-acting coding gene targets. Although this approach may not represent all possible lncRNA regulation, we believe it is a sensible beginning approach. Some lncRNAs, such as Xist and Air, have been reported to regulate adjacent genes (52,53); while others like 1/2sbsRNAs and BACE1-AS, interact with target mRNAs of complementary sequence to promote or inhibit degradation (54,55). Recent reports also suggested lncRNA could function as a competing endogenous RNA (ceRNA), which reduces the inhibitory effects of miRNAs by direct binding, such as PTENP1 and LincRNA-RoR (19,57,62,63). We will keep implementing data updates and functions to reflect new lncRNA regulation. For example, we envision including human male germ cell transcriptomes, mouse epigenome and direct lncRNA-protein interaction data in the upcoming release. Information on the recent and upcoming updates can be found on the GermlncRNA website.
In conclusion, GermlncRNA provides a systematic, integrative, user-friendly and easy-to-use platform to investigate lncRNAs involved in male germ cell development. Users can query for and clearly retrieve lncRNA of their interest through a variety of search options. Therefore, we believe that GermlncRNA can substantially improve the understanding on lncRNA biology in spermatogenesis.