BGDB: a database of bivalent genes

Bivalent gene is a gene marked with both H3K4me3 and H3K27me3 epigenetic modification in the same area, and is proposed to play a pivotal role related to pluripotency in embryonic stem (ES) cells. Identification of these bivalent genes and understanding their functions are important for further research of lineage specification and embryo development. So far, lots of genome-wide histone modification data were generated in mouse and human ES cells. These valuable data make it possible to identify bivalent genes, but no comprehensive data repositories or analysis tools are available for bivalent genes currently. In this work, we develop BGDB, the database of bivalent genes. The database contains 6897 bivalent genes in human and mouse ES cells, which are manually collected from scientific literature. Each entry contains curated information, including genomic context, sequences, gene ontology and other relevant information. The web services of BGDB database were implemented with PHP + MySQL + JavaScript, and provide diverse query functions. Database URL: http://dailab.sysu.edu.cn/bgdb/


Introduction
Embryonic stem (ES) cells have the potential to differentiate into every tissue type of the body, and offer an important model for examining transitions of cellular identity in animals (1). It has been suggested that the potential is related to specific histone modifications or characteristic chromatin structure (2)(3)(4). Epigenetic regulation of gene expression is thought to be mediated partly by post-translational modifications of histones, which in turn establish different domains of active and inactive chromatin structures. The core histones have dozens of different modifications, including acetylation, methylation, phosphorylation and ubiquitylation. Histone H3 methylations of lysine 4 (K4) and lysine 27 (K27) have been shown to relate with active and repressed states, respectively (5). These methylations are catalyzed by Trithorax-and Polycomb-group proteins and play key roles in lineage-specific developmental functions (6). Trithorax-associated H3K4 trimethylation (H3K4me3) positively regulates transcription by recruiting nucleosome remodeling enzymes and histone acetylases (7)(8)(9), whereas Polycomb-associated H3K27 trimethylation (H3K27me3) negatively regulates transcription by promoting a compact chromatin structure (10,11).The colocalization of these H3K4 and H3K27 histone methylations, termed 'bivalent domains', was found in ES cells by mapping mouse genome (12,13). This modification pattern is observed in clusters of homeobox genes and other genes related to early embryonic development (12). The bivalent domains are proposed to silence key developmental genes in ES cells while keeping them poised for later activation, and these developmental genes marked by bivalent modifications are dubbed as bivalent genes (14). Whole-genome mapping found that H3K4me3 peaks were enriched in the region within 2 kb of the TSS of RefSeq annotations, and H3K27me3 peaks were also enriched in a band centered around the TSS with a greater width; moreover, most H3K27me3 peaks localized on promoters that were already marked with H3K4me3, suggesting that bivalent modifications on the same promoter is a rule in ES cells rather than an exception (15 (15)(16)(17)(18)(19)(20). These studies used diverse experimental approaches, such as hybridization, whole-genome microarrays (15), ChIP coupled with paired-end ditag sequencing (16) and singlemolecule sequencing (18). Despite different ES cell lines and varied experimental methods used in these studies, they show remarkable consistency in genes marked with both H3K4me3 and H3K27me3. The high degree of consistency indicates that these data are reliable, especially for genes with bivalent domains identified by at least two independent experiments.

Genome-wide analyses of H3K4me3 and H3K27me3 in human ES cells and mouse ES cells identified several thousand genes marked with both trimethylation
Since recent advances in high-throughput techniques such as genomic tiling microarrays and deep sequencing have discovered vast number of bivalent genes, it is an urgent topic to collect the experimental data and provide an up-to-date compressive resource for the community. Given these considerations, we have developed a novel database called 'Bivalent Genes Database' (BGDB) to store the sequence of bivalent genes and associated information from all studies published to date. In BGDB database, we manually curated 3913 bivalent genes in human ES cells and 2984 genes in mouse ES cells (Table 1), including the primary references and other annotations of these genes. Furthermore, we found 1604 genes have the same gene name in human and mouse ES cells (Table 1). Additionally, based on the gene ontology (GO) annotations, we analyzed the functional diversities and regulatory roles of bivalent genes. Taken together, the BGDB might be an integrated resource for bivalent genes and provide valuable information not only to stem cell biologists but also to researchers generally interested in gene expression regulation.

Database construction and content
The primary motivation of our BGDB is to collect and maintain a high quality bivalent genes database, which serves as an integrated, classified and well-annotated bivalent genes resource. The data generation flow of the BGDB is briefly illustrated in Figure 1. The generation flow is composed of three primary components: data processing, integration of external database and storing structural and functional annotation in database. To ensure the quality of BGDB   Table 2.
For curation of bivalent gene data from literature, we manually curated genes with bivalent domains and mapped the gene names to Entrez gene IDs. Then, we used Entrez gene IDs for BGDB to serve as the initial information to cross-link the same genes from different external databases. To avoid gene symbol ambiguity problems caused by synonyms of gene, we gained up-todate official gene symbols from HGNC (21) and MGI (22) for human and mouse genes, separately. For better understanding the function and structure of these bivalent genes, we collected their extensive functional information as follows: basic gene information such as gene name, sequence and summary from Entrez gene database (23); gene product characteristics information from GO (24); and protein information related to gene from UnitProtKB (25).
In BGDB database, we manually curated 6897 bivalent genes from scientific literature in PubMed. Not surprisingly, many bivalent genes were experimentally identified in at least two independent articles. There are 3205 (46.5%) genes that are cross-validated in two distinct studies and 1165 (16.9%) genes in more than two studies ( Figure 2). Because 63% genes have passed cross-validation, this suggests the reliability of our database.
The annotations of each bivalent gene are described by the fields shown in Table 3. We build a MySQL relation database with two tables to store all the gene information. GO information, including GO ID, GO term and GO category, is stored in 'Genes_go'     'Entrez ID' field in 'Genes_go' as a foreign key and have it relate to the 'Genes' table. For providing a fast BLAST sequence alignment service, we also set up a local BLAST database and integrate the local BLAST application into web service. The web interface for searching and browsing was implemented by PHP and JavaScript.

Usage
To facilitate the use of BDGB resource, we developed a user-friendly web interface for user to search and browse for content. The search page (http://dailab.sysu.edu.cn/ bgdb/database.php) provides an interface for searching the BGDB database with several keywords such as gene symbol, gene alias, reference sequence ID or UniProt ID. For example, if a keyword 'GRK4' is inputted ( Figure 3A), the query result will be shown in a tabular format, with the features of BGDB ID, gene symbol, gene full name, organism and gene alias ( Figure 3B). By clicking the link of BGDB ID (BGNO_002517), the detailed information for gene GRK4 will be shown ( Figure 3C). The gene information, including gene symbol, full name, summary and relevant references, is provided. The gene sequence, protein sequence, GO annotation, genomic location and some useful external links are also presented. All output columns are described in Table 3. Furthermore, BGDB web interface provides three advanced options, including (i) batch search, (ii) BLAST search and (iii) browse function (Supplementary Figure  S1). (i) Batch query: Using this function, users could query gene data for a batch of keywords at once with the results on one screen (Supplementary Figure S1A). (ii) BLAST search: Users can use an online BLAST interface to input an interested sequence in FASTA format and search against all nucleotide or protein sequences in our database (Supplementary Figure S1B). (iii) Browse: Instead of searching for specific genes, all entries of BGDB database could be listed by organism name (Supplementary Figure S1C).
For advanced bioinformatics users, all search results with related annotation, including nucleotide and protein sequence, GO and literature, are available to export with Excel format. Additionally, users could download the whole BGDB database with MySQL format (Supplementary Figure S1D).

Discussion
Recent genome-wide analyses of H3K4me3 and H3K27me3 in human and mouse ES cells have revealed several thousands of bivalent genes, but mapping chromatin modifications across the genome is the first step toward understanding the mechanism of gene regulation in pluripotent stem cells. Because database development is important for further experimental and computational designs by providing a high-quality benchmark, we focus on data collection and manually curated 6897 bivalent genes in this work. With a large amount of bivalent gene information, we had the opportunity to analyze abundance and functional diversity of bivalent genes.
To gain insight into the functional distribution of GO, we conducted the enrichment tests on the bivalent genes in BGDB. Firstly the GO annotations in GAF 2.0 file format was downloaded from UniProt-GOA (24,26), and secondly, the columns of gene symbol, GO ID, GO term and GO category were extracted and stored in the database. Then, taking account of the GO terms with genes directly associated to it, we mapped them to bivalent genes through gene symbol column that is provided in GO annotation. Using the human genome as background, we calculated overpresented biological processes, molecular functions and cellular components in bivalent genes of BGDB with the hypergeometric distribution (P < 0.001, calculated by Fisher's exact test). The five most enriched GO terms in each category are shown in Table 4. This analysis revealed several interesting results. For example, the four most overrepresented biological processes, such as anterior/posterior pattern specification, neuron differentiation, neuron migration and central nervous system development, indicate that bivalent genes are enriched for genes involved in system development and cell differentiation (Table 4), which is in accordance with the role of bivalent genes in ES cells. The enrichment result found here is consistent with the study reported previously (13). Also, four most abundant cellular components, such as axon, dendrite, neuronal cell body and postsynaptic membrane, suggest that bivalent genes are enriched in neuron compartments (Table 4). One possibility of this abundance is that neuron is an important cell type during ES cell differentiation. In addition, the statistical analysis of molecular functions shows that bivalent genes modulate enzyme activity and protein interaction ability (Table 4). For mouse bivalent genes in BGDB, we can draw a similar conclusion as above. The detailed information of top five most overrepresented GO terms of mouse bivalent genes is shown in Supplementary  Table S2. Next, we calculated the distribution of bivalent genes in human ESC chromosomes, and found that 10-25% protein coding genes (23) in each chromosome are bivalent genes except the Y chromosome (Table 5). This distribution suggests that every chromosome may play a specific role related to pluripotency in ES cells. The Y chromosome is rich in junk (27) and has only 56 protein-coding genes (28), which may be the reason for just one bivalent gene found in Y chromosome. The same result can be achieved from the distribution of bivalent genes in mouse ESC chromosomes (Supplementary  Table S3).

Conclusion and Future Perspective
BGDB is the first attempt to establish a literature-based resource of bivalent genes by integrating genomic data, sequences, GO and other useful information. It is a valuable resource for better understanding the mechanism of gene expression regulation in pluripotent stem cells. Furthermore, the statistical analyses revealed functional diversity and enrichment of bivalent genes.
We will continuously maintain and update the database once new bivalent gene data are reported. Additionally, our next prospective goal is to collect and curate genes marked by H3K4me3 only, H3K27me3 only and neither H3K4me3 nor H3K27me3 in ES cells, respectively. This will make BGDB a more comprehensive resource for further study of ES cell epigenetics.

Supplementary Data
Supplementary data are available at Database Online.