BambooGDB: a bamboo genome database with functional annotation and an analysis platform

Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein–protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org


Introduction
As a tribe of flowering and evergreen perennial monocot classified in the subfamily Bambusoideae within the grass family Poaceae that includes rice, maize, wheat and other cereals (1), bamboo is one of the most important nontimber forest resources in the world (2). Because of having high strength-to-weight ratio, like natural woody, bamboo is natural composite material, which is useful for making construction material, paper pulp and furniture (3). Recent data make clear that $2.5 billion people in the world depending on bamboo for their daily lives, and the international trade volume on bamboo amounts to  (4). Moreover, bamboo grows widely in tropical and subtropical of Asia, Africa, northern Australia and Latin America, extending as far north as the southern United States and as far south as Patagonia. About 1000 species of woody bamboo are widely distributed all over the world, among which $100 species were used commercially. Moso bamboo (Phyllostachys edulis) is one of the most important economic bamboo species with many advantages such as fast growth rate, high yield, extensive use, short crucial period formation and strong regeneration capacity (5).

billion US dollars per year
During the past several decades, many studies have been carried out on the bamboo using various biotechnologies, such as biochemical, physiological, cytogenetic and genomic methods, mainly including chloroplast genome sequencing, identification of syntenic genes between bamboo and other grass and phylogenetic analysis of Bambusoideae subspecies (6)(7)(8)(9)(10). Additionally, a series of studies on moso bamboo were performed, including the first high-quality genome sequence by de novo sequencing (11), deep RNA sequencing (RNA-seq) for seven samples in different tissues (11), and the cloning and sequencing of 10 608 putative full-length complementary DNA (cDNA) (3). However, data from different researches are scattered in publications, and the lack of a systematic review and analytical platform of the currently available data and knowledge has remained a longstanding challenge for genetic and genomic of bamboo.
Here, we report BambooGDB, a bamboo genome database with functional annotation and analysis platform, mainly based on the de novo sequencing data of moso bamboo. In addition, the RNA-seq and full-length cDNA data were also included in BambooGDB to enrich the contents of this database. On the basis of these large-scale sequencing data, a comprehensive annotation for bamboo genome was made, including basic annotation for bamboo genes, RNAs, proteins and heterozygous single nucleotide polymorphisms (SNPs), as well as functional annotations, such as gene ontology, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, orthologs and protein-protein interaction (PPI). Besides, an analytical platform composed of comparative genomic analysis, PPI network, pathway analysis and visualization of genomic data were constructed to extend our understanding of the bamboo genome and to help researchers to design experiments for further validation. Furthermore, to facilitate the navigation of these data, diverse and powerful search tools and a convenient browser were also incorporated in BambooGDB. Through integrating high-throughput sequencing data, a full annotation and several analysis modules, BambooGDB was designed to a central genomic resource and an extensible analysis platform for bamboo genome to facilitate future studies and help reveal the genomic features of bamboo and other related plants.

Data Content
There are three types of data included in BambooGDB: (i) the high-quality genome sequence data of moso bamboo (11), (ii) the genome-wide full-length cDNA data of moso bamboo (3) and (iii) the deeply sequenced RNA-seq data for seven samples in different tissues of moso bamboo (11). It is worth mentioning that all of the aforementioned data of moso bamboo were mainly obtained from International Center for Bamboo and Rattan, which has long been dedicating to the biological research on bamboo. In addition, to carry out comparative genomic analysis between bamboo and other plants, we also collect the whole genome data of two model plants and five bamboo-related species, which includes Arabidopsis thaliana (12), Oryza sativa (13), Brachypodium distachyon (14), Panicum virgatum (sequenced by the US Department of Energy Joint Genome Institute), Sorghum bicolor (15), Setaria italica (sequenced by the US Department of Energy Joint Genome Institute) and Zea mays (16). Currently, there are >33 000 annotated bamboo genes in this database. The summary of data content in BambooGDB is shown in Table 1.

Genome Annotation and Data Analyses
Based on the large-scale sequencing data, a series of annotation under multilevel was conducted, including fundamental annotation of function such as motif, domain and structure analysis, comparative analysis among bamboorelated species, metabolic pathway network analysis and PPIs network analysis ( Figure 1). As shown in Table 1, the data statistics of BambooGDB dated 12 October 2013. Moreover, the detailed information was concisely presented in three aspects as follows:

Basic functional annotation
As a central part of BambooGDB, genomic functional annotation plays a fundamental role in genomic studies. To obtain comprehensive genomic functional information, a series of annotation and analysis work was performed with five following aspects. First, the prediction of gene function motifs and domains was performed by InterProScan (Release 5 Candidate 6) software (17) against InterPro database (18), which has integrated together predictive information about proteins function from a number of partner resources and provided an overview of the function and domain of protein. Therefore, several kinds of valuable classifications were obtained in result list of InterProScan, such as PRINTS (19), Pfam-A (20), Gene3D (21), PANTHER (22), InterPro (17) and Gene Ontology (23). Second, clusters of orthologous group (COGs) were predicted by BLASTP (24) against COG database (25) in NCBI under E-value 1e À6 . Third, based on previous study (26), full-length cDNA sequences of moso bamboo were mapped to its genome using BLAT (27). Fourth, structure features of protein were predicted by Batch CD-Search web services in NCBI (28). Finally, the bamboo gene models were aligned to entries of sorghum, rice and maize from the KEGG database (29) by BLASTP under E-value 1e À10 to find the best hit for each gene in the similar pathway.

Computational metabolic network and PPI
Computational network analysis is a kind of efficient method and tool for investigating the features that identify the topology of a metabolic network and the interaction of relative compounds. As one of the important components in computational network analysis, computational metabolic network and PPIs network for bamboo were analyzed and then implemented in BambooGDB.
As an important database and platform for study of molecular interactions, the KEGG database provides a reference knowledge base for connecting genomes to the biological functions. Therefore, on the basis of the KEGG database, proteins of moso bamboo were annotated with the KEGG orthology (KO) by using the best hit information. In addition, the graphical display used in KEGG Automatic Annotation Server (30) can help user to visually understand the characteristics of metabolic pathways. Finally, there were 3946 proteins and 191 pathway maps in computational metabolic network of moso bamboo.
Computational identification of PPIs network in moso bamboo can provide a new insight into cellular functions of proteins. The computational process was briefly introduced as follows. First, linkages with protein sequences of moso bamboo and UniProt database (release 201308) (31) were established by BLASTP comparison with the following criteria. (i) E < 1.0 Â 10 À10 , (ii) sequences identify >40% and (iii) aligned sequence length coverage >40%. In this study, the aligned sequence length coverage was strictly defined as the aligned sequence length of the query without gaps divided by the whole sequence length of the query. Then, using UniProt access numbers as input, PPIs of moso bamboo were computed on protein interaction network analysis platform, which included a database of unified PPI data from six manually curated public database (32,33). Therefore, the final PPIs network in moso bamboo contained 2202 proteins with 34 169 interactions.

Data Usage and Analysis Tools
Powerful search tools and a convenient browser There are two basic ways for users to access data stored in BambooGDB (Figure 1): search and browse. Besides the simple keyword search, BambooGDB has also offered advanced search with a Boolean search to allow users to specify and combine query options by functional characteristics, such as COG accession, InterPro accession/ description, Gene Ontology accession/term, EC number and pathway information. BambooGDB has provided not only a powerful search engine but also a user-friendly interface to browse various data and data connections. Moreover, the previous and valuable results for genomic analysis were also displayed such as RNA-seq and predicted microRNA data.

PPI analysis
The computational PPIs data of moso bamboo were integrated in the PPI module of BambooGDB. Users can submit a protein name as a query protein and then an image and a table will be generated by predicting the interacted partners of the query protein. The image is produced by Cytoscape (37), which is an open source bioinformatics software platform emphasized on providing analyses of visualizing networks. In addition, PPI also supports users to input a group of protein to explore the interactions among them.

Comparative genomic analysis based on bamboo-related species
In the orthologs groups' module of gene card, the results of predicted orthologs were demonstrated. Meanwhile, in the comparative genome search available from tools, we provide searching function to find orthologous genes that are present in one set of genomes and absent in another set among bamboo-related species. Moreover, sequences from various species also could be aligned by BLAST tool, which was incorporated in BambooGDB.

GBrowse
GBrowse is a combination of database and interactive web pages for manipulating and visualizing annotations on genomes. Entries from the various type data are marked in different colors in the browser. As an important and efficient visualization module, GBrowse (38) was incorporated in BambooGDB to facilitate viewing different types of factors (gene, CDS, messenger RNA, full-length cDNA, heterozygous SNPs and RNA-seq) simultaneously in the context of genomic regions. Users can also connect to the detailed feature page of corresponding entries from the browser.

Application
BambooGDB is a novel resource of functional annotation for bamboo genome and analytical platform to facilitate studies about genetic and genomic of bamboo. By managing the integration of genome data of moso bamboo, data connection and analysis tools, researchers may start from a single gene or functional term of interest to acquire a relatively comprehensive knowledge of functional annotation in different research levels, such as expression and functional regulation.
For basic research, BambooGDB provided a fully functional annotation for moso bamboo. For example, as the shown in Figure 2, some researchers might be interested in 'glucose-6-phosphate isomerase' and tend to find and understand the genetic information of this protein in bamboo. To achieve this, first, by using search function in BambooGDB, we searched 'glucose-6-phosphate isomerase' as search content quickly at home page, then results ultimately will be linked to result page. There are a total of five results in BambooGDB. Second, click the locus ('PH01000376G0610'), as an interested gene, and then enter gene card, which includes a graphical view of the local genomic environments and three tabs for details. In the first tab, there were fundamental annotation information such as gene/protein name, length and location, expression profiles, heterozygous SNPs, ortholog group and best hits with Arabidopsis or rice. Based on the information of 'orthologs groups', we found the aforementioned result of search together belong to 'OG5_126980' group. For detailed information of this group, we can click 'OG5_126980' to browse further information in database of orthoMCL DB. Moreover, according to the information of comparative genomics, the best hit for locus PH01000376G0610 is AT5g42740.1 and LOC_Os03g564860.2, in the genome of Arabidopsis and rice, respectively. In the second tab, comprehensive functional and structural feature of the homologous glucose-6-phosphate isomerase in moso bamboo was highlighted. For instance, the information of domain was visually displayed by comparative functional domain, such as the sequence of PH01000376G0610 mainly matched the domain of glucose-6-phosphate isomerase. Similarly, the prediction of GO term showed PH01000376G0610 belong to 'glucose-6-phosphate isomerase activity' (GO: 0004347) in the category of molecular function. In the pathway part, it was demonstrated that PH01000376G0610 participated in four carbohydrate metabolisms (glycolysis/ gluconeogenesis, pentose phosphate pathway, starch and sucrose metabolism as well as amino sugar and nucleotide sugar metabolism), and contained the following information: KO (K01810), reaction description (glucose-6-phosphate isomerase) and EC number (EC: 5.3.1.9). In the third tab, the sequence information of locus PH01000376G0610 was presented, including CDS sequence, upstream/downstream 1000-bp region away from locus PH01000376G0610 and its protein sequence. BambooGDB also provided the module of 'retrieve sequences', which will display and download specific sequences in the genome of moso bamboo. Moreover, the aforementioned information of sequence can be conveniently downloaded by clipboard function implemented in BambooGDB. Additionally, further information on each functional characteristics can be accessed via hyperlinks to external corresponding database.

Discussion and Future Development
As the first genome database with functional annotation for bamboo, BambooGDB aims to act as not only an integrated genomic resource special for bamboo but also a flexible computational platform for the genetic studies of bamboo in future. In addition, after firstly moso bamboo de novo sequencing genome published, the number of genome resequencing, RNA-seq, rare variants, epigenetics and other omics studies for bamboo is expected to keep increasing especially with the development of sequencing technologies in the next few years. Therefore, BambooGDB will be periodically updated to ensure a most up-to-date follow up of the omics research progress of bamboo. Meanwhile, manual curation of literature for genetics of bamboo will be carried out to fulfill the increasing research demands in addressing the genetic complexity of bamboo. The scope of BambooGDB will be expanded to integrate newly generated data. We hope our continuous efforts will help to better understand the genome for bamboo.