Plant Omics Data Center: An Integrated Web Repository for Interspecies Gene Expression Networks with NLP-Based Curation

Comprehensive integration of large-scale omics resources such as genomes, transcriptomes and metabolomes will provide deeper insights into broader aspects of molecular biology. For better understanding of plant biology, we aim to construct a next-generation sequencing (NGS)-derived gene expression network (GEN) repository for a broad range of plant species. So far we have incorporated information about 745 high-quality mRNA sequencing (mRNA-Seq) samples from eight plant species (Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, Solanum tuberosum, Medicago truncatula and Glycine max) from the public short read archive, digitally profiled the entire set of gene expression profiles, and drawn GENs by using correspondence analysis (CA) to take advantage of gene expression similarities. In order to understand the evolutionary significance of the GENs from multiple species, they were linked according to the orthology of each node (gene) among species. In addition to other gene expression information, functional annotation of the genes will facilitate biological comprehension. Currently we are improving the given gene annotations with natural language processing (NLP) techniques and manual curation. Here we introduce the current status of our analyses and the web database, PODC (Plant Omics Data Center; http://bioinf.mind.meiji.ac.jp/podc/), now open to the public, providing GENs, functional annotations and additional comprehensive omics resources.

*Corresponding author: E-mail, kyano@isc.meiji.ac.jp;Fax, +81-44-934-7046. (Received August 29, 2014;Accepted November 24, 2014) Comprehensive integration of large-scale omics resources such as genomes, transcriptomes and metabolomes will provide deeper insights into broader aspects of molecular biology. For better understanding of plant biology, we aim to construct a next-generation sequencing (NGS)-derived gene expression network (GEN) repository for a broad range of plant species. So far we have incorporated information about 745 high-quality mRNA sequencing (mRNA-Seq) samples from eight plant species (Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, Solanum tuberosum, Medicago truncatula and Glycine max) from the public short read archive, digitally profiled the entire set of gene expression profiles, and drawn GENs by using correspondence analysis (CA) to take advantage of gene expression similarities. In order to understand the evolutionary significance of the GENs from multiple species, they were linked according to the orthology of each node (gene) among species. In addition to other gene expression information, functional annotation of the genes will facilitate biological comprehension. Currently we are improving the given gene annotations with natural language processing (NLP) techniques and manual curation. Here we introduce the current status of our analyses and the web database, PODC (Plant Omics Data Center; http://bioinf.mind.meiji. ac.jp/podc/), now open to the public, providing GENs, functional annotations and additional comprehensive omics resources.
Keywords: Correspondence analysis Database Gene expression network Manual curation Natural language processing (NLP) Omics.

Introduction
The plant sciences have a unique and distinctive position because of their relationship to human food, culture and civilization. In particular, because of the world population explosion and fossil fuel exhaustion, the plant sciences are thought to be critically related to the future of human culture in the context of food security, biofuel production and sustainability. Hence in this big data era, maintenance of more comprehensive research resources, particularly for pan-omics data repositories, is required (Obayashi and Yano 2014). To this end, we maintain the OryzaExpress (gene expression and annotation database for rice) (Hamada et al. 2011), TOMATOMICS (multiomics database for tomato) (Kobayashi et al. 2014) and other speciesspecific crop databases.
With the availability of next-generation sequencing (NGS), the distinctiveness of the plant sciences is not only unyielding, but also taking on growing importance. The progress of plant genomics is particularly prominent in this century. Currently, not only typical model plants as represented by Arabidopsis (Arabidopsis Genome Initiative 2000) or rice (International Rice Genome Sequencing Project 2005), but also non-model genome sequences have been deciphered and published (Garcia-Mas et al. 2012, Chagne et al. 2014, Schmutz et al. 2014, and corresponding genome-related databases have been constructed (Ohyanagi et al. 2006, Tanaka et al. 2008, Bombarely et al. 2011, Goodstein et al. 2012, Lamesch et al. 2012, Sakai et al. 2013.
Among multilayer plant omics information, the transcriptome, which inscribes the profile of the total content and quantity of mRNA molecules, has been understood as an invaluable clue to predict gene functions based on gene expression similarity or to disclose the hidden molecular mechanisms behind the gene expression regulatory system, i.e. transcription factors, cis-regulatory elements and small RNAs. Actually, large-scale transcriptome analyses and database construction have been conducted by taking advantage of microarray technologies (Hamada et al. 2011, Mutwil et al. 2011, Sato et al. 2013a, Sato et al. 2013b. In recent years, we have focused on the emerging technology of NGS, and have found particularly that mRNA sequencing (mRNA-Seq), an application focusing on the layer of the transcriptome, is tremendously useful. In the plant sciences, third parties have already been analyzing and accumulating mRNA-Seq information, and opening them up to the public domain (Li et al. 2013, Postnikova et al. 2013, Ramilowski et al. 2013, Van Moerkercke et al. 2013, Liu et al. 2014. While a few of the previously mentioned gene expression databases include some mRNA-Seq data sets (Mutwil et al. 2011, we now aim to analyze comprehensively information on mRNA-Seq across a broad range of species, predict gene expression networks (GENs) using the expression profiles derived from the mRNA-Seq analysis outcomes, and establish them as a core resource of a pan-omics database. The GENs of multiple species should not be isolated from each other (Mutwil et al. 2011, Heyndrickx andVandepoele 2012), so we are trying to connect them according to the orthologous relationships of compound genes, enabling the evolutionary comprehension of the total network. In addition, we are employing natural language processing (NLP) and manual curation as an advanced option with the aim of enhancing the quality of gene annotations. Specifically the PubMed (http://www.ncbi.nlm.nih.gov/ pubmed) sentences were interpreted and summarized with proprietary NLP tools, and the relationships between two protein identifiers or between a protein identifier and a phenomenon were extracted. Then the co-occurrence relationships are manually curated and determined as the final NLP outcome.
Our goal is to establish a pan-omics database, the Plant Omics Data Center (PODC; http://bioinf.mind.meiji.ac.jp/ podc/), that includes core gene expression information. Here we introduce the current status of the PODC and discuss the future direction of this database.

GEN analysis
The GEN is an ideal technique for grasping similarities of expression profiles among genes simultaneously. By taking advantage of the correspondence analysis (CA) algorithm, we have developed a statistical method to analyze large-scale gene expression profiles to construct GENs (see the Materials and Methods). This method classifies genes according to similarities in gene expression profiles.
For construction of the PODC, we calculated similarities of gene expression profiles with mRNA-Seq expression analysis results (see the Materials and Methods) and the CA algorithm. According to a heuristic manual validation of network adequacy, currently we have defined the top 0.1% of gene pairs in expression similarities as being similarly expressed gene pairs (Arabidopsis thaliana, 622,462 pairs; Oryza sativa, 983,974 pairs; Solanum lycopersicum, 512,368 pairs; Sorghum bicolor, 763,018 pairs; Vitis vinifera, 1,442,892 pairs; Solanum tuberosum, 1,386,466 pairs; Medicago truncatula, 1,445,827 pairs; Glycine max, 3,837,387 pairs) and stored this information in the database. Currently the threshold (0.1%) for significant similarity is a fixed value in the system, but is planned to be a variable value.

Orthology detection among multiple plant species
By the means of the OrthoMCL procedure described in the Materials and Methods, 3,780,141 orthologous gene pairs among the eight species were detected, stored in the database and employed to connect interspecies GENs.

NLP and manual curation
Currently we have been focusing on plant reproduction terminology, and gathered the PubMed papers by keyword search ( Table 1). Then a total of >28,000 papers were subjected to NLP and manual curation (see the Materials and Methods). As a consequence, the number of relationships we obtained was 1,772 in A. thaliana, 92 in O. sativa, 119 in S. lycopersicum, two in S. bicolor, none in V. vinifera, 11 in S. tuberosum, one in M. truncatula and six in G. max. The NLP relationships are currently stored in the database as text, but will be graphically shown in the GEN viewer (see Database Functions and Web Interface) in the near future.

Database Functions and Web Interface
How to search the database content On the home page of the PODC (http://bioinf.mind.meiji.ac.jp/ podc/) ( Fig. 1), three quick search functions, a keyword search for gene annotations including NLP relationships ( Fig. 1, blue pane), a sequence homology search with the BLAST program ( Fig. 1, green pane) and a GEN search using gene IDs (Fig. 1, red pane) are available. For each function, an advanced search page is also implemented ( Fig

Gene detail information
The current version of the PODC provides the following data categories on the gene detail information page (

GEN viewer
Visualization of GEN as a network graphic makes it easier to understand the relationships among multiple genes and the characteristics of gene clusters. The web interface for GEN was constructed with Cytoscape Web (http://cytoscapeweb. cytoscape.org/) (Lopes et al. 2010) (Fig. 4A), a graphic network visualization tool. In terms of network representations, each node indicates a gene, and each edge means a relationship (Fig. 4A). In the case of the PODC, each solid edge indicates a similarly expressed gene pair, and each dashed edge represents an orthologous or paralogous relationship (Fig. 4A, B). The colors of nodes and edges correspond to the eight plant species and orthologous relationship. Our GEN viewer allows zooming in and out, panning, and moving nodes and edges with drag-and-drop functionality. The number of simultaneously visualizable nodes is about 1,000-2,000 (dependent on client PC specification). A brief annotation of each gene pops up by scrolling a mouse cursor over the node. Detailed information including gene expression profile, orthologous genes and NLP annotations are shown by clicking or selecting particular nodes (Fig. 4C). Each gene in GEN is accessible with a keyword search.
When searched genes (nodes) are selected, the node border color changes. GENs can be interactively expanded by every single path from a selected gene, or selected genes can be removed. The number of nodes for each species and number of edges for types of relationship within the GEN are shown (Fig. 4C). Information on functional annotations, sequences and expression profiles of genes within each GEN are downloadable. The GEN data are also downloadable in SIF (simple interaction format) or as an image (PNG format). The SIF file is portable to Cytoscape (Shannon et al. 2003).
To provide an example of the GEN, A. thaliana genes encoding enzymes functioning in the photosynthetic Calvin-Benson cycle (CBC) were obtained from the Plant Metabolic Network (http://www.plantcyc.org/; Chae et al. 2014) and used to draw GENs for the eight species. As expected, the resulting GENs   demonstrated expression networks of the CBC genes in the species (Supplementary Fig. S1A). While the GENs were varied across species, some relationships of similarly expressed genes were conserved among multiple species such as between a sedoheptulose-1,7-bisphosphatase (SBPase) gene and a fructose-1,6-bisphosphatase (FBPase) gene in A. thaliana, S. tuberosum and M. truncatula. More mRNA-Seq data are being accumulated than those of microarray platforms in recent years, and the sensitivity and accuracy of PODC GEN detection will be improved along with obtaining more sample variations. The A. thaliana GEN of the CBC was further evaluated by comparison with one drawn in another web tool, ATTED-II, which uses microarray data . GENs drawn in both web tools are summarized in Supplementary Fig. S1B. Again, an SBPase gene (AT3G55800) and an FBPase gene (AT3G54050, known as high cyclic electron flow 1) were found to be similarly expressed in ATTED-II as well as in the PODC. SBPase and FBPase are considered to be key steps in regulating carbon flow of the CBC (Tamoi et al. 2005, Liu et al. 2012, and their enzymatic activities are regulated by light condition via thioredoxin (Michelet et al. 2013). Given that the gene expression similarity of SBPase and FBPase is conserved among species, we can hypothesize that co-ordinated fundamental regulation of gene expression of SBPase and FBPase is important as an understructure sustaining precise modulation of the CBC functions. A relationship between AT3G12780 (phosphoglycerate kinase 1) and AT1G42970 (glyceraldehyde-3-phosphate B subunit) was also found in both tools.
Several similarly expressed gene pairs were found only in one of the two tools. There are many potential causes of such differences: different platform (NGS and microarray), different sample set and different method to detect gene expression similarities (CA analysis and Pearson's correlation coefficient). Because of the complexity, it is fairly difficult to identify the actual factor making the differences. However, in terms of the expression similarity among ribulose-1,5-bisphosphate carboxylase/oxygenase small subunit (RbcS) genes (AT1G67090, AT5G38410, AT5G38420 and AT5G38430), the primary reason why the relationship is not found in ATTED-II but is found in the PODC is clear: probes on the microarray cannot separate the family genes because of the high identity in nucleotide sequence, but mRNA-Seq can do it. This exemplifies an advantage of employing mRNA-Seq data to construct GENs. In principle, mRNA-Seq can quantify the expression levels of all gene models separately, unless those sequences are 100% identical. Moreover, we believe that the future accumulation of mRNA-Seq samples will enhance the advantages of the PODC.

Conclusion and Future Direction
Here we introduced the PODC, a web repository for NGS transcriptomes and GENs with an interactive network viewer. Compared with existing GEN databases (Mutwil et al. 2011, the content depth of NGS mRNA-Seq data in our PODC seems without equal. In addition, we are taking advantage of the state-of-the-art NLP technique for cost-effective accumulation of manually curated plant annotations. We believe that these multiple enrichments of data content make our database unique and invaluable in the plant sciences. We are still enhancing the data content and improving the web interface. As for future plans, we aim to add more plant species; not only model crops, but also minor and non-model plant species. We would also consider incorporating mRNA-Seq reads produced by non-Illumina platforms. In addition, we plan to add more NLP keywords for biotic/abiotic stresses and other critical plant biology terms. Moreover, we are implementing a prediction program for cis-regulatory elements (manuscript in preparation) that are strongly related to GENs in terms of hidden molecular mechanisms for control of gene expression.
We are mainly focusing on the transcriptome, but we plan to broaden the content of the database, i.e. to proteomes, metabolomes and phenomes. We believe that the GEN information in the PODC will become its core information, and make it easy to navigate throughout every plant omics layer.

Materials and Methods
Gene expression data from public data repositories

GEN analysis
We evaluated similarities in gene expression profiles of each gene by CA as described in our previous reports (Yano et al. 2006, Hamada et al. 2011. Conceptually CA summarizes a gene expression data matrix into a lower dimensional space. For each gene and sample, co-ordinates in the lowdimensional space are provided. With these co-ordinates, genes can be plotted in a three-dimensional space. Theoretically, genes with similar expression profiles are closely related. Therefore, the distance between genes in the low-dimensional space indicates similarity in gene expression profiles. The gene expression profiles determined by mRNA-Seq analysis were subjected to the CA procedure (Yano et al. 2006, Hamada et al. 2011. Then the deduced similarity relationships were inspected with the GUI software tool called CA Plot Viewer (http://bioinf.mind.meiji.ac.jp/lab/), and employed as gene expression similarities in PODC.

Orthology detection among multiple plant species
Orthologous gene pair detection among the eight plant species was performed by employing the OrthoMCL algorithm (http://orthomcl.org/orthomcl/) (Li et al. 2003) by default parameters. First, deduced protein sequences derived from all gene nucleotide sequences were quality controlled by a filter command in OrthoMCL (orthomclFilterFasta 10 20). Secondly, the cleaned protein sequences were concatenated to a single FASTA file, and employed to detect BLASTP (Altschul et al. 1997) similarities among the entire protein sequence set (blastall -p blastp -m 8 -F 'm S' -v 100000 -b 100000 -z 414453 -e 1e-5 -a 20). Then OrthoMCL commands orthomclLoadBlast and orthomclDumpPairsFiles were run with a configuration (percentMatchCutoff=50, evalueExponentCutoff=-5) on the BLASTP results in order to find potential inparalogous, orthologous and co-orthologous pairs. Finally the MCL clusters were determined with an OrthoMCL command (mcl -abc -I 1.5).

NLP and manual curation
Functional annotation strategies are mainly based on sequence similarity searches against functionally determined genes. However, more accurate functional annotation would be based on literature information with so-called manual curation. Manual curation requires the curators to have particular skills in interpreting the literature, and it is quite time consuming. The NLP technique is thought to be a breakthrough in this process. It has the potential to gather information faster than manual curation, but still has the technical problem regarding the accuracy of its results. Here we aim to combine NLP and manual curation, i.e. first we input a massive amount of literature information into the NLP program, then we validated the NLP results manually. With this strategy, we believe that higher quality functional annotations will be generated with a relatively small amount of manual effort.
As a rough idea, our NLP tools (MedScan and PathwayStudio, http://www. elsevier.com/online-tools/pathway-studio/about/pathway-studio-plant) (Novichkova et al. 2003, Yuryev et al. 2006) co-ordinately interpret and summarize PubMed sentences with a dictionary based on A. thaliana, and the outcome contains relationships between two protein identifiers or between a protein identifier and a phenomenon. Since the relationships are based on A. thaliana gene nomenclature, we have to convert the Arabidopsis gene IDs or gene symbols into those of the other seven plant species. To convert the IDs, orthologous relationships in UniProt (http://www.uniprot.org/), TAIR (http:// www.arabidopsis.org/), RAP-DB (http://rapdb.dna.affrc.go.jp/), SGN (http://sol genomics.net/) and BioMart (http://www.biomart.org/) (Kasprzyk 2011) are manually employed. Simultaneously, the co-occurrence relationships are manually extracted and curated as the final NLP outcome.
More precisely, particular terms (Table 1) were firstly searched on PubMed (http://www.ncbi.nlm.nih.gov/pubmed), and the results were saved in XML format. Secondly, the results in XML files were processed by the MedScan program and each pair of related terms (protein, small molecule, complex, cell process, cell object, disease, functional class and treatment) in a PubMed sentence was automatically extracted. Then the extracted relationships were manually inspected and relationships concerning proteins were selected (by taking advantage of MedScan filter function); simultaneously the orthologous relationships in UniProt (http://www.uniprot.org/), TAIR (http://www.arabi dopsis.org/), RAP-DB (http://rapdb.dna.affrc.go.jp/), SGN (http://solgenomics. net/) and BioMart (http://www.biomart.org/) (Kasprzyk 2011) were manually employed to convert the IDs. Finally the selected relationships were subjected to PathwayStudio (by MedScan Send to PathwayStudio function) in order to summarize the final list of NLP annotations.

System architecture and software
The PODC was implemented on a UNIX server with CentOS version 5, Apache web server and MySQL Database server. PHP version 5 was employed as a server-side scripting language. JavaScript was adopted to implement clientside rich applications. As for JavaScript libraries, jQuery (http://jquery.com), jQuery UI (http://jqueryui.com), Bootstrap (http://getbootstrap.com), D3 (http://d3js.org) and Cytoscape Web (http://cytoscapeweb.cytoscape.org) were employed. Other conventional utilities for UNIX computing were appropriately installed on the server if necessary. All of the PODC resources are stored in the server and available through HTTP access.
A GUI software tool called CA Plot Viewer (http://bioinf.mind.meiji.ac.jp/ lab/) was employed in the manual inspection step in GEN analysis.

Supplementary data
Supplementary data are available at PCP online.

Disclosures
The authors have no conflicts of interest to declare.