Bioinformatics Applications Note Databases and Ontologies Synaptomedb: an Ontology-based Knowledgebase for Synaptic Genes

Motivation: The synapse is integral to the function of the brain and may be an important source of dysfunction underlying many neuropsychiatric disorders. Consequently, it is an excellent candidate for large-scale genomic and proteomic study. However, while the tools and databases available for the annotation of high-throughput DNA and protein are generally robust, a comprehensive resource dedicated to the integration of information about the synapse is lacking. Results: We present an integrated database, called SynaptomeDB, to retrieve and annotate genes comprising the synaptome. These genes encode components of the synapse including neurotransmitters and their receptors, adhesion/cytoskeletal proteins, scaffold proteins, membrane transporters. SynaptomeDB integrates various and complex data sources for synaptic genes and proteins.


INTRODUCTION
The synapse is fundamental to the structure and function of the brain through its role in connecting neurons into circuits (Chua et al., 2010).As a result, the synapse is an excellent target of large-scale study of neuropsychiatric disorders.Over the past decade, the number of identified synaptic proteins has increased dramatically, creating a need for a comprehensive resource to integrate information about synaptic genes and proteins (the 'synaptome') from multiple heterogeneous sources.These genes encode components of the synapse including neurotransmitters and their receptors, adhesion/cytoskeletal proteins, scaffold proteins, transporters and others (Wu et al., 2010).Here, we report on an integrated database, SynaptomeDB, which provides a detailed and experimentally verified annotation of all known synaptic proteins.
The development of SynaptomeDB has been motivated by a desire to have a database that can interoperate with different resources by simply storing the object keys (database identifiers), some important attributes such as symbol and name, and the relationships among the objects.The database does not actually store large objects, like sequence records, but built-in web services can retrieve the subset of objects of interest on demand from other sources such as EBI and NCBI.

Data collection
The human synaptome protein list was compiled from a review of all peerreviewed proteomics studies from 2004 to 2010, as well from as publicly available databases that included proteins in the post-and, pre-synapse, the presynaptic active zone and the synaptic vesicle (Abul-Husn et al., 2009;Zhang et al., 2007).To date, more than 2200 published studies have reported data on post-or pre-synaptic genes and proteins, active zone and vesicles.Synaptome genes were annotated based on RefSeq (GRCh37/hg19) and UCSC (hg19) to identify human orthologs.We further annotated these genes by querying 42 databases covering all aspects of biology, including genes, proteins, pathways and other biological concepts.Both the annotation and curation processes are fully automated and can be executed regularly.The flowchart of the SynaptomeDB construction along with sample queries and explanation of curation process are illustrated in Supplementary Material 1. SynaptomeDB is a gene-centered relational database.It relies primarily on existing database identifiers derived from community databases such as NCBI, GO (Ashburner et al., 2000), EBI (Goujon et al., 2010) and Ensembl (Flicek et al., 2011) as well as the known relationships among those identifiers based on the NCBI Refseq (Mudunuri et al., 2000).The Relational database makes it possible to enhance the SynaptomeDB as an extensible platform for integration with other environments such as variation analysis.Regular updates of the database will be performed to incorporate new information.First, the annotation process will be performed on new genes.A simple update will then be executed to populate the other data in the database.The relational structure of the database allows updates to populate automatically in all related fields.Regular updates of the database will be performed weekly to incorporate new information.

Pathway enrichment
The Overrepresentation analysis (Fury et al., 2006), detailed in Supplementary Material 2, was performed against a collection of gene sets curated in the Molecular Signatures Database (MSigDB) (Liberzon et al., 2011) to identify pathways that are enriched for synaptic genes, which can inform subsequent biological analyses.Here, the proportion of genes in a given pathway appearing on the SynaptomeDB list is compared with the proportion of genes not appearing on the list, and a hypergeometric test (Holmans, 2010) is performed to test for differences in these proportions.This analysis is also fully automated and can be updated as new genes and sets are identified.

RESULTS
We assembled a list of genes (n = 1886) that encode all known proteins of the synapse.This comprises 575 genes encoding proteins in the presynaptic nerve terminal and active zone, 107 from the synaptic vesicles and 1755 from the postsynaptic density (there is some overlap between categories).The list includes strong candidates for a number of neuropsychiatric disorders such as, for example, ANK3 for bipolar disorder (Ferreira et al., 2008), GRM7 for major depression (Shyn et al., 2011), PDE4B for schizophrenia (Kahler et al., 2010) and SHANK3 for autism (Gauthier et al., 2009).SynaptomeDB is a database with a web front application resource that integrates the various and complex data sources for these synaptic genes.

Database design and features
The database is created using MySQL 5.5.The parsers are written in perl and Bioperl (Stajich et al., 2002).The Ensembl BioMart is also used to create some of the tables.A conceptual model of the database is shown in Supplementary Material 3.These tables describe fundamental information about a particular gene: name, description, associated accession numbers, chromosome location, function and comparative map information among other variables.Information from Ensembl also occupies a significant part of the database.It is important to note that no extensive cleaning of the data is performed during the database creation and update process.As detailed in Supplementary Material 1, the major cleaning process involves character screening to make sure the data is compatible for HTML viewing as well as database query.This allows automatic updates and eliminates some well-known problems created by data cleaning.

Web interface
SynaptomeDB provides a user-friendly web interface.Users can query SynaptomeDB using gene information such as names, gene IDs, synonyms and genomic regions.The output consists of a graphical representation of protein structure from PDB (Berman et al., 2000), protein-protein interactions from STRING (von Mering et al., 2003) and protein domain architecture from HPRD (Keshava et al., 2009).All information was hyperlinked to its original resources.SynaptomeDB allows the user to export multiple samples from different sample sets, in a desired order, to a number of common file formats including Excel, Word, CSV and XML.The web interface of SynaptomeDB provides a rich set of functions for searching the database.In general, search results are initially presented as the summary statements of individual gene records contained in SynaptomeDB, along with additional links to the gene detail page that reveal all details of the gene records returned by the query.A simple text search function is also provided to enable maximum flexibility in searching all records.The advanced search page provides complex searching functions.General database statistics are shown on the home page and reveal a quick summary of genes, as well as the last updates of the system.

FUTURE PLANS AND CONCLUSIONS
The database was constructed following guideline described previously (Kirov et al., 2005).It can be used to answer complex queries, such as defining a set of candidate genes based on the genome localization or specific function.The database provides a valuable resource to both experimental and bioinformatics groups by bringing together different sources of information and functional annotation in one place, and in a high-throughput fashion.A synaptome-based strategy for psychiatric genetic sequencing is valuable because there is evidence for synaptic proteins playing a role in psychiatric disorders (Glessner et al., 2010), and because these proteins represent the most 'druggable' targets for pursuit of novel therapies.Our application will further research in this area both in its current form and with additional modifications that will include incorporating navigation based on GO and functional pathways and networks among the Synaptome genes in the DB and to also include or link with protein-protein interactions.We intend to extend SynaptomeDB to connect to other psychiatric resources such as SZGene (Allen et al., 2008) and Alzgene (Bertram et al., 2007), and also to integrate variants from several ongoing studies that include synaptome genes, including the 1000 Genomes Project.