PCOSBase: a manually curated database of polycystic ovarian syndrome.

Polycystic ovarian syndrome (PCOS) is one of the main causes of infertility and affects 5-20% women of reproductive age. Despite the increased prevalence of PCOS, the mechanisms involved in its pathogenesis and pathophysiology remains unclear. The expansion of omics on studying the mechanisms of PCOS has lead into vast amounts of proteins related to PCOS resulting to a challenge in collating and depositing this deluge of data into one place. A knowledge-based repository named as PCOSBase was developed to systematically store all proteins related to PCOS. These proteins were compiled from various online databases and published expression studies. Rigorous criteria were developed to identify those that were highly related to PCOS. They were manually curated and analysed to provide additional information on gene ontologies, pathways, domains, tissue localizations and diseases that associate with PCOS. Other proteins that might interact with PCOS-related proteins identified from this study were also included. Currently, 8185 PCOS-related proteins were identified and assigned to 13 237 gene ontology vocabulary, 1004 pathways, 7936 domains, 29 disease classes, 1928 diseases, 91 tissues and 320 472 interactions. All publications related to PCOS are also indexed in PCOSBase. Data entries are searchable in the main page, search, browse and datasets tabs. Protein advanced search is provided to search for specific proteins. To date, PCOSBase has the largest collection of PCOS-related proteins. PCOSBase aims to become a self-contained database that can be used to further understand the PCOS pathogenesis and towards the identification of potential PCOS biomarkers. Database URL: http://pcosbase.org.


Introduction
Polycystic ovarian syndrome (PCOS) is an endocrine disorder that is characterized by a combination of two out of three features, i.e. ovulatory dysfunction, hyperandrogenism and/or the presence of polycystic ovaries (1). PCOS is difficult to diagnose as these features might lead to various phenotypic manifestations (2). Clinical findings showed that women with PCOS have higher risk to develop other complications such as endometrial cancer (3), diabetes (4), hypertension (5) and depression (6). These phenotypic manifestations and disease associations would significantly interrupt the progress in deciphering the cause of PCOS (7).
Transcriptomics (8) and proteomics (9) were used to identify genes and proteins differences between non-PCOS and PCOS women and the resulting data analysis could be used to elucidate the cause of PCOS. At present, numbers of published expression studies has increased significantly since 2003, and this contributes to the vast amount of PCOS-related molecular data. Unfortunately, these molecular data were randomly distributed in various general biological databases (GenBank and UniProt) and literatures thus contribute to the difficulties in finding all genes and proteins that are related to PCOS. This limitation has led us to develop PCOSBase to house 8185 PCOS-related proteins that were manually curated. These proteins were filtered from 17 492 identified proteins from 30 expression studies and 9 databases. Bioinformatic analyses were performed on these proteins to characterize and classify them into specific datasets based on their molecular characteristics. PCOSBase also provides indexed publications related to PCOS. These features signify the differences of PCOSBase to previously published, PCOSKB (10) (PCOSKB statistics as of July 2017 contains 241 sequences). Detailed information on proteins and diseases related to PCOS can be found in PCOSBase but none on the proteins-drugs association as described in Open Targets (www.targetvalidation.org). Open Targets has listed 1119 proteins identified as drug targets for PCOS and 73% of those can be found in PCOSBase (11). PCOS is a focus in this study due to inadequate information and understanding on its complex molecular mechanism and at the same time it associates with many well-described diseases identified from clinical findings. For this reason, PCOSBase serves as a comprehensive medically oriented repository that will be an excellent aid in providing and integrating accurate molecular information for in depth understanding on PCOS.

Data collection
Previous keywords of PCOS and another keywords such as 'gene expression,' 'protein expression,' 'expression,' 'transcriptomics,' 'proteomics' or 'microarray' were also used to search for relevant publications from PubMed (21), ArrayExpress (22), ScienceDirect and Scopus. Genes and proteins that were significantly expressed in those publications were included as PCOS-related proteins. These publications were indexed and listed in PCOSBase.
All genes and proteins from disease-associated databases and published expression publications were compared against NCBI Gene (23) and UniProt (24) databases to obtain their unique Gene ID and UniProt ID. The overlapping data that were obtained in more than one database or studies were combined.

Database organization and architecture
All collected data including relevant information on PCOS-related proteins, functional annotation information and PCOS publications were organized in 29 tables.
The 28 tables were linked to each other except for PCOS publications table (Figure 1).
PCOSBase was built as a relational database using MySQL Server 5.0.11. The web interfaces were designed using Laravel 5.4 (PHP web framework), HTML and JavaScript.

Results and Discussion
Database summary  Table 1.

Database interface and access
PCOSBase interface contains six main menus, i.e. About, Search, Browse, Datasets, Network and Help that will help the user to easily navigate the respective pages.   PCOSBase. Datasets tab are placed at the header and appear on every page of PCOSBase, which allow the users to quickly select and redirect to their desired datasets page. vi. Network menu contains all networks constructed using PCOS-related proteins, Interactions and PCOSrelated diseases datasets. Currently, PCOSBase only provides several static PCOS networks. Figure 3 is one of the networks that can be found in this menu, where  this network clearly depicted the association of PCOS with other diseases. vii. Help menu provides the user manual of PCOSBase, database schema and all the references that were used to retrieve the data. All terms, definition and references that were used in PCOSBase were also provided in the Help page.
Each entry in PCOSBase provides brief description. For example, if the user searches or selects one of the proteins in PCOSBase, for instance 'androgen receptor,' it will navigate the user to the Description page of 'androgen receptor.' Seven tabs containing different information of 'androgen receptor' will appear. If the user clicks on one of the entries in GO tab, it will redirect to the description's page of that GO. The list of PCOS-related proteins that are associated with this ontology will also appear below the GO description. The description of pathways, domains, diseases, tissues, databases, resources and partners will appear if the user clicks on those entries.

Conclusion and future perspective
In the next few years, the size of PCOS molecular data is expected to increase, especially with the application of new sequencing technologies such as next-generation sequencing in analysing in PCOS samples. To ensure PCOSBase is always up-to-date, all information in this database will be periodically updated. It is very important to consider a comprehensive cataloging on all types of data in any PCOS publications so as to ensure they are accessible to PCOS researchers and clinicians for their quick and easy reference. Ultimately, genomic and molecular information in this database will serve as a reliable repository that can be used to search for potential PCOS biomarker towards the development of improved diagnostics and treatment for PCOS. Figure 3. PCOS-disease interaction network. This network is predicted based on PPI and 20 diseases have been predicted to be highly associated with PCOS. The network demonstrates the complexity of PCOSdiseases association and the size of the nodes indicates the degree of association between PCOS and diseases. Green node represents PCOS and size of each node denotes number of shared proteins between PCOS and its respective associated disease.