HNdb: an integrated database of gene and protein information on head and neck squamous cell carcinoma

The total amount of scientific literature has grown rapidly in recent years. Specifically, there are several million citations in the field of cancer. This makes it difficult, if not impossible, to manually retrieve relevant information on the mechanisms that govern tumor behavior or the neoplastic process. Furthermore, cancer is a complex disease or, more accurately, a set of diseases. The heterogeneity that permeates many tumors is particularly evident in head and neck (HN) cancer, one of the most common types of cancer worldwide. In this study, we present HNdb, a free database that aims to provide a unified and comprehensive resource of information on genes and proteins involved in HN squamous cell carcinoma, covering data on genomics, transcriptomics, proteomics, literature citations and also cross-references of external databases. Different literature searches of MEDLINE abstracts were performed using specific Medical Subject Headings (MeSH terms) for oral, oropharyngeal, hypopharyngeal and laryngeal squamous cell carcinomas. A curated gene-to-publication assignment yielded a total of 1370 genes related to HN cancer. The diversity of results allowed identifying novel and mostly unexplored gene associations, revealing, for example, that processes linked to response to steroid hormone stimulus are significantly enriched in genes related to HN carcinomas. Thus, our database expands the possibilities for gene networks investigation, providing potential hypothesis to be tested. Database URL: http://www.gencapo.famerp.br/hndb


Introduction
The high-throughput 'omics' technologies (genomics, transcriptomics, proteomics and metabolomics) and advanced computational tools have led to a more thorough understanding of the neoplastic process as well as to the identification of potential biomarkers for cancer diagnosis and prognosis. These high-throughput technologies accumulate scientific data on an unprecedented scale. However, these data are dispersed between several databases, including, inter alia, The Cancer Genome Atlas (TCGA) (1) Gene Expression Omnibus (GEO) (2), ONCOMINE (3), the Human Protein Atlas (4) and the Human Metabolome Database (HMDB) (5)(6)(7). This decentralized structure poses substantial problems when attempting to draw conclusions or formulate new hypotheses.
PubMed (8), a freely available database developed and maintained by the US National Library of Medicine, is one of the most important web-based search tools for biomedical information retrieval. Currently, PubMed has over 3 million citations on cancer. Thus, it is extremely difficult to manually retrieve all relevant data, even after splitting the subarea of interest or using specific queries. In addition, literature searches on cancer are hampered by the fact that cancer is a complex disease. Cancer and cancer subtypes more closely resemble a set of diseases, each disease with different features and unknowns. Head and neck (HN) cancer is the sixth most common type of cancer worldwide, with about 600 000 new cases in 2012 (9) and a remarkable example of heterogeneous malignancy.
Similar to what is observed in many types of neoplasms, the challenge in searching the literature on HN cancer is particularly difficult due to its diversity, which involves diversity in histological type, anatomical location and primary risk factors. For instance, the anatomical sites affected by the disease and the primary risk factors can be used to divide head and neck squamous cell carcinomas (HNSCC) into at least three classes. Two of these classes involve human papillomavirus (HPV)-positive disease (mostly oropharyngeal with a favorable prognosis) and HPV-negative disease (with less favorable prognosis and a different molecular profile) (10). HPV-positive tumors are primarily wild-type TP53, whereas HPV-negative tumors present mutated TP53 and show high chromosome instability (10,11), which may sustain advantageous metabolic pathways, aid in escaping the inhibitory effects of suppressor signals (12) or promote oncogenic effects (13). A third class of HNSCC consists of nasopharyngeal tumors in which distinct etiological factors are known, including Epstein-Barr virus (EBV) infection (14).
Lymph node status and tumor size remain the most powerful prognostic factors for HNSCC. However, survival is frequently low. Only 40-50% of patients remain alive 5 years after diagnosis (10,15). This is likely because tumors in early stages frequently present few symptoms leading to a delay in diagnosis. Furthermore, therapy effectiveness is highly variable, even in early lesions or histologically similar cases.
The HNSCC molecular progression model suggests that some genetic alterations are present in benign hyperplasia, for instance the inactivation of the CDKN2A gene. According to this model, the clinical progression to dysplasia, in situ carcinoma, and, finally, invasive carcinoma is supported by the increased accumulation of molecular alterations (16). TP53 mutations, CCND1 gene amplification, EGFR activation/PTEN inactivation, and the deletion of different genome segments are some examples of the genetic alterations related to HNSCC progression, as stated by Leemans and collaborators (10). Such alterations promote the neoplastic phenotype defined by Hanahan and Weinberg (17), including increased cell proliferation, insensitivity to growth suppression factors, apoptosis resistance, sustained angiogenesis, energy metabolism alterations, immune attack avoidance and the acquisition of invasion and metastasis capability.
Investigations into HNSCC emphasize the importance of identifying the mechanisms and the molecular changes triggered during the malignant transformation that culminates in the neoplastic phenotype. New data on potential markers may shed light on tumor biology and, consequently, lead to the development of novel drugs. Literature mining is a fundamental starting point for this discovery process, but the recent exponential growth in biological data is well beyond the limit of a complete manual search in most cases. In turn, automated literature mining can help to find disease-related biomarkers and their interrelationships, and extract hidden information with tools able to efficiently target valuable research questions and generate testable hypothesis. During this process, the articles of interest are retrieved, the biological entities are identified in texts, and specific information, particularly relationships between biological entities, is extracted.
One of the challenges in automated approaches is the exact identification of genes, proteins or diseases since they may be referred to by different names, share names and symbols, or even be described by nonstandard nomenclature in literature and databases (18). Another challenge is to identify consistent descriptions of gene products and their associated features, and supporting evidence for inferring such associations. To overcome these limitations, textmining applications have incorporated tools to recognize specific keywords and to capture relevant sentences and ontologies. For example, relationships may be extracted investigating entities that co-occur in the same report, title, abstract or even a sentence, or by the so-called natural language processing (NLP) methods. NLP methods are based on the structure of sentences and on how the biological data is mentioned (19). However, this approach has advantages and limitations, since it may give rise to erroneous relationships depending on used parameters (20).
The controlled vocabularies of the Gene Ontology (GO) (21) project enable coupling of gene products to their associated biological processes, cellular components and molecular functions (22). However, the automatic identification of GO-literature association is less accurate than manual curation methods, such as the one using Medical Subject Headings (MeSH) (23) for indexing PubMed articles, a process performed by trained experts that potentially generates few false positive assignments. In addition, MeSH-literature associations may be linked to genes or diseases, facilitating the identification of previously unrevealed relationships between entities, such as protein-protein, drug-effect and protein-disease (24,25).
In this work, we developed an in-house methodology to conduct literature mining aiming to identify genes and gene products related with various aspects of HNSCC. A database (HNdb) was established for unifying the information on these genes and proteins, covering data on genomics, transcriptomics, proteomics, literature citations, and also cross-references of external databases. The information was wrapped up in a friendly web interface, which provides easy and rapid access to the HNSCC-related genes and to a vast number of biological data resources. The interfaces aims to facilitate the selection of candidates for validation assays and the identification of potential new markers, as exemplified in this study.

Data collection and literature mining
The workflow of our literature mining consisted of two initial automated stages and a separate manual step. In stage I, the studies were retrieved from PubMed database using a combination of MeSH terms and Boolean operators. Three literature searches based on different MeSH terms were run on 29 June 2015. In stage II, the articles selected in stage I were associated with genes using the gene2pubmed association file (26), which contains the gene identifiers (gene IDs) and the respective PubMed article identifiers (PMIDs). For this association, only human genes were accepted. The PMIDs thereby obtained were downloaded via PubMed and compiled, and publications assigned to MeSH terms for HN neoplasms were manually curated by two independent investigators. The details on the MeSH terms and on the literature search strategy are presented in Supplementary File 1 and an overview of the workflow is provided in Figure 1.
Considering that our automated strategy may have missed relevant articles and genes, the only two databases, to our knowledge, that also focus on HNSCC were searched: the Head and Neck and Oral Cancer Database (HNOCDB) and Oral Cancer Gene Database (OrCGDB) (27,28). PMIDs/genes not detected by our approach but selected by these databases were included in our list after manual curation to confirm a positive involvement with the HNSCC sites of interest. Precision (specificity) and recall (sensitivity) values were calculated, respectively, as the proportion of genes relevant from our search, and as the proportion of relevant genes that were retrieved [Precision ¼ genes retrieved and relevant/total genes retrieved; Recall ¼ genes retrieved and relevant/total genes relevant in collection]. To overcome the difficulty of predicting the total number of genes in PubMed that are relevant for our search, we used HNOCDB and OrCGDB data on the same query.
To establish a gene-to-HNSCC association, contingency tables were constructed using the curated set of articles addressing genes in HNSCC, and PMIDs and genes from all other neoplasms. Fisher's exact test was performed to evaluate association and P < 0.05 were considered statistically significant. The analyses were performed using SASV R 9.3 (SAS Institute Inc., Cary, NC, USA) for Windows. Genes were then ranked according to their level of association with HNSCC-from the most relevant to the less relevant defined by the number of publications addressing the gene in HNSCC-by a hypergeometric test (29) performed using the Stirling's approximation to highfactorial values (30). The method calculates the probability of k or query-relevant publications for a gene A by chance, being S the score for gene A, m the publications in the gen-e2pubmed association file, n the number of publications retrieved for the query and present in the gene2pubmed association file, j the number of publications that involve gene A, and k the number of query-relevant publications that involves the gene A. The formula (ln¼ natural logarithm) is: Due to the importance of identifying prognostic signatures for HNSCC as well as markers associated with disease progression, an independent search was performed using MeSH and non-MeSH terms related to 'Metastasis' and 'Prognosis/Outcome' against abstracts and titles of the manually curated PMID set of articles (Supplementary File 1).
The data collection workflow will be routinely updated twice per year to incorporate new PMIDs and genes.

Database frameworks and web interface
To integrate potential biomarkers involved in HNSCC with data from the available literature, we constructed a MySQL relational database system implemented in an Apache server using the Linux operating system. The web platform interface was developed using the JavaScript programming language, HTML and PHP at the front end and the back end supported by PHP and PERL programming languages. The platform provides users with the ability to search for and download information on the genes and proteins involved in HN cancer.
The home page presents the database objectives and provide tools for searching genes related to HNSCC, their expression pattern and chromosome location. External data were included in the database to facilitate access to the maximum amount of information on a particular gene or protein. For example, the genes selected by users are linked to PMIDs, metabolic pathways (31, 32, 33, 34), associated ontologies (21), somatic mutations in HN cancer (35), genetic disorders (36) and microarray data. HNSCC microarray data were obtained from GEO (2) and ONCOMINE (3) platforms at the time of manuscript preparation (GEO accession numbers GSE9844, GSE6631, GSE1722, GSE13601, GSE3524, GSE2379, GSE25099 and ONCOMINE dataset Ginos Head-Neck) (37-44) and may help users identify genes with similar expression patterns. Data on proteins, including interactions and drugs that target them (4,(45)(46)(47)(48)(49)(50)(51)(52)(53) are also available.

GO and pathway analysis
The curated set of genes related to HNSCC was imported into DAVID (54, 55), a database for annotation, visualization and integrated discovery (54), and the genes were annotated for GO and pathways using the whole human genome as background. The one-tail Fisher Exact Probability Value was used for gene-enrichment analysis and Bonferroni corrected P < 0.05 were considered significant. Ingenuity Pathway Analysis (IPA) software (Qiagen, Redwood City, CA, USA) was also used to identify relevant canonical pathways overrepresented in the set of HNSCC-related genes.

Database querying
The database is freely available and can be searched at http://www.gencapo.famerp.br/hndb/ with three input forms. By typing the gene symbol, aliases, gene or protein name, accession number or ID into the search box, users can obtain information on whether a gene has already been related to HN cancer. Users can also retrieve all genes related to HN cancer at once and evaluate their expression in HN tumor samples and paired surgical margins, according to eight microarray studies (37)(38)(39)(40)(41)(42)(43)(44) selected at the time of the manuscript preparation and described in the 'Database frameworks and web interface' section. The search settings are configured to use the official gene symbols, ID numbers and aliases from the National Center for Biotechnology Information (NCBI) or Ensembl Project (56,57), as well as proteins (by accession number) from the Universal Protein Resource (UniProt) (58).
Users can also browse chromosome regions associated with HN cancer. The data returned by the queries can be downloaded as a spreadsheet or a text file. The results of a particular gene are displayed in a new page that provides the official gene name, gene IDs, aliases, chromosome location and gene expression pattern generated via microarray studies on tumor tissues as well as articles that support its involvement in HNSCC or report prognostic markers. As indicated above, the results also include gene ontologies, metabolic pathways and links to external databases on expression patterns in normal tissues, somatic mutations in cancer and gene-phenotype or disease associations. The protein page provides 3D structures and posttranslational modifications, metabolite and protein-protein interactions, expression patterns and drugs for targets of interest.

Results and discussion
In total, the 'Neoplasms by site' search resulted in 1 819 931 articles (between 2015 and 1928). Two searches for 'Head and Neck Neoplasms' resulted in 38 862 and 41 086 articles (between 2015 and 1945), respectively, which after gene2pubmed association and exclusion of redundancy, generated a list of 1611 genes. Following a manual curation, 421 genes not related to HNSCC were excluded and a list of 1190 genes was obtained. To this list, 180 among 517 genes identified by HNOCDB and OrCGDB databases but not detected by our approach were added after a thorough manual reevaluation, resulting in 1370 genes in total. Considering these data, the precision (specificity) of our automated approach was estimated in 74%, and recall (sensitivity) was estimated in 87%. Although these values are satisfactory, they still need to be improved since not all the genes retrieved by the approach were considered relevant after manual curation. In addition, several relevant genes were missed, which indicates that the literature search in future versions of HNdb have to be expanded to include articles identified through digital libraries besides PubMed (e.g. Google Scholar, Web of Science and Scopus) (59-61), and approaches for information extraction should be added, such as NLP based methods.
The analysis of contingency tables constructed using our PMID sets revealed that, although HNSCCs compared to all neoplasms (except HNSCC) show genes with differential citation frequency at the 0.05 level of significance, none of these genes are exclusively associated with HNSCC. In fact, established HNSCC genes listed by (10) (CCND1, CDKN2A, EGFR, MET, PIK3CA, PTEN, SMAD4, TP53) are also associated with several other tumors (62)(63)(64)(65)(66)(67)(68)(69) and all are present in our list of HNSCC-related genes. These results highlight the need of extensive basic and clinical research focused on unique characteristics of this group of carcinomas.
One hundred forty-eight of 1370 genes were linked to at least five PMIDs and thus were classified as top HNSCC-related genes, with TP53 and EGFR being the first two genes of this list (Table 1). These scores for TP53 and EGFR were confirmed by the hypergeometric test (Supplementary Table 1), and indicate that they represent the most extensively studied ones and certainly exhibit relevant results. Regarding the 893 genes mentioned by only one article, many of them probably have not yet been completely exploited as potential markers and deserve further investigations.
The 1370 HNSCC-related genes showed a heterogeneous distribution along the chromosomes ( Table 2) and, as expected, many of them were mapped to known HNSCC 'hot spots' such as 11q13 (70,71). However, several others were mapped to less frequently cited regions. Approximately 10% were mapped to chromosome 1, 7% to chromosome 11 and almost the same amount to chromosome 17, a distribution not correlated with the size in MB of each chromosome.
To evaluate the performance of our literature mining approach, we compared our nonredundant list of 1190 genes with the top genes selected in HNOCDB and OrCGDB (currently frozen) databases. Considering the same anatomical sites analyzed in the present work, HNOCDB extracted 133 genes in oral, 14 in tongue, 7 in hypopharyngeal, 3 in oropharyngeal and 60 in laryngeal cancers through text-mining. OrCGDB selected 374 genes involved in oral cancer by searching PubMed abstracts and MeSH terms. A total of 517 nonredundant genes was identified by these databases. After a manual curation, 180 genes retrieved from HNOCDB and OrCGDB were added to our list of 1190 genes. In contrast with these databases, the present study performed three searches using MeSH terms and was more stringent by excluding articles that also analyzed non-HNSCC tumors. Therefore, our gene list (the largest of the three databases) is more specific and, therefore, more focused on the tumors of interest. In addition, the IPA showed that the top canonical pathway associated with our 1190 genes is the Molecular Mechanisms of Cancer (P ¼ 6.64 À66 , overlap 34.5%, 126/ 365), thus supporting their relevance in the neoplastic process. Differently, this pathway was not associated with OrCGDB and HNOCDB genes (n ¼ 517), which showed as the top-ranked pathways Aryl Hydrocarbon Receptor Signaling (P ¼ 1.42E À32 , overlap 29.3% 41/140), Bladder Cancer Signaling (p-value 4.44E À32 , overlap 39.1% 34/87) and Hepatic Fibrosis/Hepatic Stellate Cell Activation (P ¼ 5.49E À32 , overlap 24.6% 45/183). Furthermore, HNdb is the only database that uses specific MeSH terms to link genes to literature data on prognosis and outcome (Supplementary Table 2, also available on the gene results   page), facilitating the identification of markers that are relevant to tumor biology and therapy response.
To investigate the biological meaning of the HNSCCrelated genes, we performed a GO and pathway analyses using DAVID tools. A total of 1329 DAVID identifiers were mapped from the list of 1370 genes and similar annotation terms were clustered into groups, removing redundancy. More than 500 of annotation clusters were obtained, 86 of them with enrichment scores >5.0 and Bonferroni corrected P < 0.05 (Supplementary Table 3). The results showed an overrepresentation of clusters related to tissue development and differentiation, response to stimulus, signal transduction, cell proliferation, cell migration, apoptosis, transcription and cell adhesion, which are biological processes relevant to cancer. In addition, the top five canonical pathways identified by the IPA for these 1370 genes were Molecular Mechanisms of Cancer (Figure 2A), Colorectal Cancer Metastasis Signaling, Role of Macrophages, Fibroblasts and Endothelial Cells in Rheumatoid Arthritis, Pancreatic Adenocarcinoma Signaling and IL-8 Signaling (P ¼ 4.90E À 71 , 1.95E À 58 , 7.11E À 56 , 5.46E À 53 , 2.25E À 48 , respectively), thus strongly validating our strategy and the informative characteristic of the set of genes.
Furthermore, the diversity of results compiled in our dataset allowed identifying novel and mostly unexplored gene associations. For example, the DAVID analysis revealed that processes related to response to steroid hormone stimulus were significantly enriched in our list of genes (enrichment scores ¼ 37.99, Bonferroni correction, P corr ¼ 7.50E À 31 ) and IPA showed beta-estradiol as one of the top upstream regulators (P-value of overlap ¼ 3.35E À 163 ), ranking next to TGFB1, TNF and TP53 ( Figure 2B). Few studies have explored the metabolic pathways involved in the response to steroid stimulus in HNSCC. Egloff and collaborators (72) observed that estrogen induces activation of members of the mitogenactivated protein kinase (MAPK) family in HNSCC cell lines. The authors also reported evidence that estrogen receptor and epidermal growth factor receptor cross talk is present in HNSCC. In turn, Brooks and collaborators (73) found that increased levels of estrogen receptor b promotes NOTCH1 expression and differentiation of HNSCC cells both in vitro and in vivo. Thus, we demonstrate that a database integrating multiple types of data greatly expands the possibilities for gene networks investigation, providing potential associations to be tested.

Conclusions
Despite the development of tools to mine vast amounts of genomic data, to our knowledge, there is no initiative to curate and compile information from literature regarding genes, proteins, metabolic pathways, diseases, prognosis/ outcomes and drugs associated with HNSCC. The HNdb is an effort toward this goal and is intended to be an integrated database with rapid and easy-to-use tools that facilitate literature and biological data mining to thereby promote research and generate new insight into the development of useful markers for HN cancer.

Supplementary data
Supplementary data are available at Database Online.