Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data

The current version of the Human Disease Ontology (DO) (http://www.disease-ontology.org) database expands the utility of the ontology for the examination and comparison of genetic variation, phenotype, protein, drug and epitope data through the lens of human disease. DO is a biomedical resource of standardized common and rare disease concepts with stable identifiers organized by disease etiology. The content of DO has had 192 revisions since 2012, including the addition of 760 terms. Thirty-two percent of all terms now include definitions. DO has expanded the number and diversity of research communities and community members by 50+ during the past two years. These community members actively submit term requests, coordinate biomedical resource disease representation and provide expert curation guidance. Since the DO 2012 NAR paper, there have been hundreds of term requests and a steady increase in the number of DO listserv members, twitter followers and DO website usage. DO is moving to a multi-editor model utilizing Protégé to curate DO in web ontology language. This will enable closer collaboration with the Human Phenotype Ontology, EBI's Ontology Working Group, Mouse Genome Informatics and the Monarch Initiative among others, and enhance DO's current asserted view and multiple inferred views through reasoning.


INTRODUCTION
Human disease data is a cornerstone of biomedical research for identifying drug targets, connecting genetic variations to phenotypes, understanding molecular pathways relevant to novel treatments and coupling clinical care and biomedical research (1,2). Consequently, across the multitude of biomedical resources there is a significant need for a standardized representation of human disease to map disease concepts across resources, to connect gene variation to phenotypes and drug targets and to support development of computational tools that will enable robust data analysis and integration (3,4). Defining a biomedical domain within the context of an ontology creates a rigorous knowledge backbone for the annotation of biomedical data through defined concepts connected by specified relations (5,6). Ontologies with their clearly-defined and well-structured descriptions are vital tools for the effective application of 'omic' information through computational approaches.
The Human Disease Ontology (Figure 1) (7) (DO, http://www.disease-ontology.org) is a community driven standards-based ontology that is focused on representing common and rare disease concepts captured across biomedical resources with the mission of providing a disease interface between data resources through ongoing support (term review and integration) of disease terminology needs. The DO project has had a significant impact on the development of biomedical resources, as evidenced by the body of 95 Google Scholar citations to DO's 2012 NAR paper (7).
The Human DO includes only concepts of disease and by design is meant to be a disease-focused scaffold for associating additional facts about disease. DO does not include progression (early, late, metastasis, stages) or manifestations (transient, acute, chronic) of disease as part of the definition of disease. DO does not intentionally include compound disease terms (those describing the combination of two disease terms) such as glaucoma associated with pupillary block, rather these diseases are represented by two distinct disease terms. DO integrates disease concepts from ICD-9, the National Cancer Institute (NCI) Thesaurus (8), SNOMED-CT (10) and MeSH (https://www. nlm.nih.gov/mesh/MBrowser.html) extracted from the Unified Medical Language System (UMLS) (9) based on the UMLS Concept Unique Identifiers for each disease term. DO also includes disease terms extracted directly from Online Mendelian Inheritance in Man (OMIM) (11), the Ex-perimental Factor Ontology (EFO, http://www.ebi.ac.uk/ efo/) and Orphanet (12).

THE ENHANCED HUMAN DO
In this submission, we report on major enhancements to the DO database since 2012 including content growth, improved data structure, new areas of community based curation efforts, new community partners and a switch to web ontology language (OWL) based curation of DO. The DO website has been maintained with periodic data updates. Improvement of data content has been the primary focus of the past two years. These updates further expand the utility of DO for representing common and rare human diseases, assessment of genomic variants among human cancers, defining human disease within model organism databases, unifying disparate disease representations, connecting phenotypes to disease and connecting Big Data knowledge between pathway, protein and drug target databases.

DO content update
We report here the expanded and more highly curated content of the DO. The DO team has committed 192 revisions  (7)). The DO has increased the number of textual definitions from 22 to 32% of DO terms defined (2087/6419).
During this time DO's curatorial efforts have focused on enriching the content of rare genetic diseases, cardiovascular diseases, neurodegenerative diseases, inherited metabolic disorders, diseases related to intellectual disabilities and cancer. The focus of these curation efforts has been driven by requests from the research community for new terms and by requests for review of groups of terms, review of areas of DO or sets of disease terms being utilized within a biomedical resource.

Rare genetic diseases in DO
Since the 2012 NAR paper, we have improved DO's representation of genetic disease and rare diseases in collaboration with OMIM, Orphanet (http://www.orpha.net) and National Organization for Rare Disorders (https://www. rarediseases.org). These resources provide rich clinical descriptions of rare disease prevalence, inheritance and epidemiology for the set of ∼6800 rare diseases. To date, DO has not distinguished between rare and common disease terms. In future revisions, DO will integrate disease prevalence to augment DO's disease classification. Disease prevalence, defined as the proportion of a population that is affected, can then be utilized to designate rare diseases in DO and harmonize with the European (1 in 2000 affected persons) and USA (fewer than 200 000 affected individuals) definitions of rare.

DO structure updates
We report here examples of the structural improvements in the third and fourth tier of DO. Since the previous DO NAR report (7), the subtypes of each body system disease have been reviewed and classified. For example, to update the classification of cardiovascular system disease terms, Harrison's Principles of Internal Medicine (13) was utilized to more specifically categorize disease terms into four types of cardiovascular disease: heart conduction disease, heart disease, pericardium disease and vascular disease. All DO terms under the cardiovascular system disease node were reviewed, their parentage was assessed and updated as needed and additional nodes were defined as subtype categories, for example atrioventricular block and sinoatrial node disease were added as subtypes of heart conduction disease.
The structure of the genetic disease parent node in DO has been restructured. Genetic diseases in DO are subtyped into chromosomal disease or monogenic disease. There are currently seven DO terms that do not fit into either of these two subtypes. These diseases are subtyped directly under genetic disease as their etiology involves causal mutations within an unknown pattern of inheritance, multi-genic mutations or multiple patterns of inheritance (e.g. Coffin-Siris syndrome).

Automated DO to OMIM mapping pipeline
In this report, we describe the results of an automated OMIM to DO mapping pipeline devised to identify candidate mappings. The pipeline was developed as part of the DISEASES (Disease-gene associations mined from literature, http://diseases.jensenlab.org/Search) project. A dictionary of DO terms and synonyms were matched to titles and acronyms from OMIM disease pages (14). Filtering rules were applied to achieve a one-to-one OMIM to DO mapping to eliminate false matches between similarly named but distinct OMIM titles. The filtering rules included: removal of possessive case (e.g. Alzheimer's), stripping punctuations, quotes and parenthesis, removal of various prefixes and postfixes and other ontology specific words (e.g. finding, Ambiguous) and changing Roman numbers to Arabic numbers. Between November 2013 and April 2014 this collaboration produced 658 candidate OMIM to DO mappings. This pipeline has identified OMIM IDs (multiple phenotypes and inheritance variants) that represent a single disease that had not previously been identified through DO's UMLS concept mapping or the DO team's curation efforts. For example, 16 additional OMIM IDs were identified, reviewed and integrated into DO for Alzheimer's disease. These OMIM IDs include both 'Phenotype description, molecular basis known' (OMIM symbol #) and the 'Phenotype description, molecular basis unknown' (OMIM symbol%).

Assessing disease vocabulary concept overlap
The DO continues to strive to be a resource for unifying disease concepts directly through DO terms and indirectly through DO's extensive cross-mappings. Ultimately, through collaborative development, DO has the goal to provide a complete set of disease term concepts. Evaluation to identify areas of DO to be augmented involves examining concept overlap. As a first step to examine the connectivity of disease vocabularies via cross-references, we carried out a cross-comparison analysis for seven disease vocabularies: the Human DO, National Drug File Reference Terminology (NDF-RT) disease terminology (15), NCI Thesaurus disease terminology, Orphanet (12)  UMLS includes a meta-thesaurus that combines health and biomedical vocabularies using grouping and sematic integration of synonymous relationships. MeSH, OMIM and SNOMED-CT are among the UMLS source vocabularies; therefore, cross-references to disease concepts from MeSH, OMIM or SNOMED-CT were further (indirectly) mapped to a UMLS disease concept (when such a cross mapping exists within UMLS).
All MeSH, OMIM, SNOMED-CT and UMLS crossreference links (directly provided by a disease vocabulary or indirectly derived as described above) were used in the comparison analysis. To justify our cross-mapping approach, we have also carried out overlap analysis either without direct links through MeSH, OMIM and SNOMED-CT, or without indirect links through UMLS (see Supplementary  Tables S1 and S2). The overlap analysis of cross-references between disease vocabularies changed markedly for practical reasons. For example, using only UMLS for the overlap analysis excluded a number of disease-related OMIM crossreferences that did not have a corresponding UMLS concept. Similarly, excluding any indirect UMLS links prevents NCI from having significant overlap with other disease vocabularies, as NCI does not provide any MeSH, OMIM or SNOMED-CT cross-references.
The sizes of the disease vocabularies cross-compared with DO are provided in Table 1 (see also Supplementary Table S3). The NDF-RT ontological framework organizes disease concepts into a hierarchical structure derived from the MeSH vocabulary, and it explicitly asserts different relations between drugs and diseases. NCI Thesaurus integrates different kinds of concept schemes and their interrelationships in a unified conceptual framework. NCI disease terms belong to six semantic types, including disease or syndrome, neoplastic process, pathologic function, mental or behavioral dysfunction, sign or symptom and abnormality. ORDO integrates Orphanet multi-hierarchical classifications of rare diseases with semantic interoperability. KEGG MEDICUS contains integrated information for drugs, diseases and other health-related concepts. CTD MEDIC has been employed to hierarchically annotate disease-associated toxicogenomic relationships and is derived from MeSH and OMIM. MEDIC disease terms were downloaded, and all of the primary/alternative MeSH and OMIM identifiers including the MEDIC Slim identifiers were used in the cross-comparison analysis. HPO-D provides genetic disease annotations for human phenotypic abnormalities using disease concepts in OMIM, Orphanet and DECIPHER databases.

Cross-comparison results and conclusions
The count of unique cross-references provided by each disease vocabulary is summarized in Table 2. UMLS links reported in this table are provided in both 'indirect' (i.e. derived via UMLS from MeSH, OMIM and SNOMED-CT links) and 'direct' (i.e. provided directly by the vocabulary) form along with the unique count of the union of these two links being 'combined', showing there can be substantial unique UMLS cross-references provided using the 'indirect' linking method but also strong overlap.
It is worth noting that DO provides the largest number of unique cross-references (35 895), which contributes ∼64% of the total number of unique cross-references (Table 2). This emphasizes DO's role as a resource rich in cross-references and usefulness as a disease-centric scaffold for data. On average, each DO term has more than four cross-references (35 895/8757). The large number of crossreferences indicates the role of the DO in interoperability in the DO domain.

Mapping methodology
If any two disease terms from different disease vocabularies shared one or more cross-references, identity was assumed and they were treated as synonymous terms. Each vocabulary has its own emphases on particular disease categories. For instance, ORDO and HPO-D are primarily comprised genetic diseases, and NCI Thesaurus has specific focus on cancers. Not surprisingly, ORDO and HPO-D have a high degree of overlap with each other, and other disease vocabularies may have a low degree of overlap with them (Table 3).
It is noteworthy that most disease vocabularies have very low overlap with NCI Thesaurus except DO. DO has more general coverage of disease categories, but still a relatively low degree of overlap with the rare diseases represented in ORDO and HPO-D, and has moderate overlap with CTD, NDF-RT and KEGG diseases and has high overlap with NCI cancer diseases ( Table 3).
The percent overlap between the seven different disease vocabularies is presented in Table 4. There are a number of observations apparent. KEGG has the fewest disease terms and, therefore, a limited ability to cover the disease concepts used by other terminologies; however, it does so rather effectively, covering 56% of CTD and 50% of ORDO. HPO-D extensively uses ORDO to annotate their disease-related terms, so it may not be a surprise that it covers 87% of the ORDO disease terms. For the same reason, it is not very surprising that HPO-D and ORDO both cover about the same amount of the KEGG disease concepts. NCI, with its emphasis on cancer, has limited overlap (at most 38%) with the other disease terminology concepts. NDF-RT's focus on druggable diseases appears to limit its coverage of known diseases without treatments, but it still covers more than half of CTD and KEGG disease concepts. Lastly, it is rather interesting to compare the cross-coverage of CTD and DO. Clearly, CTD covers druggable targets better than DO (as manifest by 99% coverage of NDF-RT) but has limited coverage of cancer disease concepts (as indicated by a 24% overlap with NCI).
Alternatively, DO has good coverage of cancer concepts (as indicated by 87% overlap with NCI) but has only moderate coverage of druggable/treatable diseases (as indicated by a 61% overlap with NDF-RT). This indicates areas of strength and improvement for DO to cover additional disease concepts (e.g. by identifying missing NDF-RT disease concepts).

Updates to the DO website
The current version of the DO website (7) (version 1.0) provides a comprehensive resource to perform full-text search-  The number of identical terms between any two disease vocabularies indicates the degree of overlap between them. The off-diagonal numbers in the overlap matrix can be interpreted as the number of terms from the row disease resource covered by the column disease resource. For example, DO has common cross-references with 1492 of the 4799 ORDO terms, while ORDO has common cross-references with 1952 of the 8757 DO terms. The diagonal numbers in the overlap matrix indicate the total number of terms in each disease resource. a The total number of disease terms in each disease vocabulary.  (18), Bioconductor Ge-neAnswers (19)) and downloads of the DO Neo4j database on GitHub. The DO API allows for querying of terms for a specific DOID through a REST-style URL. For example, the URL for DOID:11725 is http://www.disease-ontology. org/term/DOID:11725/.  The increased number of visits (sessions), number of users and percent of returning visitors to the DO website over the past year and monthly averages between December 2011 and June 2014.

Community feedback and data submissions
The DO team receives community feedback on a continual basis through individual new term requests, requests for definition, synonym or term updates and requests for explanations of the DO curatorial process and curation decisions. Requests are received through multiple methods including DO's SVN term tracker (http://sourceforge.net/p/ diseaseontology/feature-requests/), DO's Contact Us (http: //www.disease-ontology.org/contact/), through DO's website feature 'Add an item to the term tracker' found at the bottom of each metadata page for each DO term, and direct emails to the DO PIs. The DO listserv (diseaseontologydiscussion) has received over 200 submissions since November 2011. The DO team has fielded 106 distinct postings through the DO website (DO disease-ontology.org requests) and 59 feature requests through the DO SVN site. Each review request involves examination of the current disease information in DO and examination of current literature and online expert resources (e.g. GeneReviews (20), Orphanet, OMIM, NIH Institutes and MayoClinic) to identify the most appropriate classification for each disease term. The DO team provides prompt replies to each user request at the conclusion of the curation.
Examples of the most often types of requests include: refining DO textual definitions, creating new DO classes (e.g. 25 new DO terms for the fission yeast database), adding DO terms for a set of diseases (e.g. dystonia diseases), term name fixes, term status (obsolete), adding comments to obsolete DO records, adding synonyms from publications to a DO term, removing term redundancy among the synonyms, adding apostrophe free synonyms to disease terms, identifying typos in term names or definitions, adding OMIM IDs to specific DOIDs, updating term parentage or adding relations to definitions to further clarify etiology.

Large-scale community data submissions
We report here DO's collaborative efforts with biomedical resources over the past two years. DO provides individual curation efforts to coordinate, map and integrate the disease terms used by each biomedical resource. DO works directly with community members providing disease curation to support disease representation among the Model Organism Databases (FlyBase (Susan Tweedie), Worm-Base (Ranjana Kishore), PomBase (Antonia Lock), within pathway (Reactome: Peter D'Eustachio) and epitope (Immune Epitope Database (IEDB), Bjorn Peters IEDB Disease Finder, http://www.iedb.org/home.php) databases and to foster the development of gene--variant--phenotype resources (e.g. Gene Wiki (21), OMIM API) and cancer variant projects (e.g. The Jackson Laboratory for Genomic Medicine (http://www.jax.org/ct/), HIVE (22) (27) and the NIH Library of Integrated Network-based Cellular Signatures program (28).
DO has provided disease mappings in the past year across a number of biomedical resources (Table 6) DO has become a disease knowledge resource for the further exploration of biomedical data including measuring disease similarity based on functional associations between genes (29), as a disease data source for the building of biomedical databases, e.g. cdGO: an ontology database for protein domains (http://supfam.org/SUPERFAMILY/ dcGO) (30) and defining disease-gene relationships, e.g. DGA: Disease and Gene annotations resource (31).

FUTURE DIRECTIONS
The DO team recognized the imperative need to provide definitions for all DO terms. Integration of textual definitions for all DO terms is a major curatorial effort of the DO team in the next year. DO is moving to a multi-editor curation model in the fall of 2014 in order to improve DO's interoperability, to enable integration of cross products within DO and to develop inferred DO hierarchies (genetic, clinical) in addition to DO's asserted etiology-based hierarchy. This work will involve: moving curation effort to Protégé (coordinated by Chris Mungall) to use OWL and reasoning; creating cross product terms for better interoperability with the OBO Foundry ontologies; and engaging community partners (EBI, MGI, HPO and Orphanet) and cardiovascular and metabolic disease clinicians to join DO as editors and data reviewers. An initial list of the types of data that will be added to DO, along with their associated relations, has been proposed and will undergo additional review within the DO group and among community ontology partners including Barry Smith. The data types, source ontologies and relations to be added to DO include: phenotypes (HPO/PATO: has phenotype), symptoms (SYMP: has symptom), age of onset (HPO), anatomical location (UBERON: located in), GO annotations (Gene Ontology: disregulated in), cells/tissues of origin (Cell ontology: derives from or has material basis in) and types of inheritance (has physical basis in). For example, the DO term inhalation anthrax is currently defined as 'is a anthrax disease'. The addition of UBERON (32) cross products will utilize the located in relation to connect the DO term to the UBERON terms lung (UBERON:0002048) and lymph node (UBERON:0000029).

DO website: future directions
Additional development of the DO website is planned over the next year. Version 2.0 of the DO website will include bulk querying and API development, saving of queries and result datasets and better direct-link support to DO terms. An enhanced DO API will allow users to perform any action that they can do interactively on the website via the API. This will include searching using all fields provided, pulling down static images of visualized term relationships and pulling down single or many sets of term metadata. Expansion of the API will provide another robust resource to allow tech-savvy users to scrape or pull down large amounts of metadata for use in their own websites, web applications or bulk analysis.