PhenoMiner: from text to a database of phenotypes associated with OMIM diseases

Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high-quality databases such as the Online Mendelian Inheritance in Man (OMIM). However, the identification and harmonization of phenotype descriptions struggles with the diversity of human expressivity. We introduce a novel automated extraction approach called PhenoMiner that exploits full parsing and conceptual analysis. Apriori association mining is then used to identify relationships to human diseases. We applied PhenoMiner to the BMC open access collection and identified 13 636 phenotype candidates. We identified 28 155 phenotype-disorder hypotheses covering 4898 phenotypes and 1659 Mendelian disorders. Analysis showed: (i) the semantic distribution of the extracted terms against linked ontologies; (ii) a comparison of term overlap with the Human Phenotype Ontology (HP); (iii) moderate support for phenotype-disorder pairs in both OMIM and the literature; (iv) strong associations of phenotype-disorder pairs to known disease-genes pairs using PhenoDigm. The full list of PhenoMiner phenotypes (S1), phenotype-disorder associations (S2), association-filtered linked data (S3) and user database documentation (S5) is available as supplementary data and can be downloaded at http://github.com/nhcollier/PhenoMiner under a Creative Commons Attribution 4.0 license. Database URL: phenominer.mml.cam.ac.uk


PhenoMiner Web Search and REST Guide
Overview Phenotypes play a key role in inferring the complex relationships between genes and human heritable diseases. Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high quality databases such as the Online Mendelian Inheritance of Man (OMIM). However, the identification and semantic harmonisation of phenotype descriptions is a time consuming process that struggles to come to grips with the diversity of human expressivity. High throughput text mining, enhanced with automated conceptual analysis now make it possible to identify phenotype mentions and to predict associative relationships with diseases. We show the effectiveness of our approach by comparing the results against the manually curated gold standards in the Human Phenotype Ontology (HPO) and the phenotype-disorder relations in OMIM.
Following a series of experiments we have applied text/data mining to extract and filter a set of phenotype candidates and link these to associated concepts and literature references. We now wish to make these available as a database and shared portal. The data and experiments are being written up and made available through various means -as journal and conference publications, as a downloadable XML database (through GitHub at https://github.com/nhcollier/PhenoMiner and CERN's Zenodo at DOI: 10.5281/zenodo.12493), as literature annotations (via EMBL-EBI's External Links service) and as a standalone demonstration database portal and REST interface. The last of these will be outlined in this document. The Web-GUI is available via: http://phenominer.mml.cam.ac.uk/index.html and the REST interface is available from: phenominer.mml.cam.ac.uk:8080/phenominer/phenotype/_search?q=

Element: Term
The data in this required element describes one complete phenotype term. There is no effort at this stage to unify or encode synonyms so different forms (e.g. plurals) might appear as distinct terms.

Attributes for Term include:
ID This is the surface form of the phenotype term as it appears in text KEY This is a unique identifier within the S5 database.

EVIDENCE
This is an evidence code showing how the information in the term was curated, i.e. the level of evidence supporting the phenotype annotation. The codes are the same as those used in the Human Phenotype Ontology database for compatibility (see http://www.human-phenotypeontology.org/contao/index.php/annotation-guide.html). At the moment this only takes one value, 'ITM' stands for 'Inferred by Text Mining'. Other codes will include 'IEA' for 'Inferred from Electronic Annotation', 'PCS' for 'Published Clinical Study', 'TAS' for 'Traceable Author Statement'. DATE The date on which the term annotation was created. The format is YYY.MM.DD.

Element: qualifierList
This data element is optional and will in the future encode all possible seen qualifiers that are encoded within the PATO 'qualitative:intensity:intensity' subtree, e.g. 'mild','moderate','remittent','severe'. PATO stands for Phenotypic Attribute and Trait Ontology.

Element: Link
The data in this field represents a link to an external annotation about the term or part of the term. This is important for grounding the semantics of the term in widely used external vocabularies, to allow interoperability and reasoning.

Attributes for Term include:
text This is the part of the term about which the annotation refers to ID This is the URL (Universal Resource Indicator) for the external vocabulary entry evidence This is the name of the agent who provided the link, e.g. 'NCBO Annotator' or 'Bio-LarK'

Element: Tree
The data in the Tree element has been provided by parsing the term in its original context using the MCCJ parser (McClosky Charniak Johnson parser). The tree element is a grammatical phrase structure tree with lexical and syntactic nodes (e.g. JJ stands for Adjective and CC stands for Conjunction).

Element: associatedDisorder
After discovering phenotype candidates we applied a filtering step to verify them through association with human disorders gathered from the Online Mendelian Inheritance of Man database. We applied the R package's Apriori algorithm for identifying disorder-phenotype rules. Association rule (AR) mining attempts to discover rules between frequently co-occurring items in a transaction data set. The set of OMIM disorders and their synonyms was obtained from MEDIC. PMIDs are used to label the transaction items and are found for each phenotypes and disorder by querying the PMC E-utils RESTful Web Service. We applied Apriori using a set of parameters (support, confidence, minimum length, target) so that we retained only those association rules with carinality of 2, i.e. phenotype  disorder. The results for each phenotype are recorded in the associatedDisorder element.
Each associatedDisorder element consists of zero or more disorder elements describing the discovered OMIM association.

Attributes for associatedDisorder include:
source This is the source of evidence about the association. At the moment this takes only the value 'apriori '. min_supp This is the value of minimum support used in the Apriori algorithm min_conf This is the value of minimum confidence used in the Apriori algorithm df This is the number of citations where the association between the phenotype and disorder could be found, i.e. the number of disorder elements contained in the associatedDisorder element.
Note that minlen and maxlen attributes were both set to 2 within Apriori but are not recorded in the XML data.

Element: disorder
Each disorder element consists of the name of the disorder and its OMIM identifier.
Attributes for disorder include: supp The level of support Apriori found for the phenotype-disorder association conf The level of confidence Apriori found for the phenotype-disorder association lift The level of lift Apriori found for the phenotype-disorder association pval The p-value Apriori found for the phenotype-disorder association using a Fisher's exact test.

Element: name
The name element corresponds to an entry in the DiseaseName element in the Comparative Toxicogenomics (CTD) database at http://ctdbase.org (Mount Desert Island Biological Laboratory).

Element: omim_id
The omim_id entry corresponds to the OMIM unique identifier for the disorder concept.

Element: fullTextList
This element contains zero or more links to literature citations where the phenotype term has been found through a fielded search of full text articles in the PubMed Central database. The maximum number of returned citations was bounded at 10,000. In practice the number of phenotype terms which reach this limit is quite small (<5%).
Attributes for fullTextList include: source The source of evidence for the full text citation -this only takes one value at the moment which is 'eutils', i.e. the PubMed Central E-utilities Web interface (see http://www.ncbi.nlm.nih.gov/books/NBK25499/).

df
The number of documents returned by the source about the phenotype annotation retmax The maximum number of documents to be returned by the source

Element: ID
The ID contains the PubMed Identifier (http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_830.html) of the literature citation where the phenotype term was found.