Formal axioms in biomedical ontologies improve analysis and interpretation of associated data

Abstract Motivation Over the past years, significant resources have been invested into formalizing biomedical ontologies. Formal axioms in ontologies have been developed and used to detect and ensure ontology consistency, find unsatisfiable classes, improve interoperability, guide ontology extension through the application of axiom-based design patterns and encode domain background knowledge. The domain knowledge of biomedical ontologies may have also the potential to provide background knowledge for machine learning and predictive modelling. Results We use ontology-based machine learning methods to evaluate the contribution of formal axioms and ontology meta-data to the prediction of protein–protein interactions and gene–disease associations. We find that the background knowledge provided by the Gene Ontology and other ontologies significantly improves the performance of ontology-based prediction models through provision of domain-specific background knowledge. Furthermore, we find that the labels, synonyms and definitions in ontologies can also provide background knowledge that may be exploited for prediction. The axioms and meta-data of different ontologies contribute to improving data analysis in a context-specific manner. Our results have implications on the further development of formal knowledge bases and ontologies in the life sciences, in particular as machine learning methods are more frequently being applied. Our findings motivate the need for further development, and the systematic, application-driven evaluation and improvement, of formal axioms in ontologies. Availability and implementation https://github.com/bio-ontology-research-group/tsoe. Supplementary information Supplementary data are available at Bioinformatics online.


The ChEBI Ontology
We downloaded ChEBI in the OWL format from http://purl. obolibrary.org/obo/chebi.owl on April 26, 2018. The ChEBI ontology formally describes relations between molecular entities, in particular small chemical compounds (Degtyarenko et al., 2007). It contains a total of 432,822 logical axioms and 92,015 classes.

The Plant Ontology (PO)
We downloaded the OWL version of PO from http://purl. obolibrary.org/obo/po.owl on April 26, 2018. This version of PO contains 4,835 axioms and 1,649 classes. PO provides a formal description of the vocabulary related to external and internal plant anatomy and plant development phases. It is mainly used to associate plant structures and development to gene expression and phenotype data (Cooper et al., 2013).

The Cell Type Ontology (CL)
We downloaded CL in OWL from http://purl.obolibrary. org/obo/cl.owl on April 26, 2018. CL contains 17,958 axioms and 3,862 classes. It is an ontology that describes cell types for major animal and plant organisms (Bard et al., 2005).

Phenotype and Trait Ontology (PATO)
The OWL version of PATO was downloaded from April 26, 2018 from http://purl.obolibrary.org/obo/pato.owl. This version contains 5,644 logical axioms and 2,251 different classes. PATO provides a systematic description of phenotypes through the concepts and relationships defined by its axioms (Gkoutos et al., 2005).

Sequence Ontology (SO)
We obtained the SO ontology from http://purl.obolibrary. org/obo/so.owl on November 25, 2018. This version of SO contains 5,443 logical axioms and 2,2234 classes. The SO consists of a set of classes and relations that describe the parts of a genomic annotation (Eilbeck et al., 2005).

Commom Anatomy Reference Ontology (CARO)
The CARO ontology was obtained on http://purl.obolibrary. org/obo/caro.owl on November 25, 2018. This version contains 209 axioms and 158 classes. The CARO serves as a template to unify the structure of anatomy ontologies (Haendel et al., 2008).

Protein Ontology (PR)
We downloaded the PR ontology from http://purl.obolibrary. org/obo/pro_reasoned.owl on November 4, 2018. This ontology contains 1,312,362 axioms and 400,923 classes. The PR ontology formally represents protein-related entities and their relations at different levels of specificity (Natale et al., 2010). Table 2 summarizes the number of axioms in GO-Plus describing relations to each of these ontologies and shows an example of such axioms for each ontology.

Node2vec
For comparison, we have applied node2vec (?) on the ontology graph and the class-entity relations to obtain embeddings for the biological entities used later in our analysis. Figure 3 shows the workflow used to apply Node2Vec in this work. Figure 1 shows the detailed ROC curves when using node2vec to predict PPI based on GO and GO-plus for human, yeast and Arabidopsis datasets using cosine similarity and neural network. Figure 2 shows the ROC curves obtained for gene-disease association prediction using node2vec comparing Phenomenet (Phenomenet + GO) to Phenomenet combined with GO-plus (Phenomenet + GO-plus) for human and mouse datasets.

Evaluation metrics
We used the ROC (receiver operating characteristic) curve (Yin and Vogel, 2017) along with the AUC (area under ROC curve) as a quantitative measure to assess the performance of each predictive method. For both PPI prediction and gene-disease prediction, the true positive pairs are considered to be the ones available from the STRING network and the MGI_DO.rpt file from the MGI database respectively. The negative pairs on the other hand are down-sampled from the set of all unknown associations to form a set of negatives equal in size to the set of positive pairs for both PPI prediction and gene-disease association prediction.

Cosine similarity
One way to perform prediction tasks using ontology-based embeddings is by calculating the similarity between each pair of vectors and using the obtained similarity as a confidence score to predict whether two entities are associated or not. To do so, we use cosine similarity as a similarity measure between the obtained vectors. The cosine similarity cos sim between two vectors A and B is calculated as where A · B is the dot product of A and B. 'Anatomical structure development' EquivalentTo ('Developmental process' and ( results in development of 'anatomical structure')) P R 1,914 'tyrosine 3-monooxygenase kinase activity' SubClassOf (has input some ('tyrosine 3-monooxygenase')) N CBIT axon 1,136 'chloroplast proton-transporting ATP synthase complex assembly' SubClassOf (only_in_taxon Viridiplantae)