Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems

Abstract Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity–quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine–human consistency, or similarity, was significantly lower than inter-curator (human–human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.


INTRODUCTION
: Examples of Entity-Quality (EQ) annotations of varying complexity from the present study. A illustrates a simple EQ annotation; B shows an EQ annotation in which the quality term relates two entities to each other; and C provides an example of an entity that does not correspond to a term in an existing ontology, but is instead a complex logical expression post-composed from multiple ontology terms. PATO:absent training of automated NLP systems e.g., (13,14,15). Another limitation was the use of 162 performance measures that did not fully account for the continuum of similarity possible 163 between semantic phenotype annotations. While these authors recognized that phenotypes 164 annotated with parent and daughter terms in the ontology bear some partial resemblance, 165 here we introduce semantic similarity measures that can account for any level of relationship 166 between the terms from two phenotype annotations. 167 The present work describes the development of an expert-curated Gold Standard dataset 168 of annotated phenotypes for evolutionary biology that is the best available given current 169 constraints in semantic representation. The Gold Standard was developed for the annota- costs required for manual annotation, "silver standard" corpora have also been created, in 189 which automatically generated annotations are grouped into a single corpus (23, 24). As  Inter-curator consistency has been used by several studies as a baseline against which 193 to evaluate the performance of automated curation software (25,26,27   Semantic similarity measures between annotation sources (e.g., different curators) were 307 aggregated at the level of the individual character state, and across all character states 308 ( Figure 1). Aggregation of pairwise (EQ to EQ) annotations by character state is necessary 309 because a curator may generate more than one EQ annotation for a given character state. This is illustrated by Figure 1 where Curator A generated three EQs and Curator B generated 311 two EQs for State i. To measure the overall similarity between two annotation sources (e.g.,

312
Curator A to Curator B in Figure 1, top), we first compute a similarity score between 313 corresponding character state pairs as the best match (maximum score) among all pairwise 314 comparisons between EQs for the same character state (Maximum Character State Similarity 315 in Figure 1). We then compute the similarity between two annotation sources by taking the 316 arithmetic mean of the pairwise character state similarity scores across all character state 317 pairs (Mean Curator Similarity in Figure 1, bottom). We treat each EQ annotation as a node in an ad hoc EQ ontology. Creating the complete 320 cross-product of the component ontologies would necessarily include all possible subsumers 321 but would be prohibitive. As a memory saving measure, we developed a computationally 322 efficient approach to identify subsumers for EQ annotations on an ad hoc basis, as follows.   Figure 1: Similarity of annotations between two curators is calculated across multiple character states (e.g., states 1-3, bottom). First, the maximum character state similarity is calculated at the level of a single character state, and is the best match (maximum score) in pairwise comparisons across that state's EQ annotations. Mean curator similarity is then calculated as the mean of the maximum similarities across all character state pairs.

342
The Jaccard Similarity (J sim ) between nodes N 1 and N 2 in an ontology graph is defined as 343 the ratio of the number of nodes in the intersection of their subsumers over the number of 344 nodes in the union of their subsumers (44):  to N j and S(N j ) to be the set of nodes subsumed by N j :

365
Precision and Recall are commonly used to evaluate the performance of information re-366 trieval systems. Traditionally, these two measures do not attempt to account for imperfect 367 matches; information is either retrieved or it is not. For ontology-based annotations, partial 368 information retrieval is possible because the information to be retrieved is the semantics of 369 the annotated text, rather than a particular term. To account for this, here we use two 370 metrics, Partial Precision (P P ) and Partial Recall (P R), to measure the success of semantic 371 information retrieval by a test curator (C T ) relative to a reference curator (C R ), where a 372 curator can be understood as either human or software. While other variants of semantic 373 precision and recall are used in the literature (46,47), the measures we use here specifically 374 use semantic similarity, in this case J sim , to quantify partial matches between annotations. In 375 contrast to our approach, (46) and (47)  of semantics that are retrieved by C T relative to the number of C R annotations. Thus, both 382 P P and P R have a range of [0,1]. P P will decrease due to extra annotations by C T that 383 are dissimilar from those in C R , while P R will decrease due to extra annotations in C R that 384 are lacking from C T . Both use J sim to measure semantic similarity and are computed at 385 the character-state level rather than the individual EQ annotation level. Using C R and C T 386 as an example, they are calculated as: where i = 1..X indexes the EQs from C R and j = 1..Y indexes the EQs from C T . generated by a particular annotation source were presented to the authors. 403 We used two statistics to test for differences among author preferences for the differ- where t = 5 is the number of possible ranks and the expected number of observations 407 X(i, j) = n/t for factor i assigned rank j and number of observations n. A was tested against 408 a χ 2 distribution for significance with (t − 1) 2 degrees of freedom. The null hypothesis is 409 that all author preferences for all annotation sources will be equally frequent.

410
Friedman's statistic, F , was used to test if the mean ranks of the different annotation 411 sources differed from chance: where t = 5 is the number of annotation sources, i = 1..t is the annotation source, j = 1..t is 414 the number of ranks that can be assigned to an annotation, obs(i, j) is the number of times 415 rank j was assigned to factor i, and n is the number of observations, as before. F was tested against a χ 2 distribution for significance with t − 1 = 4 degrees of freedom.   (Table 3), a total of 111 anatomical terms and 12 synonyms, and 20 quality terms 447 and two synonyms, were added to the public versions of Uberon and PATO, respectively.

448
The remaining subset of terms created by curators in the Merged ontology were not added to 449 the public ontology versions either because a different term was chosen for the GS annotation 450 of a particular character, or the term was determined to be invalid after discussion among 451 curators.

452
Using J sim and I n (see Section 3.5) to measure semantic similarity between the four

Subsumers of EQ annotation Quality
Step

2: Obtain superclasses of individual components Step 1: Split into individual components
Step 3: Combine E-Q-RE superclasses with class expression superclasses to get EQ subsumers    We computed consistency among curators for the EQ annotations generated for each char-474 acter state. Figure 5 shows the mean inter-curator consistency scores across three pairwise to the Naïve round. Each curator changed EQ annotations between these rounds for more 486 than 50% of character states. Among the EQs that were different between the two rounds,

487
29% were more complex, 33% were less complex, and 38% retained the same complexity in 488 the Knowledge round.

489
Due to the lack of significant differences between inter-curator consistency in Naïve and  plete EQs refer to those statements that are only partially matched to ontology terms, e.g., 499 either E or Q terms are matched. In case of post-compositions, some parts needed in the 500 composition are not matched to an ontology term. Human-machine comparisons involving 501 character states with incomplete EQs were awarded a 0 similarity score. 502 We found that machine-human consistency was significantly lower than inter-curator con-   Figure 5 shows the resulting P P , P R, J sim , and I n scores comparing SCP annotations 511 generated with the Initial, Merged, or Augmented ontologies (plus, unfilled square, and 512 filled square markers, respectively) to annotations from the human Knowledge round (as 513 noted above, no statistically significant differences were found in SCP similarity to human 514 annotations between the Naïve versus Knowledge rounds). However, almost universally, the 515 scores among the similarity metrics increased as the ontologies progressed from Initial to 516 Augmented and then from Augmented to Merged. The one exception is Partial Precision, 517 which declined from the Augmented to the Merged ontology. All these increases, and the 518 one decrease, were found to be statistically significant with two-sided paired Wilcoxon rank 519 sum tests at the Bonferonni-corrected threshold of α = 0.0008 (Table 5). 520 Table 5: Comparison of Semantic CharaParser annotations using Initial, Augmented, and Merged ontologies to measure the effect of ontology completeness on SCP-human consistency. Shown are p-values from two-sided paired Wilcoxon rank sum tests. Only SCP similarity to human-generated annotations from the Knowledge round are shown. Consistency between SCP annotations to human annotations was significantly lower than human inter-curator consistency. Across all metrics, SCP annotation similarity to human annotations increased significantly between the use of Initial to Augmented ontologies and again from Augmented to the Merged ontology except for P P (decreased from Augmented to Merged). Detailed results are in Supplementary Materials, Tables 3, 4 .  Similarly, in some cases, character states describing the presence of a structure are not 566 annotated directly in the Gold Standard. This is because presence can be inferred using 567 machine reasoning on annotations to different attributes (e.g., shape) of the structure (53).

569
(34), the annotation in the Gold Standard consists of a single EQ phenotype: E: 'horn 570 of hemipenis', Q: 'multicuspidate'. The presence of 'horn of hemipenis' is inferred by the 571 assertion describing its shape and did not require a separate EQ annotation.

572
In other cases, "coarse" level annotations were used that did not include every concept in 573 the character state due to limited expressivity in the EQ formalism. For example, take the More complex annotations can be made using a less restrictive annotation tool (e.g.,

580
Protégé) rather than the EQ templates available in Phenex. However, allowing increased 581 complexity when annotating in EQ format is likely to increase inter-curator variability.  edge had no effect on inter-curator consistency and did not further differentiate them from This was true despite the fact that curators changed annotations considerably between the plexity when curators were at liberty to bring in additional knowledge, this was not borne 659 out by the data.

660
These results indicate that lack of access to external knowledge is not one of the fac-661 tors that contributes to SCP's low performance with respect to human curators. This is 662 encouraging, because lack of access to external knowledge during machine curation would 663 be a challenge to remedy. Our results indicate that using more complete ontologies can significantly improve machine 667 performance (Figures 4 and 5). This is encouraging because ontology completeness is con-668 tinually improved through the synergistic efforts of the ontology and curator communities.

669
This finding leads to specific ideas for how the curation workflow could be optimized by post-composed terms to be generated. And mentioned in Section 5.6, our results show that 697 more comprehensive input ontologies will lead to improved performance of SCP.

699
The Gold Standard dataset for EQ phenotype curation developed herein is a high-quality 700 resource that will be of value to the sizable community of biocurators annotating phenotypes 701 using the EQ formalism. As illustrated here, the Gold Standard enables assessment of how 702 well a machine can performs EQ annotation and the impact of using different ontologies for 703 that task. At present, machine-generated annotations are less similar to the Gold Standard 704 than those of an expert human curator. The continued use of this corpus as a Gold Standard 705 will enable training and evaluation of machine curation software in order to ultimately make 706 phenotype annotation accurate at scale.