Abstract

The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.

Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.

The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.

For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.

Introduction

Biomedical databases support many aspects of biological research, from getting basic information about a gene or a protein, to complex applications for data analysis. The usefulness of these databases critically depends on the amount of information, its correct interpretation and the regular updating of the content. For the vast majority of databases, these curatorial tasks are done manually by curators with expertise in the specific domain of interest of the database. To give an appreciation of the scope of the task, the volume of biomedical literature in PubMed, a free literature search service developed and maintained by the National Center for Biotechnology Information, currently containing 28 million citations, has increased at a sustained growth rate of ∼4% over the past 20 years (1).

It has been stated repeatedly that manual curation is inadequate to keep up with the volume of information published (for example in (2)). Meanwhile, no fully automated tools have been successfully implemented in the annotation workflow of major databases. The essential features required for a complete or at least partial replacement of manual curation include accurate prioritization of the literature to serve database-specific curation tasks, correct detection of bioentities (named-entity recognition) as well as recall and precision rates approaching manual curation. Moreover, since best practices in curated databases require the assignment of unique identifiers to entities derived from biomedical ontologies, automated tools should be able to convert natural language into these controlled languages. Tools able to perform those tasks can be used to perform literature triage, bioentity identification and normalization, relationship extraction [typically between a gene product and a disease (D) or a biological process (BP), for instance] and association of supporting evidence qualifiers (3). These tools would facilitate and accelerate the curation process, hence improving its cost-effectiveness and throughput.

The ideal tool for retrieving biomedical information would display a user-friendly interface, provide a powerful search tool from databases containing up-to-date biomedical data, allow a search within specific sections of articles, highlight terms of interest, display results that could be filtered and ranked, create annotations and respond fast following the request. Existing text-mining tools exhibit some of these features but none have all the required functionality, as we show in our analysis of currently available text-mining-supported curation tools (Table 1). We assessed Textpresso Central (4), PubMed (5), NextBio, PolySearch (6), GoPubMed (7) and PubTator (8) and evaluated all parameter listed in Table 1. We also looked at the workflow of other text-mining tools, such as Argo (9–12), Egas (13), EXTRACT (14), MetastasisWay (15), Ontogene (16) and RegulonDB (17), but because they are dedicated to specific biomedical fields (and not appropriate for our use cases), we didn’t include them in our comparative study. The functionalities important to the curation workflow must be close in quality to that of manual annotation. However, direct comparison is not always possible as automatic systems exhibit characteristics that do not align one-to-one with curation tasks as performed by humans. More importantly, the digitalization of curation workflows may require to challenge existing end-users’ practices and well-established workflows (18); data stewardship and capture need revision in order to also keep track of materials rejected by biocurators (wrong annotations, irrelevant articles etc.). Nevertheless, for the annotations proposed by the system, a precision of 60–70% seems a minimal—yet demanding—target to meet the curators’ expectations. Similar quantitative targets also apply to triage tasks. Considering that a 100% manual triage is not achievable, any improvement over existing tools is welcome. Indeed, triage tasks are a bottleneck and cannot be performed without using general-purpose search engines such as PubMed or Europe PubMedCentral (PMC).

Table 1

Comparison of some existing text-mining tools

The performance of the main parameters important for the curation workflow is indicated by the degree of shading: white means feature not available; light grey, medium performance; and dark gray, very good performance.

Table 1

Comparison of some existing text-mining tools

The performance of the main parameters important for the curation workflow is indicated by the degree of shading: white means feature not available; light grey, medium performance; and dark gray, very good performance.

neXtProt (19) is a knowledgebase focused on human proteins, which complements UniProtKB (20) by extending the content and tools, supporting use cases specifically relevant to human proteins. neXtProt manually annotates various aspects of protein function, variants and phenotypes (19, 21). To do this, we have developed a curation tool, the BioEditor, that allows curators to capture biomedical data. Annotations are structured in triplets, in accordance with the neXtProt BioEditor annotation data model. The triplets are composed of a subject (the protein being annotated); an object describing a gene ontology (GO) term, a D, an interaction partner etc.; and a relation describing how the subject and the object are related.

We have developed an automatic article-processing tool that addresses our specific curation needs, neXtA5 (22, 23). neXtA5 provides a search engine coupled with an annotation system, directly integrated into the workflow of curators. Thus, neXtA5 assists curation with specific modules optimized for the various curation tasks: document triage, entity annotation and relationship extraction. The tool performs literature retrieval and prioritization and creates annotations. The curator queries the system by entering a human gene name and an axis of interest. For the purposes of this study, two axes were evaluated: GO BP as well as Ds. The system returns a ranked list of abstracts and concepts for the relevant axis for each of the papers. The curator can select the relevant articles/gene/concept combination and validate/refine/reject annotations proposed by the system.

In previous work, we have optimized the ranking algorithm of neXtA5 for the triage task. The tool exhibits significant improvements of 191–261% compared to PubMed (22, 23). The present article describes the testing and evaluation of neXtA5 by expert curators. To evaluate the accuracy and performance of neXtA5, we submitted specific requests and then compared the results obtained from manual curation to the results given by the neXtA5 application. The analysis is focused on the usability of neXtA5 on two types of annotations: BPs and Ds, respectively defined as GO concepts (24, 25) and National Cancer Institute thesaurus (https://ncit.nci.nih.gov/). We have evaluated the relevance of the papers proposed as well as the recall and precision of the concepts extracted.

Methods and results

neXtA5 software infrastructure

The neXtA5 system was developed with Java/JavaScript technologies to improve the scientific literature curation process as it is currently performed with neXtProt.

Publication retrieval and concept extraction

SIB Text Mining houses the complete MEDLINE collection locally, updated on a weekly basis, in an information system named BioMed, that pre-indexes the collection using the Terrier and ElasticSearch platforms (26, 23) according to vocabularies relevant to the axes of interest. Again, here we focused on GO BP and Ds. BioMed services support the maintaining of several premier molecular biology databases, including Europe PMC’s SciLite or UniProt’s UPCLASS (27–29). Indexed papers are analyzed and concepts from the ontology of interest are extracted and stored in the BioMed database, as well as human gene names obtained from the neXtProt application programming interface (API). Once the information is stored, BioMed applies a combination of weighting schemas, which includes a vector space model representation (30), and the Okapi BM25 scoring function, which was tuned and tested during Text Retrieval Conference (TREC) competitions (31). This results in two outputs: (i) a ranked list of abstracts and (ii) for each abstract, a ranked list of concepts for the axis of interest. The ranking function is described in a previous publication (22).

Document prioritization

The list of documents provided by the search engine is further ranked with a score based on a linear combination of factors; each of the search axis was tuned specifically to fit the curation model of neXtProt curators as detailed in (22, 23). This final score is calculated on the basis of the search engine score, combined with the range of concepts found in the paper and the term frequency–inverse document frequency (TF–IDF).

Figure 1

Activity diagram of the literature curation process using neXtA5.

User interface

We have implemented a web-based curation interface that connects the BioEditor curation database with a set of APIs. The first screen is dedicated to the user input, with customized intake fields to refine the original query. The second panel displays the result of the triage function, with the final score granted to each document. Finally, in a third screen, a list of automatically generated annotations is proposed for each document. Each entry can be accepted as it stands, rejected or modified as needed. At the end, the curator can submit the annotation to the BioEditor. The work can also be saved at any time and completed subsequently. Indeed, the graphical user interface (GUI) is also linked to a historical database that keeps track of the curation process and results, which can also serve to set out a relevance feedback. This history enables the system to remember every processed publication and remove them from upcoming searches (using the same query).

neXtA5 user interface

The workflow of the neXtA5 curation-support tool is shown in Figure 1.

The neXtA5 user interface is designed to assist specific biocuration tasks (Figure 2). The user performs a query, which is a gene name and an annotation axis. Additional features include the ability for users to exclude specific references that will not be retrieved by the system (e.g. publications that were previously processed or publications of low interest). Users can also provide keywords that must be ‘excluded’, for instance because they result in too many false positives, or ‘added’, in which case they will receive more weight during the ranking step, for the ranking. Finally, advanced options allow the user to restrict the search based on a range of publication dates, the maximum number of publications to retrieve.

Figure 2

neXtA5 user interface for query page.

The output of the query is a list of publications, ranked according to the relevance score developed in (22–23). The list displays relevant information about the publication, including the PMID, the title, the year of publication, the relevance score and the annotation status. Different annotation statuses are possible: ‘not done’, ‘partial’ (when some but not all the annotations proposed by the system have been reviewed by the curator) or ‘completed’ (when every automatic annotation has been manually reviewed).

From this ranked list, the curator can select a paper to curate that opens another page in the user interface displaying the list of potential annotations identified by neXtA5. The potential annotations are presented in table form, showing the subject (which corresponds to the protein of interest), the relation, the object (concept) and the evidence code (Eco). For each annotation, when the user clicks on the ‘Show’ button (in the ‘Details’ column on the right), the abstract appears, highlighting the sentence from which the annotation was derived in blue and underlining the concept (Figure 3). Here, three operations are possible, from a pull-down menu in the ‘Action’ column; the curator can accept, modify or reject the annotations created by neXtA5. The curator can also change the relation linking subject and object as well as the Eco (currently these are set to default values in the interface); however, changes in the relation or the Eco does not impact the type of action; if the concept was not changed, then the annotation is considered as ‘accepted’.

Figure 3

neXtA5 user interface for curation. From the abstract of an article, neXtA5 extracts relevant concepts and displays a list of potential annotations. Here, the annotations related to PIM1 for the BPs and extracted from the abstract of (32) are shown.

neXtA5 usability study

To evaluate the usability of neXtA5 as a curation support system, we measured the recall and precision of the annotations proposed by the system as compared to manual curation. The precision corresponds to the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total number of relevant instances. Here, ‘instances’ can correspond to either documents or concepts.

Four experienced curators from the neXtProt team reviewed the neXtA5 output. The evaluation focused on neXtA5 annotation for 12 different proteins: CDK2 (NX_P24941), CSK (NX_P41240), FYN (NX_P06241), IRAK4 (NX_Q9NWZ3), LRRK2 (NX_Q5S007), LYN (NX_P07948), PIM1 (NX_P11309), RIPK2 (NX_O43353), SGK1 (NX_O00141), STK11 (NX_Q15831), SYK (NX_P43405) and ZAP70 (NX_P43403). The proteins were selected on the basis of having sufficient literature to allow proper evaluation of the system, i.e. >100 papers in a PubMed search, while avoiding the gene normalization problem, i.e. the gene name is not used as a synonym for another gene or as an acronym for a term used elsewhere in the literature. Example of proteins we avoided includes BTK (used in orthopedic papers as an acronym for ‘below the knee’) and ABL1 (used for ABL1 and ABL2 in older literature). The latter could have been controlled using the date range, while the former can be handled by excluding the word ‘knee’. Having a certain number of different targets ensure that we cover a wide range of biological research areas, to increase the number of distinct concepts reported in the literature. This was aimed to control for biases, for example in the concept extraction step (as certain concepts have labels that are more difficult to extract by automated tools) and in the gene name extraction step (certain genes may have an abnormally high rate of false positives or false negatives, for example if a synonym is shared with another gene name or a concept or if the main gene name is not widely used in the literature).

Moreover, we ensured that each abstract was reviewed by two different curators, so as to have a measure of confidence of the evaluation of the annotations proposed by the automatic system, the rational being that when two curators do not agree, an error by neXtA5 should be less penalized.

Table 2

Semantic classification of concepts annotated by the curators or proposed by neXtA5

Semantic classificationGO terms
1Reactive oxygen species biosynthetic process
Reactive oxygen species metabolic process
ROS generation
2S phase
DNA replication
Regulation of cell cycle
3Autophagy
Autophagosome assembly
Autophagosome formation
Semantic classificationGO terms
1Reactive oxygen species biosynthetic process
Reactive oxygen species metabolic process
ROS generation
2S phase
DNA replication
Regulation of cell cycle
3Autophagy
Autophagosome assembly
Autophagosome formation
Table 2

Semantic classification of concepts annotated by the curators or proposed by neXtA5

Semantic classificationGO terms
1Reactive oxygen species biosynthetic process
Reactive oxygen species metabolic process
ROS generation
2S phase
DNA replication
Regulation of cell cycle
3Autophagy
Autophagosome assembly
Autophagosome formation
Semantic classificationGO terms
1Reactive oxygen species biosynthetic process
Reactive oxygen species metabolic process
ROS generation
2S phase
DNA replication
Regulation of cell cycle
3Autophagy
Autophagosome assembly
Autophagosome formation

Setting the baseline: inter-curator agreement

Since curation is a subjective process to some extent, before comparing neXtA5’s performance as evaluated by curators, we determined the agreement between different curators for the tasks we evaluated for neXtA5.

Strategy for assessing agreement with respect to concept extraction

Since the BP branch of the GO has nearly 30 000 classes, the selection of 2 different terms by 2 curators does not automatically imply a disagreement. The evaluation must take into account how related two terms are to decide whether two curators (or a curator and the automatic system) recognized a similar concept or not. To do this, we manually reviewed all annotated concepts (both by curators and by neXtA5) for all abstracts and manually assigned each concept to a semantic class, numbered from 1 to n for each abstract. This is illustrated in Table 2. In this example, curators identified 9 different GO terms, which we classified into three semantic classes, labeled 1, 2 and 3. Concepts falling in the same semantic class were considered equivalent in our evaluation.

Here, we have decided to use a manually semantic classification approach rather than using the hierarchical structure in the GO. While the hierarchy of the GO could be used for this purposes (as in cases 1 and 3 in Table 2), in other cases GO terms that represent the same experiment correspond to completely different areas of the tree, as shown in case 2 Table 2. We have grouped the three GO terms S phase (GO:0051320), DNA replication (GO:0006260) and regulation of cell cycle (GO:0051726) into the same semantic classification group by manual classification, whereas these concepts belong to three different branches of the GO, as shown in Figure 4.

Figure 4

Ancestor charts of the GO terms from semantic classification 2, shown in Table 2 [S phase (GO:0051320), DNA replication (GO:0006260) and regulation of cell cycle (GO:0051726)], using https://www.ebi.ac.uk/QuickGO/.

(i) Inter-curator agreement test for precision of document retrieval

We first evaluated the inter-curator agreement with respect to the relevance of abstracts proposed by neXtA5, the so-called literature triage task. For this task, we determined the fraction of the first top-ranking 20 papers proposed by neXtA5 that were deemed relevant by both curators (assessed by whether or not they had identified relevant concepts in the abstract). The criteria for selecting an abstract as relevant for annotation were that it had information indicating that there was data in the full text paper relevant to the axis of interest. To exclude papers with general statements (rather than actual data), we specified the following guidelines: exclude statements from titles and from the introductory part of the abstract (highlighted in Figure 5); and do not capture any ‘hypothesis’ type information, such as ‘We hypothesized that the protein X performs process Y.’ Examples of such sentences include ‘Since activation of Ras oncogenes is a common oncogenic event leading to the activation of multiple effector pathways, we explored if Ras could induce Fyn expression.’ (33); ‘The fact that IRAK4, another IRAK family member necessary for the IL-1 pathway, is able to phosphorylate IRAK in vitro suggests that IRAK4 might be the IRAK kinase.’ (34); ‘The mechanism of activation for IRAK4 is currently unknown, and little is known about the role of IRAK4 kinase in cytokine production, particularly in different human cell types.’ (35); ‘In this study, we analyzed the relative PTPN22 and CSK expression in peripheral blood from 89 RA patients and 43 controls to determine if the most relevant PTPN22 (rs2488457, rs2476601 and rs33996649) and CSK (rs34933034 and rs1378942) polymorphisms may influence on PTPN22 and CSK expression in rheumatoid arthritis (RA).’ (36).

Figure 5

neXtA5 user interface for curation. One of the guidelines for the curators to select relevant documents was to not consider statements from titles and from the introductory part of the abstract. Here, the introduction of the abstract of (37) related to FYN function (BP axis) is highlighted in yellow.

Of the 12 proteins, a total of 242 abstracts were analyzed for each axis (for 12 targets, we expected to analyze 240 abstracts; however, in some cases abstracts with the same score were presented in a different order, which led to the annotation of 2 additional abstracts). As shown in Table 3, in 83% of cases for BP and in 80% of cases for D, both curators made the same decision with respect to the relevance of an abstract for the axis of interest.

Table 3

Inter-curator agreement analysis

BPsDs
Papers accepted by both curators16267%15263%
Papers rejected by both curators3916%4817%
Papers rejected by just one curator4117%4220%
Total papers analyzed242242
BPsDs
Papers accepted by both curators16267%15263%
Papers rejected by both curators3916%4817%
Papers rejected by just one curator4117%4220%
Total papers analyzed242242
Table 3

Inter-curator agreement analysis

BPsDs
Papers accepted by both curators16267%15263%
Papers rejected by both curators3916%4817%
Papers rejected by just one curator4117%4220%
Total papers analyzed242242
BPsDs
Papers accepted by both curators16267%15263%
Papers rejected by both curators3916%4817%
Papers rejected by just one curator4117%4220%
Total papers analyzed242242

(ii) Inter-curator agreement test for precision of concept retrieval

The precision of concept retrieval corresponds to the number of relevant terms extracted in each document. We assessed this by determining the rate at which both curators extracted the same concepts from an abstract. Again, specific curation guidelines were given: when similar descriptors are proposed, use the most accurate one, i.e. choose preferentially the child term than the parent term (for example, reject the annotation suggesting ‘Neoplasm’ when ‘Ovarian carcinoma’ is also mentioned in another annotation); annotations describing techniques (such as ‘immunohistochemistry’) are acceptable as indication of experimental data in the full text paper; and annotations describing negative evidence are included as relevant for annotation. If a concept was modified from the original concept, it had to be within the same branch of the ontology.

For this task, 45 abstracts of the BP axis and 51 abstracts of the D axis were annotated by two curators with BP and D terms, respectively (while the expected number of annotated papers for this task is 48, the actual number varies because the papers chosen by different curators for annotation may differ). This corresponds to a minimum of four abstracts by curator and by protein, with a few additional abstracts to ensure that at least two curators reviewed each abstract (the additional abstracts correspond to cases where curators made different decisions with respect to the relevance of an abstract for an axis). For the 45 abstracts annotated for the BP axis by both curators, at least 1 common term was found in 42 abstracts (93% of abstracts, Figure 6A). The overall average inter-curator agreement rate with respect to concepts, i.e. the average proportion of concepts annotated by both curators relative to all concepts found by either curator, was of 60%. For the D axis, out of the 51 abstracts annotated by both curators, the 2 curators found at least 1 common term in 48 abstracts (94% of abstracts; Figure 6B). The overall average inter-curator agreement rate with respect to concepts was of 87%.

Figure 6

Inter-curator agreement with respect to concepts in BP (A) and D (B) axes showing the proportion of common concepts found by both curators. The number indicated is the number of common concepts identified by both curators (0–4 for BP; 0–6 for Ds).

Hence, the inter-curator agreement is ~80% with respect to relevance of abstracts, regardless of the axis (Table 3), and curators find at least 1 common concept in over 90% of the abstracts (Figure 6). On average, 60% of the concepts in an abstract were identified by both curators for BP and 87% for Ds. This may reflect the greater complexity of GO compared to D terminology, which likely hampers annotation consistency.

neXtA5 evaluation

We then evaluated the precision and the recall of the neXtA5 system. We evaluated the precision both at the level of the document retrieval and information extraction and the recall (as compared) with the manually extracted terms (as the set of expected true positives).

(i) neXtA5 precision for document retrieval

Using the data from task (i) for inter-curator agreement, we can derive the fraction of the abstracts retrieved by neXtA5 and that curators assessed as relevant for the axis of interest. We find that both curators agree with neXtA5 for 67% of the abstracts suggested in the BP axis and for 63% of the abstracts in the D axis. Moreover, for 15% of the abstracts, both curators judged that the abstract was not relevant for the axis of interest (Table 3).

(ii) neXtA5 precision for information extraction

To determine the fraction of relevant concepts that neXtA5 retrieved, we manually evaluated each of the annotations proposed by neXtA5 for the 20 first abstracts, for each of the 12 target proteins (in cases where all concepts were rejected, additional abstracts were annotated until we reached 20 evaluated abstracts). Again, each abstract was evaluated independently by 2 curators, for a total of 254 abstracts. From these 254 abstracts, a total of 3175 annotations were proposed by the neXtA5 system. For the BP axis, curators approved or modified the proposed descriptor (a modification is a change of term within the same branch of the GO) for 35% of the terms; hence, 65% of the descriptors were considered as non-relevant. For the D axis, curators approved or modified the proposed descriptors for 25% of the cases and rejected 75% of the descriptors (Table 4).

Table 4

Precision analysis for BP and D axes

Total number of descriptors analyzedAcceptedModifiedRejectedPrecision
BP3175699 (22%)413 (13%)2061 (65%)35%
Ds49671094 (22%)146 (3%)3727 (75%)25%
Total number of descriptors analyzedAcceptedModifiedRejectedPrecision
BP3175699 (22%)413 (13%)2061 (65%)35%
Ds49671094 (22%)146 (3%)3727 (75%)25%
Table 4

Precision analysis for BP and D axes

Total number of descriptors analyzedAcceptedModifiedRejectedPrecision
BP3175699 (22%)413 (13%)2061 (65%)35%
Ds49671094 (22%)146 (3%)3727 (75%)25%
Total number of descriptors analyzedAcceptedModifiedRejectedPrecision
BP3175699 (22%)413 (13%)2061 (65%)35%
Ds49671094 (22%)146 (3%)3727 (75%)25%
Table 5

Average number of terms found by curators (common terms and total terms) and by neXtA5 for BP and D axes

BPsDs
Number of concepts identified by at least one curator and neXtA51.11.2
Manual curator (average number of concepts/papers)2.41.5
neXtA5 (average number of concepts/papers)6.26.0
BPsDs
Number of concepts identified by at least one curator and neXtA51.11.2
Manual curator (average number of concepts/papers)2.41.5
neXtA5 (average number of concepts/papers)6.26.0
Table 5

Average number of terms found by curators (common terms and total terms) and by neXtA5 for BP and D axes

BPsDs
Number of concepts identified by at least one curator and neXtA51.11.2
Manual curator (average number of concepts/papers)2.41.5
neXtA5 (average number of concepts/papers)6.26.0
BPsDs
Number of concepts identified by at least one curator and neXtA51.11.2
Manual curator (average number of concepts/papers)2.41.5
neXtA5 (average number of concepts/papers)6.26.0

neXtA5 recall for annotations

To assess recall, curators manually extracted descriptors (independent of the neXtA5 information extraction module) from the first 4 abstracts for each of the 12 target proteins, as described in task (ii). Again, two curators performed the task for each abstract. We evaluated neXtA5 with two different criteria: (i) based on the descriptors only identified by both curators or (ii) based on the descriptors identified by either curator. That latest assessment is the best evaluation for an automated system; if a descriptor is identified manually, regardless of whether this assignment may be disputable, we don’t expect an automatic system to be capable of such nuanced judgement.

For the BP axis, neXtA5 successfully identified 27% of the descriptors found by both curators and 36% of the terms identified by either curator (Supplementary Data Table 1). For the D axis, neXtA5 identified 42% of the terms found by both curators and 68% of the terms identified by either curator.

Discussion

Improvement of the manual annotation

Our results show an inter-annotator agreement (IAA) of ~80% with respect to relevance of abstracts, regardless of the axis (Table 3), and curators found at least 1 common concept in over 90% of the abstracts (Figure 6). There is little data in the literature where inter-curator agreement was evaluated, so it is difficult to judge whether this is expected. A recent study, showing the mining of clinical attributes of genomic variants using Egas, a web-based platform for text-mining-assisted literature curation, presented an overall IAA of 74% (13), while 2 other studies investigating the text-mining assisted biocuration workflows in Argo exhibited an IAA of 68.12% or varying between 67% and 84% (9, 10). Looking at some events of divergent decisions by the two curators, it seems that in most cases there was a drift from the curation guidelines and that if we return to the guidelines we can more often agree on the decision.

Performance of neXtA5

We have developed neXtA5, a system that enhances the biocuration workflow by prioritizing research articles for specific tasks, and evaluated its performance with respect to document triage, precision and recall compared with manual annotation. These parameters are essential to develop a tool that can be used in the daily workflow of curated biological databases. We evaluated the effectiveness of the system to support the curation of GO BPs and Ds.

With respect to document retrieval, neXtA5 proposes ~15% of documents that are not relevant for the task at hand. This is quite acceptable, given that neXtProt curators routinely use PubMed to retrieve literature, which returns a much higher fraction of non-relevant documents, because it does not allow to specify a general domain of interest but only keywords. Moreover, this 15% is also similar to the rate at which curators disagree with each other with respect to the relevance of a document (17–20%; Table 3), thus suggesting that the current triage effectiveness is approaching a theoretical upper bound.

For the concept extraction task, neXtA5 had a precision rate of 35% for BP and 25% for D and a recall rate of 27% for BP and 42% for D. It must be noted that neXtA5 retrieves 2.6 times more descriptors compared to curators in the BP axis (Table 5). Indeed, neXtA5 finds an average of 6.2 concepts per abstract for the 45 abstracts annotated by both curators for the recall test, while curators find 2.4 terms and 1.1 common terms on average. In the D axis, neXtA5 finds an average of 6 concepts per abstract for the 45 abstracts annotated by both curators, while curators find 1.5 terms and 1.2 common terms on average. Therefore, neXtA5 finds four times more concepts than curators for the D axis. This high level of identified descriptors contributes to the low precision rate of neXtA5.

While the precision and recall performance do not yet allow for completely automated annotation, the fraction of relevant terms certainly makes the system a valuable enhancement to manual curation tasks.

Potential improvements of neXtA5

While doing the evaluations, and based on their extensive experience in annotation, we noticed some recurring issues that should be addressed to enhance the performance of neXtA5.

Heterogeneity of neXtA5 concept extraction by annotation target

We noticed significant heterogeneity in the precision of concept extraction among the different targets. For instance, in the BP axis, only 17% of the terms proposed for ZAP70 by neXtA5 were accepted or modified by the curators compared to 49% of the terms proposed for LRRK2 (Table 6).

Table 6

Precision analysis for BP (A) and D (B) axes

ABPsBDs
Number of terms analyzed per proteinPrecisionNumber of terms analyzed per proteinPrecision
LRRK224749%LYN39841%
SGK130143%SYK57036%
SYK26240%ZAP7025031%
IRAK423639%PIM144430%
LYN26538%FYN40223%
FYN34336%RIPK219423%
PIM132735%IRAK435122%
CDK233333%CDK250421%
RIPK214532%LRRK245219%
CSK31829%SGK149119%
STK1115627%STK1163518%
ZAP7024217%CSK27618%
ABPsBDs
Number of terms analyzed per proteinPrecisionNumber of terms analyzed per proteinPrecision
LRRK224749%LYN39841%
SGK130143%SYK57036%
SYK26240%ZAP7025031%
IRAK423639%PIM144430%
LYN26538%FYN40223%
FYN34336%RIPK219423%
PIM132735%IRAK435122%
CDK233333%CDK250421%
RIPK214532%LRRK245219%
CSK31829%SGK149119%
STK1115627%STK1163518%
ZAP7024217%CSK27618%
Table 6

Precision analysis for BP (A) and D (B) axes

ABPsBDs
Number of terms analyzed per proteinPrecisionNumber of terms analyzed per proteinPrecision
LRRK224749%LYN39841%
SGK130143%SYK57036%
SYK26240%ZAP7025031%
IRAK423639%PIM144430%
LYN26538%FYN40223%
FYN34336%RIPK219423%
PIM132735%IRAK435122%
CDK233333%CDK250421%
RIPK214532%LRRK245219%
CSK31829%SGK149119%
STK1115627%STK1163518%
ZAP7024217%CSK27618%
ABPsBDs
Number of terms analyzed per proteinPrecisionNumber of terms analyzed per proteinPrecision
LRRK224749%LYN39841%
SGK130143%SYK57036%
SYK26240%ZAP7025031%
IRAK423639%PIM144430%
LYN26538%FYN40223%
FYN34336%RIPK219423%
PIM132735%IRAK435122%
CDK233333%CDK250421%
RIPK214532%LRRK245219%
CSK31829%SGK149119%
STK1115627%STK1163518%
ZAP7024217%CSK27618%

This discrepancy might be due to synonyms that cause problems (formation, growth etc.), terms that are too vague (signaling, signal transduction, signaling cascade, regulation, carcinogenesis, tumor, autoimmune D etc.), technical terms (RNA interference, RNAi, knockout mice etc.) or non-relevant terms for the axis of interest (pathogenesis, memory, methylation, phosphorylation, localization, point mutations, gene variant, accumulation, sensitivity etc.; Table 7).

Table 7

List of rejected terms by the curators in BP (A) and D (B) axes

Table 7A List of rejected terms by the curator in biological process axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
GO:0023052SignalingSignaling15456%6323%5621%273
GO:0032502Developmental processDevelopmental process11286%1310%54%130
GO:0065007N/ABiological regulation11081%2619%00%136
GO:0016310PhosphorylationPhosphorylation11048%5524%6629%231
GO:0007165Signal transductionSignal transduction9076%119%1714%118
GO:0006351Transcription and DNA-templatedTranscription and DNA-templated8879%65%1816%112
GO:0009058Biosynthetic processBiosynthetic process8380%1918%22%104
GO:0040007GrowthGrowth7480%1617%22%92
GO:0010467Gene expressionGene expression5576%11%1622%72
GO:0006915Apoptotic processApoptotic process4452%45%3643%84
GO:0009405N/APathogenesis42100%00%00%42
GO:0051726Regulation of cell cycleRegulation of cell cycle4082%24%714%49
GO:0007049Cell cycleCell cycle3765%47%1628%57
GO:0006954Inflammatory responseInflammatory response3662%23%2034%58
GO:0006283Transcription-coupled nucleotide-excision repairTCR3477%920%12%44
GO:0016246N/ARNA interference31100%00%00%31
GO:0008283Cell proliferationCell proliferation3160%12%2038%52
GO:0009056Catabolic processCatabolic process2659%1023%818%44
GO:0033673N/ANegative regulation of kinase activity26100%00%00%26
GO:0051179LocalizationLocalization2469%1029%13%35
GO:0008152Metabolic processMetabolic process2288%00%312%25
GO:0016049Cell growthCell growth2181%14%415%26
GO:0045087Innate immune responseInnate immune response1976%00%624%25
GO:0001816Cytokine productionCytokine production1752%13%1545%33
GO:0008219Cell deathCell death1657%27%1036%28
GO:0006412TranslationTranslation1653%13%1343%30
GO:0042110T cell activationT-cell activation1470%00%630%20
GO:0051320S phaseS phase1346%518%1036%28
GO:0046903SecretionSecretion1350%727%623%26
GO:0006914AutophagyAutophagy1365%00%735%20
GO:0030154Cell differentiationCell differentiation1286%214%00%14
GO:0032259N/AMethylation12100%00%00%12
GO:0006260DNA replicationDNA replication1155%00%945%20
GO:0009293N/ATransduction1179%321%00%14
GO:0006810N/ATransport11100%00%00%11
GO:0046960SensitizationSensitization1192%00%18%12
GO:0016311N/ADephosphorylation10100%00%00%10
Table 7B (Ds) List of rejected terms by the curator in disease axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
C2991D or DisorderCondition14889%117%85%167
C3262NeoplasmTumor10066%128%3926%151
C45576N/AMutation90100%00%00%90
C9305Malignant neoplasmCancer9074%22%2924%121
C3114HypersensitivitySensitivity5094%12%24%53
C3137InflammationInflammation4973%11%1725%67
C18264PathogenesisPathogenesis4696%12%12%48
C120860N/AAccumulation43100%00%00%43
C18078CarcinogenesisTumorigenesis3671%48%1122%51
C26845Parkinson’s DParkinson’s D3379%00%921%42
C19296N/ADeletion32100%00%00%32
C50753N/AStaining30100%00%00%30
C3324Peutz–Jeghers syndromePeutz–Jeghers syndrome2983%00%617%35
C14339N/AKnockout mice27100%00%00%27
C20200N/AOutcome26100%00%00%26
C45581Gene amplification abnormalityAmplification2696%00%14%27
C3671N/AInjury2586%414%00%29
C53802Adverse event associated with the gastrointestinal systemGastrointestinal2583%00%517%30
C42077Cellular infiltrateInfiltration2489%00%311%27
C17666N/AGermline mutations23100%00%00%23
C75004InvasionInvasion2279%14%518%28
C55998N/APlatelets19100%00%00%19
C3161LeukemiaLeukemia1979%00%521%24
C53791Adverse event associated with infectionInfection1851%1440%39%35
C54685Tissue adhesionAdhesion1794%00%16%18
C94604N/AMouse model16100%00%00%16
C39723Immune system findingImmune system1694%00%16%17
C19987Cancer progressionCancer progression1689%00%211%18
C4089PolyposisPolyposis1689%16%16%18
C93210Inflammatory disorderInflammatory Ds1676%00%524%21
C19151MetastasisMetastases1636%511%2453%45
C53809Adverse event associated with the vascular systemVascular1588%00%212%17
C17609Tumor progressionTumor progression1583%00%317%18
C3208LymphomaLymphoma1568%00%732%22
C16897N/ANecrosis14100%00%00%14
C27990ToxicityToxicity1493%00%17%15
C36117Invasive lesionInvasive1470%210%420%20
C62200N/APoint mutation13100%00%00%13
C39725ImmunodeficiencyImmunodeficient1393%00%17%14
C120867N/ABacteria1372%528%00%18
C102283N/AExtracted12100%00%00%12
C17354N/AFrameshift mutation12100%00%00%12
C28193N/ASyndrome12100%00%00%12
C2873N/AAneuploidy12100%00%00%12
C45582N/ADuplication12100%00%00%12
C18016Loss of heterozygosityAllelic loss1292%00%18%13
C14174N/AMetastatic1286%214%00%14
C50774Tissue degenerationDegeneration1286%00%214%14
C2916CarcinomaCarcinomas1280%213%17%15
C3340PolypPolyps1275%16%319%16
C2950Cytogenetic abnormalityChromosomal aberration1192%00%18%12
C3117HypertensionHypertension1173%00%427%15
C4872Breast carcinomaBreast carcinomas1139%00%1761%28
C120945N/AInclusions10100%00%00%10
C17212N/ACell transformation10100%00%00%10
C18133N/AMissense mutations10100%00%00%10
C3101N/AInherited D10100%00%00%10
C3174N/AChronic myelogenous leukemia10100%00%00%10
C48189N/AGenome instability10100%00%00%10
C48275N/AFatal10100%00%00%10
C8509Primary neoplasmPrimary tumor1071%00%429%14
Table 7A List of rejected terms by the curator in biological process axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
GO:0023052SignalingSignaling15456%6323%5621%273
GO:0032502Developmental processDevelopmental process11286%1310%54%130
GO:0065007N/ABiological regulation11081%2619%00%136
GO:0016310PhosphorylationPhosphorylation11048%5524%6629%231
GO:0007165Signal transductionSignal transduction9076%119%1714%118
GO:0006351Transcription and DNA-templatedTranscription and DNA-templated8879%65%1816%112
GO:0009058Biosynthetic processBiosynthetic process8380%1918%22%104
GO:0040007GrowthGrowth7480%1617%22%92
GO:0010467Gene expressionGene expression5576%11%1622%72
GO:0006915Apoptotic processApoptotic process4452%45%3643%84
GO:0009405N/APathogenesis42100%00%00%42
GO:0051726Regulation of cell cycleRegulation of cell cycle4082%24%714%49
GO:0007049Cell cycleCell cycle3765%47%1628%57
GO:0006954Inflammatory responseInflammatory response3662%23%2034%58
GO:0006283Transcription-coupled nucleotide-excision repairTCR3477%920%12%44
GO:0016246N/ARNA interference31100%00%00%31
GO:0008283Cell proliferationCell proliferation3160%12%2038%52
GO:0009056Catabolic processCatabolic process2659%1023%818%44
GO:0033673N/ANegative regulation of kinase activity26100%00%00%26
GO:0051179LocalizationLocalization2469%1029%13%35
GO:0008152Metabolic processMetabolic process2288%00%312%25
GO:0016049Cell growthCell growth2181%14%415%26
GO:0045087Innate immune responseInnate immune response1976%00%624%25
GO:0001816Cytokine productionCytokine production1752%13%1545%33
GO:0008219Cell deathCell death1657%27%1036%28
GO:0006412TranslationTranslation1653%13%1343%30
GO:0042110T cell activationT-cell activation1470%00%630%20
GO:0051320S phaseS phase1346%518%1036%28
GO:0046903SecretionSecretion1350%727%623%26
GO:0006914AutophagyAutophagy1365%00%735%20
GO:0030154Cell differentiationCell differentiation1286%214%00%14
GO:0032259N/AMethylation12100%00%00%12
GO:0006260DNA replicationDNA replication1155%00%945%20
GO:0009293N/ATransduction1179%321%00%14
GO:0006810N/ATransport11100%00%00%11
GO:0046960SensitizationSensitization1192%00%18%12
GO:0016311N/ADephosphorylation10100%00%00%10
Table 7B (Ds) List of rejected terms by the curator in disease axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
C2991D or DisorderCondition14889%117%85%167
C3262NeoplasmTumor10066%128%3926%151
C45576N/AMutation90100%00%00%90
C9305Malignant neoplasmCancer9074%22%2924%121
C3114HypersensitivitySensitivity5094%12%24%53
C3137InflammationInflammation4973%11%1725%67
C18264PathogenesisPathogenesis4696%12%12%48
C120860N/AAccumulation43100%00%00%43
C18078CarcinogenesisTumorigenesis3671%48%1122%51
C26845Parkinson’s DParkinson’s D3379%00%921%42
C19296N/ADeletion32100%00%00%32
C50753N/AStaining30100%00%00%30
C3324Peutz–Jeghers syndromePeutz–Jeghers syndrome2983%00%617%35
C14339N/AKnockout mice27100%00%00%27
C20200N/AOutcome26100%00%00%26
C45581Gene amplification abnormalityAmplification2696%00%14%27
C3671N/AInjury2586%414%00%29
C53802Adverse event associated with the gastrointestinal systemGastrointestinal2583%00%517%30
C42077Cellular infiltrateInfiltration2489%00%311%27
C17666N/AGermline mutations23100%00%00%23
C75004InvasionInvasion2279%14%518%28
C55998N/APlatelets19100%00%00%19
C3161LeukemiaLeukemia1979%00%521%24
C53791Adverse event associated with infectionInfection1851%1440%39%35
C54685Tissue adhesionAdhesion1794%00%16%18
C94604N/AMouse model16100%00%00%16
C39723Immune system findingImmune system1694%00%16%17
C19987Cancer progressionCancer progression1689%00%211%18
C4089PolyposisPolyposis1689%16%16%18
C93210Inflammatory disorderInflammatory Ds1676%00%524%21
C19151MetastasisMetastases1636%511%2453%45
C53809Adverse event associated with the vascular systemVascular1588%00%212%17
C17609Tumor progressionTumor progression1583%00%317%18
C3208LymphomaLymphoma1568%00%732%22
C16897N/ANecrosis14100%00%00%14
C27990ToxicityToxicity1493%00%17%15
C36117Invasive lesionInvasive1470%210%420%20
C62200N/APoint mutation13100%00%00%13
C39725ImmunodeficiencyImmunodeficient1393%00%17%14
C120867N/ABacteria1372%528%00%18
C102283N/AExtracted12100%00%00%12
C17354N/AFrameshift mutation12100%00%00%12
C28193N/ASyndrome12100%00%00%12
C2873N/AAneuploidy12100%00%00%12
C45582N/ADuplication12100%00%00%12
C18016Loss of heterozygosityAllelic loss1292%00%18%13
C14174N/AMetastatic1286%214%00%14
C50774Tissue degenerationDegeneration1286%00%214%14
C2916CarcinomaCarcinomas1280%213%17%15
C3340PolypPolyps1275%16%319%16
C2950Cytogenetic abnormalityChromosomal aberration1192%00%18%12
C3117HypertensionHypertension1173%00%427%15
C4872Breast carcinomaBreast carcinomas1139%00%1761%28
C120945N/AInclusions10100%00%00%10
C17212N/ACell transformation10100%00%00%10
C18133N/AMissense mutations10100%00%00%10
C3101N/AInherited D10100%00%00%10
C3174N/AChronic myelogenous leukemia10100%00%00%10
C48189N/AGenome instability10100%00%00%10
C48275N/AFatal10100%00%00%10
C8509Primary neoplasmPrimary tumor1071%00%429%14

Terms always rejected are highlighted in grey. The list is limited to terms proposed at least 30 times by the system. The proposed label does not necessarily correspond to the primary class label; it may be the term synonym identified by neXtA5.

Table 7

List of rejected terms by the curators in BP (A) and D (B) axes

Table 7A List of rejected terms by the curator in biological process axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
GO:0023052SignalingSignaling15456%6323%5621%273
GO:0032502Developmental processDevelopmental process11286%1310%54%130
GO:0065007N/ABiological regulation11081%2619%00%136
GO:0016310PhosphorylationPhosphorylation11048%5524%6629%231
GO:0007165Signal transductionSignal transduction9076%119%1714%118
GO:0006351Transcription and DNA-templatedTranscription and DNA-templated8879%65%1816%112
GO:0009058Biosynthetic processBiosynthetic process8380%1918%22%104
GO:0040007GrowthGrowth7480%1617%22%92
GO:0010467Gene expressionGene expression5576%11%1622%72
GO:0006915Apoptotic processApoptotic process4452%45%3643%84
GO:0009405N/APathogenesis42100%00%00%42
GO:0051726Regulation of cell cycleRegulation of cell cycle4082%24%714%49
GO:0007049Cell cycleCell cycle3765%47%1628%57
GO:0006954Inflammatory responseInflammatory response3662%23%2034%58
GO:0006283Transcription-coupled nucleotide-excision repairTCR3477%920%12%44
GO:0016246N/ARNA interference31100%00%00%31
GO:0008283Cell proliferationCell proliferation3160%12%2038%52
GO:0009056Catabolic processCatabolic process2659%1023%818%44
GO:0033673N/ANegative regulation of kinase activity26100%00%00%26
GO:0051179LocalizationLocalization2469%1029%13%35
GO:0008152Metabolic processMetabolic process2288%00%312%25
GO:0016049Cell growthCell growth2181%14%415%26
GO:0045087Innate immune responseInnate immune response1976%00%624%25
GO:0001816Cytokine productionCytokine production1752%13%1545%33
GO:0008219Cell deathCell death1657%27%1036%28
GO:0006412TranslationTranslation1653%13%1343%30
GO:0042110T cell activationT-cell activation1470%00%630%20
GO:0051320S phaseS phase1346%518%1036%28
GO:0046903SecretionSecretion1350%727%623%26
GO:0006914AutophagyAutophagy1365%00%735%20
GO:0030154Cell differentiationCell differentiation1286%214%00%14
GO:0032259N/AMethylation12100%00%00%12
GO:0006260DNA replicationDNA replication1155%00%945%20
GO:0009293N/ATransduction1179%321%00%14
GO:0006810N/ATransport11100%00%00%11
GO:0046960SensitizationSensitization1192%00%18%12
GO:0016311N/ADephosphorylation10100%00%00%10
Table 7B (Ds) List of rejected terms by the curator in disease axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
C2991D or DisorderCondition14889%117%85%167
C3262NeoplasmTumor10066%128%3926%151
C45576N/AMutation90100%00%00%90
C9305Malignant neoplasmCancer9074%22%2924%121
C3114HypersensitivitySensitivity5094%12%24%53
C3137InflammationInflammation4973%11%1725%67
C18264PathogenesisPathogenesis4696%12%12%48
C120860N/AAccumulation43100%00%00%43
C18078CarcinogenesisTumorigenesis3671%48%1122%51
C26845Parkinson’s DParkinson’s D3379%00%921%42
C19296N/ADeletion32100%00%00%32
C50753N/AStaining30100%00%00%30
C3324Peutz–Jeghers syndromePeutz–Jeghers syndrome2983%00%617%35
C14339N/AKnockout mice27100%00%00%27
C20200N/AOutcome26100%00%00%26
C45581Gene amplification abnormalityAmplification2696%00%14%27
C3671N/AInjury2586%414%00%29
C53802Adverse event associated with the gastrointestinal systemGastrointestinal2583%00%517%30
C42077Cellular infiltrateInfiltration2489%00%311%27
C17666N/AGermline mutations23100%00%00%23
C75004InvasionInvasion2279%14%518%28
C55998N/APlatelets19100%00%00%19
C3161LeukemiaLeukemia1979%00%521%24
C53791Adverse event associated with infectionInfection1851%1440%39%35
C54685Tissue adhesionAdhesion1794%00%16%18
C94604N/AMouse model16100%00%00%16
C39723Immune system findingImmune system1694%00%16%17
C19987Cancer progressionCancer progression1689%00%211%18
C4089PolyposisPolyposis1689%16%16%18
C93210Inflammatory disorderInflammatory Ds1676%00%524%21
C19151MetastasisMetastases1636%511%2453%45
C53809Adverse event associated with the vascular systemVascular1588%00%212%17
C17609Tumor progressionTumor progression1583%00%317%18
C3208LymphomaLymphoma1568%00%732%22
C16897N/ANecrosis14100%00%00%14
C27990ToxicityToxicity1493%00%17%15
C36117Invasive lesionInvasive1470%210%420%20
C62200N/APoint mutation13100%00%00%13
C39725ImmunodeficiencyImmunodeficient1393%00%17%14
C120867N/ABacteria1372%528%00%18
C102283N/AExtracted12100%00%00%12
C17354N/AFrameshift mutation12100%00%00%12
C28193N/ASyndrome12100%00%00%12
C2873N/AAneuploidy12100%00%00%12
C45582N/ADuplication12100%00%00%12
C18016Loss of heterozygosityAllelic loss1292%00%18%13
C14174N/AMetastatic1286%214%00%14
C50774Tissue degenerationDegeneration1286%00%214%14
C2916CarcinomaCarcinomas1280%213%17%15
C3340PolypPolyps1275%16%319%16
C2950Cytogenetic abnormalityChromosomal aberration1192%00%18%12
C3117HypertensionHypertension1173%00%427%15
C4872Breast carcinomaBreast carcinomas1139%00%1761%28
C120945N/AInclusions10100%00%00%10
C17212N/ACell transformation10100%00%00%10
C18133N/AMissense mutations10100%00%00%10
C3101N/AInherited D10100%00%00%10
C3174N/AChronic myelogenous leukemia10100%00%00%10
C48189N/AGenome instability10100%00%00%10
C48275N/AFatal10100%00%00%10
C8509Primary neoplasmPrimary tumor1071%00%429%14
Table 7A List of rejected terms by the curator in biological process axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
GO:0023052SignalingSignaling15456%6323%5621%273
GO:0032502Developmental processDevelopmental process11286%1310%54%130
GO:0065007N/ABiological regulation11081%2619%00%136
GO:0016310PhosphorylationPhosphorylation11048%5524%6629%231
GO:0007165Signal transductionSignal transduction9076%119%1714%118
GO:0006351Transcription and DNA-templatedTranscription and DNA-templated8879%65%1816%112
GO:0009058Biosynthetic processBiosynthetic process8380%1918%22%104
GO:0040007GrowthGrowth7480%1617%22%92
GO:0010467Gene expressionGene expression5576%11%1622%72
GO:0006915Apoptotic processApoptotic process4452%45%3643%84
GO:0009405N/APathogenesis42100%00%00%42
GO:0051726Regulation of cell cycleRegulation of cell cycle4082%24%714%49
GO:0007049Cell cycleCell cycle3765%47%1628%57
GO:0006954Inflammatory responseInflammatory response3662%23%2034%58
GO:0006283Transcription-coupled nucleotide-excision repairTCR3477%920%12%44
GO:0016246N/ARNA interference31100%00%00%31
GO:0008283Cell proliferationCell proliferation3160%12%2038%52
GO:0009056Catabolic processCatabolic process2659%1023%818%44
GO:0033673N/ANegative regulation of kinase activity26100%00%00%26
GO:0051179LocalizationLocalization2469%1029%13%35
GO:0008152Metabolic processMetabolic process2288%00%312%25
GO:0016049Cell growthCell growth2181%14%415%26
GO:0045087Innate immune responseInnate immune response1976%00%624%25
GO:0001816Cytokine productionCytokine production1752%13%1545%33
GO:0008219Cell deathCell death1657%27%1036%28
GO:0006412TranslationTranslation1653%13%1343%30
GO:0042110T cell activationT-cell activation1470%00%630%20
GO:0051320S phaseS phase1346%518%1036%28
GO:0046903SecretionSecretion1350%727%623%26
GO:0006914AutophagyAutophagy1365%00%735%20
GO:0030154Cell differentiationCell differentiation1286%214%00%14
GO:0032259N/AMethylation12100%00%00%12
GO:0006260DNA replicationDNA replication1155%00%945%20
GO:0009293N/ATransduction1179%321%00%14
GO:0006810N/ATransport11100%00%00%11
GO:0046960SensitizationSensitization1192%00%18%12
GO:0016311N/ADephosphorylation10100%00%00%10
Table 7B (Ds) List of rejected terms by the curator in disease axis
Unique conceptProposed conceptProposed synonymRejectedModifiedAcceptedTotal
C2991D or DisorderCondition14889%117%85%167
C3262NeoplasmTumor10066%128%3926%151
C45576N/AMutation90100%00%00%90
C9305Malignant neoplasmCancer9074%22%2924%121
C3114HypersensitivitySensitivity5094%12%24%53
C3137InflammationInflammation4973%11%1725%67
C18264PathogenesisPathogenesis4696%12%12%48
C120860N/AAccumulation43100%00%00%43
C18078CarcinogenesisTumorigenesis3671%48%1122%51
C26845Parkinson’s DParkinson’s D3379%00%921%42
C19296N/ADeletion32100%00%00%32
C50753N/AStaining30100%00%00%30
C3324Peutz–Jeghers syndromePeutz–Jeghers syndrome2983%00%617%35
C14339N/AKnockout mice27100%00%00%27
C20200N/AOutcome26100%00%00%26
C45581Gene amplification abnormalityAmplification2696%00%14%27
C3671N/AInjury2586%414%00%29
C53802Adverse event associated with the gastrointestinal systemGastrointestinal2583%00%517%30
C42077Cellular infiltrateInfiltration2489%00%311%27
C17666N/AGermline mutations23100%00%00%23
C75004InvasionInvasion2279%14%518%28
C55998N/APlatelets19100%00%00%19
C3161LeukemiaLeukemia1979%00%521%24
C53791Adverse event associated with infectionInfection1851%1440%39%35
C54685Tissue adhesionAdhesion1794%00%16%18
C94604N/AMouse model16100%00%00%16
C39723Immune system findingImmune system1694%00%16%17
C19987Cancer progressionCancer progression1689%00%211%18
C4089PolyposisPolyposis1689%16%16%18
C93210Inflammatory disorderInflammatory Ds1676%00%524%21
C19151MetastasisMetastases1636%511%2453%45
C53809Adverse event associated with the vascular systemVascular1588%00%212%17
C17609Tumor progressionTumor progression1583%00%317%18
C3208LymphomaLymphoma1568%00%732%22
C16897N/ANecrosis14100%00%00%14
C27990ToxicityToxicity1493%00%17%15
C36117Invasive lesionInvasive1470%210%420%20
C62200N/APoint mutation13100%00%00%13
C39725ImmunodeficiencyImmunodeficient1393%00%17%14
C120867N/ABacteria1372%528%00%18
C102283N/AExtracted12100%00%00%12
C17354N/AFrameshift mutation12100%00%00%12
C28193N/ASyndrome12100%00%00%12
C2873N/AAneuploidy12100%00%00%12
C45582N/ADuplication12100%00%00%12
C18016Loss of heterozygosityAllelic loss1292%00%18%13
C14174N/AMetastatic1286%214%00%14
C50774Tissue degenerationDegeneration1286%00%214%14
C2916CarcinomaCarcinomas1280%213%17%15
C3340PolypPolyps1275%16%319%16
C2950Cytogenetic abnormalityChromosomal aberration1192%00%18%12
C3117HypertensionHypertension1173%00%427%15
C4872Breast carcinomaBreast carcinomas1139%00%1761%28
C120945N/AInclusions10100%00%00%10
C17212N/ACell transformation10100%00%00%10
C18133N/AMissense mutations10100%00%00%10
C3101N/AInherited D10100%00%00%10
C3174N/AChronic myelogenous leukemia10100%00%00%10
C48189N/AGenome instability10100%00%00%10
C48275N/AFatal10100%00%00%10
C8509Primary neoplasmPrimary tumor1071%00%429%14

Terms always rejected are highlighted in grey. The list is limited to terms proposed at least 30 times by the system. The proposed label does not necessarily correspond to the primary class label; it may be the term synonym identified by neXtA5.

A few concepts considered by the annotators (~4%) were chosen from terms not indexed by the name entity recognition module. This minor inconsistency from the input may have contributed to some discrepancy in the results between the manual and neXtA5 annotations.

Highly rejected terms

We have also noticed for both axes that certain terms are frequently rejected, while others are always rejected (highlighted terms; Table 7, Supplementary Data Table 2). Those include synonyms with multiple semantic meanings (formation, growth etc.), terms that are too vague (signaling, signal transduction, regulation, developmental process, carcinogenesis, tumor, autoimmune D, genome instability, outcome etc.), technical terms (RNA interference, RNAi, knockout mice, staining etc.) or non-relevant terms for the axis of interest (such as pathogenesis, memory, methylation, phosphorylation, dephosphorylation, localization, point mutation, accumulation, sensitivity etc.). One possible approach to alleviate this problem would be to put these terms on a black list and not propose them as annotations. Ideally, those terms would also be excluded from the prioritization step, which would also have the advantage of improving the triage step.

Improvements to the user interface

In addition to improving the document triage and concept extraction algorithms, the users have noticed several improvements to the user interface that would facilitate the workflow.

In the current neXtA5 user interface, annotations are displayed according to the position of the descriptor in the text. This was one of the initial specifications of the project, to improve readability and allow curators to know exactly where concepts were extracted from the text. However, while neXtA5 is able to suggest relevant descriptors, those descriptors are spread over many irrelevant or trivial descriptors. After performing the usability study, we realized that being able to rank the evidences could deliver a complementary view. In the current GUI, the two types of views are available and the default remains the linear view, which seems somehow more intuitive. We do consider that such complementary revisions are somehow expected as outcome of usability studies.

It would therefore be much more efficient from an interaction point of view to display annotations based on their estimated relevance. We have experimented with improvement to the ranking function of the specific axes. The impact on the performance resulting from these changes in the ranking function seems promising. This additional assessment was performed using TREC_EVAL tool (38), and the results relate the relevance of the annotations proposed by the system at top ranks (P0 for the precision at first rank and P5 for the precision on the five first descriptors returned).

For GO BP, we used a machine-learning approach to improve the ranking of the annotations displayed by neXtA5. We used GOCat, a large multiclass multilabel categorizer (39), that exploits more than 100 000 curated citations from the Gene Ontology Annotation (GOA) database (https://www.ebi.ac.uk/GOA/downloads) and aims at inferring GO annotations for any textual input (abstracts, sentences etc.) it receives. As GOCat learns from GOA, the proposed GO concepts are modeling a manual curation task. The GOCat system showed highly competitive results during the BioCreative 2014 competition, which explored a GO automatic annotation task (40). In neXtA5, GOCat output is used to promote GO descriptors identified in the input text. Thanks to GOCat, neXtA5 improves performances from 0.48 to 0.63 in P0 (+31%) and from 0.28 to 0.35 for P5 (+25%) (Table 8).

Table 8

Results of learning to rank applied to annotations

BaselineRe-ranking
P0P5P0P5
BPs0.480.280.630.35
Ds0.480.170.590.22
BaselineRe-ranking
P0P5P0P5
BPs0.480.280.630.35
Ds0.480.170.590.22
Table 8

Results of learning to rank applied to annotations

BaselineRe-ranking
P0P5P0P5
BPs0.480.280.630.35
Ds0.480.170.590.22
BaselineRe-ranking
P0P5P0P5
BPs0.480.280.630.35
Ds0.480.170.590.22

For Ds, we used a simple TF–IDF scoring function to estimate the importance of every single annotation. The basic assumption is that important concepts from the curator perspective tend to occur repeatedly in the corpus of texts (i.e. the meaningful entities detected by neXtA5 would be repeated through the abstracts). However, these high-frequency concepts may also be regular English words; therefore, the raw frequency of occurrence must be balanced by the inverse document frequency, i.e. the frequency of the concept in a large sample of MEDLINE. As presented in the Table 8, this simple approach results in a significant precision gain of +23% at first rank.

Perspective

Our results show that the neXtA5 system performs well enough to improve the manual curation workflow. The average consensus between curators covers 60% of the concepts for the BPs and 87% for the Ds. neXtA5 is then intended to reach similar performance, and prior experiments already show that the exclusion of specific terms and the re-ranking of annotations highly impact on its precision without negatively impacting the recall.

From abstracts to full-text articles

neXtA5 was developed and optimized using abstracts. The analysis of full-text articles is necessary for this tool to be usable in a production setting. Full-text papers pose many problems (3), most are only available in pdf format that should be Optical character recognition (ORC) preprocessed and some are not even available due to the journal access policy. Certain sections, most notably the introduction and the discussion, have a lower interest for the curation, respectively due to the type of information and to the redundancy. For these reasons, the abstract is probably the most useful part of a research article to perform article prioritization. This is not the case for concept extraction; the neXtProt curators (as well as most other curated databases) extract data directly based on experimental results, so it is mandatory that the full text of the paper be reviewed. One middle-way solution would be to allow curators to paste text in a form, which would then be used for concept extraction from neXtA5. This would avoid the problem of automatic recognition of article’s sections that is notoriously difficult (41) while making use of the strengths of the system to recognize concepts.

Perspectives for neXtA5

The results of this study convinced us that neXtA5 is a valuable addition to our curation pipeline, and we are in the process of implementing neXtA5 in the BioEditor curation tool. We are now considering the customization of the curation-support platform to support other use cases of other manually curated resources, such as the detection of positional information (post-translational modifications and variants). These use cases focus more heavily on triage that is both the most mature component of the platform and the most needed service for professional curators. Further developments are ongoing to apply the system to a wider range of curated databases, including core resources of Elixir (https://www.elixir-europe.org/) such as DisProt (42), that will require developing text-mining services to recognize lesser studied entities such as sequence positions. The annotation services will also be expanded to support the annotation of full-text contents. Indeed, while triage is performed mostly on abstracts, the authoring of curated annotations does require the use of full-text contents.

Finally, we are committed to develop neXtA5 according to state-of-the-art methodologies. Our work (29) and that of other groups (43) indicate that machine-learning assisted triage method could improve the document retrieval process, outperforming manual curators at least for specific tasks. As machine learning does better than other strategies only in cases where the available body of data is sufficiently large, this approach is currently limited to few data types. We will continue to explore all appropriate algorithms for our use cases and adjust our algorithms as new development occurs that could justify changes in strategies.

Software availability

A demo version of neXtA5 is available at http://candy.hesge.ch/nextA5. The manual judgements on which this study is based are included in Table 3.

Acknowledgements

We thank the reviewers for their valuable comments.

Funding

Swiss National Fund (SNF #153437).

Conflict of interest. None declared.

Database URL:http://candy.hesge.ch/nextA5; https://nextprot.org

References

1.

Lu
,
Z.
(
2011
)
PubMed and beyond: a survey of web tools for searching biomedical literature
.
Database (Oxford)
,
2011
,
baq036
.

2.

Baumgartner
,
W.A.J.
,
Cohen
,
K.B.
,
Fox
,
L.M.
et al.  (
2007
)
Manual curation is not sufficient for annotation of genomic databases
.
Bioinformatics
,
23
,
i41
i48
.

3.

Hirschman
,
L.
,
Burns
,
G.A.P.C.
,
Krallinger
,
M.
et al.  (
2012
)
Text mining for the biocuration workflow
.
Database (Oxford)
,
2012
,
1
10
,
bas020
.

4.

Müller
,
H.M.
,
Van Auken
,
K.M.
,
Li
,
Y.
et al.  (
2018
)
Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature
.
BMC Bioinformatics
,
19
,
94
.

5.

NCBI Resource Coordinators
(
2016
)
Database resources of the National Center for Biotechnology Information
.
Nucleic Acids Res.
,
46
,
D8
D13
.

6.

Liu
,
Y.
,
Liang
,
Y.
and
Wishart
,
D.
(
2015
)
PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more
.
Nucleic Acids Res.
,
43
,
W535
W542
.

7.

Doms
,
A.
and
Schroeder
,
M.
(
2005
)
GoPubMed: exploring PubMed with the gene ontology
.
Nucleic Acids Res.
,
33
,
W783
W786
.

8.

Wei
,
C.H.
,
Kao
,
H.Y.
and
Lu
,
Z.
(
2013
)
PubTator: a web-based text mining tool for assisting biocuration
.
Nucleic Acids Res.
,
41
,
W518
W522
.

9.

Rak
,
R.
,
Batista-Navarro
,
R.T.
,
Rowley
,
A.
et al.  (
2014
)
Text-mining-assisted biocuration workflows in Argo
.
Database (Oxford)
,
2014
,
1
4
.

10.

Wang
,
Q.
,
Abdul
,
S.
,
Almeida
,
L.
et al.  (
2016
)
Overview of the interactive task in BioCreative V
.
Database (Oxford)
,
2016
,
1
18
.

11.

Batista-Navarro
,
R.
,
Carter
,
J.
and
Ananiadou
,
S.
(
2016
)
Argo: enabling the development of bespoke workflows and services for disease annotation
.
Database (Oxford)
,
2016
,
1
11
.

12.

Fu
,
X.
,
Batista-Navarro
,
R.
,
Rak
,
R.
et al.  (
2015
)
Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows
.
J. Biomed. Semantics
,
6
,
8
.

13.

Matos
,
S.
,
Campos
,
D.
,
Pinho
,
R.
et al.  (
2016
)
Mining clinical attributes of genomic variants through assisted literature curation in Egas
.
Database (Oxford)
,
2016
,
1
9
.

14.

Pafilis
,
E.
,
Buttigieg
,
P.L.
,
Ferrell
,
B.
et al.  (
2016
)
EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation
.
Database (Oxford)
,
2016
,
1
7
.

15.

Dai
,
H.J.
,
Su
,
C.H.
,
Lai
,
P.T.
et al.  (
2016
)
MET network in PubMed: a text-mined network visualization and curation system
.
Database (Oxford)
,
2016
,
1
10
.

16.

Gama-Castro
,
S.
,
Rinaldi
,
F.
,
López-Fuentes
,
A.
et al.  (
2014
) et al. .
Database (Oxford)
,
2014
,
1
13
.

17.

Rinaldi
,
F.
,
Lithgow
,
O.
,
Gama-Castro
,
S.
et al.  (
2017
)
Strategies towards digital and semi-automated curation in RegulonDB
.
Database (Oxford)
,
2017
,
1
11
.

18.

Ruch
,
P.
(
2017
)
Text mining to support gene ontology curation and vice versa
.
Methods Mol. Biol.
,
1446
,
69
84
.

19.

Gaudet
,
P.
,
Michel
,
P.-A.
,
Zahn-Zabal
,
M.
et al.  (
2017
)
The neXtProt knowledgebase on human proteins: 2017 update
.
Nucleic Acids Res.
,
45
,
D177
D182
.

20.

The UniProt Consortium
(
2017
)
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res.
,
45
,
D158
D169
.

21.

Hinard
,
V.
,
Britan
,
A.
,
Schaeffer
,
M.
et al.  (
2017
)
Annotation of functional impact of voltage-gated sodium channel mutations
.
Hum. Mutat.
,
38
,
485
493
.

22.

Mottin
,
L.
,
Gobeill
,
J.
,
Pasche
,
E.
et al.  (
2016
)
neXtA5: accelerating annotation of articles via automated approaches in neXtProt
.
Database (Oxford)
,
2016
,
1
9
.

23.

Mottin
,
L.
,
Pasche
,
E.
,
Gobeill
,
J.
et al.  (
2017
)
Triage by ranking to support the curation of protein interactions
.
Database (Oxford)
,
2017
,
1
11
.

24.

The Gene Ontology Consortium
(
2017
)
Expansion of the gene ontology knowledgebase and resources
.
Nucleic Acids Res.
,
45
,
D331
D338
.

25.

Ashburner
,
M.
,
Ball
,
C.A.
,
Blake
,
J.A.
et al.  (
2000
)
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
.
Nature Genetics
,
25
,
25
29
.

26.

Gobeill
,
J.
,
Gaudinat
,
A.
,
Pasche
,
E.
et al.  (
2015
)
Deep question answering for protein annotation
.
Database (Oxford)
,
2015
,
1
9
.

27.

Europe PMC Consortium
(
2015
)
Europe PMC: a full-text literature database for the life sciences and platform for innovation
.
Nucleic Acids Res.
,
43
,
D1042
D1048
.

28.

Venkatesan
,
A.
,
Kim
,
J.H.
,
Talo
,
F.
et al.  (
2016
)
SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data
.
Wellcome Open Res.
,
1
,
25
.

29.

Teodoro
,
D.
,
Mottin
,
L.
,
Gobeill
,
J.
et al.  (
2017
)
Improving average ranking precision in user searches for biomedical research datasets
.
Database (Oxford)
,
2017
,
1
18
.

30.

Salton
,
G.
,
Wong
,
A.
and
Yang
,
C.S.
(
1975
)
A vector space model for automatic indexing
.
Commun. ACM
,
18
,
613
620
.

31.

Gobeill
,
J.
,
Gaudinat
,
A.
,
Pasche
,
E.
et al. . (
2014
)
Full-texts representations with medical subject headings, and co-citations network reranking strategies for TREC 2014 Clinical Decision Support Track
.
University of Applied Sciences Geneva, Switzerland
.
http://www.dtic.mil/docs/citations/ADA618744 (31 May 2018, date last accessed)
.

32.

Wang
,
M.
,
Okamoto
,
M.
,
Domenico
,
J.
et al.  (
2012
)
Inhibition of Pim1 kinase prevents peanut allergy by enhancing Runx3 expression and suppressing T(H)2 and T(H)17 T-cell differentiation
.
J. Allergy Clin. Immunol.
,
130
,
932
944. e12
.

33.

Yadav
,
V.
and
Denning
,
M.F.
(
2011
)
Fyn is induced by Ras/PI3K/Akt signaling and is required for enhanced invasion/migration
.
Mol. Carcinog.
,
50
,
346
352
.

34.

Qin
,
J.
,
Jiang
,
Z.
,
Qian
,
Y.
et al.  (
2004
)
IRAK4 kinase activity is redundant for interleukin-1 (IL-1) receptor-associated kinase phosphorylation and IL-1 responsiveness
.
J. Biol. Chem.
,
279
,
26748
26753
.

35.

Cushing
,
L.
,
Stochaj
,
W.
,
Siegel
,
M.
et al.  (
2014
)
Interleukin 1/toll-like receptor-induced autophosphorylation activates interleukin 1 receptor-associated kinase 4 and controls cytokine induction in a cell type-specific manner
.
J. Biol. Chem.
,
289
,
10865
10875
.

36.

Remuzgo-Martínez
,
S.
,
Genre
,
F.
,
Castañeda
,
S.
et al.  (
2017
)
Protein tyrosine phosphatase non-receptor 22 and C-Src tyrosine kinase genes are down-regulated in patients with rheumatoid arthritis
.
Sci. Rep.
,
7
,
10525
.

37.

An
,
L.
,
Song
,
L.
,
Zhang
,
W.
et al.  (
2014
)
The aspartic acid of Fyn at 390 is critical for neuronal migration during corticogenesis
.
Exp. Cell Res.
,
328
,
419
428
.

38.

Zhou
,
W.
,
Smalheiser
,
N.R.
and
Yu
,
C.
(
2006
)
A tutorial on information retrieval: basic terms and concepts
.
J. Biomed. Discov. Collab.
,
1
,
2
.

39.

Gobeill
,
J.
,
Pasche
,
E.
,
Vishnyakova
,
D.
et al.  (
2013
)
Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases
.
Database (Oxford)
,
2013
,
bat041
.

40.

Mao
,
Y.
,
Van Auken
,
K.
,
Li
,
D.
et al. . (
2014
)
Overview of the gene ontology task at BioCreative IV
.
Database (Oxford)
,
2014
,
1
14
. doi: https://academic.oup.com/database/article/doi/10.1093/database/bau086/2634979.

41.

Liakata
,
M.
,
Saha
,
S.
,
Dobnik
,
S.
et al.  (
2012
)
Automatic recognition of conceptualization zones in scientific articles and two life science applications
.
Bioinformatics
,
28
,
991
1000
.

42.

Piovesan
,
D.
,
Tabaro
,
F.
,
Mičetić
,
I.
et al.  (
2017
)
DisProt 7.0: a major update of the database of disordered proteins
.
Nucleic Acids Res.
,
45
,
D1123
D1124
.

43.

Lee
,
K.
,
Famiglietti
,
M.L.
,
McMahon
,
A.
et al.  (
2018
)
Scaling up data curation using deep learning: an application to literature triage in genomic variation resources
.
PLoS Comput. Biol.
,
13
,
1
14
,
e1006390
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data