PEDL+: protein-centered relation extraction from PubMed at your fingertip

Abstract Summary Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein–protein and protein–chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. Availability and implementation PEDL+ is freely available at https://github.com/leonweber/pedl.


Details on the RE Models
The PPA model that we use for PEDL+ differs in a few aspects from the model that we described in Weber et al. (2020).First, we updated the training data by adding more distantly supervised data and by removing one directly supervised dataset to improve the consistency of the annotations.Specifically, in addition to PID, the training data now includes PPAs derived from the Path-wayCommons representations of Panther (Mi and Thomas, 2009)   path (Kandasamy et al., 2010) 4 .From all used databases we include the two additional PPA types interacts-with, which represents physical protein-protein interactions observed in high-throughput experiments, and catalysis-precedes, which is annotated when the head protein controls a reaction with a product that is used as substrate in another reaction controlled by the tail protein5 .For the directly supervised data, we removed the BioNLP Epigenetics dataset, because its annotation guidelines are very different from those of the other datasets which led to the inclusion of many false negative annotations when combining all.Additionally, we made the MyGeneInfo-based normalization more lenient by introducing support for non-SwissProt proteins and for protein mentions that are mapped to more than one uniprot id, which allows mapping a larger fraction of gene mentions to uniprot ids.As we only include PPAs for proteins which can be resolved to uniprot, we obtain more PPAs per dataset, and thus, in the end, we have more directly supervised PPAs than in the dataset described in Weber et al. (2020).See Table 2, for statistics on the updated dataset.We additionally introduced some minor modifications to the model architecture described in (Weber et al., 2020).First, we use the concatenation of the final-layer embeddings of the entity start markers <e1> and <e2> instead of [CLS] to represent a text span for subsequent classification, because Baldini Soares et al. (2019) suggest that this can lead to more accurate extractions.Also, we use LogSumExp instead of maximum to aggregate the scores from the score matrix to form the evidence prediction to benefit from better gradient flow.
For CPAs, we retrain a model on the DrugProt shared task data (Miranda

Evaluation Projects
In the first project, two curators sought to develop models based on ordinary differential equations and Boolean logic that describe the role of cellular senescence in B-cell lymphoma.We provide the gene sets and results for both projects as supplementary files.For this, they used PEDL+ to connect a recently proposed transcriptomic signature for cellular senescence in diffuse large B-cell lymphoma patients (Reimann et al., 2021) to inhouse models of B-cell development based on the models of Roy et al. (2019) and Thobe et al. (2021).Here, we used the MeSH filter Lymphoma, B-Cell.
In the second project, a third curator developed a Boolean model for the intrinsic pathway of apoptotic regulation with a specific focus on the role of the BCL-2 family.They used PEDL+ to extract PPAs in two ways; (1) among 15 different members of the BCL-2 family and (2) between these BCL-2 family members and a list of putative upstream regulators of apoptosis based on the models of Roy et al. (2019) and thobePatientSpecificModelingDiffuse2021.In this project, we provided the annotator with results without any MeSH filter and, additionally, results for (1) filtered by the MeSH term Lymphoma, B-Cell.

Error Analysis
The results of the error analysis can be found in Figure SM 1.The most frequently cited reasons for incorrect PPA extractions were that PEDL erroneously extracted a PPA from a sentence that does not state it and that it assigned the wrong type for a PPA.For the unhelpful PPAs, the results are inconclusive because annotators provided only 16 annotations in total and the numbers are close together.Cited reasons were that articles discussed the PPA in the context of a disease that is irrelevant to the curation context, that the extracted PPA is known to be an indirect interaction, that the article did not provide sufficient biochemical evidence for the PPA, and that the PPA is only true in specific contexts, e.g. when a protein is mutated or a drug is administered.
Figure SM 1: Results of the error analysis for incorrect (top) and unhelpful (bottom) PPAs.Incorrect: 'No PPA' means that the article does not confirm the PPA, 'Type' that the article states a PPA for the correct protein pair, but the wrong type of PPA was extracted.'Normalization' refers to cases in which the PPA was correctly extracted but PubTator Central assigned at least one wrong gene id for the pair.'Negation' describes cases in which the two proteins and the PPA type are correct, but the existence of the PPA is explicitly negated in the article.'Direction' means that the head and tail of the PPA should be inversed.Unhelpful: 'Indirect' means that the interaction is indirect but the curator was interested only in direct interactions, 'insufficient evidence' means that there was not enough biochemical evidence to support the plausibility of the PPA, 'wrong disease' refers to cases in which the PPA was specific to a disease that is irrelevant to the curation context and 'context missing' to cases where the PPA is only valid in certain contexts such as when the protein is mutated or a drug is administered.

Table SM 1
: Comparison of different text mining tools for PPA extraction.Speed is measured in seconds per protein pair estimated on a sample of 100 random related protein pairs.Filter by MeSH means whether the tool allows to filter results based on the MeSH terms of the articles in which the PPA was found.PMID lookup refers to the ability to find the PubMed articles in which two proteins occur together.Evidence span denotes the tools that provide the text snippet supporting the PPA for quick verification by a user.
Table SM 2: 'Pairs' is the total number of protein pairs with at least one PPA (pos.) and without any PPA (neg.).'Spans' states the average number of text spans per protein pair for pairs with at least one PPA (pos.) and without any PPA (neg.).
Weber et al. (2022)022)-controls-expression-of, 'phosph.'  -controls-phosphorylation-of, 'state' -controls-state-change-of, 'transport'controls-transport-of, 'interacts' -interacts-with, 'catalysis' -catalysis-precedes   et al., 2021)that is based on the single-model baseline without entity descriptions inWeber et al. (2022).Unfortunately, the licensing of RoBERTa-large-PM-M3-Voc (the strongest base model in our evaluation inWeber et al. (2022)) prohibits commercial use.Thus we replace it with LinkBert-base(Yasunaga et al., 2022)because of its strong performance in BioNLP tasks.We fine-tune it for 3 epochs on the training portion of DrugProt and obtain an F1 score of 78.7% on the development set, which is comparable with the best single-model configuration reported inWeber et al. (2022).