GePI: large-scale text mining, customized retrieval and flexible filtering of gene/protein interactions

Abstract We present GePI, a novel Web server for large-scale text mining of molecular interactions from the scientific biomedical literature. GePI leverages natural language processing techniques to identify genes and related entities, interactions between those entities and biomolecular events involving them. GePI supports rapid retrieval of interactions based on powerful search options to contextualize queries targeting (lists of) genes of interest. Contextualization is enabled by full-text filters constraining the search for interactions to either sentences or paragraphs, with or without pre-defined gene lists. Our knowledge graph is updated several times a week ensuring the most recent information to be available at all times. The result page provides an overview of the outcome of a search, with accompanying interaction statistics and visualizations. A table (downloadable in Excel format) gives direct access to the retrieved interaction pairs, together with information about the molecular entities, the factual certainty of the interactions (as verbatim expressed by the authors), and a text snippet from the original document that verbalizes each interaction. In summary, our Web application offers free, easy-to-use, and up-to-date monitoring of gene and protein interaction information, in company with flexible query formulation and filtering options. GePI is available at https://gepi.coling.uni-jena.de/.


INTRODUCTION
The molecular interactions between genes or gene products (e.g. exemplifying gene binding or positi v e regulatory e v ents) are the subject of a large body of scientific r esear ch that is ra pidl y growing and typicall y comm unicated in scientific pa pers. Manuall y cur ated inter action databases such as HPRD ( 1 ), INTACT ( 2 ) and BIOGRID ( 3 ) contain millions of inter actions extr acted from thousands of publications. Such da tabases of fer structured, high-quality, and detailed interaction data enabling continuous scientific hypothesis testing and discov ery. Howe v er, only a small fraction of the complete set of documents in PubMed (PM) (with over 35M citations) and PubMed Central (PMC) (with more than 5.1M full texts in its Open Access (OA) subset, as of March 2023) is covered by such databases through manual cura tion ef f orts. Theref ore, automatic means of harvesting such information from the biomedical literature on a larger scale became a focus of r esear ch in the recent years.
STRING (4)(5)(6), because of its long-term de v elopment history dating back to the beginning of this millennium ( 7 ), is one of the most popular tools in the biomolecular community for gene and protein interaction retrieval and features text mining as one of its data input channels. Researchers mainly from the field of natural language processing (NLP) approached the automatic analysis of biomedical literature in a series of challenge competitions, where tools for the extraction of molecular e v ents were rigorously evaluated against shared benchmark datasets (for a survey, see ( 8 )). Some of the front-runner systems were subsequently integrated into Web servers, such as BIOCONTEXT ( 9 ) or EV E XDB ( 10 ).
We here introduce an alternati v e NLP-based system, Gene and Protein Interactions ( GEPI ), for the quick and v ersatile retrie val of up-to-date molecular interactions from the biomedical literature contained in PM and the OA subset of PMC. GEPI comes as a Web application with a graphical user interface that allows easy access also to non-programming end users. It can be queried with one or two (unlimited) lists of gene / protein IDs and related entity classes. Boolean full-text queries can be specified to filter for the sentence or par agr aph context in which interactions occur. Additional filter options for r estricting sear ch to document titles and section headings provide further options to constrain the retrieval of gene / protein associations. These full-text filters can also be used without the necessity to provide input genes at all. An assessment of the degree of factuality ( 11 , 12 ) for each interaction or e v ent is deri v ed from verbatim signals in the documents where authors express their (un)certainty for an observed or stipulated interaction (using hedging expressions such as 'maybe' or '[our data] suggest that'). Such factuality tags may be used, e.g. to filter out negated or low-certainty interaction statements.
As a response to the often raised request for recency of informa tion, GEPI includes automa ted procedures for updating the indexed literature from PubMed and PMC multiple times a week. The obtained results are shown in a dashboard which includes base statistics and visualizations, as well as a comprehensi v e downloadab le tab le summarizing all identified interactions and their associated literature passages.

BACKGROUND
The retrieval of bio-molecular interactions from scientific documents r equir es gene r eco gnition (GR), w here spans of text that correspond to names or identifiers for gene entities are identified, e.g., Arp5 as a gene in the text passage '[...] the protein level of Arp5 was mar k edl y reduced [...]' . To uniquely identify gene mentions, database IDs are assigned to them in the gene normalization (GN) step. A number of tools handle both tasks, e.g., GNAT ( 13 ), GENO ( 14 ), GNORMPLUS ( 15 ), the system proposed in ( 16 ) and others. After gene and protein mentions have been recognized, semantic relations between pairs of genes / proteins must be identified --the correlate of factual assertions in documents. Again, NLP community challenges were the dri v ers for a number of powerful relation extractors, such as TEES ( 17 ), JREX ( 18 ), BIOSEM ( 19 ), VERSE ( 20 ) or DEEPEVENTMINE ( 21 ).
For the choice of GEPI components, we evaluated existing NLP solutions with focus on the following criteria: open source availability, performance, integratability into our NLP infrastructur e, ex ecution speed, and currency of the employed databases for GN. Considering the optimal combination of these r equir ements, for gene recognition and normalization , we equipped GEPI with GNORMPLUS , a freely available tool that can be applied to arbitrary documents, both abstracts and full texts. It was already applied to and evaluated on the whole of PubMed and PMC in PUBTATOR CENTRAL ( 22 ), demonstrating its large-scale applicability. The tool has been continuously updated since its first release, thus ensuring its currency. Using the same assessment criteria from above, for event extraction , GEPI runs BIOSEM . It showed competiti v e ov erall performance in major Shared Tasks, with very high precision values, resulting in a high fraction of rele vant e v ents. Gi v en the r equir ement of correctly identifying both gene occurrences and their interactions, this ensures that GEPI generates meaningful results, thus minimizing the number of false positi v es. Figure 1 depicts the components of the GEPI application ecosystem. The input to the pipeline are XML documents from PubMed and PMC. Mentions of genes, gene products, families, protein complexes and molecular events between those entities are extracted by an NLP pipeline (depicted in Supplementary Figure S1) and indexed into an Elastic-Search (ES) cluster. A Neo4j graph database stores structured gene informa tion tha t includes relationships expressing ortholo gy, famil y or gr oup membership, pr otein complexes and their subunits, and gene ontology annotations. This information (taken from FamPlex, GO, NCBI Gene, etc.) is incorporated in the NLP pipeline for the resolution of ambiguous gene groups, families and protein comple xes. The e v ents and interactions in ES are connected to the entities in Neo4j via unique IDs. Together they form a knowledge graph that feeds the GEPI Web application. To ensure high result reliability, GEPI restricts molecular interactions to occur in single sentences. Finally, the Web application le v erages the gene database in Neo4j (holding stable terminological background knowledge) and the ES index (holding the continuously harvested results of relation extraction) to serve user r equests. Mor e detailed descriptions of the NLP pipeline and our data model can be found in Supplementary Material S1. An evaluation report on the gene interaction extraction components is provided in Supplementary Material S2. Evaluation r esults ar e shown in Supplementary Tables S1 and S2.

Gene / protein search: finding molecular information
Input to the GEPI interface is provided by a query form; the results of query processing are based on all interactions stored in the knowledge graph (its current size is depicted in Table 1 ).
The query form provides two input panels for lists of genes and related entities we refer to as A -list and B -list, respecti v ely. Further input fields allow the specification of a di v erse set of filters described below. Input to the A -and B -lists may consist of NCBI GENE ( 23 ) IDs and symbols, UNIPROT ( 24 ) IDs, FAMPLEX ( 25 ) protein family and complex identifiers or names, HGNC ( 26 ) gene group names, or Gene Ontology (GO) ( 27 ) terms. Family and group query items will include their members in the r esult, wher eas GO term queries will include genes that have been annotated with the respecti v e terms. Specifying only A items will result in an open search . This mode retrie v es gene and protein interactions between the entries of A and any other interaction partners from the interaction database. If a second set of gene / protein identifiers is entered for the B -list, a closed search is performed. This mode yields only interaction items that have an element of A as one and an element of B as their other argument.
Adding genes to A -and B -lists in a search query is optional. If left out, the result of such a 'full-text-only' search comprises all molecular e v ents in the interaction database tha t ma tch the respecti v e query filters. This query type can be used to identify published molecular interactions associated with specific filter terms, such as a condition (e.g., 'elderly') or a disease (e.g., 'obesity') without restrictions on the associated genes.
The query form offers two input fields to specify full-text queries either on the sentence or par agr aph le v el. They act as conte xt-sensiti v e query filters for interactions. Sentencele v el conte xt can be used to specify filter terms occurring in addition to an interaction in a sentence. The par agr aph le v el widens the textual window around the interaction statement and allows to retrie v e interactions based on relevant context k eyw ords beyond sentence boundaries. Both filter queries can be connected by Boolean AND or OR operators. Another full-text filter is sensiti v e to document titles and section headings and can be applied to restrict the gene / protein interaction search to, e.g., 'Results' sections only, since 'Introduction' and 'Background' sections commonly refer to established (and thus less interesting) prior knowledge. Further filter options may restrict the search results to specific organisms, interaction types or factuality le v els.
After the r etrieval process, GEPI r esults ar e pr esented in a dashboard. This includes aggregated ov ervie w panels and a table of complete primary data where each retrie v ed interaction is listed individually within its document context. The aggregation panels summarize the publication state with respect to the input and include frequency-sorted pie and bar charts of interaction partners and Sankey diagrams re v ealing interaction frequencies. We distinguish two types of Sankey diagrams. The first shows the most frequent interactions in the retrieval result. The second summarizes second-degree interactions, offering information on interaction partners occurring in more complex interactions than provided in the output table. Finally, the table panel provides a list of direct associations between genes or proteins associated to the current query. It discloses detailed information about the gene mentions in the text, the NCBI GENE IDs and symbols they were mapped to, and the sentence in which interactions occurs. Where applicable, the table offers links to the source databases of the molecular entity, e.g. NCBI GENE for genes or HGNC for gene groups to quickly obtain full gene names, gene descriptions and further r esour ces.
The primary data table can be downloaded as an Excel workbook, including query details and the resulting interaction data. It also lists the occurrence counts of the gene or protein symbols that take part in the interactions and how often each symbol was found in interaction with another symbol. Thus, the Excel workbook documents the query, the query result, and additional statistics enabling efficient downstream analysis. It can also be used for the manual curation of results, e.g. further filtering or the removal of erroneous result items.

USE CASES INVOLVING GEPI
We pr epar ed two use cases tha t fea tur e GEPI 's sear ch and filtering facilities. The first one presents a search for interaction information about specific marker genes in the context of the dev astating inv asive pulmonary aspergillosis disease caused by the fungal opportunistic pa thogen Asper gillus fumigatus (see Supplementary Material S3) ( 28 ). The second one describes the application of GEPI to investigate the Ferritin H chain in the context of disease tolerance in sepsis (see Supplementary Material S4) ( 29 ). A detailed walk-through how to use data from these studies with GePI is provided in Supplementary Materials S5 and S6. The description of the second use case includes a verification step of GEPI results le v eraging STRING (see Supplementary Data S7 for the raw protein pair confidence values gi v en by STRING ). We also specify the inputs to and outputs from GEPI in both use cases for r efer ence in Supplementary Data S8.

DISCUSSION
Automated molecular e v ent e xtraction from scientific literature is an area of acti v e r esear ch to by-pass large coverage gaps in manually curated life science databases. STRING , for instance, brings together a multitude of biological infrastructur e r esour ces, including cur ated inter action and genome sequence databases. One of STRING 's se v en information channels is concerned with textual data accessed W240 Nucleic Acids Research, 2023, Vol. 51, Web Server issue from PM and the PMC OA subset. STRING 's text mining facilities collect information for the interaction of protein pairs in two fundamental ways. Firstly, statistical methods are le v eraged to find over-represented proteins in documents with respect to the user input. This is used to calculate an aggregated measure of confidence that a unique pair of proteins is associated in general, instead of identifying explicit association descriptions of a protein pair in individual documents ( 5 ). This can be witnessed in STRING 's text mining viewer where at least two input proteins are highlighted in each citation but are not necessarily mentioned in any interaction. Secondly, a modern NLP approach is used to find protein pairs explicitly described to physically interact, i.e., to build protein complexes ( 5 ). For such cases, the text mining viewer shows explicit verbalizations of physical interactions in PM abstracts.
Unlike STRING , GEPI is an e xclusi v ely NLP-dri v en tool designed towards the identification of interacting gene / protein pairs or e v ents involving single genes / proteins from biomedical documents. As a safe-guard mechanism, GEPI supplies the text passage provided with each result item. By design, GEPI cannot render interaction data potentiall y (onl y) stor ed in structur ed tables, e.g. Supplementary data, external of the publication text. GEPI covers a broader range of interaction types, including binding (corresponding to STRING 's physical interaction), regulation and activation, thus substantially widening the scope of high-quality interaction descriptions extracted from the literature. GEPI 's additional capabilities of searching for a closed pair of gene lists and filtering by document context, make GePI a complementary tool to STRING , which (unlike GEPI ) processes se v eral types of structur ed r esour ces (databases and terminologies). While STRING calculates confidence scor es aggr egated from a di v erse set of r esour ces for unique protein pairs, GEPI 's focus lies on the highquality extraction of interactions explicitly described in publications together with factuality r atings extr acted from the te xtual conte xt of interaction descriptions in the form of hedging expressions.
In this regard, GEPI is much closer in spirit to BIOCON-TE XT and EV E XDB . Howe v er, BIOCONTE XT and EV E XDB allow only single gene queries to identify their interaction partners. Even worse, the BIOCONTEXT Web server became offline in the meantime. EV E XDB allows for queries of single genes or a pair of genes. It provides interaction information including statistics and interaction members, as well as text snippets of the identified interactions grouped by interaction type (regulation, binding, etc.). To the best of our knowledge, EV E XDB has not been updated since 2013 and thus features mostly outdated knowledge. Neither tool offers further context filtering, nor non-programmatic means to search for more than two genes at once.
GEPI 's value for the scientific community is also evidenced by the usage of prior versions of its NLP pipeline for a variety of experimental studies. For these studies, we de v eloped NLP modules ( 30 ) that allow us to extract molecular e v ent information from the whole of PM and PMC (OA subset) at any gi v en point of time. Pre vious instances of our NLP pipeline wer e alr eady successfully applied to identify potential interaction partners of proteins found in phosphoproteomics experiments including the 5' adenosine monophospha te-activa ted protein kinase (AMPK) complex ( 31 ). By using full-text filter capabilities, we identified interactions between members of the Akt family and pyruvate dehydrogenase kinase 2 (PDK2) in the context of cellular stress ( 32 ). We also used our NLP engine to gain literature-based insight into the current state of potential interaction partners on kinases of the PI3K / Akt signaling network based on quantitati v e phosphoproteomics as input for our NLP service ( 33 ). We furthermore combined results from our e v ent e xtraction system with largescale gene expression analysis and multiple validation experiments to generate a molecular signature of the activation of aryl hydrocarbon receptor (AHR) ( 34 ). Finally, our efforts supported the analysis of peripheral blood samples of children to investigate childhood asthma ( 35 ).
The above mentioned usage scenarios demonstrate the fle xib le applicability of our NLP engines and their potential to complement life-science data set analysis and hypothesis generation. Besides its core functionality, the recognition of gene and protein interactions in huge literature repositories, GEPI e xcels with fle xib le options for e xpressi v e query formulation and a wide range of filtering functions to constrain its result sets which can then be channeled to its end users by simple-to-(r e)use r eporting devices (Excel sheets). Furthermore, GEPI operates on up-to-date textual data (based on short-term update cycles in the range of few days from the current date) and runs efficiently (as evidenced by its timely responses).
Despite the added value offered by GEPI , there is also r oom for impr ovement. On the prepr ocessing le v el, BIOSEM was used for its superior specificity and processing performance but has been equalized, in the meantime, by deep learning-based approaches (DL) that might identify additional interaction e v ents, e.g., DEEPEV ENTMINE . Howe v er, these DL approaches have a high demand for computational power, and frequent updates incorporating new literature can become prohibiti v e with updates consuming significant r esour ces and taking prolonged periods of time. Despite the focus on high precision of GEPI 's NLP components, false positi v e results cannot be ruled out. For further result verification we recommend to query curated databases, e.g. BIOGRID , or comparison of GEPI results with those of STRING , gi v en the same query. Indeed, we see a clear complementary relation between GEPI and STRING : The interactions explicitly described in the literature found by GEPI can be corroborated with STRING 's di v erse set of evidence channels to obtain a high-certainty list of interactions for each of which exists a directly linked literature support. We also recommend to investigate specific text portions provided by GEPI for interactions not found by BIOGRID or STRING . If their relevance is confirmed, these may feature interactions currently not included in the other databases.

CONCLUSION
We introduced GEPI , a Web application for the fully automated extraction of molecular events from the biomedical litera ture. The automa tically popula ted and upda ted knowledge graph includes interactions mined from the whole of PubMed and PubMed Central OA subset with Nucleic Acids Research, 2023, Vol. 51, Web Server issue W241 regular updates to keep up with ne west de v elopments in the literatur e. We pr esented powerful query and filtering options for contextualized retrieval of interactions between genes , proteins , families and protein complexes. The interaction results can be downloaded as an Excel workbook that contains all identified relation pairs and statistics about the frequency of the interaction partners and the interactions themselves. For each result interaction, its textual source is provided for cross-checking. The NLP pipelines we presented here and their accessibility via a Web application opens our NLP service to the broad life science community to effecti v ely foster ne w scientific discov eries.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.