GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this paper, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: (1) API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; (2) GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; (3) Different types of errors are enriched in different tasks, providing valuable insights for future improvements.


Introduction
Large language models (LLMs) such as PaLM (Chowdhery et al., 2022) and GPT-4 (OpenAI, 2023) have shown great success on a wide range of general-domain Natural Language Processing (NLP) tasks.They also achieve state-of-the-art (SOTA) performance on domain-specific tasks like biomedical question answering (Singhal et al., 2022;Liévin et al., 2022;Nori et al., 2023).However, since there is no intrinsic mechanism for autoregressive LLMs to "consult" with any source of truth, they can generate plausible-sounding but incorrect content (Ji et al., 2023).To tackle the hallucination issue, various studies have been proposed to augment LLMs (Mialon et al., 2023) by either conditioning them on retrieved relevant content (Guu et al., 2020;Lewis et al., 2020;Borgeaud et al., 2022) or allowing them to use other external tools such as program APIs (Gao et al., 2022;Parisi et al., 2022;Schick et al., 2023;Qin et al., 2023).
In this work, we propose to teach LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI).NCBI provides API access to its entire biomedical databases and tools, including Entrez Programming Utilities (E-utils) and Basic Local Alignment Search Tool (BLAST) URL API (Altschul et al., 1990;Schuler et al., 1996;Sayers et al., 2019).Enabling LLMs to use NCBI Web APIs can provide easier and more precise access to biomedical information, especially for users who are inexperienced with the database systems.More importantly, Web APIs can relieve users from locally implementing functionalities, maintaining large databases, and heavy computation burdens because the only requirement for using Web APIs is an internet connection.
We introduce GeneGPT, a novel method that prompts Codex (Chen et al., 2021) to use NCBI Web APIs by in-context learning (Brown et al., 2020).GeneGPT consists of two main modules: (a) a specifically designed prompt that consists of documentations and demonstrations of API usage, and (b) an inference algorithm that integrates API calls in the Codex decoding process.We evaluate GeneGPT on GeneTuring (Hou and Ji, 2023), a question answering (QA) benchmark for genomics, and compare GeneGPT to a variety of other LLMs such as the new Bing1 , ChatGPT2 , and BioGPT (Luo et al., 2022).GeneGPT achieves the best performance on eight GeneTuring tasks with an

✅
In summary, our contributions are three-fold: 1. We introduce GeneGPT, a novel method that uses NCBI Web APIs to answer biomedical questions.To the best of our knowledge, this is the first study on augmenting LLMs with domain-specific Web API tools.
3. We conduct experiments to further characterize GeneGPT, including ablation, probing, and error analyses.We also contribute a novel GeneHop dataset, and use it to show that GeneGPT can perform chain-of-thought API calls to answer multi-hop genomics questions.

GeneGPT
In this section, we first introduce the general functions and syntax of NCBI Web APIs ( §2.1).We then describe two key components of GeneGPT: its prompt design for in-context learning ( §2.2) and the inference algorithm ( §2.3).

NCBI Web APIs
We utilize NCBI Web APIs of E-utils3 that provide access to biomedical databases and the BLAST tool4 for DNA sequence alignment.Web API calls are implemented by the urllib library in Python.
E-utils.It is the API for accessing the Entrez portal (Schuler et al., 1996), which is a system that covers 38 NCBI databases of biomedical data such as genes and proteins (Sayers et al., 2019).guage descriptions of the API functionality, general syntax, and argument choices.We include one for the E-utils API (Dc.1) and one for the BLAST tool (Dc.2).
3. Demonstrations (Dm.) are concrete examples of using NCBI Web APIs to solve questions.Based on questions in the GeneTuring tasks, we manually write four demonstrations that cover four functions (esearch, efetch, esummary, blastn) and four databases (gene, snp, omim, nt) of E-utils and BLAST.The API URLs and the call results are marked up by "[ ]", with a special "->" symbol inserted in between that serves as an indicator for API calls.

Test question:
The specific test question is then appended to the end of the prompt.
While the initial GeneGPT uses all documentations and demonstrations (denoted as GeneGPTfull in Table 2), we find through analyses in §4.1 that GeneGPT can work well with only two demonstrations (denoted as GeneGPT-slim) on all tasks.

Inference algorithm
The GeneGPT inference algorithm is briefly shown in Algorithm 1. Specifically, we first append the given question to the prompt (described in §2.2) and feed the concatenated text to Codex (code-davinci-002, Chen et al. ( 2021)) with a temperature of 0. We choose to use Codex for two reasons: (1) it is pre-trained with code data and shows better code understanding abilities, which Algorithm 1 GeneGPT inference algorithm is crucial in generating the URLs and interpreting the raw API results; (2) its API has the longest (8k tokens) context length among all available models so that we can fit the demonstrations in.
We discontinue the text generation process when the special "->" symbol is detected, which is the indication for an API call request.Then we extract the last URL and call the NCBI Web API with it.The raw execution results will be appended to the generated text, and it will be fed to Codex to continue the generation.When "\n\n", an answer indicator used in the demonstrations, is generated, we will stop the inference and extract the answer after the generated "Answer: ".

GeneTuring
The GeneTuring benchmark (Hou and Ji, 2023) contains 12 tasks, and each task has 50 questionanswer pairs.We use 9 GeneTuring tasks that are related to NCBI resources to evaluate the proposed GeneGPT model, and the QA samples are shown in Appendix B. The chosen tasks are classified into four modules and briefly described in this section.
Nomenclature: This is about gene names.We use the gene alias task and the gene name conversion task, where the objective is to find the official gene symbols for their non-official synonyms.

Genomics location:
The tasks are about the locations of genes, single-nucleotide polymorphism (SNP), and their relations.We include the gene location, SNP location, and gene SNP association tasks.The first two tasks ask for the chromosome locations (e.g., "chr2") of a gene or an SNP, and the last one asks for related genes for a given SNP.
Functional analysis: It asks for gene functions.We use the gene disease association task where the goal is to return related genes for a given disease, and the protein-coding genes task which asks whether a gene is a protein-coding gene or not.
Sequence alignment: The tasks query specific DNA sequences.We use the DNA sequence alignment to human genome task and the DNA sequence alignment to multiple species task.The former maps an DNA sequence to a specific human chromosome, while the latter maps an DNA sequence to a specific species (e.g."zebrafish").

Compared methods
We evaluate two settings of GeneGPT, a full setting (GeneGPT-full) where all prompt components are used, as well as a slim setting (GeneGPT-slim) inspired by our ablation and probing analyses ( §4.1) where only Dm.1 and Dm.4 are used.

Evaluation
For the performance of the compared methods, we directly use the results reported in the original benchmark that are manually evaluated.
To evaluate our proposed GeneGPT method, we follow the general criteria but perform automatic evaluations.Specifically, we only consider exact matches between model predictions and the ground truth as correct predictions for all nomenclature and genomics location tasks.For the gene disease association task, we measure the recall as in the original dataset but based on exact individual gene matches.For the protein-coding genes task and the DNA sequence alignment to multiple species task, we also consider exact matches as correct after applying a simple vocabulary mapping that converts model-predicted "yes"/"no" to "TRUE"/"NA" and Latin species names to their informal names (e.g., "Saccharomyces cerevisiae" to "yeast"), respectively.For the DNA sequence alignment to human genome task, we give correct chromosome mapping but incorrect position mapping a score of 0.5 (e.g., chr8:7081648-7081782 v.s.chr8:1207812-1207946), since the original task does not specify a reference genome.Overall, our evaluation of GeneGPT is more strict than the original evaluation of other LLMs in Hou and Ji (2023), which performs manual evaluation and might consider non-exact matches as correct.

Main results
Table 2 shows the performance of GeneGPT on the GeneTuring tasks in comparison with other LLMs.For GeneGPT, tasks with "*" in Table 2 are one-shot where one instance is used as API demonstration, and the other tasks are zero-shot.
For the compared LLMs, all tasks are zero-shot.
Nomenclature: GeneGPT achieves state-of-theart (SOTA) performance on both the one-shot gene alias task with an accuracy of 0.84 and the zero-shot gene name conversion task with an accuracy of 1.00.On average, GeneGPT outperforms New Bing by a large margin (0.92 v.s.0.76).All other GPT models have accuracy scores of less than 0.10 on the nomenclature tasks.
Genomic location: GeneGPT also achieves SOTA performance on all genomic location tasks, including the gene SNP association task (1.00) gene location task (0.66) and the SNP location task (1.00).While the New Bing is comparable to GeneGPT on gene location (0.61 v.s.0.66), its performance on the two SNP-related tasks is close to 0. Again, most other LLMs score less than 0.10.Notably, while all genomics location tasks are zeroshot for GeneGPT-slim, it performs comparably to GeneGPT-full which uses one gene SNP association demonstration.This indicates that API demonstrations have strong cross-task generalizability.
Functional analysis: The new Bing performs better functional analysis tasks than the proposed GeneGPT (average score: 0.91 v.s.0.84), which is probably because many web pages related to gene functions can be retrieved by the Bing search engine.We also note that other LLMs, especially GPT-3 and ChatGPT, perform moderately well and much better than they perform on other tasks.This might also be due to the fact that many gene-function-related texts are included in their pre-training corpora.Sequence alignment: GeneGPT performs much better with an average score of 0.66 than all other models including the new Bing (0.00), which essentially fails on the sequence alignment tasks.This is not very surprising since sequence alignment is easy with the BLAST tool, but almost impossible for an auto-regressive LLM even with retrieval augmentation as the input sequences are too specific to be indexed by a search engine.
Although evaluated under a more strict setting ( §3.3), GeneGPT achieves a macro-average performance of 0.83 which is much higher than other compared LLMs including New Bing (0.44).Overall, GeneGPT achieves new SOTA performance on all 2 one-shot tasks and 6 out of 7 zero-shot tasks and is outperformed by New Bing only on the gene disease association task.

Discussions
We have shown that GeneGPT largely surpasses various LLMs on the GeneTuring benchmark.In this section, we further characterize GeneGPT by studying three research questions (RQ): RQ1: What is the importance of each prompt component in GeneGPT?

RQ2: Can GeneGPT answer multi-hop questions by chain-of-thought API calls?
RQ3: What types of errors does GeneGPT make on each studied task?
For ablation tests, we remove each component from GeneGPT-full and then evaluate the prompt.The results are shown in Figure 2 (left).Notably, the performance on the DNA to genome and species alignment tasks is only significantly decreased without the BLAST demonstration (Dm.4), but not affected by the ablation of the BLAST documentation (Dc.2).While the ablations of other components decrease the performance, most only affect one relevant task (e.g., Dm.1 and gene name conversion), which indicates a high level of redundancy of the prompt components.
For the probing experiments, we evaluate GeneGPT with only one prompt component to study the individual capability.The results are shown in Figure 2 (right).Overall, GeneGPT with only one documentation (Dc.1 or Dc.2) fails on all tasks.Surprisingly, with only one demonstration of the gene alias task (Dm.1) in the prompt, GeneGPT is able to perform comparably to GeneGPT-full on all tasks except the alignment ones.On the other hand, GeneGPT with only the BLAST demonstration (Dm.4) performs well on the two alignment tasks, which is somehow expected.These results suggest that GeneGPT with only two demonstrations (Dm.1 and Dm.4) in the prompt can general-ize to all tasks in the GeneTuring benchmark.We denote this as GeneGPT-slim, and results in Table 2 show that with only two demonstrations, it outperforms the GeneGPT-full and achieves stateof-the-art overall results on GeneTuring.

RQ2: Multi-hop QA on GeneHop
Questions in the GeneTuring benchmark are singlehop and just require one step of reasoning, e.g., "Which gene is SNP rs983419152 associated with?".However, many real-world biomedical questions are multi-hop that need more steps to answer (Jin et al., 2022).For example, to answer "What is the function of the gene associated with SNP rs983419152?", the model should first get the associated gene name and then find its functions.To test GeneGPT's capability of answering multi-hop questions, we present GeneHop, a novel dataset that contains three new multi-hop QA tasks based on the GeneTuring benchmark: (a) SNP gene function, which asks for the function of the gene associated with a given SNP.(b) Disease gene location, where the task is to list the chromosome locations of the genes associated with a given disease.(c) Sequence gene alias, which asks for the aliases of the gene that contains a specific DNA sequence.Each task in GeneHop contains 50 questions, and the collection pipeline is detailed in Appendix C. For all tasks, we append the chain-ofthought instruction "Let's decompose the question to sub-questions and solve them step by step."after the test question (Wei et al., 2022b).Figure 3 shows an example of GeneGPT to answer Task (a).In this case, GeneGPT successfully decomposes the multi-hop question into two sub-questions, and the sub-question 2 is based on the answer of the sub-question 1.Interestingly, GeneGPT uses a shortcut to answer sub-question 2: instead of first calling esearch and then calling esummary, GeneGPT finds the gene id in the API call results of sub-question 1 and directly calls esummary.This capability is not shown in the prompt but elicited by chain-of-thought API calls.retmax=20&retmode=json&id=64877,1377, 7758276,100818277,100166185,10948718, 10849362,9700326,3180310,3180309, 1192676,1371,50945,10682,12892,1037, 4336951,100216262,380161,108700304] ->[ API call results ] Answer: Xq21.1 (correct) Figure 4 shows another example of GeneGPT answering Task (b), where GeneGPT successfully decomposes the multi-hop question and correctly calls the required APIs.Notably, the answering chain involves 3 sub-questions and 4 API calls, which are longer than all in-context demonstrations (1 single-hop question and 2 API calls at most).This ability to generalize to longer chains of thought is an important aspect of GeneGPT's flexi-bility and usefulness for real-world applications.We manually evaluate the results predicted by GeneGPT and compare it to the new Bing, which is the only baseline LLM that performs well on the single-hop GeneTuring benchmark due to its retrieval augmentation feature.The evaluation criteria are described in Appendix D. As shown in Table 3, while the new Bing outperforms GeneGPT on the disease gene location task, it is mostly using webpages that contain both the disease and location information without multi-hop reasoning.The new Bing fails to perform the other 2 tasks since the input information (SNP or sequence) is not indexed by Bing and can only be found in specialized databases.GeneGPT, on the other hand, performs moderately well on all 3 tasks, and achieves a much higher average score (0.50 v.s.0.24).

RQ3: Error analysis
We manually study all errors made by GeneGPT and classify them into five types.Table 4 shows the count of each error type on the evaluate tasks: E1: using the wrong API or not using APIs, e.g., using the gene instead of the omin database for diseases; E2: using the right API but wrong arguments, e.g., passing terms to id; E3: not extracting the answer in the API result, most commonly seen in gene function extraction; E4: right API call but results do not contain the answer, where the question is not answerable with NCBI databases; and O includes other unclassified errors.Specific error examples are shown in Appendix E.
Our results suggest that different tasks have specific and enriched error types: simple tasks (alias and location) fail mostly because of E4; E1 only happens in disease-related tasks; alignment tasks face more issues with BLAST interfaces and reference genomes (O); multi-hop tasks in GeneHop tend to have E2 and E3 in the reasoning chains.

Related work
Large language models: Recent studies have shown that scaling pre-trained LMs leads to performance improvement and potentially emergent abilities on various NLP tasks (Brown et al., 2020;Kaplan et al., 2020;Wei et al., 2022a;Chowdhery et al., 2022;OpenAI, 2023).However, such auto-regressive LLMs are still susceptible to hallucinations and generate erroneous content (Ji et al., 2023).Augmenting LLMs with external tools is a possible solution to this issue (Mialon et al., 2023).

Biomedical question answering:
It is an essential step in clinical decision support (Ely et al., 2005) and biomedical knowledge acquisition (Jin et al., 2022).LLMs have been successfully applied to various biomedical QA tasks that are knowledgeor reasoning-intensive (Singhal et al., 2022;Liévin et al., 2022;Nori et al., 2023).However, autoregressive LLMs fail to perform data-intensive tasks which require the model to precisely store and recite database entries, such as the GeneTuring benchmark (Hou and Ji, 2023).Retrieval augmentation also falls short since specialized databases are usually not indexed by commercial search engines.GeneGPT solves this task by tool augmentation.

Conclusions
We present GeneGPT, a novel method that teaches LLMs to use NCBI Web APIs.It achieves SOTA performance on 8 GeneTuring tasks and can perform chain-of-thought API calls.Our results indicate that database utility tools might be superior to relevant web pages for augmenting LLMs to faithfully serve various biomedical information needs.

C GeneHop collection
The GeneHop dataset contains three multi-hop tasks: SNP gene function, disease gene location, and sequence gene alias.We describe the collection of these tasks in this section.Table 6 shows several question-answer samples from the GeneHop dataset.

SNP gene function:
The question template for this task is "What is the function of the gene associated with SNP {snp}?Let's decompose the question to sub-questions and solve them step by step.".We re-use the 50 {snp} from the gene SNP association task in the original GeneTuring benchmark.The ground-truth answer of the gene function is manually annotated: For each SNP, we first get its corresponding gene from the annotations of the gene SNP association task.We then check the gene information page8 and select its functional summary as the ground-truth answer.

Figure 1 :
Figure 1: Left: GeneGPT uses NCBI Web API documentations and demonstrations in the prompt for in-context learning.Right: Examples of GeneGPT answering GeneTuring and GeneHop questions with NCBI Web APIs.

Figure 2 :
Figure 2: Performance changes of the ablation (left) and probing (right) experiments as compared to GeneGPTfull.

Figure 3 :
Figure 3: GeneGPT uses chain-of-thought API calls to answer a multi-hop question in GeneHop.

Figure 4 :
Figure 4: GeneGPT uses chain-of-thought API calls to answer a multi-hop question in GeneHop.

Figure 5 :
Figure 5: Documentation 1 (Dc.1) of the GeneGPT prompt.Dc.1 describes the functionality, general syntax, and argument choices of the NCBI E-utils API.

Figure 8 :
Figure 8: Demonstration 2 (Dm.2) of the GeneGPT prompt.The instance is chosen from the gene SNP association task in the GeneTuring benchmark.Links are actually called Web API URLs.Readers can directly click the link and get the API call result, which is inserted in the prompt.

Table 2 :
Performance of GeneGPT compared to other LLMs on the GeneTuring benchmark.*One-shot learning for GeneGPT.Bolded and underlined numbers denote the highest and second-highest performance, respectively.
GeneHop question (Disease gene location):List chromosome locations of the genes related to Cleft palate with ankyloglossia.Let's decompose the question to sub-questions and solve them step by step.

Table 3 :
Performance of multi-hop QA on GeneHop.We only compare GeneGPT with New Bing since other LLMs cannot even answer single-hop questions well.

Table 4 :
Counts of GeneGPT errors on different tasks.
Documentation 1 (Dc. 1) You can call Eutils by: "[https://eutils.ncbi.nlm.nih.gov/entrez/eutils/{esearch|efetch|esummary}.fcgi?db={gene|snp|omim}&retmax={}&{term|id}={term|id}]".esearch: input is a search term and output is database id(s).efectch/esummary: input is database id(s) and output is full records or summaries that contain name, chromosome location, and other information.Normally, you need to first call esearch to get the database id(s) of the search term, and then call efectch/esummary to get the information with the database id(s).Database: gene is for genes, snp is for SNPs, and omim is for genetic diseases.
BLAST maps a specific DNA sequence to its chromosome location among different species.You need to first PUT the BLAST request and then GET the results using the RID returned by PUT. Figure 6: Documentation 2 (Dc.2) of the GeneGPT prompt.Dc.2 describes the functionality, general syntax, and argument choices of the BLAST API.
Question: Which gene is SNP rs1217074595 associated with?[https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=snp&retmax=10& Figure 7: Demonstration 1 (Dm.1) of the GeneGPT prompt.The instance is chosen from the gene alias task in the GeneTuring benchmark.Links are actually called Web API URLs.Readers can directly click the link and get the API call result, which is inserted in the prompt.