MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.


Introduction
Information retrieval (IR) is an important step in biomedical knowledge discovery and clinical decision support (Ely, et al., 2005;Gopalakrishnan, et al., 2019).However, most IR systems in biomedicine are keyword-based, which will miss articles that are semantically relevant but have no lexical overlap with the input query.Recent progress in IR and deep learning has shown that dense retrievers, which encode and match queries and documents as dense vectors, can perform better semantic retrieval than traditional sparse (lexical) retrievers such as BM25 (Karpukhin, et al., 2020;Khattab and Zaharia, 2020;Lin, et al., 2022;Nogueira and Cho, 2019).They are typically based on pre-trained transformers (Vaswani, et al., 2017), and are further fine-tuned with task-specific data.However, dense retrieval models trained on general datasets cannot generalize well to domain-specific IR tasks (Thakur, et al., 2021).Nevertheless, domain-specific datasets are limited in scale and diversity, restricting the creation of generalizable models (Roberts, et al., 2014;Roberts, et al., 2017;Tsatsaronis, et al., 2015;Voorhees, et al., 2021).As a result, there is a pressing need for pre-trained models that can perform well across various biomedical IR tasks.
In response, we propose bioMedical Contrastive Pre-trained Transformers (MedCPT), a novel model trained with an unprecedented scale of 255M query-article pairs from PubMed search logs.MedCPT is the first biomedical IR model that includes a pair of retriever and re-ranker closely integrated by contrastive learning.Unlike previous separately developed models that have a discrepancy between the two modules (Gao, et al., 2021), MedCPT re-ranker is trained with the negative distribution sampled from the pre-trained MedCPT retriever.This matches the inference time article distribution where the MedCPT re-ranker is used to re-rank the articles returned by the MedCPT retriever.As shown in Figure 1, we perform zero-shot evaluation on a wide range of biomedical IR tasks.For document retrieval, MedCPT (330M) achieves state-of-the-art (SOTA) document retrieval performance on three individual biomedical tasks and the overall average in BEIR (Thakur, et al., 2021), outperforming much larger models such as Google's GTR-XXL (4.8B) (Ni, et al., 2021) and OpenAI's cpt-text-XL (175B) (Hirschman, et al., 2012).For article representation, we also show that the MedCPT article encoder sets new SOTA performance on the RELISH similar article dataset (Brown, et al., 2019) and the MeSH prediction task in SciDocs (Cohan, et al., 2020).For sentence representation, MedCPT performs the best or second best among compared methods on the BIOSESS (Sogancioglu, et al., 2017) and MedSTS (Wang, et al., 2020) for semantic evaluations.As such, MedCPT can be readily applied to a variety of biomedical applications such as searching relevant documents, retrieving similar sentences, recommending related articles, as well as providing domain-specific retrievalaugmentation for large language models (Jin, et al., 2023).

Materials and Methods
Query-article relevance data collection from PubMed search logs We collected anonymous query-article clicks in PubMed search logs in three years (2020-2022) to train MedCPT.The raw logs contain 167M unique queries and 23M unique PubMed articles.We first filtered the navigational queries like author and journal title searches with Field Sensor (Yeganova, et al., 2018).After filtering, there are 87M informational queries and 17M articles.Based on the user click information, we generated 255M relevant query-article pairs to train the MedCPT retriever.However, most of such queries are short keywords, and matching them to the clicked articles is a relatively simple task.As such, we use a difficult subset that requires better semantic understanding to train the MedCPT re-ranker, which is aimed to distinguish harder negatives among the top-ranking articles returned by the retriever.Specifically, we further filtered out 79M keyword queries from the informational query set, which are defined as either having only one word or all of the clicked articles containing exact mentions of the whole input query.In the end, there are 7.7M non-keyword (e.g., short sentences) queries and 5.2M articles, from which we generated 18.3M relevant query-article pairs to train the MedCPT reranker.
MedCPT architecture MedCPT includes a first-stage retriever and a second-stage re-ranker.The retriever includes a query encoder (QEnc in Figure 1) and a document encoder (DEnc).This biencoder architecture is scalable because millions of articles can be encoded offline, and only one encoding computation for the query and a nearest neighbor search are required during real-time inference.The re-ranker is a cross-encoder ( CrossEnc ) that is computationally more expensive but also more accurate due to the cross-attention computation between query and article tokens.It will only be applied on the top articles returned by the retriever and generate the final article ranking.
MedCPT re-ranker The MedCPT re-ranker is a cross-encoder, denoted as CrossEnc.Similarly, CrossEnc is also initialized with PubMedBERT.The MedCPT re-ranker predicts the relevance between a query  and a document  by passing them into a single CrossEnc .Specifically, As shown in Figure 2 (B), for training the MedCPT re-ranker, each instance has a query  1 , a clicked document  1 E , and a list of  irrelevant (not clicked) documents P 16 F Q  = 1, 2, 3, … , }.Following (Gao, et al., 2021), we use local negatives to train the MedCPT re-ranker instead of in-batch negatives.Specifically, unlike the in-batch negative documents used by the MedCPT retriever that are approximately random samples, the local negative documents are sampled from rank  to rank  in the top retrieved documents by the pre-trained MedCPT retriever through a maximum inner product search, which ensures that the MedCPT re-ranker can distinguish the hard negatives returned by the retriever.The loss ℒ 1 for the instance is a negative log-likelihood loss: ) We take a weighted sum of the instance-level loss and optimize the final loss by gradientbased methods.More details on MedCPT inference and configuration are shown in Appendix A.

Results
MedCPT achieves state-of-the-art performance on biomedical IR tasks Benchmarking-IR (BEIR) (Thakur, et al., 2021) is a standardized evaluation benchmark for zero-shot IR systems.We evaluate MedCPT with all five biomedical tasks in the BEIR benchmark.Appendix C describes the evaluation details and Table 1 shows the evaluation results.
First, MedCPT improves its initialization PubMedBERT by huge margins, where the latter basically fails on the retrieval tasks.Overall, MedCPT sets new SOTA performance on 3/5 tasks, surpassing compared sparse (Dai and Callan, 2020;Nogueira, et al., 2019;Zhang, et al., 2015) , dense (Hofstätter, et al., 2021;Izacard, et al., 2021;Karpukhin, et al., 2020;Xiong, et al., 2020), and late-interaction (Khattab and Zaharia, 2020) retrievers on all of the compared tasks.As shown in the BEIR paper, BM25 is a strong baseline that is generalizable to biomedical IR tasks.Notably, MedCPT is still better than BM25 with cross-encoder in 4/5 of the evaluated tasks, showing its effectiveness at retrieving relevant articles for biomedical queries.BM25 with re-ranker is only better on the TREC-COVID dataset, which might be due to annotation biases (Thakur, et al., 2021).We further compare MedCPT with more recent large dual retriever models, represented by Google's GTR and OpenAI's cpt-text, both of which have model sizes ranging from millions to billions of parameters.MedCPT is able to outperform all sizes of the GTR model.While the GPT-3 (Brown, et al., 2020) sized (175B) cpt-text-XL is better than MedCPT on NFCorpus, MedCPT outperforms cpt-text-XL on TREC-COVID and SciFact despite being about 500 times smaller.This indicates that small models trained on domain-specific datasets can still have better in-domain zero-shot performance than much larger general domain retrievers.(Brown, et al., 2019).RELISH is an expert-annotated dataset that contains 196k article-article relevance annotations for 3.2k query articles, as described in Appendix D. Table 2 shows the evaluation results on RELISH.The MedCPT article encoder (DEnc) outperforms all other models, including SPECTER (Cohan, et al., 2020) and SciNCL (Ostendorff, et al., 2022) that are specifically trained with article-article citation information.Compared to its base PubMedBERT model, the MedCPT article encoder improves by over 10% performance.We also evaluate the MedCPT article encoder on SciDocs (Cohan, et al., 2020) as described in Appendix E, which contains all scientific domains from biomedicine to engineering.The MedCPT article encoder achieves SOTA performance on the MeSH prediction subtask and is comparable to SOTA methods on the overall score, showing its effectiveness on biomedical tasks and generalizability to other scientific domains.MedCPT generates better biomedical sentence representations We evaluate the MedCPT query encoder on two datasets for sentence similarities: BIOSSES in the biomedical domain (Sogancioglu, et al., 2017) and MedSTS in the clinical domain (Wang, et al., 2020).Appendix F introduces the evaluation details and Table 3 shows the evaluation results.On BIOSSES, MedCPT performs the best among all compared models, surpassing the second SciNCL by 5% relative performance (0.893 vs. 0.847).On the MedSTS dataset, MedCPT ranks the second and the performance is comparable to the highest-ranking model BioSentVec (Chen, et al., 2019) (0.765 vs. 0.767), which uses an external clinical corpus MIMIC-III (Johnson, et al., 2016) for its model training.Overall, our results show that the MedCPT query encoder can effectively encode biomedical and clinical sentences that reflect their semantic similarities.

Discussions
MedCPT is only trained with query-article click data derived from PubMed user logs, but it generalizes well and achieves the SOTA performance on many biomedical IR tasks in the BEIR benchmark, which indicates that query-article pairs in the PubMed search logs can serve as high-quality training data for serving general-purpose information needs in biomedicine.Furthermore, while not being explicitly trained with query similarity and article similarity data, the MedCPT query encoder and article encoder still achieve the SOTA performance on sentence similarity and article similarity tasks, respectively.This shows that the contrastive objective can train not only a dense retriever, but can also train the individual query and document encoders to perform tasks related to informationseeking behaviors.As such, MedCPT has broad implications in a variety of real-world scenarios: enhancing algorithms for biomedical literature search such as PubMed's Best Match (Fiorini, et al., 2018), where case studies in Appendix G show that MedCPT retrieves more semantically relevant articles than other commonly used literature search engines; improving similar article recommendation algorithms in literature search (Lin and Wilbur, 2007); facilitating sentence-to-sentence retrieval tasks such as sentence-level literature search (Allot, et al., 2019).
Although transformer-based retrieval and re-ranking models such as MedCPT can return more comprehensive results, they are not as controllable or explainable as sparse retrievers such as BM25.For example, when user searches the gene "MAP3K3", MedCPT will also return articles that only contain "MAP3K7", which might not be the original information need.In addition, the semantic similarity scores between a query article pair are not explainable.As such, one potential future direction is to develop hybrid dense-sparse retrieval systems that can harvest the advantages from both approaches (Ma, et al., 2020;Shin, et al., 2023).
To summarize, we use large-scale PubMed logs to contrastively train MedCPT, the first integral retriever-reranker model for biomedical information retrieval.Systematic zeroshot evaluations show that MedCPT achieves the highest performance for six different biomedical information retrieval tasks, including query-to-article retrieval, semantic article and sentence representation.We anticipate that MedCPT will have a broad range of applications and significantly enhance access to biomedical information, making it a valuable tool for researchers and practitioners alike.
Finally, we sort the retrieved articles by Z,  1 7 [ from the highest to the lowest and return the sorted results.
We implemented the MedCPT using PyTorch (Paszke, et al., 2019) and the Hugging Face transformers library (Wolf, et al., 2020).The hidden dimension for MedCPT ℎ = 768 as in the BERT-base configuration.We use the Adam optimizer (Kingma and Ba, 2014) without weight decay to train both the retriever and the re-ranker, where we set the learning rate 2e-5 and epsilon 1e-8.For the MedCPT retriever,  = 32 and  = 0.8, and we also apply gradient accumulation of 8 steps.We train the retriever for 100k steps with 10k warm-up steps.For the MedCPT re-ranker:  = 31,  = 50,  = 200.We train the re-ranker for 10k steps with 1,000 warm-up steps.We apply cosine learning rate schedule after the warm-up steps.During inference,  and  vary for specific tasks.We implemented MIPS with the FlatIP index of the Faiss library (Johnson, et al., 2019).

Large language model retrievers
We also compare MedCPT with two large language model retrievers: Google's GTR (Ni, et al., 2021) and OpenAI's cpt-text (Hirschman, et al., 2012).Unlike most dense retrievers that are based on the BERT-base model of 110M parameters, GTR and cpt-text use much larger language model encoders.Specifically, GTR is based on T5 (Raffel, et al., 2020) and its largest variant has 4.8B parameters, while cpt-text is based on GPT-3 (Brown, et al., 2020) and its largest variant has 175B parameter.Both GTR and cpt-text are pretrained by large-scale Web corpora with in-batch negatives, and are further fine-tuned with supervised datasets such as MS MARCO (Bajaj, et al., 2016).In comparison, MedCPT is only trained by the user click data from PubMed logs without using any supervised datasets.

Figure 1 .
Figure 1.A high-level overview of this work.MedCPT contains a query encoder (QEnc), a document encoder (DEnc), and a cross-encoder (CrossEnc).The query encoder and document encoder compose of the MedCPT retriever, which is contrastively trained by 255M query-article pairs and in-batch negatives from PubMed logs.The cross-encoder is the MedCPT re-ranker, and is contrastively trained by 18M non-keyword query-article pairs and local negatives retrieved from the MedCPT retriever.MedCPT achieves stateof-the-art performance on various biomedical information retrieval tasks under zero-shot settings, including query-article retrieval, sentence representation, and article representation.

Figure 2 .
Figure 2. Overview of the MedCPT training process.(A) Training the MedCPT query encoder (QEnc) and document encoder (DEnc) using a contrastive loss with querydocument pairs and in-batch negatives; (B) Training the MedCPT cross-encoder

Table S2 .
Top-three retrieval article titles of MedCPT and widely used literature search engines for three case study queries.The results of PubMed, Google Scholar, and Semantic Scholar were collected on Mar 25, 2023.Bolded texts denote lexical matching while bolded and underlined texts denote wrong semantic matching.Titles in "[…]" denote articles in non-English languages.Appendix H: Scaling properties of MedCPTIn FigureS2, we study the scaling properties of MedCPT.Specifically, we evaluate the MedCPT retriever performance measured by NDCG@10 on four biomedical tasks on the BEIR benchmark (TREC-COVID, SciDocs, SciFact, NFCorpus).As shown in the figure, the performance of MedCPT increases log-linearly as the number of training logs increase, and stabilize at the end of training with 255M query-article pairs.The model needs to be trained on at least 150M query-article pairs to stabilize and consistently outperform BM25, although it should be noted that BM25 appears to be a strong baseline since many IR datasets favor BM25 due to the exposure bias in annotation.Practically, 255M query-article pairs are the most we can get from the new PubMed, and training on them already takes about 1 month of computation on a server of 8 Nvidia V100 GPUs, roughly costing ~15,000 US dollars.In conclusion, it is necessary to train on large amounts of data, but the marginal gain might decrease because the performance-training size curve follows a logarithm law.

Figure S2 .
Figure S2.The average NDCG@10 performance on biomedical tasks in the BEIR benchmark of the MedCPT retrievers trained by different sizes of PubMed user logs.The performance increases log-linearly as the number of training logs increases.

Table 1 .
Zero-shot performance of MedCPT on biomedical subtasks of the BEIR benchmark.Bolded numbers, underlined, and italicized numbers denote the highest, 2 MedCPT generates better biomedical article representations We evaluate the MedCPT article encoder on the RELISH article similarity task nd highest, and 3 rd highest, respectively.COVID: TREC-COVID; NFC: NFCorpus; Avg.: average.

Table 2 .
Evaluation results of the MedCPT article encoder on the RELISH dataset.
Bolded numbers, underlined, and italicized numbers denote the highest, 2nd highest, and 3rd highest, respectively.All numbers are percentages.Avg.: average.

Table 3 .
Evaluation results (Pearson's correlation coefficients) of the MedCPT query encoder on the BIOSSES and MedSTS datasets.Bolded numbers, underlined, and italicized numbers denote the highest, 2nd highest, and 3rd highest, respectively.All numbers are percentages.
. Details of the compared methods are described in Appendix B.