Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach

Abstract Motivation Automated extraction of population, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation. Results We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmod dataset, a randomly selected and re-annotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 Coronavirus Disease 2019 (COVID-19) RCT abstracts, and a dataset of 150 Alzheimer’s disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level. Availability and implementation Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.


Introduction
Healthcare providers rely on the evidence from published biomedical literature for assessing the effectiveness of a new treatment or intervention for diseases.With the evidence and insights gained from well-designed clinical studies such as randomized control trials (RCTs) in PubMed (Frost et al. 2020, Abani et al. 2021, Emani et al. 2021), it is possible to explore the best treatment options for diseases such as Coronavirus Disease 2019 (COVID-19) (Emani et al. 2021).However, the number of RCTs that keep growing every day has become a challenge for clinicians to keep themselves up to date with the new evidence.In 2020, a total of 29 256 RCT abstracts (80 RCT abstracts per day) were added to PubMed.This increased to 29 983 RCT abstracts (82 RCT abstracts per day) in 2021 and 28 482 RCT abstracts (78 RCT abstracts per day) in 2022.These estimates show a continuously growing number of publications related to RCT and, it has become extremely difficult for clinicians to gain knowledge from these publications to provide the best care for their patients.Currently, more than 595 000 RCT abstracts are in PubMed.
In 1995, the PICO (P: Population, I: Intervention, C: Comparison/Control, O: Outcomes) framework (Richardson et al. 1995) was introduced to formulate a well-defined research question and facilitate the literature search for evidence-based medicine (EBM).Manual synthesis of PICO-based evidence from the published literature is both time-consuming and costly.Hence, automated approaches were developed for extracting PICO elements.Earlier works were based on rule-based approaches and machine learning (ML) approaches (e.g.Support Vector Machine, Random Forest, Conditional Random Field) and rely heavily on hand-crafted features (Huang et al. 2006, Demner-Fushman and Lin 2007, Boudin et al. 2010, Chabou and Iglewski 2015, 2018).Recently, deep learningbased (DL) approaches are applied to overcome the tedious feature engineering task and boost performance (Jin andSzolovits 2018b, Zhang et al. 2020).Both ML/DL approaches rely heavily on the annotated corpus to achieve high performance.The time and cost involved in the manual annotation by domain experts, the distribution of PICO elements within RCT abstracts, and the variations in PubMed abstracts (e.g.structured and unstructured) are the major reasons for the lack of a large, publicly available annotated corpus for PICO extraction.Nye et al. (2018) released the EBM-NLP corpus with 4993 RCT abstracts annotated (using crowdsourcing) with PICO elements at two levels of granularity.The first level is the named span-level where the phrases containing Population, Intervention, and Outcome (PIO) information are annotated.The second level further distinguishes the named span-level with more fine-grained labels (e.g.distinguish P according to gender, age, condition, etc.).Though the EBM-NLP corpus has revived the PICO extraction task, it has certain limitations.The overall named entity recognition (NER) model performance is not satisfying (a token-level F1 score of 68% on the span-level and 48% on the hierarchical labels).There is no entity-level evaluation or information retrieval evaluation at the abstract level.Based on our observation, probable reasons for low performance are: (i) Information regarding PICO elements is scattered throughout the abstract with interventions and outcomes often presented multiple times with different lexical variants.(ii) PICO elements in RCTs can be complex, making the annotation task complex.For example, all interventions may not refer to the clinical trial under consideration (e.g., mentions of interventions of prior clinical trials), and identifying distinct outcomes of a particular intervention in multi-arm trials is often difficult.(iii) PICO elements can vary widely across different diseases hence requiring large datasets to learn meaningful patterns.For example, intervention may be a pharmacologic substance (e.g.doxorubicin) for diseases such as cancer or a music-based therapy for certain neurological disorders.(iv) The EBM-NLP corpus utilized a crowdsourcing approach for the annotation process.The expert annotation is limited to only 200 abstracts.The remaining abstracts in the corpus were annotated by the annotators recruited by Amazon Mechanical Turk (AMT), who were not well-trained.The inter-annotator agreement (IAA) is low (0.50 for P, 0.59 for I, and 0.51 for O) even among the medical experts.The annotation schema is also complex, and it is difficult for the annotators to comprehend all the details.This results in annotation inconsistencies.
We hypothesized that selecting only the specific sections of an RCT abstract and title that contain relevant PICO information for annotation can significantly reduce the annotation complexity, effort, and the issues mentioned above.In the current study, we analyzed the distribution of PICO elements in different sections of RCT abstracts.Our findings suggested that the Title and Methods section covers most of the PICO elements.Consequently, we developed a two-step natural language processing (NLP) pipeline to identify and extract the PICO elements.To assess its generalizability, we further evaluated it on three different datasets namely EBM-NLP mod, dataset, Alzheimer's disease (AD) dataset, and COVID-19 dataset.The major contributions of this work are: 1) A novel two-step NLP pipeline that classifies the sentences from a PubMed abstract into different sections, backgrounds, methods, results, and conclusions and extracts the PICO elements.
2) EBM-NLP mod dataset derived from the EBM-NLP corpus.The dataset includes 500 randomly selected RCT abstracts.The PICO elements were re-annotated to overcome the limitations of the EBM-NLP corpus.Jin and Szolovits (2018a,b) treated PICO detection as a sequential sentence classification task rather than a single sentence classification task.They utilized neural network architectures such as long short-term memory (LSTM) (Jin and Szolovits 2018b) and bi-directional long short-term memory (Bi-LSTM) (Jin and Szolovits 2018a) to encode the contextual content from the preceding and succeeding sentences to improve the prediction of the current sentence.Recently, several deep learning approaches that utilize pre-trained language models such as SciBERT and BERT are applied to improve the performance of sentence classification (Cohan et

Recognition of PICO elements
NER is used to identify the PICO elements (Nguyen et al. 2017, Nye et al. 2018, Kang et al. 2019, 2021, Zhang et al. 2020, Dhrangadhariya et al. 2021, Liu et al. 2021b).Nye et al. (2018) presented two baseline models, the linear CRF model and the LSTM-CRF model for identifying PICO elements in the EBM-NLP corpus.A recent study showed improved performance on PICO extraction when the NER model was first pre-trained on the EBM-NLP corpus, and further fine-tuned with the additional data annotated by themselves (Kang et al. 2019).Zhang et al. (2020) proposed an approach by combining sentence classification, disease entity recognition, and disease mapping using various deep learning models (convolutional neural network, Bi-LSTM, etc.) for extracting P and O elements.To alleviate the reliance on time-consuming manual annotation by experts, a span detection approach for PICO extraction that uses only lowquality, crowd-sourced, sentence-level annotations as inputs, was proposed by Liu et al. (2021b).The authors applied a masked span prediction task in which input spans were replaced with predefined mask tokens and a pre-trained neural language model [BLUE (Peng et al. 2019)] was used to infer which spans contribute most to the PICO sentence classification results using the EBM-NLP corpus.A multi-task learning approach that learns and recognizes both coarse-grained descriptions (e.g.40 children aged 7-11 with autism spectrum disorder) and constituent finer semantic units (e.g., "40" shows "sample size", "7-11" shows "age" and "autism spectrum disorder" shows "condition") was explored by Dhrangadhariya et al. (2021).In that study, the EBM-NLP corpus was utilized as it provided a multi-level annotation: the span-level (level 1) annotation corresponds to the coarse-grained descriptions and other levels of annotation focus on specific semantic units.Recently, the Easy Data Augmentation (Wei and Zou 2019) technique incorporated with the Unified Medical Language System (UMLS) knowledge (including synonym replacement, random insertion, random swap, and random deletion) was evaluated on PICO extraction (Kang et al. 2021).

Materials and methods
The overview of our NLP pipeline is illustrated in Fig. 1.First, the sentences in both structured and unstructured RCT abstracts were classified into background, methods, results, and conclusions using our recent work on sentence classification (Hu et al. 2022).Next, P, I, C, and O elements were extracted using a NER model.

Distribution of PICO elements in RCT abstracts
We conducted a preliminary evaluation of the distribution of PICO elements across different sections of the RCT abstract.We randomly selected 30 RCT abstracts and reviewed them manually to identify the unique mentions of PICO elements in different sections of the abstracts.The purpose was to identify the specific sections with high coverage of PICO elements.We hypothesize that annotating only these sections may reduce the redundancy, ambiguity, and time and effort involved in the manual annotation.We also hypothesize that our proposed approach is liable only for a minimum loss of information.Titles of RCT abstracts usually include P and I elements.They provide a precise, and accurate description of the study, and are easy to annotate.Our analysis shows that the Methods section achieved the highest coverage of 95.2% among all the sections.The Title and Methods sections together achieve coverage of 96.7% (Table 1).The PICO elements in the results section are often duplicates of the methods section.Thus, we considered only the Title and Methods sections for annotation.The annotation can easily be extended to other sections with our sentence classification model.(Hu et al. 2022) shows high performance in predicting methods, results, and conclusion sections, the classifier has great difficulty distinguishing the background and objective sections.The difference between the sentences from the background and objective sections is less obvious when compared to other sections.We relabeled the sentences from the objective section as background and redefined the section labels as background, methods, results, and conclusions.

Evaluation data
We created three datasets to evaluate and report the performance of the sentence classification model.In our dataset splits, we have meticulously ensured that the test set used for the NER task does not overlap with the training data for sentence classification.Our approach eliminated potential data leakage and ensured an unbiased evaluation of both tasks.

Section-specific annotation
Annotating a high-quality dataset is a labor-intensive task and identifying the domain experts is challenging.Though the crowdsourcing approaches have shown some promising results on corpus generation, the IAA (e.g.EBM-NLP corpus) is relatively low.This results in a suboptimal performance of a NER model.
Initially, we followed the annotation guidelines from EBM-NLP to annotate the PICO elements.The IAA reported using Cohen's Kappa coefficient was only 0.3 (Supplementary Materials S1.2 for the equation for Cohen's Kappa coefficient).Further investigation revealed three major reasons for achieving a low kappa coefficient: (i) the original annotation guidelines from EBM-NLP are complex and complicated.It first necessitates annotating the Participant, Intervention, and Outcome elements at the span-level, and further annotates the specific details at a granular level.For instance, Participant is annotated at the span-level and the specific details about the participant (i.e.age, gender, and condition) are annotated at the granular level.Likewise, Intervention is annotated at the span-level and the specific details about the intervention (i.e.physical, non-physical, and Control) are annotated at the granular level.Note that Control is annotated as a subtype of Intervention.(ii) The original annotation guidelines lack specific rules for defining entity boundaries for the PICO elements.In addition to defining the PICO elements with examples of inclusion and exclusion, the guidelines only mention -"mark the longest contiguous text that includes such a description."While differences in span boundaries in annotations by different annotators are a major reason behind low IAA, our experience in annotating the documents for other studies has shown that the specific rules regarding modifiers, articles, prepositions, and overlapping entities improve the IAA.(iii) The repeated mentions of interventions and outcomes across different sections of the abstract lead to ambiguity and missing annotations, especially in complex interventions and multi-arm trials.For example, consider a study comparing a multicomponent community health program versus usual care on several health outcomes.As we move along different sections of the abstract, we may come across repeated mentions of these interventions including a detailed mention, a specific component mention, abbreviation of the program, and even some generic reference such as "the We resolve the issues observed in the original annotation guidelines, by: (i) annotating the PICO elements at a single level; (ii) enriching the annotation guidelines with a set of linguistic rules to define the boundaries (see Supplementary Materials S3), and (iii) retaining the sentences only from the Title and Methods sections (see Supplementary Materials S3).This is based on our preliminary experiments as shown in Table 1.Annotating the PICO elements mentioned in the Title and Methods sections took only 60 s per abstract.This is significantly lower than the time required to annotate the PICO elements mentioned in all the sections of a PubMed abstract (i.e.146 s).Switching from the hierarchical annotation to a single-level annotation was based on several limitations and challenges that we noticed in the original multi-level annotation.Firstly, the hierarchical-level annotation significantly increases the time, effort, and cost of the annotation process.Secondly, prior research using the EBM-NLP corpus has reported issues with the fine-grained classification of PICO elements utilized for the hierarchical annotation.One such issue was concerned with the "Intervention" element.According to Dhrangadhariya et al. (2021), even the human annotators find it difficult to classify certain interventions as education or psychological.Their experiments showed the least performance of 0.31 F1-score on the "Intervention" class.From the error analysis, the authors suspect that the ambiguities arising from the split of coarse-grained PICO into fine-grained PICO classes were one of the issues for such low performance.
Using the revised annotation guidelines, two annotators with medical background re-annotated the EBM-NLP mod dataset and annotated two additional datasets related to COVID-19 and AD.The IAA was calculated for each PICO element and the entire dataset using Cohen's Kappa coefficient.The approach achieved the Cohen's Kappa coefficient of 0.714, 0.808, 0.701, and 0.790 for the P, I, C, and O components, respectively, and 0.746 for all PICO elements.

Prompt-based learning for sentence classification
Recent works show the feasibility of using natural language prompts for tuning the pre-trained language models for specific downstream tasks (Petroni et al. 2019, Ding et al. 2021, Liu et al. 2021a).In our prior research (Hu et al. 2022), we applied prompt-based learning to classify sentences from RCT abstracts.In brief, our approach classifies sentences by predicting the mask position in RCT abstracts using promptbased learning.Other existing approaches use traditional machine learning and deep learning to classify sentences.The performance of our sentence classification approach surpasses the performance of the previous state-of-the-art approach using Hierarchical Sequential Labeling Network (HSLN) (Jin and Szolovits 2018a).A more detailed description of our method is in Supplementary Materials S1.1.We applied the model from our previous work to classify the sentences in the RCT abstracts from the EBM-NLP mod dataset, COVID-19 dataset, and AD dataset.The parameters used in our promptbased learning approach were set as follows: dropout ¼ 0.5, batch size ¼ 8, learning rate ¼ 6eÀ6, optimizer ¼ AdamW, and learning rate decay ¼ 0.01.We evaluated our prompt-learning model and compared our model with that of the HSLN architecture on the EBM-NLP mod dataset, COVID-19 dataset, and AD dataset independently.We used the standard evaluation metrics, precision (P), Recall (R), and F1 scores.

NER for PICO extraction
A recent work on identifying P, I, and O elements using LSTM-CRF within BERT achieved a 0.68 F1-score on the EBM-NLP corpus when evaluated at the token-level.Another work by Gu et al. (2022) used the pre-trained model from PubMedBERT and achieved a 0.73 F1 score on the same corpus at the same token-level.We used the same experimental setting (i.e.data, training parameters) and evaluation script provided by the EBM-NLP corpus and reported the performance at the token-level using micro-averaged precision, recall, and F1-score (Supplementary Results Table S3).In addition, we also reported the performance at entity-level by matching the exact spans.
We trained the NER models for the EBM-NLP mod dataset, COVID-19 dataset, and AD dataset using PubMedBERT for five epochs with a learning rate of 1eÀ5 and a batch size of 32.We also experimented with the NER models with other pre-trained models including BERT (Devlin et al. 2018), BioBERT (Lee et al. 2020), BioM-ALBERT (Alrowili and Vijay-Shanker 2021), and BioM-ELECTRA (Gu et al. 2022).The PubMedBERT was better among all pre-trained models in our previous study.

NER evaluation
We evaluated the performance of the NER models on two levels: (i) token-level; and (ii) entity-level.For the token-level evaluation, we used the original evaluation script from EBM-NLP corpus.The script excludes all the "Outside" labels in Inside, Outside, Beginning (IOB) tagging for a fair comparison.The token-level evaluation may not be the best approach because many biomedical named entities include multiple tokens, and the goal is to identify the whole entity.For example, the Participant element, "86 hospitalized COVID-19 patients," was partially identified as "86 hospitalized" and "patients" in the token-level evaluation, The condition, "COVID-19" was omitted.This results in an incomplete representation of the Participant element.For the entity-level evaluation, the entire entity span is viewed as the Participant element.The evaluation preserves comprehensive information.The entity-level evaluation is more reliable than the token-level evaluation.For both types of evaluations, we calculated precision (P), recall (R), and F1 scores for each PICO element, as well as the micro-averaged overall scores for P, R, and F1.These micro-averaged overall scores are computed using the sum of True Positives (TP), False Positives (FP), and False Negatives (FN) from each PICO element.The formula for these scores is provided in Supplementary Materials S1.3.

End-to-end evaluation
In addition to evaluating the sentence classification model and the NER system individually, we also performed an end-toend evaluation to assess the combined performance of the sentence classification model and NER.For a more direct comparison with the standalone NER module, we implemented our two-step pipeline on the identical dataset used to evaluate the standalone NER module.We maintained a consistent evaluation metric used for the standalone NER module.

Performance of the sentence classification module
Our approach achieved an overall F1 score of 0.953 on the EBM-NLP mod dataset, 0.931 on the COVID-19 dataset and 0.962 on the AD dataset.The F1 score for the Methods section alone is 0.949 for the EBM-NLP mod dataset, 0.923 for the COVID-19 dataset, and 0.955 for the AD dataset (Table 3).A comparison between our sentence classification approach and the existing state-of-the-art approach using HSLN architecture validates the superior performance of our approach.A detailed comparison of both approaches and their performance is in Supplementary Results Table S4.

Performance of standalone NER module
Table 4 presents the number of P, I, C, and O elements in three datasets.Table 5 and Supplementary Results Table S5 show the performance of PubMedBERT on PICO extraction (i.e.token-level and entity-level respectively) on all three datasets.We observed that our pipeline achieves higher performance on AD and COVID-19 datasets than the EBM-NLP mod dataset.Our approach looks promising with F1 score >0.8 for token-level evaluation and >0.68 for entity-level evaluation on all three datasets.

End-to-end evaluation of PICO extraction system
The token-level F1 score for the EBM-NLP mod dataset, COVID-19 dataset, and AD dataset, was 0.833, 0.928, and 0.899.Similarly, the entity-level F1 score for these datasets was 0.712, 0.850, and 0.805.The performance of our twostep NLP pipeline system is better across all datasets when compared to the standalone NER module.

Discussion and future work
Our section-specific annotation schema aimed to reduce annotation inconsistencies by classifying sentences before NER, decreasing annotation complexity and time.Although focused on the methods section, it can be extended to other sections if needed.Our method balances minimizing complexity and information loss, covering 95.2% of all entities in the Methods section and improving inter-annotator agreement.Our system has significantly enhanced PICO extraction performance, but with a modest impact on the COVID-19 dataset, possibly due to its specificity and contemporary nature.
However, there is further scope for improvement.To perform error analysis, we evaluated our NER results on the entity-level by partial match.The models achieved 0.848, 0.924, and 0.899 for EBM-NLP mod , COVID-19, and AD datasets, respectively.Detailed performance by different entity types and the confusion matrix are shown in Supplementary Results Table S7 and Supplementary Results Fig. S2.The F1 score for the Control element is lower when compared to other elements across all three datasets (Tables 5  and 6).The confusion matrix shows that 17% of controls are confused with the Intervention element of PICO.In several studies, Control and Intervention elements are the same and it is difficult to distinguish them without a proper context.In most cases, a proper understanding of whether an intervention was applied to a "study group" or a "control group" is required to distinguish between the Control and Intervention elements.The sentence, "Patients with severe COVID-19 were randomly divided into two groups: the standard treatment group and the standard treatment plus hUC-MSC infusion group," says that the standard treatment was given to both groups.The first mention was annotated as the control and the second mention was annotated as the Intervention element.This may be confusing to the model because of the lack of context to learn the difference between the Intervention and Control elements.To ameliorate this confusion between Intervention and Control elements, we need to strategically incorporate more contextual information.This might be achieved by performing NER on all the sentences from each abstract together, rather than processing each sentence in isolation.This approach has the potential to provide the model  with richer contextual information.We also observed that numerous Control elements labeled as "Placebo," lead to model overfitting.For instance, in the sentence, "either 3 weeks of taper and 5 weeks of placebo only or continuing use of risperidone."The Control is "5 weeks of placebo" as per the annotation guidelines.However, the model only partially identified "placebo" as the Control.Such issues can be resolved by incorporating more specific annotation rules, such as, "do not include any prepositional phrases preceding a Control element." Nearly 19% of the "Inside" label was misclassified as the "Outside" label in the Participant element.But the "Beginning" label of the Participant element is well-identified.This suggests that the NER model struggles with discerning the ending position of the Participant element.The issue may arise from our annotation guidelines, which include the rule for annotating the longest noun phrase for the Participant element.Clear boundary rules might address this.For example, we could stipulate that "only one prepositional phrase in a Participant element should be included." The overall performance of our PICO extraction system also depends on the performance of sentence classification model.While our sentence classification model categorizes the sentences into general categories based on rhetorical roles, this could be extended to classifying sentences directly into P, I, C, and O categories.The NICTA-PIBOSO corpus incorporated P, I, and O categories in addition to the background and study design, with every sentence belonging to a single category.However, it is very common that a sentence may include multiple PICO elements.Hence, developing a multilabel sentence classification model for identifying PICO categories will be more beneficial when compared to the existing classification model.
The broader applicability of our approach is yet to be established.Testing models trained on specific datasets like AD and COVID-19 on other diseases may offer insights.Utilizing transfer learning could save further annotation time.
Many existing studies (Nye et al. 2018, Zhang et al. 2020) including our work use only abstracts to identify the PICO elements from PubMed articles.However, certain interventions and outcomes are reported only in the full-text of the article.Our approach will miss extracting the interventions and outcomes mentioned only in the full-text articles.Developing a system that can extract PICO elements from both PubMed abstract and full-text articles or applying our pipeline to the full-text might be useful.
In the future, we plan to develop an interface that can take a PubMed identifier or an abstract as input and returns a list of extracted PICO elements as output.The Cochrane Review currently provides a PICO search engine associated with their reviews.However, those PICO elements are manually curated from a limited number of articles.

Conclusions
We presented a new two-step extraction approach to extract PICO elements from the RCT abstracts.We modified annotation guidelines to improve the annotation quality and IAA and reduce annotation complexity for PICO extraction.By annotating the method section and title alone, we not only reduce annotation complexity for PICO extraction but were able to achieve a much higher performance on retrieving unique PICO elements without much loss of information from a subset of RCT abstracts from the EBM-NLP corpus.We verified the usability and reliability of our system by applying and evaluating it on an unseen dataset.

Figure 1 .
Figure 1.Overview of PICO extraction system.(a) The two-step PICO extraction system that includes a sentence classification and NER of PICO elements.(b) Training and test datasets for sentence classification and NER

Table 1 .
Distribution of PICO entities in different sections of the abstract for a random selection of 30 abstracts.

Table 2 .
Number of sentences in each class for the sentence classification evaluation dataset.

Table 3 .
Performance of the prompt-based sentence classification model on the evaluation dataset of COVID-19, AD, and EBM-NLP mod .

Table 4 .
Statistics of PICO elements in all three corpora.

Table 5 .
Entity-level evaluation of the standalone NER model on all three corpora by exact match.

Table 6 .
An end-to-end entity-level evaluation on all three datasets.