Evaluation of an artificial intelligence clinical trial matching system in Australian lung cancer patients

Abstract Objective The objective of this technical study was to evaluate the performance of an artificial intelligence (AI)-based system for clinical trials matching for a cohort of lung cancer patients in an Australian cancer hospital. Methods A lung cancer cohort was derived from clinical data from patients attending an Australian cancer hospital. Ten phases I–III clinical trials registered on clinicaltrials.gov and open to lung cancer patients at this institution were utilized for assessments. The trial matching system performance was compared to a gold standard established by clinician consensus for trial eligibility. Results The study included 102 lung cancer patients. The trial matching system evaluated 7252 patient attributes (per patient median 74, range 53–100) against 11 467 individual trial eligibility criteria (per trial median 597, range 243–4132). Median time for the system to run a query and return results was 15.5 s (range 7.2–37.8). In establishing the gold standard, clinician interrater agreement was high (Cohen’s kappa 0.70–1.00). On a per-patient basis, the performance of the trial matching system for eligibility was as follows: accuracy, 91.6%; recall (sensitivity), 83.3%; precision (positive predictive value), 76.5%; negative predictive value, 95.7%; and specificity, 93.8%. Discussion and Conclusion The AI-based clinical trial matching system allows efficient and reliable screening of cancer patients for clinical trials with 95.7% accuracy for exclusion and 91.6% accuracy for overall eligibility assessment; however, clinician input and oversight are still required. The automated system demonstrates promise as a clinical decision support tool to prescreen a large patient cohort to identify subjects suitable for further assessment.


BACKGROUND AND SIGNIFICANCE
Prospective clinical trials are the gold standard for assessing the potential harms and benefits of new cancer treatments. However, clinical trial recruitment is challenging and time-consuming. 1,2 As of 2018, only about 6% of patients with a cancer diagnosis in the state of Victoria in Australia were recruited to clinical trials, with rates unchanged for more than a decade.
While the global increase in the availability of clinical trials should facilitate greater trial recruitment, a key deterrent is the tedious manual task of matching patients to clinical trials or identifying a cohort of patients for a trial. Both processes require detailed knowledge of patient characteristics and trial eligibility criteria, a challenge given the growing number clinical trials available and complexity of trial designs. Eligibility criteria for clinical trials can narrowly define the study population and thus limit the number of patients who can enroll in clinical trials. The time to screen patients for trials can further limit enrollment.
Studies demonstrate that incorporating clinical decision support into the management of oncology patients and automating referrals to clinical trials show promise for increased patient referrals. 1,3-9 IBM V R (Cambridge, MA, United States) Watson for clinical trial matching (CTM) is a software platform developed to identify potential trials for individual patients or potential trial candidates for individual trials. CTM uses natural language processing (NLP) to ingest trial and patient information from unstructured sources and matches patients to trials for which they might be eligible through machine learning (ML) techniques. Previous studies have shown that CTM 10 can reduce the screening time for clinical trials and increase trial enrollment. 3 CTM determines the degree of eligibility based on the patient clinical attributes entered. Features of the tool include the ability to classify patients as "Exclude" (patient not eligible) or "Consider" (patient potentially eligible), based on patient attributes and whether the patient has unmet criteria which are modifiable conditions. This mimics real-world practice and circumstances faced by clinical trial screeners in which individual records are examined successively against increasingly specific criteria to identify modifications that can increase a patient's chance for matching to a trial.
The objective of this retrospective study was to evaluate the performance of CTM eligibility determinations against a selected pool of active lung cancer trials for a cohort of potentially eligible patients in an Australian cancer hospital.

Participants
The lung cancer cohort was derived from a clinical dataset of patients attending an Australian specialist cancer hospital and enrolled a prospective observational cohort study-the Thoracic Malignancies Cohort (TMC). The hospital institutional review board (IRB) approved the TMC (study no. 17/70, approved January 4, 2012) as a prospective observational study; consenting patients are followed from diagnosis or first hospital presentation at 3-month intervals until death or loss to follow-up. The IRB approval and patient consent allow the TMC data to be used in other IRB approved studies, including the current study (study no. 17/152L, approved October 24, 2017).
Eligibility criteria for this study included diagnosis of small cell lung cancer (SCLC) or non-small cell lung cancer (NSCLC) between 2012 and 2018. A pragmatic sample of approximately 100 cases was selected based on most recent study follow-up, in reverse chronological order from September 1, 2018. This data extraction date was selected to allow at least 3 months for study follow-up at time of data extraction (December 1, 2018), ensuring at least 1 routine follow-up and complete data acquisition. Follow-up date, rather than diagnosis or first presentation date, identified cases that defined a representative sample of patients attending lung oncology clinics during the study period. These selection criteria represent an unselected patient cohort who may be considered for a range of clinical trials including all stages of disease and clinical time points within a patient's journey, including newly diagnosed patients, treatmentnaïve patients, previously treated patients, and patients receiving ongoing treatment.

Data collection
Clinical trial eligibility criteria were extracted for 10 phases I-III cancer clinical trials registered on clinicaltrials.gov that were open to lung cancer patients at the Peter MacCallum Cancer Center in Melbourne, Australia. Because inclusion and exclusion criteria in ClinicalTrials.gov often require clarification, additional detail required for the CTM protocol ingestion was obtained from relevant trial protocols. For example, an exclusion criterion in Clinical-Trials.gov may state "no therapy allowed," however, the institution may clarify this by stating "no previous chemotherapy or radiotherapy allowed" in the institutional protocol. CTM allows users to ingest eligibility criteria from ClinicalTrials.gov and modify as needed. Watson's NLP algorithms extracted eligibility and exclusion criteria from the protocol library of Portable Document Format files that contained previously extracted inclusion and exclusion criteria for trials. The trial data intake was optimized with 3 rounds of trial ingestion and evaluated by experts to validate ingestion protocols prior to study inception.
Clinical data for included patients were extracted from the TMC study database and medical records. De-identified patient attributes such as histological diagnosis, stage, and prior therapies were manually entered in CTM. The TMC database collects the following variables at diagnosis: TNM staging according to 7th and 8th edition of UICC staging criteria (as relevant for year of cancer diagnosis), histological subtype (adenocarcinoma, squamous cell carcinoma, large cell carcinoma, NSCLC not otherwise specified, carcinoid, SCLC), mutation status (epidermal growth factor receptor [EGFR], anaplastic lymphoma kinase [ALK], Kirsten rat sarcoma viral oncogene homolog [KRAS], BRAF, MET); PDL1 expression; comorbidities according to the Simplified Comorbidity Score 11 including tobacco consumption, diabetes mellitus, renal insufficiency, respiratory comorbidity, cardiovascular comorbidity, neoplastic comorbidity, and alcoholism; Eastern Cooperative Oncology Group performance status (PS); weight loss within 3 months of diagnosis (0-10%, 11-15%, >15%); smoking history (current, past, never); smoking magnitude (pack-years); sex, and age. Longitudinal data include cancer treatment (chemotherapy, immunotherapy, targeted therapy, radiotherapy, surgery), patient status, and response to therapy. Results of specific diagnostic tests not mandated by the TMC study are collected and reported if testing is performed as part of routine care.

Statistical analysis
This study tested the overall performance (including ML and NLP) of the CTM eligibility determination process. CTM-processed clinical trial eligibility criteria were checked and refined by 2 clinicians (a medical oncologist and a pharmacist) prior to commencement of matching. Accuracy of NLP processing of eligibility criteria from trial protocol extracts was not evaluated, because the eligibility criteria in ClinicalTrials.gov was modified to include additional protocol details, including laboratory results not included in the database. Once matching was complete, a timed query was executed using a cloud-based instance of CTM. Each patient was assessed for eligibility against potential trials and classified by CTM as "Exclude" (patient not eligible) or "Consider" (patient potentially eligible).
A gold standard for trial eligibility was determined for each patient and the 10 cancer trials by manual review of patient attributes entered into CTM (not the full medical record) by 2 clinicians, with discrepancies discussed to achieve consensus. Accuracy (agreement), recall (sensitivity), specificity, and precision (positive predictive value [PPV] and negative predictive value [NPV]) of CTM trial classification was measured against this gold standard. CTM performance was further classified by counts (per trial and overall) of the total number of individual inclusion/exclusion criteria assessed by CTM and the proportion of assessments that agreed with the gold standard. Interrater reliability between clinicians involved in manual review leading to a consensus gold standard was measured by Cohen's kappa with reported standard error.

RESULTS
A total of 102 lung cancer patients were included in the study and assessed for eligibility against 10 lung cancer clinical trials. Patient attributes and trial features are summarized in Table 1 and Figure 1, respectively. More detailed trial descriptions are available in Supplementary Table S1.
The median time for CTM to run a query and return eligibility determinations for 102 patients against 10 trials was 15.5 s (range 7.2-37.8 s). In establishing the gold standard comparator, clinician interrater agreement was high (Cohen's kappa 0.70-1.00, Table 2), with disagreement due to selection error or overlooked features/criteria, rather than material disagreement. Agreement was lowest for trial 10, in which clinician interpretation of exclusion criteria relating to "relevant" driver mutations was discrepant. Differences were reconciled on discussion, which determined that patients with driver mutations would remain potentially eligible, due to the lack of detail provided for classification of excluded mutations. This detail was included only in trial appendices and thus not available to CTM for processing, nor considered in clinician assessments. Excluding this trial, agreement levels were 97% (Cohen's kappa 0.82-1.00). On a per-patient basis, the accuracy of CTM for eligibility classification across trials was 91.6%; recall (sensitivity), 83.3%; precision, (PPV) 76.5%; NPV, 95.7%; and specificity was 93.8%. When considering trials classified as "Exclude" by CTM, accuracy was 95.7%, with 34 of 799 trials incorrectly excluded. Conversely, for trials classified as "Consider" by CTM, accuracy was 76.5%, with 52 of 221 labeled "Consider" that should have been excluded. CTM accuracy for individual trials ranged from 77% to 100% (Table 2). Note: Attributes presented as entered into CTM with all field accurately representing data from the prospective clinical database expect for race which was available for all patients, however, only entered in CTM for a portion of patients due to data entry omission. Among all trials, 1490 trial eligibility criteria (inclusion/exclusion) were listed as "not met" by CTM (90% agreement with gold standard), 1231 were "met" (96% agreement), 8088 were identified as requiring further action to make a decision and listed as "action needed" (89% agreement), 136 were identified as "unmet modifiable" (90% agreement), and 522 consent criteria were reviewed (81% agreement). The number of data elements and criteria varied by trial, with trial 6 representing an umbrella multicohort design with a notably higher number of criteria. Agreement between CTM and gold standard was lower for trials 6 and 7, reflecting complexity of trial 6 and interpretation difficulties by CTM relating to technicalities of exclusion for prior radiotherapy in trial 7. For each clinical trial, the total number of items assessed, their CTM classifications, and accuracy are detailed in Table 3.
In general, false positives and false negatives were the result of incorrect interpretation of eligibility criteria (IF THEN logic) by CTM. The most common cause of false positives and false negatives was misclassification of metastatic status in the context of progressive disease (ie, non-metastatic primary staging but metastatic at trial screening), summarized in Table 4.

DISCUSSION
This study is the first to evaluate performance of CTM eligibility determinations outside of the United States. In our unselected patient cohort, CTM software was able to reliably exclude ineligible patients from trial consideration (>95% accuracy), but less accurately identified eligible patients (77%). One contributing factor that limited CTM's ability to determine eligibility was that for the 102-patient cohort, 8088 data items were identified as requiring further action (data input or clinician interpretation of eligibility). In routine use of CTM in clinical practice, this reconciliation process is part of the normal workflow, but it was not done as part of this retrospective study.
CTM performance in this cohort exceeded that of a previous study of CTM undertaken at Mayo Clinic in the United States in which accuracy was reported as 87.6% for 4 breast and 74.9% for 3 lung cancer trials. 12 In the Mayo study, CTM used NLP to process unstructured electronic health record (EHR) documents to ingest patient data, whereas in this study, clinical data were entered into CTM directly. Both manual entry and NLP processes can introduce errors. While intended clinical utilization of CTM includes ingestion of patient data from an EHR, the hospital in our study did not yet have an integrated EHR. Therefore, our study tested CTM's decision-making algorithms for trial eligibility, rather than its NLP capabilities in the patient ingestion process.
Our study evaluated combined NLP and ML performance of CTM for evaluation of eligibility criteria but did not separately evaluate these components. Zhang et al 13 have reported on NLP classification methods for eligibility of HIV-positive patients for interventional cancer trials and eligibility of HIV-positive and pregnant women for general interventional trials. F2 scores (weighted average of precision and recall) ranged from 77% to 91% for these methods. Relative to a comparative trial matching platform from the Cincinnati Children's Hospital Medical Center (CCHMC), CTM demonstrated significantly greater precision but lower recall and NPV. CCHMC developed and implemented its own clinical trial eligibility screening algorithm, reporting outcomes on 55 trial protocols and 215 pediatric oncology patients. 8 Employing similar methods as used in our current study, CCHMC oncologists conducted manual medical record review for a randomly selected patient subset to generate a gold standard for performance assessment. In the CCHMC study, the best reported performance for matching trials to patients was 36% precision (vs 77% in the current study), 100% recall (vs 83%), 100% NPV (vs 96%), and 95.5% specificity (vs 94%). There are open-source tools to help with the process of clinical trials matching, however, the need for labeled data for NLP training and large datasets for ML can create obstacles to the success  of open-source tools, many of which are developed in academic settings. 14 Integration of clinical trial alert systems with EHRs has shown the potential to increase overall enrollment in trials, despite the alert fatigue noted by these studies. [15][16][17][18] Tools such as Deep 6 AI, 19 Mendel.ai, 20 Antidote, 21 Smart Patients, 22 and Synergy 23 are examples of artificial intelligence trial matching systems using ML and NLP. However, to the best of our knowledge there are no studies directly comparing these tools to each other. Publications in this area are mostly abstracts using limited datasets. Mendel.ai have published a retrospective study assessing the ability of their software to increase identification of eligible patients for 3 studies. 24 For 2 of the studies, 24% and 50% potentially eligible patients were additionally identified. By comparison, our work analyzed a significantly larger number of studies, including phases I-III trials across a variety of treatment settings and modalities as available on ClinicalTrials.gov and recruiting at our institution.   We demonstrate the feasibility of developing and implementing an automated patient-trial classification system and highlight the clinical need for such systems, given the increasing challenge of manual matching and the trend toward increasingly complex, risk-based eligibility criteria that are not necessarily clinically intuitive. It is also the first such report using real-world data at an Australian cancer hospital. Importantly, we highlight the need for clinician input and oversight to support automated systems and remind the informatics community of the technical, intuitive, and nuanced clinician interpretations and decisions required to fully assess trial eligibility for an individual patient. CTM was designed and intended to be used as a clinical decision support tool to aid rather than replace clinicians in determining trial eligibility.
The current study has several limitations. First, although CTM is capable of processing structured and unstructured information from an EHR, only the matching components (NLP and ML performance) of the CTM system were evaluated, because an EHR was not integrated with CTM for this study. Second, we did not separately evaluate the NLP and ML performance of CTM for eligibility assessment. Third, the study included a relatively small number of patients at a single center. Fourth, though reduction in manual labor is a benefit of automated systems such as CTM, the manual entry of data in this study (necessitated by the lack of an integrated EHR) did not accurately reflect standard processes and so is not likely to be representative of results using automated systems.
Strengths of the study are a rigorous gold standard for eligibility with consensus agreement of 2 clinicians with high interrater reliability. We also recognize that the clinician-derived gold standard in this study is not feasible for larger scale studies. For larger studies, it is likely that automated or semi-automated EHR data extraction would be required, though such methods have their own limitations. Therefore, we highlight the high value in a small dataset that has had human review of every case as a strength of this study.

CONCLUSION
This study demonstrated that CTM allows efficient and reliable screening of Australian lung cancer patients for clinical trials, with 96% accuracy in exclusion and 92% performance in assessing overall potential eligibility. Many patient attributes remained unknown after CTM analysis, highlighting the need for clinician input and oversight in assessing nuances of patient characteristics against individual criteria.

FUNDING
This work was funded by IBM.