Systematic review and validation of clinical models predicting survival after oesophagectomy for adenocarcinoma

Abstract Background Oesophageal adenocarcinoma poses a significant global health burden, yet the staging used to predict survival has limited ability to stratify patients by outcome. This study aimed to identify published clinical models that predict survival in oesophageal adenocarcinoma and to evaluate them using an independent international multicentre dataset. Methods A systematic literature search (title and abstract) using the Ovid Embase and MEDLINE databases (from 1947 to 11 July 2020) was performed. Inclusion criteria were studies that developed or validated a clinical prognostication model to predict either overall or disease-specific survival in patients with oesophageal adenocarcinoma undergoing surgical treatment with curative intent. Published models were validated using an independent dataset of 2450 patients who underwent oesophagectomy for oesophageal adenocarcinoma with curative intent. Results Seventeen articles were eligible for inclusion in the study. Eleven models were suitable for testing in the independent validation dataset and nine of these were able to stratify patients successfully into groups with significantly different survival outcomes. Area under the receiver operating characteristic curves for individual survival prediction models ranged from 0.658 to 0.705, suggesting poor-to-fair accuracy. Conclusion This study highlights the need to concentrate on robust methodologies and improved, independent, validation, to increase the likelihood of clinical adoption of survival predictions models.


Introduction
Globally, oesophageal cancer affects 746 000 patients and is associated with 459 000 deaths annually 1,2 . While squamous cell carcinoma (SCC) remains the predominant global histological subtype, in recent years, the incidence of oesophageal adenocarcinoma has exceeded that of SCC in many parts of North America and Europe 3 .
Surgical resection of the oesophagus, with or without chemoradiotherapy or chemotherapy, offers the principal curative treatment modality for oesophageal adenocarcinoma but is only appropriate in a minority of patients who present with localized disease. For those patients who do undergo potentially curative surgical resection, overall 5-year survival is typically between 20 and 30 per cent and seldomly exceeds 50 per cent 4,5 . The associated morbidity and long-term sequelae of oesophageal resection serve as additional obstacles in the treatment of oesophageal cancer. The desire to establish greater equipoise of risk and benefit in the surgical management of oesophageal adenocarcinoma has meant that the ability to accurately predict survival after oesophagectomy is of particular clinical significance.
The eighth edition of The American Joint Committee on Cancer (AJCC) and the International Union for Cancer Control (UICC) TNM classification, is principally based on anatomical tumour extent and remains the most widely adopted method of prognostication in oesophageal cancer 6 . Nevertheless, TNM staging criteria does not acknowledge other pathological, demographic, and clinical variables that are also known to impact upon survival 7,8 . In oesophageal cancer, as in other malignancies, predictive models of survival have been developed in an effort to improve prognostication. Such models are intended to support clinical decision-making and to better inform patients of their envisaged disease outcome. However, it is notable that few of these models are ever routinely used in clinical practice.
This systematic review aimed to identify published clinical and pathological models that predict survival in patients undergoing potentially curative surgical resection for oesophageal adenocarcinoma. Where possible, the performance of identified models were assessed using a prospectively collected multicentre dataset.

Search strategy
This systematic review was conducted in accordance with the recommendations of the Cochrane Library and MOOSE guidelines 9 . A systematic literature search using the Ovid Embase and MEDLINE databases (from 1947 to 11 July 2020) was performed to identify studies reporting predictive models of survival in oesophageal adenocarcinoma. Details of the search strategy are provided in Table S1. The titles and abstracts of identified articles were screened by three independent reviewers (A.S., P.R.B., and B.V.) for potentially relevant studies that were subsequently subject to full-text review. To identify further potentially relevant studies, the reference lists of included articles were hand searched.
Inclusion criteria were studies that developed or validated a clinical prognostication model to predict either overall or diseasespecific survival in patients with oesophageal adenocarcinoma undergoing surgical treatment with curative intent. Models that included patients with both adenocarcinoma and SCC were included if the former was the predominant tumour subtype used to develop the model. Likewise, studies including patients receiving therapies with curative intent, other than surgery, were included, if surgery was the predominant treatment modality within the study cohort. No restrictions were made regarding patient demographics, surgical approach, use of (neo)adjuvant therapies, or study design. Models developed for either pre-or postoperative use were also included. Models that included experimental metabolic and/or genetic biomarkers that are either not currently routinely available or used within clinical practice were excluded. Models that were developed through the use of artificial neural networks were also excluded, owing to their unsuitability for independent validation. Non-English-language articles and conference abstracts without an associated published full-text article were excluded. Any disagreement regarding a study's inclusion was resolved by a fourth reviewer (C.P.). Three reviewers (A.S., B.V., and A.O'S.) independently extracted data from included studies.

Definitions
Oesophageal adenocarcinoma was defined as a histologically specific malignancy affecting the oesophagus and/or gastroesophageal junction (Siewert type I and II) and oesophagectomy as surgery to resect all or part of the oesophagus through either an open, hybrid, or totally minimally invasive approach but not including endoscopic techniques. A prognostic model was defined as a multivariable tool designed to predict patient survival (overall or disease specific) or a surrogate factor that was shown by the authors to correlate directly and reliably with patient survival.

Methodological quality assessment
The methodological quality of the included studies was assessed using the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist 10 . This checklist groups 35 key items into 11 domains that may be extracted from individual studies for the purpose of critical appraisal.

Model validation
Models were validated using an independent dataset of 2450 patients who underwent oesophagectomy for oesophageal adenocarcinoma with curative intent. Data were acquired from the Oesophageal Cancer Clinical and Molecular Stratification Consortium (OCCAMS; 1088 patients) 11,12 , Predicting Outcomes of Esophageal Malignancy Biomarker Consortium (POEM; 811 patients), and from a high-volume North American Centre (Virginia Mason Medical Center; 551 patients). Local institutional review board approval was obtained by all participating centres for the purpose of sharing anonymised data. Characteristics of the validation dataset are provided in Table 1.

Statistical analysis
Model validation was dependent on the concordance of test variables between the model and available variables within the validation dataset. Validation was performed according to the published eligibility criteria within each study. Missing data were dealt with by imputing the mode for categorical variables, and the median for continuous variables 13 . Risk stratification scoring systems were assessed using Kaplan-Meier curves and   I/II  III  IV   I  II  III   I  II  III  IV   I   I  II  III  IV   II  III  IV   I  II  III  IV   I  II  III 19 . b Cao et al. 21 . c Zhou et al. 23 . d Gabriel et al. 24 . e Xie et al. 26 . f Liu et al. 28 . g Du et al. 30 the log-rank test. Individualized prediction scoring models were evaluated by plotting calibration curves of predicted against actual survival. Where possible, model performance was further evaluated using receiver operating characteristic (ROC) curves and the corresponding area under the curve (AUC). Statistical analysis was performed using SPSS statistics version 27.0 (IBM, Armonk, New York, USA), with P , 0.05 considered to signify statistical significance.
Further study characteristics and details of variables assessed and utilized within models are provided in Tables S2 and S3. The commonest features included in models to predict survival were T stage (11 studies), tumour grade (10 studies), N stage (nine studies), and patient age (eight studies). In five studies, patient cohorts were prospectively sourced solely for the development of the intended prognostic models. The remaining studies used pre-existing data registries, including the Surveillance, Epidemiology, and End Results Program (SEER) 18,21,23,26,28,30 , the National Cancer Database (NCDB) 24 , trial datasets 19 , and internal hospital datasets 22,31 . The inclusion of two studies in which the proposed model was intended to predict lymph node metastasis 16,20 was based on the concurrent association of this outcome with survival that was demonstrated by the authors.

Appraisal of studies
Critical appraisal of included studies was performed in accordance with the CHARMS checklist. Eleven studies adopted either a prospective or retrospective cohort design, while seven studies used data from a national data registry. One study used a retrospective case-control design. A summary of the risk of bias within included studies, as determined by the CHARMS checklist, is provided in Table S4. Notably, none of the 17 papers achieved the highest CHARMS standards in terms of sample size, missing data, and model development, and only one paper achieved the best rating in candidate predictors. In contrast, interpretation and discussion were performed well in 15 of 17 studies.

Assessment of model performance
Of the 17 studies identified from the literature search, 11 presented models that were suitable for testing against the independent validation dataset [16][17][18][19][20][21]23,24,26,28,30 . Details of models that were assessed, including a comparison of survival outcomes with the validation dataset, are presented in Table 3 and Figs 1 and 2, with full details provided in Table S5. With the exception of the models reported by Barbour et al. 16 (Fig. 1a) and Eil et al. 18 (Fig. 1c,d), all models were able to predict survival successfully in patients within the independent validation dataset. The model published by Davison et al. 20 demonstrated poor discrimination of groups I/II and III versus group IV (Fig. 1f). In terms of individual predictions of survival, this was, generally speaking, more accurate for longer surviving patients, with ROC-AUC ranging from 0.658 to 0.705 (Table 3); however, some models tended to under-predict survival 19,24 .

Discussion
This is the first study to systematically identify and attempt to validate published models for the prediction of survival in oesophageal adenocarcinoma. Owing to the generally poor prognosis of oesophageal cancer 2 , it is vital to have accurate prognostic information so that patients and clinicians can make informed treatment choices. This makes a clear case for the benefit of well-constructed and reliable predictive models. While the TNM system stratifies patients into groups with significantly different outcomes 32 , the majority of patients fall into stage III, limiting its real-world utility. This systematic review identified 17 published prediction models, that assessed a combined 50 variables. These studies were of variable methodological quality, with none reaching the highest standards in several core assessment criteria: sample size, missing data, and model development 10 .
Ten of the 17 models were derived from existing cohorts of patients and seven from national registries. Eleven of the 17 models could be tested using a large multicentre cohort of patients from the UK, USA, Ireland, and the Netherlands, including cohorts from the OCCAMS Consortium 11,12 and the newly formed POEM Biomarkers Consortium. This mixed cohort represents a good test of their predictive power. It was reassuring that nine of the 11 models successfully validated; however, complete separation of all prognostic groups was not seen in two studies 16,20 . It is also notable that one of the models that failed to validate 16 had one of the smallest assessment cohorts (85 patients) and was one of the oldest studies (1991 to 2008).
Models that predicted individual survival did reasonably well but with a tendency to under-predict short-term survival. The AUCs for the models clustered around 0.65 to 0.70, which is considered poor-to-fair accuracy. Therefore, this raises the question of how these models could be used in clinical scenarios. It is noteworthy that none of the models presented herein is currently in routine clinical use. Existing models may have limited use on an individual patient basis but could potentially be used to stratify patients into different treatment groups: neoadjuvant chemotherapy and surgery versus surgery alone, for example. It is notable that many were designed for a subset of patients with oesophageal adenocarcinoma (e.g. T1 or neoadjuvant chemoradiotherapy) that potentially makes them more restricted in their utility. However, in one such model, Davison et al. were able to subdivide the T1 group to some degree, identifying a poorer prognosis group (IV) that may warrant more aggressive treatment 20 .
A further limitation of existing models is, for the most part, their reliance upon knowledge of pathological staging. For patients and clinicians, it would be more informative to have information regarding prognosis at the time of diagnosis, when it may have the largest impact on clinical decision-making and treatment planning. Future work should therefore place a greater emphasis on the identification of pretreatment prognostic markers.
The strengths of this study include the carefully constructed systematic review following published Cochrane Library and MOOSE guidelines 9 . The validation cohort was large, mixed, and multinational; however, as the individual databases differed in terms of the data recorded it was not possible to use all the patients to validate every model. In particular, the small numbers of patients used to test the model published by Barbour et al. 16 may have been linked to its failure to validate; however, the test set was larger than the cohort used to generate the original model. It was not possible to assess the performance of all identified models, owing to a limited number of variables within the validation dataset. A further limitation was the decision to exclude models that included novel biomarkers such as immunohistochemistry and genetic markers 11 . Such models are likely to have an important role to play in the future, but it was decided to concentrate on those models that could be immediately implemented by surgical centres and therefore, by definition, were models that could be tested easily.
This systematic review and validation of models designed to predict survival in oesophageal adenocarcinoma demonstrates that the models already have potential to stratify patients in a more granular way than standard TNM staging. However, none has yet achieved widespread adoption, with this study being the first time any of these models have been tested in other cohorts. To develop a robust and reliable model it is important to avoid as many of the potential sources of bias as possible and to generate and test the model in a large and multicentre cohort. This avoids the danger of overfitting the data to local outcomes 33 . While these 17 models have added to the field, it has not been translated into widespread adoption, and therefore they have failed to alter management decisions or improve outcomes in the real world. This must be the goal for any predictive model. While we did not include models that included biomarkers it is notable that, despite many being proposed 34 , none of these are in widespread use either, very likely for the same reasons that these clinical models have not been adopted.
Future work developing and validating predictive models in oesophageal cancer must concentrate on adopting robust and bias-free methodologies, suitably sized and representative patient cohorts, and, most importantly, externally validate the models in independent patient groups. If the work is carried out in this way, it will increase the likelihood of adoption and therefore improve the chance of improving patients' outcomes.