Prognostic models for surgical-site infection in gastrointestinal surgery: systematic review

Abstract Background Identification of patients at high risk of surgical-site infection may allow clinicians to target interventions and monitoring to minimize associated morbidity. The aim of this systematic review was to identify and evaluate prognostic tools for the prediction of surgical-site infection in gastrointestinal surgery. Methods This systematic review sought to identify original studies describing the development and validation of prognostic models for 30-day SSI after gastrointestinal surgery (PROSPERO: CRD42022311019). MEDLINE, Embase, Global Health, and IEEE Xplore were searched from 1 January 2000 to 24 February 2022. Studies were excluded if prognostic models included postoperative parameters or were procedure specific. A narrative synthesis was performed, with sample-size sufficiency, discriminative ability (area under the receiver operating characteristic curve), and prognostic accuracy compared. Results Of 2249 records reviewed, 23 eligible prognostic models were identified. A total of 13 (57 per cent) reported no internal validation and only 4 (17 per cent) had undergone external validation. Most identified operative contamination (57 per cent, 13 of 23) and duration (52 per cent, 12 of 23) as important predictors; however, there remained substantial heterogeneity in other predictors identified (range 2–28). All models demonstrated a high risk of bias due to the analytic approach, with overall low applicability to an undifferentiated gastrointestinal surgical population. Model discrimination was reported in most studies (83 per cent, 19 of 23); however, calibration (22 per cent, 5 of 23) and prognostic accuracy (17 per cent, 4 of 23) were infrequently assessed. Of externally validated models (of which there were four), none displayed ‘good’ discrimination (area under the receiver operating characteristic curve greater than or equal to 0.7). Conclusion The risk of surgical-site infection after gastrointestinal surgery is insufficiently described by existing risk-prediction tools, which are not suitable for routine use. Novel risk-stratification tools are required to target perioperative interventions and mitigate modifiable risk factors.


Introduction
Surgical-site infection (SSI) represents the most common complication after gastrointestinal surgery, affecting as many as one in nine patients in high-income countries and one in three patients in low-and middle-income countries 1 .Reducing the incidence and severity of SSI remains a high-priority issue for patients, surgical teams, and healthcare systems 2,3 , due to the substantial contribution of SSI towards postoperative morbidity and mortality 1,4 , quality of life 5 , and healthcare costs 6 .
The capability to accurately predict patients who are at high risk of SSI has several potential advantages.At an individual level, this would allow individualized preoperative assessment of the risk of SSI and prioritization of evidence-based interventions could lead to iatrogenic harm (for example antibiotic prophylaxis) or are resource intensive (for example increased postoperative monitoring) towards patients at highest risk 7 .However, there are also wider benefits, including improving the efficiency of clinical trials on SSI by facilitating the selection of patients most likely to benefit from the trial intervention 8 and allowing risk adjustment to facilitate the fair comparison of SSI rates across different sites and populations 9 .
However, while clinical risk-prediction tools have increasingly been developed across all areas of medicine, frequently these fail to align with methodological recommendations 10 .They often lack validation outside the original cohorts, meaning their clinical utility remains uncertain 11 .Efforts to develop predictive tools for SSI have been ongoing for decades 12,13 , yet none has been widely adopted to predict individual risk for patients undergoing gastrointestinal surgery 7 .Furthermore, to the best of

Systematic Review
the authors' knowledge, no previous systematic reviews have been conducted to determine what models have been developed and if these are suitable for wider adoption.Therefore, the aim of this systematic review was to identify and assess the quality of existing prognostic tools for the prediction of SSI within gastrointestinal surgery populations.

Methods
A systematic review was performed according to a predefined protocol (registered on PROSPERO: CRD42022311019) and reported according to the PRISMA guidelines 14 .

Search strategy and information sources
The search strategy was developed to identify prognostic models that predict the occurrence of SSI after gastrointestinal surgery (Appendix S1).A comprehensive search of MEDLINE, Embase, Global Health, and IEEE Xplore was performed on 24 February 2022.This search was supplemented through hand searching citation and reference lists from relevant articles.The searches were limited to publications in the English language due to practical restraints and restricted to the year 2000 onwards to ensure relevance to current surgical practice (unless the model was subsequently validated).
Studies were eligible for inclusion if they developed or externally validated a model that sought to predict risk of SSI after gastrointestinal surgery in adults using preoperative and/ or operative characteristics.Models that included patients undergoing non-gastrointestinal surgical procedures were eligible if these also included gastrointestinal surgical procedures.Furthermore, models that were developed before 2000 but externally validated afterwards were eligible.However, the exclusion criteria were: development or validation performed for non-adult patients (less than 18 years) where development and performance were not separate for adults and children; inclusion of postoperative (including administrative data) or context-specific parameters (time interval or individual sites) in the risk model, or the predictors included were not reported; primary outcome was not SSI (that is a composite outcome of postoperative infections or complications); and procedure-specific risk models (for example appendicectomy).

Study selection and data extraction
After the removal of duplicate publications, titles and abstracts were screened, and full texts of relevant publications uploaded onto the Covidence online systematic review tool 15 for review against these eligibility criteria.Data fields of interest were extracted from eligible papers, related to study characteristics (year of publication, setting, sample size, inclusion criteria, and techniques for model development), SSI (definition used, number of cases, and method and time frame of follow-up), and the risk model itself (validation status, modelling techniques, clinical parameters included, and any metrics reported regarding prognostic accuracy and model performance).Data extracted were stored on a research electronic data capture ('REDCap') server 16 .Study screening and data extraction were completed independently by two among the reviewers (K.A.M., J.S., S.L., T.G., and A.R.), with any disagreements resolved through a consensus-based approach.

Quality assessment and data synthesis
Quality assessment of eligible studies was performed using 'Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis' (TRIPOD) reporting guidelines 17 and the 'Prediction model Risk Of Bias ASsessment Tool' (PROBAST) 18 .
A narrative (descriptive) synthesis of results was performed.The data extracted were summarized using frequencies and percentages for dichotomous variables and using medians and interquartile ranges for continuous variables.SSI event rates with 95 per cent confidence intervals were also calculated, where possible.No meta-analysis was planned or performed.Furthermore, the minimum sample size required for developing a multivariable prediction model was calculated for the observed SSI rate and number of candidate predictors evaluated (if reported), and compared with the development-cohort sample size 19 .This was performed irrespective of whether logistic regression or other modelling techniques were applied.
Model performance was compared using the area under the receiver operating characteristic (ROC) curve (AUC) and summarized using the geometric mean and range.Prognostic accuracy summary statistics (sensitivity, specificity, positive predictive value, and negative predictive value) were reported.An AUC of less than 0.6 was considered to indicate 'poor' model discrimination, an AUC of 0.6 to less than 0.7 was considered to indicate 'moderate' model discrimination, an AUC of 0.7 to less than 0.8 was considered to indicate 'good' model discrimination, and an AUC of greater than or equal to 0.8 was considered to indicate 'excellent' model discrimination 20 .Models should also be 'well calibrated' in be able to accurately predict the outcome of interest across the spectrum of risk-for example, resource wastage or even iatrogenic harm may occur if there is overestimation for patients at low risk and underestimation of patients at high risk.Reporting calibration intercept (calibration-in-the-large) and slope (an intercept of 0 and slope of 1 indicating 'perfect' calibration) was considered an appropriate method to assess calibration, in line with current best practice 21 .However, assessment of calibration through a Brier score or Hosmer-Lemeshow test was also extracted.All statistical analyses were performed using RStudio version 4.1.1(R Foundation for Statistical Computing, Vienna, Austria), with packages including tidyverse, finalfit, pmsampsize 22 , and predictr 23 .

Characteristics of included studies
The models were typically based on single-centre studies (44 per cent, 10 of 23), with no models developed using prospective national or international data.Furthermore, of the nine studies based on national or international data, 56 per cent (five of nine) were based on the same US-based registry (National Surgical Quality Improvement Program ('NSQIP')).Overall, almost all models (83 per cent, 19 of 23) were developed using data from high-income countries, with the remainder from upper-middle-income countries (Brazil, China).There was also no patient-public involvement identified across included models.
Furthermore, there were important differences in the underlying populations included, with most involving patients either undergoing colorectal (48 per cent, 11 of 23) or general surgical (30 per cent, 7 of 23) procedures, with the others also including some or all other surgical specialties (22 per cent, 5 of 23).Furthermore, whereas most models included all procedures irrespective of operative urgency, a minority included only elective (17 per cent, 4 of 23) or emergency (13 per cent, 3 of 23) procedures.
Within the included studies, SSI was typically defined according to the Centers for Disease Control and Prevention (CDC) criteria (57 per cent, 13 of 23), with the rest providing no clear definition (26 per cent, 6 of 23) or alternative definitions based on administrative codes or other clinical signs (17per cent, 4 of 23) (Table 1).However, even among studies using the CDC criteria, only a minority used the full definition (46 per cent, 6 of 13), with the rest including a combination of superficial, deep, or organ-space SSI.
Overall, there was a wide variation in the SSI rates reported between studies (1.0 per cent 13 to 25.6 per cent 33 ), with the highest rates observed in colorectal populations and the lowest rates observed in multi-specialty populations (Fig. 2).However, there was no clear pattern in the SSI rate observed according to the definition used.

Variable selection
Predictive factors were predominantly identified via logistic regression (74 per cent, 17 of 23) (Table 1), with a minority involving novel machine-learning-based approaches (17 per cent, 4 of 23).However, there were substantial methodological concerns with most variable selection approaches used, with a majority conducting selection based on: stepwise approaches (52 per cent, 12 of 23), univariable significance (9 per cent, 2 of 23), or expert opinion alone (13 per cent, 3 of 23).Where reported, only 42.9 per cent (9 of 21) of derivation cohorts achieved the minimum sample size required to develop a predictive model based on the SSI rate reported and the number of candidate predictors being explored (Fig. S1).
There was substantial heterogeneity in the number and examples of predictors identified across models (median 7, range 2-28).However, commonly identified predictors involved operative and patient-specific factors, with most models highlighting the importance of operative contamination (57 per cent, 13 of 23) and duration (52 per cent, 12 of 23) (Fig. 3 and Table S1).Overall, three models were solely formed of predictors available before operation, whereas the majority of models also

Quality assessment of included studies
Adherence to the TRIPOD reporting guidelines was mixed, with consistently poor reporting for methods of blinding to outcome or predictors, how missing data were handled, and reporting of model results and evaluation (Fig. S2).All models demonstrated a high risk of bias, principally due to the analytic approach and outcome definition, and none displayed high applicability to an undifferentiated gastrointestinal surgical population (Fig. S3).S2 for full list.

Model discrimination and prognostic accuracy
Of the 23 unique models identified, discrimination was reported in 61 per cent (14 of 23) for the derivation cohort, with the majority reporting 'good' or 'excellent' discrimination (79 per cent (11 of  14), AUC geometric mean 0.831, AUC range 0.620-0.991)(Fig. 4a and Table S2).Of the 15 studies that subsequently conducted internal validation, 67 per cent (10 of 15) reported discrimination.There was a reduction in models reporting 'good' or 'excellent' discrimination when internal validation was performed (60 per cent (6 of 10), AUC geometric mean 0.735, AUC range 0.620-0.878).
In comparison, of the four models that underwent external validation, the AUC remained greater than or equal to 0.7 for one model (the NNIS model) in 17 per cent (2 of 12) of cohorts (Fig. 4b and Table S2).However, there was no evidence to confirm this as statistically significant 'good' discrimination for SSI with an AUC greater than or equal to 0.7.
However, only a minority of models assessed model calibration (22 per cent, 5 of 23) or prognostic accuracy (17 per cent, 4 of 23) during development or internal validation (Table S2).These were less frequently reported in external validation studies (prognostic accuracy 6 per cent (1 of 23), calibration 11 per cent (2 of 23)).Even when reported, calibration was insufficiently assessed, with only one model reporting the calibration intercept and slope in line with current best practice.

Discussion
This systematic review identified 23 original models developed for the prediction of SSI that were relevant to patients undergoing gastrointestinal surgery.Like many predictive models developed for other outcomes of interest 10 , significant concerns have been identified around methodological quality, evaluation, and clinical relevance to an undifferentiated gastrointestinal surgical population.Furthermore, even among the four (17 per cent) models that had undergone external validation 12,13,24,36 , none significantly exceeded the a priori threshold for 'good' discrimination (AUC greater than or equal to 0.7) and almost all did not report assessment of calibration or prognostic accuracy.Predicting patients at high risk of SSI would facilitate shared perioperative decision-making and allow targeting of resources to those most likely to benefit.Yet, despite over 30 years of research into the prediction of SSI, this remains a challenging task for several reasons 45 .SSI is inherently multifactorial, with numerous risk factors previously established 46 , encompassing an interplay of patient, operative, and hospital-based determinants.This leads to many candidate predictors, increasing the minimum sample size required, meaning development becomes limited to data from large-scale registries or prospective studies.A breadth of prognostic factors were identified in this systematic review-both in number and type across different models.While the most frequently identified prognostic factors likely represent those of greatest relevance for prediction of SSI, these should be interpreted with caution given the highlighted methodological concerns and heterogeneity in underlying populations.Future risk-prediction models may seek to prioritize investigation of these factors, although they will need to consider the potential for confounders, statistical error, and/or collinearity if not accounted for in the original models.Furthermore, these should use robust modelling approaches for variable selection (penalized regression or machine-learning approaches, rather than univariable selection or stepwise regression) or internal validation (bootstrapping or cross-validation, over random data-set splitting) 18 .While the CDC definition of SSI represents an established gold-standard definition, this is inherently subjective and requires in-person assessment 47 .Particularly with the adoption of enhanced recovery after surgery ('ERAS') programmes, SSI increasingly occurs after hospital discharge 48 .Therefore, studies conducted retrospectively that lack robust follow-up, or in areas of poor healthcare access, may underestimate the true event rate.These concerns were seen in many of the models identified, which may partly explain poor prediction in those externally validated.However, even with an established gold standard, there remained heterogeneity in the outcome of interest given that a significant minority of models used a non-CDC outcome or considered only specific subtypes of SSI.This limits the comparability between models and signals a lack of consensus on what aspect(s) of SSI should be the intended predictive target.Finally, the sample size and study design of included studies often did not meet required expectations.Only a minority of models were based on estimates of a minimum sample size, all of which were based on retrospective data (Fig. S1 and Table 1).As multicentre prospective studies may be expensive and/or complex to conduct, retrospective data should only be used when there are sufficient event rates, data quality, and modelling approaches to account for inherent biases 18 .
Assessment of model performance was limited by the overall poor quality of reporting (particularly prognostic accuracy and model calibration), as well as the high risk of bias and scarce external validation.Overall, most models performed well with regard to derivation and internal validation, with the highest discrimination observed in models based on machine learning 31,43 .These are increasingly common in the literature 49 , with 30 per cent (3 of 10) of SSI models developed in the last 5 years using machine-learning approaches.Although at an early stage within healthcare, these machine-learning-based models have theoretical benefits, including better handling of non-linearity and the incorporation of interaction terms, with the potential to improve predictive accuracy 50 .However, these can require significantly more data to achieve stability, are prone to overfitting, are less transparent for patients and clinicians, and often do not provide clinically significant enhancement to discrimination over well-conducted regression approaches 51,52 .
While several models appear promising, confirmation using external validation is essential before these can be trusted or used in clinical practice 11 .Despite this, there continues to be difficulties with reproducibility across the broader prediction literature 53 .This is reflected here, with only one in six having undergone external validation.This may be due in part to the quantity, heterogeneity, and complexity of variables identified in models, posing practical challenges to validation and clinical usage if these data are not routinely recorded or available 54,55 .Even when this was performed, as expected there was a reduction in the observed discrimination compared with the derivation cohorts 53 .Of all models, the NNIS model remains the most validated model in the literature, likely in part due to its simplicity, being among the first published 13 , and its use in risk adjustment to allow inter-centre comparison of SSI rates 7 .While it did not demonstrate 'good' discrimination on external validation, it still displayed the highest discrimination reported.Nevertheless, particularly as the prognostic accuracy and calibration are unclear, the clinical utility remains low.Furthermore, it should be noted that almost all model derivation and validation has occurred in the context of high-income countries.It remains unclear whether any models can be generalized to low-and middle-income countries, which continue to experience the greatest burden of SSI 1 .Additional external validation or development of models relevant to these contexts are needed to ensure equitable benefit.
This systematic review has several key strengths.It has comprehensively identified and evaluated predictive models previously developed for SSI after gastrointestinal surgery.Each model has been compared with current best practice regarding the reporting quality, risk of bias, minimum sample size, and practice of external validation-this allows a clear understanding of the suitability for the original purpose intended, as well as for prediction in undifferentiated populations of patients undergoing gastrointestinal surgery.This also provides a clear framework of standards for future models to meet.However, there are several important limitations to this study.First, the search was limited to English-language papers and databases, and so the systematic review may not encompass every possible model developed globally.Second, only models that were not procedure specific were included.While these may share common prognostic factors relevant to broader populations, it was anticipated this would be limited due to procedure-specific variables and poor performance when transported to a broader population 56 .
Prognostic models should be deliverable within the clinical context, have a clear target population with utility within the clinical decision process, and be demonstrated to be acceptable to patients and clinicians.Across the wider predictive-model literature, there is a gulf between the number of models developed and those adopted into routine practice 57 .No models have been recommended for individual risk assessment of SSI within guidelines or routinely adopted on a large-scale basis 7,58,59 .Indeed, there is limited evidence to support the use of any in undifferentiated gastrointestinal surgical patients (and even within the original subgroups of interest).Therefore, ongoing work to address this gap in prognostic models validated for a global gastrointestinal surgical population is underway by the National Institute for Health and Care Research ('NIHR') Global Health Research Unit on Global Surgery.There are numerous evidence-based interventions already available before, during, and after surgery that modify the risk of SSI and minimize associated harm 7,58,59 .Yet, without an adequate understanding of how to stratify patients according to their risk, shared decision-making and the appropriate allocation of targeted enhanced monitoring and perioperative interventions for SSI remain challenging 32 .Further, comprehensive external validation of existing models or novel, validated prognostic tools are needed to better differentiate risk of SSI across a global population of gastrointestinal surgery patients.

SmithFig. 2 Fig. 3
Fig. 2 Rate of 30-day surgical site infection outcome reported in included studies, by definition used.Where not reported, 95% binomial confidence intervals were calculated.CDC, Centers for Disease Control and Prevention.

Fig. 4
Fig.4Discriminatory performance of all prognostic models to predict surgical site infection in the 30-day postoperative period.Where not reported, no AUC is displayed.

Table 1 Characteristics of included studies describing an original risk score
Procedures; LASSO, least absolute shrinkage and selection operator; CNN, convolutional neural network.