Systematic review and narrative synthesis of surgeons' perception of postoperative outcomes and risk

Background The accuracy with which surgeons can predict outcomes following surgery has not been explored in a systematic way. The aim of this review was to determine how accurately a surgeon's ‘gut feeling’ or perception of risk correlates with patient outcomes and available risk scoring systems. Methods A systematic review was undertaken in accordance with PRISMA guidelines. A narrative synthesis was performed in accordance with the Guidance on the Conduct of Narrative Synthesis In Systematic Reviews. Studies comparing surgeons' preoperative or postoperative assessment of patient outcomes were included. Studies that made comparisons with risk scoring tools were also included. Outcomes evaluated were postoperative mortality, general and operation‐specific morbidity and long‐term outcomes. Results Twenty‐seven studies comprising 20 898 patients undergoing general, gastrointestinal, cardiothoracic, orthopaedic, vascular, urology, endocrine and neurosurgical operations were included. Surgeons consistently overpredicted mortality rates and were outperformed by existing risk scoring tools in six of seven studies comparing area under receiver operating characteristic (ROC) curves (AUC). Surgeons' prediction of general morbidity was good, and was equivalent to, or better than, pre‐existing risk prediction models. Long‐term outcomes were poorly predicted by surgeons, with AUC values ranging from 0·51 to 0·75. Four of five studies found postoperative risk estimates to be more accurate than those made before surgery. Conclusion Surgeons consistently overestimate mortality risk and are outperformed by pre‐existing tools; prediction of longer‐term outcomes is also poor. Surgeons should consider the use of risk prediction tools when available to inform clinical decision‐making.


Introduction
Surgical procedures all carry associated risks. It is therefore important that surgeons are able to make accurate predictions of potential benefit and risk, including immediate mortality and morbidity, as well as long-term outcomes, to enable balanced decision-making and fully informed consent. Risks can also be estimated after surgery, based on additional perioperative and intraoperative data, which allows contemporary prediction of outcome. There are numerous risk prediction models that enable the surgeon to quantify risk based on measurable parameters 1 -5 . However, there are inherent limitations in using a generalized risk prediction model, which may not include clinical data pertinent to the individual case in question, leading to variability in model accuracy 6 -10 . As a result, risk prediction tools are generally used in tandem with the surgeon's 'gut feeling' of overall risk and anticipated outcome ('clinical gestalt'). Several disparate factors influence surgeons' perception of outcome: patient factors, such as their perceived fitness, their pathology and planned procedure; setting factors, such as the experience of other members of staff; and surgeon factors, such as clinical knowledge, operative skill, previous significant surgical complications, and inclinations and attitudes 11 -13 . Anticipating surgical risk is subject to multiple biases, which make it challenging. These include the natural tendency toward anecdotal recall and the availability heuristic (the likelihood of making a decision based on how easily the topic or examples come to mind) 14,15 . Some studies 16 -18 support the accuracy and reproducibility of surgeons' predictions, whereas others 19 -22 demonstrate less favourable results. The complexity of synthesizing risk perceptions is significant and incompletely understood 23,24 . The accuracy of surgeons' prediction has not been explored previously in a systematic manner.
The aim of this review was thus to determine, from the available evidence, whether a surgeon's gut feeling or perception of risk correlates with postoperative outcomes, and to compare this prediction with currently available risk scoring systems, where available.

Methods
This systematic review was undertaken in accordance with the PRISMA guidelines 25,26 . MEDLINE (via PubMed), Embase, the Cochrane Library Database, and the Cochrane Collaboration Central Register of Controlled Clinical Trials were searched with no date or language restrictions, with the last search date on 9 July 2018. The search term used was ('Surgeons' [Mesh] OR 'General Surgery/manpower*' [MeSH]) AND ('perception' OR 'intuition' OR 'predict*' OR 'decision making' [mesh]). There was no restriction on publication type. This search was complemented by an exhaustive review of the bibliography of key articles, and also by using the Related Articles function in PubMed of included papers. Results were restricted to human research published in English.

Inclusion and exclusion criteria
All studies of patients undergoing surgery in which a preoperative or postoperative surgeon assessment (or proxy assessment) of a postoperative outcome was performed were included. This included articles that reported general risk (such as mortality) or a surgery-specific risk (for example anastomotic leakage). Studies that made comparisons with established risk scoring tools were also included. Papers or abstracts in English, or non-English papers with an English abstract, were included.
Papers describing the risk assessment of 'theoretical' cases, or patient vignettes in a situation distant from clinical practice (such as a conference), were excluded, as were studies in which surgeons' assessment of risk was compared with an established risk scoring tool, without data on actual patient outcome.

Data extraction and assessment of study quality
Three authors independently extracted data and assessed the methodological quality of the studies, with all data extraction independently checked by the senior author.
The following baseline data were extracted from each study: first author, year of publication, data collection period, geographical location, study design and type (single or multiple centres, number of surgeons involved in risk estimation, whether consecutive patients were enrolled), surgical specialty, whether other risk scoring systems were used for comparison and, if so, whether the assessor was blinded to this result. Data extracted regarding the assessment of risk included: risk outcome assessed; timing of risk estimation (preoperative or postoperative); type of risk assessment by surgeons (qualitative, quantitative, continuous scale such as a visual analogue scale (VAS), or composite score); absolute value of risk event predicted by surgeon and by scoring system; absolute value of risk occurrence rate; summary data on outcome reported, including area under the curve (AUC) of receiver operating characteristic (ROC) curves, observed : expected (O : E) or predicted : observed (P : O) ratios, or any other summary data.
When data were available, AUCs were extracted with their 95 per cent confidence intervals. AUCs greater than 0⋅9 were considered as indicating high performance, 0⋅7-0⋅9 as moderate performance, 0⋅5-0⋅7 as low performance, and less than 0⋅5 as indicating risk assessment no better than chance alone 27,28 .
Risk predictions made by pre-existing tools, such as the Physiological and Operative Severity Score for the enumeration of Mortality and morbidity (POSSUM) 1 , Portsmouth-POSSUM (P-POSSUM) 4 or Continuous Improvement in Cardiac Surgery Program (CICSP) 5 , were compared with outcome when given. Internal prediction models, where authors would derive significant predictive co-variables from their data set and assess the accuracy of these co-variables within the same data set, were not evaluated as they lacked validity.
Study quality was assessed using the Newcastle-Ottawa (NO) score 29,30 . The NO score assigns points based on: the quality of patient selection (maximum 4 points); comparability of the cohort (maximum 2 points); and outcome assessment (maximum 3 points). Studies that scored 6 points or more were considered to be of higher quality. postoperative procedure-specific morbidity; and long-term outcome (typically operation-specific). Further comparative analyses of outcomes included comparison of preoperative and postoperative predictions, and of predictions made by consultants and surgical trainees.

Narrative synthesis
Given the marked heterogeneity in study design, patient population included, method of assessing risk and outcomes assessed, meta-analysis was deemed not appropriate. A narrative synthesis was therefore performed according to the Guidance on the Conduct of Narrative Synthesis In Systematic Reviews 31 . Three authors systematically summarized each article using bullet points to document key aspects of each study, focusing particularly on methods used and results obtained. The validity and certainty of the results were noted (whether appropriate statistical comparisons were used and, if so, their effect size and significance). The senior author identified and grouped common themes, divided larger themes into subthemes, tabulated a combined summary of the paper, and synthesized a common rubric for each theme. Consolidated reviewers' comments can be found in Table S1 (supporting information).

Results
A total of 584 articles were identified from the literature search, of which 48 were retrieved for evaluation. Papers were excluded on the basis of being duplicates (1) and being irrelevant based on the title (497) and abstract (38) (Fig. 1). Twenty-seven studies 16 -24,32-49 comprising 20 898 patients met the inclusion criteria and were included in the narrative synthesis (Appendix S1, supporting information).

Baseline characteristics and study design
Study demographics are shown in Table 1 In all but one study 24 , surgeons overestimated the mortality risk. In six of seven studies assessing mortality estimate, surgeons (range 0⋅68-0⋅91) were outperformed by risk prediction tools (range 0⋅64-0⋅98). The most accurate assessment of mortality risk was in a series of 163 patients undergoing emergency general surgical operations 36 . Both surgeons and anaesthetists assessed risk, with anaesthetists (O : E ratio 0⋅93; AUC 0⋅907) performing marginally better than surgeons (O : E ratio 0⋅83; AUC = 0⋅903). In cardiac surgery, surgeons rarely classified individuals as low risk, even when they were 19,37,49 . Four papers provided mortality assessments using mortality estimate risk scoring tools (POSSUM 2 17,36 ; P-POSSUM 1 36 ; CICSP 2 18,49 ). These scoring tools provided a lower, and more accurate, absolute figure for mortality estimates, with a greater AUC value (when given) in all studies.
Surgeons overestimated risk in three studies 34,35,41 where data were provided, and underestimated risk in four studies 22,24,33,39 . One study 41 demonstrated that surgeons overpredicted complications in elective cases and underpredicted complications in emergency cases. Surgeons' accuracy in estimating morbidity varied considerably (AUC 0⋅4-0⋅92). The accuracy of prediction tools showed less variability (AUC 0⋅65-0⋅84). Surgeons' predictive accuracy was better than prediction tools in three 17,41,48 of five 17,41,48,22,24 comparative studies. Four papers provided morbidity estimates using POSSUM 17,35,41,48 and P-POSSUM 48 . Surgeons predicted morbidity better than POSSUM, but were comparable with P-POSSUM. P-POSSUM was found to be a better predictor than POSSUM by the authors of one study 48 .

Operation-specific morbidity
Three studies 20,22,39 comprising 2832 patients (all risk assessments made after surgery) evaluated operationspecific morbidity prediction ( Table 2). Two 22,39 (274 patients) assessed surgeons' estimate of developing an anastomotic leak after primary anastomosis. Both showed surgeons' estimated leak rate was approximately half the actual leak rate, with a predictive power no better than that from chance alone. One study 22 found an online prediction tool for anastomotic leak (AUC 0⋅84, 95 per cent c.i. 0⋅67 to 1⋅00) to be superior to surgeons at estimating leak rates (AUC 0⋅4). Another study 20 investigated surgeons' ability to predict accurately the risk of postoperative hypocalcaemia (POH) and permanent hypoparathyroidism following thyroid surgery in 2558 patients. Limited data were available, but the more common hypocalcaemia (occurring 28⋅3 per cent of the time) was better predicted than the less frequent hypoparathyroidism (occurring 2⋅5 per cent of the time).
AUC values were poorly reported, but where available ranged from 0⋅51 to 0⋅75. A number of studies 17,21,34,40,42 found that surgeons significantly and consistently overestimated functional, analgesic and overall satisfaction outcomes after spinal, orthopaedic and neurosurgical operations. The only outcomes that were predicted accurately were ambulation at 90 days after emergency hip fracture surgery 17 and LOS 24 .
Of the five studies presenting AUC data, four 23,44,47,48 found that risk perception was better after than before surgery, although some of the improvements were small. One 24 found no difference in prediction accuracy before and after surgery. One study 47 demonstrated that patients with a significantly increased risk assessment after surgery (compared with before surgery) had higher mortality (6⋅3 versus 2⋅4 per cent respectively; P = 0⋅006), major complication (20⋅1 versus 11⋅0 per cent; P = 0⋅001) and all complications (48⋅3 versus 34⋅3 per cent; P = 0⋅001) rates.

Surgeon experience: consultant versus junior
Four papers 21,39,41,48 (2426 patients; 859 preoperative and 2426 postoperative risk assessments) assessed the difference in predictive accuracy between senior surgeons (consultants or attending surgeons) and surgeons in training. Outcomes assessed were morbidity 39,41,48 and functional status 21 . Three papers 21,39,48 (gastrointestinal surgery and neurosurgery) found a trend towards better predictions by surgeons in training, whereas one 41 (elective and emergency major hepatobiliary and gastrointestinal surgery) showed that senior surgeons were better than trainees in predicting outcomes.

Discussion
This systematic review and narrative synthesis examined the accuracy of surgeons' estimates in predicting outcomes. Surgeons' predictions of mortality in both general and cardiac surgery were good, with most of the AUCs presented in papers being greater than 0⋅7. Where data were presented, surgeons consistently overestimated mortality risk. Only one paper 36 assessed anaesthetic risk, and found that anaesthetists predicted mortality following emergency general surgery more accurately than surgeons. In cardiac surgery, surgeons rarely classified individuals as low risk even when they were 19,37,49 . Prediction tools (POSSUM, P-POSSUM and CICSP) consistently predicted mortality rate more accurately than surgeons, with lower absolute values. P-POSSUM performed exceptionally well in a single study 36 of emergency general surgery. Mortality overestimation was a consistent finding in a recent study 50 in which residents were given real-life clinical vignettes and asked to estimate risks. It is been suggested that the pessimism in predictions may allow patients to exceed surgeons' expectations (when pessimistic predictions are proven wrong), which is psychologically preferable to patients failing to meet a pre-established expectation 37 . These findings differ to physicians' estimates of mortality in the ICU. Radtke and colleagues 51 found that ICU physician estimates were as good as risk assessment tools, and either accurately or slightly underestimated mortality risk.
For general morbidity, surgeons were relatively good at predicting outcomes (AUC generally above 0⋅6 where data were given). Data on absolute risk rates were not given routinely, and when presented there was no consistent overprediction or underprediction of risk. One study 41 suggested that surgeons overpredicted complications in elective cases and underpredicted risk in emergency cases. Pre-existing scoring systems were better than surgeons' predictions in some studies 18,22 , but worse in others 33,35,41 . One study 48 demonstrated that surgeons' accuracy in predicting complications improved with feedback from previous predictions. General morbidity occurs shortly after surgery and is often audited and scrutinized by the operating surgeon; this provides a constant feedback for fine-tuning individual surgeons' risk estimation.
Three studies 20,22,39 investigated surgeons' ability to predict specific surgical complications accurately. Two studies 22,39 showed that surgeons' predictions of anastomotic leak were exceptionally poor, predicting markedly fewer leaks than occurred, in contrast to a risk prediction tool, which performed well. Although there are several caveats to anastomotic leak predictions, foremost that it is exceptionally unusual to create an anastomosis with an expectation of a leak, the risk assessment tool can be used with good accuracy. The large study by Promberger and co-workers 20 showed that a more common complication was better predicted than a less frequent one, perhaps due to better pattern recognition by the surgeons.
Predictions of long-term outcomes following surgery are variable, in part due to marked heterogeneity, but clearly demonstrate poor predictive power of surgeons.
This summary is based predominantly on spinal, orthopaedic and neurosurgical surgery, in which outcomes are recognized as being variable. Although this does limit generalizability, it may also be that surgeons do not routinely follow up patients for a long time (beyond 1 year), and therefore estimates of long-term outcomes are based on fewer patient encounters than more immediate surgical outcomes. It may also be due to confirmation bias, which is related to the overconfidence hypothesis 52 , when surgeons preferentially remember successful outcomes and forget failures, highlighting the importance of auditing patient outcomes.
This systematic review allowed comparisons between preoperative and postoperative risk predictions, and between senior surgeons and surgeons in training. However, only patients who had a surgical intervention were included, and so this review does not examine the risk assessment of patients managed without surgery, which comprises a large volume of the surgical workload.
A significant weakness of this review was the marked heterogeneity between the included studies, with significant differences in risk assessment methods, statistical analysis, assessment of outcome and data presentation, which precluded meta-analysis. Additionally, given the limited volume of data, it was impossible to perform separate analyses of individual surgical specialties, despite the risk that postoperative outcomes may be perceived significantly differently between various specialties depending on baseline event rate. Furthermore, the information available to the operating surgeon during risk evaluation was not always apparent and estimates may, therefore, have been prejudiced by the use of scoring schemes (such as P-POSSUM). Certain studies 24,32 used subjective outcome measures susceptible to bias. Risk predictions made before and after surgery were grouped together. Finally, a number of studies 16,18,21,23,33 -35,38,40-44 did not provide AUC data (or equivalent). It was therefore impossible to make meaningful statistical comparisons between studies, which might have been possible with a more focused review including only studies with AUC data.
This systematic review has several implications for surgical practice. Surgeons need to be aware of the global limitations of surgeons' judgement. The consistent finding of an increased prediction of mortality suggests surgeons tend towards more pessimistic predictions, which will invariably influence surgical decision-making and patient consent. Recall bias (caused by inconsistencies of recalled events), confirmatory bias (the tendency to interpret new evidence as confirmation of one's existing theories), anchoring bias (preference for reliance on information identified first during information-gathering), overconfidence bias (when a person's subjective confidence in their judgement is consistently greater than the objective accuracy of those judgements), self-serving bias (the tendency to attribute positive events to personal ability, whilst attributing negative events to external factors), as well as numerous others 14,15,52 , will hamper the surgeon's ability to predict outcomes accurately. Existing risk scoring tools, especially P-POSSUM and CICSP, appear to be of significant value and outperform surgeons in their estimation of mortality. However, they invariably cannot capture all variables affecting outcome, and should therefore be used as an adjunct to risk estimation. Recently, a machine-learning algorithm has been developed to predict postoperative outcomes 53 , with AUCs ranging from 0⋅82 to 0⋅94 (99 per cent c.i. 0⋅81 to 0⋅94) for morbidity and 0⋅77 to 0⋅83 (0⋅76 to 0⋅85) for mortality. This tool has the potential of using future data to refine its algorithm automatically and improve its predictive power.
Risk evaluation is a crucial step in the surgeon and patient deciding on whether to have surgery. Detailed interviews have demonstrated that risk evaluation often occurs before a patient is seen for the first time, and has a profound influence on how likely surgery is to be offered and accepted 54 . Randomized data assessing surgeons' responses to various clinical vignettes showed that access to data from a well validated risk calculator reduced the variability of risk estimation and led to more accurate risk prediction 55 . This is crucial as a composite estimate of risk/benefit is a key determinant of a surgeon deciding whether to offer an operation 56,57 . Although this study did not include papers in which patients did not undergo an operative intervention, the implication of these results is that risk prediction tools could be of value in reducing heterogeneity between surgeons' willingness to offer patients surgery.
When making decisions, there is a clear difference between intuitive, unconscious, automatic thought and deliberate, conscious, analytical thought 58 , sometimes referred to as system 1 (rapid intuitive thinking that relies on personal experience, bias and heuristics) and system 2 (time-consuming deliberate thought requiring focus and dedication) thinking 59 . These systems can be viewed as two ends of a continuum, whereby an expert can move effortlessly from one to the other as the situation requires, described as fluidity. It is likely that unconscious intuition was evaluated predominantly in the included studies, and, where able, compared with a tool that would complement the analytical decision-making aspect. Senior physicians are recognized as using their intuition far more than a novice, in part to avoid overloading their conscious working memory and reduce the risk of burnout associated with excessive system 2 thinking 60,61 . This review highlights the potential value to be gained by using surgical intuition alongside predictive tools, which would complement deliberate and conscious system 2 thought. This decision-making can be further enhanced by regular multidisciplinary team case discussions and frequent reviews of surgical morbidity and mortality.