Benchmarking of robotic and laparoscopic spleen-preserving distal pancreatectomy by using two different methods

Abstract Background Benchmarking is an important tool for quality comparison and improvement. However, no benchmark values are available for minimally invasive spleen-preserving distal pancreatectomy, either laparoscopically or robotically assisted. The aim of this study was to establish benchmarks for these techniques using two different methods. Methods Data from patients undergoing laparoscopically or robotically assisted spleen-preserving distal pancreatectomy were extracted from a multicentre database (2006–2019). Benchmarks for 10 outcomes were calculated using the Achievable Benchmark of Care (ABC) and best-patient-in-best-centre methods. Results Overall, 951 laparoscopically assisted (77.3 per cent) and 279 robotically assisted (22.7 per cent) procedures were included. Using the ABC method, the benchmarks for laparoscopically assisted and robotically assisted spleen-preserving distal pancreatectomy respectively were: 150 and 207 min for duration of operation, 55 and 100 ml for blood loss, 3.5 and 1.7 per cent for conversion, 0 and 1.7 per cent for failure to preserve the spleen, 27.3 and 34.0 per cent for overall morbidity, 5.1 and 3.3 per cent for major morbidity, 3.6 and 7.1 per cent for pancreatic fistula grade B/C, 5 and 6 days for duration of hospital stay, 2.9 and 5.4 per cent for readmissions, and 0 and 0 per cent for 90-day mortality. Best-patient-in-best-centre methodology revealed milder benchmark cut-offs for laparoscopically and robotically assisted procedures, with operating times of 254 and 262.5 min, blood loss of 150 and 195 ml, conversion rates of 5.8 and 8.2 per cent, rates of failure to salvage spleen of 29.9 and 27.3 per cent, overall morbidity rates of 62.7 and 55.7 per cent, major morbidity rates of 20.4 and 14 per cent, POPF B/C rates of 23.8 and 24.2 per cent, duration of hospital stay of 8 and 8 days, readmission rates of 20 and 15.1 per cent, and 90-day mortality rates of 0 and 0 per cent respectively. Conclusion Two benchmark methods for minimally invasive distal pancreatectomy produced different values, and should be interpreted and applied differently.


Introduction
Benchmarking is a process in which the performance of best-in-class performers is measured to establish reference values to enable comparison of outcomes against those of the best in the industry 1 . Lately, there has been growing interest in this concept from the surgical field, considering that benchmarks can encourage surgeons to reach the highest possible level of clinical quality and not just perform to the average 2 .
Interestingly, the approach for defining benchmarks in surgery differs among published studies. The majority of the studies followed the best-patient-in-best-centre methodology, firstly described by Staiger et al. 2 , in which benchmarks are derived in a predefined low-risk population as the 75th percentile of the median proportion of outcomes across high-volume units. These benchmarks are supposed to mirror a realistic cut-off value, because not only the top few, but 75 per cent of that median proportion, achieved in high-volume centres, represent the benchmark. However, for the same reason, they may not be considered as intuitive or very strict.
On the contrary, fewer studies have followed the Achievable Benchmark of Care (ABC TM ; University of Alabama, Birmingham, Alabama, USA) methodology 19 . This aims to present benchmarks as the best achievable outcomes derived from top performers in an unselected population from centres with different volumes. They could be considered too ambitious, but ABC benchmarks do not imply that this outcome always has or can be achieved, but they illustrate the gap between benchmark and personal performance to encourage potential improvement knowing that that the target level has been achieved 19 . Furthermore, these benchmarks are applicable to real-life surgical patients instead of only low-risk patients. To date, it remains unknown whether these two methodologies give similar results or not, and there is no clear consensus on the best benchmark methodology to apply.
This study aimed to establish benchmarks for laparoscopic and robotic SPDP, integrating the ABC and best-patient-in-best-centre methodologies, and to investigate the impact of different methodologies in defining benchmarks and subsequent interpretation and guidance.

Study population and design
Data from patients undergoing either laparoscopically or robotically assisted SPDP for benign and premalignant lesions were extracted from a retrospective database of centres participating in the European Consortium on Minimally Invasive Pancreatic Surgery (E-MIPS) (2006-2019). The anonymous data were collected from the principal investigators of each centre using a Microsoft ® Excel ® (Microsoft, Redmond, Washington, USA) datasheet. All data were stored in the database and secured with a password. Consecutive patients, aged 18 years or above, were included. Patients who underwent intraoperative splenectomy but were intended for a spleen-preserving procedure were also included. Patients were excluded from a benchmark calculation if there were any missing data for that specific outcome.

Ethics
The study was conducted according to the principles of the Declaration of Helsinki (64th Fortaleza Brazil, October 2013), and in accordance with the Medical Research Involving Human Subjects Act and STROBE guidelines on reporting of observational studies 20 . The ethical board of Amsterdam UMC waived the need for informed consent owing to the retrospective design.

Variables and definitions
Preoperative variables included baseline characteristics, such as age, sex, American Society of Anaesthesiologists (ASA) fitness grade 21 , body mass index (BMI), previous abdominal surgery, and tumour size. Indicators of surgical performance were identified based on literature 11,22,23 and clinical relevance. Ten clinically relevant intraoperative and postoperative outcomes were selected for benchmarking. These included surrogate outcomes of both overall surgical quality (namely duration of operation, intraoperative blood loss, conversion, overall morbidity, major morbidity, duration of hospital stay, readmission, 90-day mortality) and procedure-specific quality (such as failure to preserve the spleen and postoperative pancreatic fistula (POPF)). Postoperative outcomes were recorded up to 90 days after surgery.
Conversion was defined as any procedure that started as minimally invasive but underwent unplanned or unintended laparotomy, or required hand assistance 24 . Overall morbidity included any postoperative complication according to the Clavien-Dindo classification; major morbidity was defined as that with a Clavien-Dindo grade of III or higher 25 . Clinically relevant grade B/C POPF was defined in accordance with the International Study Group of Pancreatic Surgery 26 . Spleen-preserving procedures were classified according to the Kimura 27 or Warshaw 28 method. Failure to preserve the spleen included patients in whom spleen preservation was intended before surgery, but intraoperative splenectomy was performed.

Statistical analysis
Categorical data are presented as proportions, normally distributed continuous data as mean values, and continuous data with a skewed distributed as median (i.q.r.). Normality of distribution was checked by the Kolmogorov-Smirnov test. Mann-Whitney U, χ 2 and Fisher's exact test were used as appropriate to compare baseline characteristics. Statistical significance was set at two-sided P < 0.050. Data were analysed using SPSS ® for Windows ® version 26.0 (IBM, Armonk, NY, USA).

Best achievable outcome benchmarks (Achievable Benchmark of Care)
The benchmark calculation on the total unselected cohort was performed according to ABC methodology 19 . With this method, benchmark values represent the best achievable outcomes, calculated for the consecutive best performing centres for a specific outcome until at least 10 per cent of the patient pool across all centres is reached. A threshold of 10 per cent is used to ensure that best practice will be measured reliably based on a few remarkable centres, and thus avoiding inclusion of outcomes of average care, usually performed by the majority. Exclusion of centres that provide fewer procedures is not necessary as the calculation adjusts for the impact of procedures in a centre with a small sample size (adjusted performance fraction) without eliminating them.
First, the adjusted performance fraction was calculated by adding 1 to the number of events (numerator) and 2 to the number of patients (denominator), and then dividing the adjusted numerator by the denominator.
Second, the adjusted performance fractions for all the centres were sorted from the lowest (best performing centre) to the highest value. Centres included in the benchmark calculation were the centres with the consecutive lowest adjusted performance fraction until the sum of patients reached at least 10 per cent of the cohort for that specific outcome. The ABC for that outcome was calculated by dividing the sum of all events in the benchmark centres (numerator) by the sum of patients in the benchmark centres (denominator). For this purpose, only centres with at least one event in overall morbidity, major morbidity, conversions, and POPF were included in the analysis as for these outcomes an event rate of zero was not considered achievable in real-life practice. The corresponding 25th, 50th, and 75th percentiles were also reported for each outcome. For continuous outcomes, such as duration of hospital stay and operating time, ABCs were calculated as the 10th percentile of the median value across all centres.

Best-patient-in-best-centre method
The benchmark calculation for the best-patient-in-best-centre method was performed as described by Staiger et al. 2 . In this methodology, benchmarks are calculated in a predefined low-risk patient cohort treated in expert centres. The benchmarks are represented as the 75th percentile of the medians for each centre for each outcome and considered as a cut-off value, not as best achievable results. These benchmarks reflect realistic and acceptable cut-off values that a performer is at least expected to achieve. Thus, individual values for performers below the benchmark value (75th percentile) indicate acceptable outcomes, whereas values above the benchmark indicate 'bad or worse' performance and may require closer attention and evaluation of the potential cause. Centres included in the best-patient-in-best-centre benchmark analysis were required to perform at least 10 minimally invasive distal pancreatectomies annually over the years that they provided patient data. Selection criteria for low-risk patients were determined using those applied in the benchmark analysis for pancreatoduodenectomy 11 , whereas only the surgical criteria were adjusted related to distal pancreatectomy ( Table 1).

Unselected population
In the study interval, 1230 patients were scheduled for minimally invasive SPDP from 32 centres of the European Consortium on Minimally Invasive Pancreatic Surgery (Fig. S1) The overall rates of conversion, overall morbidity, and major morbidity were 8.2 per cent (101 patients), 50.6 per cent (622), and 13.9 per cent (170) respectively, with no significant differences between the laparoscopic and robotic approach. The median duration of operation was significantly longer in the robotically assisted SPDP group than in the laparoscopically assisted group (262.5 (210-340) versus 195 (150-254) min; P < 0.001), as was the median duration of hospital stay (8 (6-11) Table 4 Baseline characteristics, and perioperative and postoperative outcomes for the low-risk cohort used in the best-patient-in-best-centre benchmark analysis LSPDP (n = 602) RSPDP (n = 162) P*

ABC benchmarks
The best achievable outcome benchmarks for the unselected cohort with their percentile ranks for 10 clinically relevant intraoperative and postoperative domains are reported in Table 3. The ABCs for laparoscopically assisted SPDPs among centres were 150 min for duration of operation, 55 ml for intraoperative blood loss, 3.5 per cent for conversion, 0 per cent for failure to preserve the spleen, 27.3 per cent for overall morbidity, 5.1 per cent for major morbidity, 3.6 per cent for POPF, 5 days for duration of hospital stay, 2.9 per cent for readmission, and 0 per cent for 90-day mortality. The ABCs for robotically assisted SPDP among centres were 207 min for duration of operation, 100 ml for intraoperative blood loss, 1.7 per cent for conversion, 1.7 per cent for failure to preserve the spleen, 34 per cent for overall morbidity, 3.3 per cent for major morbidity, 7.1 per cent for POPF, 6 days for duration of hospital stay, 5.4 per cent for readmission, and 0 per cent for 90-day mortality ( Table 3).

Best-patient-in-best-centre benchmarks
From the total cohort of 1230 patients treated in 32 centres, 764 (62 per cent) low-risk patients treated in 23 centres were identified for the best-patient-in-best-centre benchmark analysis. Exclusion and inclusion criteria for the low-risk cohort are reported in Table 1. Of the 764 procedures, 602 (79 per cent) were laparoscopically assisted and 162 (21 per cent) were robotically assisted. Patient and operative characteristics are summarized in Table 4. The benchmarks cut-offs for laparoscopically assisted SPDP were 254 min for duration of operation, 150 ml for intraoperative blood loss, 5.8 per cent for conversion, 29.9 per cent for failure to preserve the spleen, 62.7 per cent for overall morbidity, 20.4 per cent for major morbidity, 23.8 per cent for POPF, 8 days for duration of hospital stay, 20 per cent for readmission, and 0 per cent for 90-day mortality ( Table 5).
Robotically assisted SPDP benchmark cut-offs were 262.5 for duration of operation, 195 ml for intraoperative blood loss, 8.2 per Table 5 Best-patient-in-best-centre 75th percentiles (benchmark cut-offs) and 25th percentiles in comparison to those for Achievable Benchmark of Care best achievable benchmarks and percentiles for laparoscopically assisted spleen-preserving distal pancreatectomy   Table 6).

Discussion
This pan-European multicentre retrospective study identified benchmarks for 10 clinically relevant surgical outcomes after laparoscopically and robotically assisted SPDP using the 2 most widely established and validated methodologies 2,19 , applied to unselected and low-risk patients.
Based on an unselected population of 951 laparoscopically and 279 robotically assisted procedures from 32 European centres, the ABC best achievable values for both procedures were quite comparable for most parameters. The biggest differences were found in duration of surgery and POPF rates, favouring the laparoscopic approach. The superiority of the laparoscopic approach in terms of operating time has been noted in previous cohort studies and meta-analyses [29][30][31] . According to a recently published fistula risk score for distal pancreatectomy, longer operating times can increase the risk of POPF 32 , which might explain the higher POPF rate after robotically assisted SPDP. None of the other variables included in the new fistula risk score differed significantly between the two groups. For both procedures, conversion rates were low: 3.5 per cent for laparoscopically assisted and 1.7 per cent for robotically assisted SPDP. In recent studies 15,18,33 , conversion rates have varied from 0 to 9 per cent, making the ABC values of 3.5 and 1.7 per cent obtained here seem realistic. The lower ABC conversion rate for the robotic procedure aligns with the results of a recent meta-analysis 31 that reported lower conversion rates for robotically compared with laparoscopically assisted SPDP. This could be attributed to the features of the robotic system, allowing greater dexterity and three-dimensional vision.
A remarkable outcome of this study is the ABC rate of 0 per cent for laparoscopically assisted and 1.7 per cent for robotically assisted SPDP for failure to preserve the spleen. Centres with no events were included in the benchmark calculation, as an event rate of zero was considered feasible and what should be strived for, even in real-life practice. The current literature confirms the feasibility of such ambitious values, as 100 per cent rates of successful spleen preservation have been reported in previous studies [33][34][35] . Although other studies [29][30][31] have pointed towards the superiority of the robotic approach in terms of splenic preservation, the present findings do not confirm this.
Profound differences were noted between the ABC benchmarks for the unselected cohort and the best-patient-in-best-centre enchmarks for the low-risk cohort. As the best-patientin-best-centre methodology aims to provide cut-off values rather than best achievable results, these benchmarks are more lenient than the ABC benchmarks and were more likely comparable to the ABC medians or 75th percentiles. The largest differences between the best-patient-in-best-centre 75th percentiles and the ABC 75th percentiles (that is differences between low-risk cohort and total cohort) were found for conversion, morbidity, POPF, and readmission rates; the best-patient-in-best-centre benchmarks were lower and thus stricter. These findings suggest that a low-risk cohort mainly results in lower rates of conversion, morbidity, and readmission, and to a lesser extent affects parameters such as operating time, duration of hospital stay, and spleen preservation. Previous literature supports these findings given the exclusion criteria used for the low-risk cohort in which the best-patient-in-best-centre benchmarks were established; extended or multivisceral resections, ASA grade at least III, major previous abdominal surgery, and BMI 35 kg/m² or higher have been associated with higher rates of conversion and postoperative morbidity [36][37][38][39] . Excluding such patients may have resulted in better outcomes and thus led to stricter best-patient-in-best-centre 75th percentiles compared with the ABC 75th percentiles.
The differences in benchmark outcomes obtained with the two methodologies show that applying the concept of benchmarking necessitates realism in the choice of methodology and that the two approaches imply critically divergent interpretation. The present study has shown that the type of benchmark used in daily practice should depend on two factors. The first of these is the purpose of benchmarking. If the purpose is to compare outcomes between departments or hospitals, it is recommended to use best-patient-in-best-centre benchmark cut-offs to illustrate an accepted level of performance. On the contrary, if the intention is to compare individual surgeons and motivate them towards superior performance, ABC benchmarks, representing the best achievable outcomes, would be more useful. Second, the type of cohort must be considered in the choice of benchmark. As the methodologies have been developed and validated in different patient cohorts, and the benchmark values in the present study have been obtained according to this, the benchmarks should be applied to the appropriate patient cohort to generate reliable and equal comparisons. Discrepancy may arise when ABC benchmarks are applied to a low-risk cohort, as they will most likely be more easily achieved; on the other hand, an unselected population would be expected to perform outside the benchmark cut-offs when best-patient-in-best-centre benchmarks are applied.
The preferred type of benchmark remains debatable, as clarity on the most reliable or appropriate methodology is still lacking. Recently, a standardized methodology for establishing benchmarks based on a Delphi consensus was published 40 , endorsing the best-patient-in-best-centre methodology of the present study. However, even though many points were clarified during this consensus, uncertainties remain on the correct use of benchmarks in clinical practice and their generalizability to non-low-risk patients. The present authors wonder whether it is time for the surgical community to consider a new or modified benchmark method, one that considers both best achievable and acceptable results, balances between ABC and best-patient-in-the-best-centre methods, and can identify personalized benchmarks based on different risk groups. Clinical expertise and judgement on the subject of benchmarking are needed to ensure equal and accurate comparison of clinical outcomes in (pancreatic) surgery.
The results of this study should be interpreted in the light of several limitations. First, owing to the retrospective design and the wide time span of the study, confounding factors and changes in (post)operative policies and definitions of outcomes over time may have influenced outcomes, such duration of hospital stay. In addition, because no data on the learning curve phase at the time of surgery were available, it is feasible that surgeons provided data during different phases of their learning curve. As proficiency learning curves may be quite long, this could have biased the results 41 . Second, significant differences in outcomes such as major complications were found when comparing the Kimura and Warshaw methods, and might have influenced some benchmark outcomes. Third, only 279 robotically assisted procedures were collected in the database, which may be on the low side and therefore could raise doubts about the reliability of benchmarks for robotic procedures compared with the laparoscopic benchmarks. A major strength of this study is that the data were retrieved from a pan-European database, making the results generalizable and better reflecting the real-world scenario.

Funding
This research received no external funding.