Evaluation of objective tools and artificial intelligence in robotic surgery technical skills assessment: a systematic review

Abstract Background There is a need to standardize training in robotic surgery, including objective assessment for accreditation. This systematic review aimed to identify objective tools for technical skills assessment, providing evaluation statuses to guide research and inform implementation into training curricula. Methods A systematic literature search was conducted in accordance with the PRISMA guidelines. Ovid Embase/Medline, PubMed and Web of Science were searched. Inclusion criterion: robotic surgery technical skills tools. Exclusion criteria: non-technical, laparoscopy or open skills only. Manual tools and automated performance metrics (APMs) were analysed using Messick's concept of validity and the Oxford Centre of Evidence-Based Medicine (OCEBM) Levels of Evidence and Recommendation (LoR). A bespoke tool analysed artificial intelligence (AI) studies. The Modified Downs–Black checklist was used to assess risk of bias. Results Two hundred and forty-seven studies were analysed, identifying: 8 global rating scales, 26 procedure-/task-specific tools, 3 main error-based methods, 10 simulators, 28 studies analysing APMs and 53 AI studies. Global Evaluative Assessment of Robotic Skills and the da Vinci Skills Simulator were the most evaluated tools at LoR 1 (OCEBM). Three procedure-specific tools, 3 error-based methods and 1 non-simulator APMs reached LoR 2. AI models estimated outcomes (skill or clinical), demonstrating superior accuracy rates in the laboratory with 60 per cent of methods reporting accuracies over 90 per cent, compared to real surgery ranging from 67 to 100 per cent. Conclusions Manual and automated assessment tools for robotic surgery are not well validated and require further evaluation before use in accreditation processes. PROSPERO: registration ID CRD42022304901


Introduction
Robotic surgery is increasingly being adopted due to improved vision, dexterity and surgical ergonomics.In selected procedures there is supportive evidence demonstrating non-inferiority and lower morbidity compared to laparoscopy [1][2][3][4][5] .Minimally invasive surgery (MIS) is complex, highly variable and requires technical skill with unfavourable error profiles compared to industrial data 6 .Meanwhile, the addition of new technology into the operating room, with novel technical and non-technical considerations, increases the potential for human error, and therefore patient risk 7 .Of surgical patients, 10-15 per cent in the UK experience adverse events, of which 50 per cent are preventable 8 .Adverse events relating to robotic procedures (10 624) were reported in the USA between 2000 and 2013 9 while a global independent review on health technology hazards identified a lack of robotic surgical training as one of the top 10 risks to patients 10 .This deficit is being addressed through development and standardization of basic and specialty curricula [11][12][13][14][15][16][17][18][19][20][21][22][23] .
Robotic surgical procedures require high levels of experience.Evaluation of performance in surgery is shifting from time-and operative numbers-based assessment towards proficiency-based training and accreditation 24 .To assist this, objective tools are frequently employed but must be fully evaluated if they are to be used as summative, high-stakes assessment instruments.Traditionally, proficiency in surgery was extrapolated from clinical outcomes such as histopathology, morbidity and mortality, yet these are subject to multifactorial influences.Intraoperative performance analysis has proved to be a fruitful area for

Selection of eligible studies
Four reviewers independently screened, reviewed and extracted data, with the primary investigator reviewing all articles.Disagreements were resolved through discussion with the corresponding author.
Included studies followed the PICO question: • Population-participants being assessed on robotic technical skill.• Intervention-an objective technical skill assessment tool or method is developed and/or implemented.• Comparison-to other tools or measurement of assessment.
Exclusion criteria were solely laparoscopic and/or open assessment skills, or failures to retrieve the article or an English translation.

All studies
Study details including year, country, participant number, participant expertise level and evaluator type were extracted.Identified studies were grouped based on study and tool types into manual, automated performance metrics (APMs) and evaluation of statistical models or AI algorithms.These domains of technical skill assessment were devised using approaches employed by previous reviews and that reflect different assessment methods.The manual domain is human assessment with subgroups that are global rating scales, procedure-specific and error-based tools.APMs are metrics produced by computer software typically in virtual reality (VR) simulators.Finally, AI algorithms are mathematical models implemented to process input data and estimate skill or clinically related outputs, for example, using kinematic data to predict postoperative urinary incontinence 36 or vision data to predict skill level (Fig. 1).

Manual and APM studies
Due to the heterogeneity in methods of technical skill assessment, different approaches were applied to facilitate evaluation.Manual tools and automated performance metrics (non-AI articles) were evaluated using Messick's validity concept 43 and the Modified Educational Oxford Centre for Evidence-Based Medicine (OCEBM) Levels of Evidence (LoE) and Levels of Recommendation (LoR) 44 .Messick's concept views validity as a continuous process and a combination of the classical views of face, content, construct and predictive validity, internal consistency, intra-and inter-rater reliability.Instead of viewing these as separate, five aspects that need to be considered for an assessment tool to be valid were assessed (Table 1).Strength of correlational analyses and significance was also extracted using standardized definitions.

Artificial intelligence studies
AI specialists contributed to screening, data extraction and evaluation and a bespoke data extraction template was employed to standardize data capture.

Methodological quality assessment
Methodological quality assessment was evaluated using a modifiable Downs-Black checklist (Table S2) 45 .Due to study heterogeneity some aspects were not applicable; therefore, taking a pragmatic approach, we modified the score in these circumstances, with a maximum score of 10 available.For AI studies, it was not feasible to apply a relevant methodological quality tool such as the Downs-Black checklist or Medical Education Research Study Quality Instrument (MERSQI) 46 , as most study designs are conceptual.
Tables 2-4 summarize the main tools in each domain of technical skill assessment and Supplementary Tables (Tables S3 to  S8) describe each study's analysis.Summaries of the remaining tools can be viewed in Table S9.

Results
Two thousand, nine hundred and forty-four studies were identified from searches with 85 identified from additional sources.Seven hundred and forty-nine duplicates were removed.Of 2280 studies that were screened and reviewed, 2033 were excluded with 247 studies undergoing data extraction (Fig. 2).Two hundred and twenty-seven studies were classified as observational, including Delphi meetings, experimental, cohort and randomized studies not defined as randomized control trials (RCTs), while a total of 20 RCTs were identified.Of the manual studies, 93 used global rating scales (GRS), 45 procedure-or task-specific tools, 43 error-based, 77 simulator automated performance metrics, 28 non-simulator automated performance metrics and 53 AI studies.

Global rating scale tools
Eight different GRS tools were identified (Table 2 and Table S3).Global Evaluative Assessment of Robotic Skills (GEARS) was the most utilized assessment method, assessed in 58 studies, including 12 RCTs giving a Level 1 recommendation based on 21 studies reporting excellent reliability and 3 low/poor.Interestingly, crowd-sourced GEARS ratings demonstrated excellent inter-rater reliability 47 , good to strong/excellent inter-observer group reliability compared to experts [48][49][50][51] , as well as construct 48,52 and predictive validity 53 .GEARS (all raters) demonstrated supportive evidence of 'relationship to other variables' including concurrent (17 studies), construct (25 studies) and predictive validity (3 studies).
Objective Structured Assessment of Technical Skills (OSATS), Global Operative Assessment of Laparoscopic Skills (GOALS) and Robotic-OSATS (R-OSATS) were used in a total of 34 studies and all received a Level 2 recommendation, despite only one of them being robotic-specific.GEARS has no robust data validating a benchmark for overall and domain scores, whereas GOALS and R-OSATS used the contrasting groups method 54 and the modified Angoff method 55 , setting competency at 80 per cent and 70 per cent, respectively.All other tools identified have not been thoroughly evaluated with LoR 3 or 4 (Table S9).

Internal consistency
Excellent 105,145,160 Low 49

Internal consistency
High 160

Concurrent validity
Other GRS tools 160 Virtual reality 160,200 Construct validity 54,199,201 Pass mark defined by contrasting groups method 54 by experts 92 Level 1b 200 Level 2a 160 Level 2b 54,92,199,201 Level 4 58 Level 2 recommendation R-OSATS Observational 4 Lab Developed from GOALS and OSATS 202 All studies demonstrated response process

Inter-rater reliability
Strong/Very high/ Excellent 55,203,204 Acceptable/Moderate/ Good 202,203 Intra-rater reliability Very strong 202 Moderate/Good 203 Concurrent validity with VR 204 Construct validity 202,204 Modified Angoff method set threshold competency scores per drill 55 Level 2b 55,202,204 Level 3 203

Error tools
Three main tools underwent multiple study evaluations (Table 2): the Fundamental Laparoscopic Skills (FLS) scoring system, Generic Error Rating Tool (GERT) and Task-Performance Metrics.The most common error method was the cumulative number of errors, reported in 20 of 42 studies (46.5 per cent; Table S5).In 13 studies (69.7 per cent) a composite score was created, while a further study defined arbitrary task-specific time penalties.There is substantial variability in the definition and measurement of errors, often missing robust evaluation on validity and multiple tools were only present in single studies.The FLS scoring system gives a composite score and was used in 9 (20.9 per cent) studies.Task-Performance Metrics tools define errors and were used in 5 (11.6 per cent).The GERT tool assesses a surgical task group, error mode, number, description and mechanism of event and was analysed in 2 (4.6 per cent) studies.All three methods reached Level 2 recommendation.FLS and Task-Performance Metrics tools both had evidence of internal structure and relationship to other variables, with excellent reliability, concurrent and construct validity.There were no reports on predictive validity or benchmarking of these tools.

Simulator automated performance metrics
Ten different simulators were identified (Table S6, Table 3).Automated performance metrics in simulation environments have been thoroughly evaluated with 39 (50.6 per cent) studies on da Vinci Skills Simulator (dVSS), 17 on (22.1 per cent) dV-Trainer (dV-T) and 9 (11.7 per cent) on RobotiX Mentor.Sixteen (76.2 per cent) of the 20 RCTs in this review involved simulators.These three simulators have been validated in all five Messick domains exhibiting concurrent and construct validity.dVSS and dV-T training also predicted better performance on the console in operative and dry model performances.In addition, more comprehensive evaluation in the consequence domain has been carried out for all three when compared to other assessment tools.Current evidence favours dVSS at Level 1 recommendation.dV-T, RobotiX Mentor, Promis hybrid surgical simulator, Robotic Surgery Simulator (RoSS) and 3D hydrogel models with 'Clinically Relevant Objective or Performance Metrics (CROMS/CRPMS)' all receive Level 2 recommendation.The Versius trainer from CMR Surgical has currently been evaluated at Level 4 recommendation.Simulators [99][100][101]283 unlikely to be in wide usage were identified and excluded from Table 3. Notably the vast majority of studies (70; 90.9 per cent) looked at basic skills, with only 6 (7.8 per cent) reviewing procedure-specific VR 52,84,99,[102][103][104] .

Non-simulator automated performance metrics
Of the 28 included studies (Table 3 and Table S7), 16 used da Vinci Application Programming Interface (API) kinematic and system event data, with 6 (21.4 per cent) from the operating room and all within urology.Kinematic and system event data from the da  169,206 Inter-rater reliability Intra-rater reliability Excellent 207

Artificial intelligence
Fifty-three AI studies were identified (Table 4 and Table S8).The range of participating surgeons across the AI studies varied from 1 106 to 77 107 (median = 8).Most studies employed the publicly available JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) 109 , which features 139 trials from eight surgeons suturing, knot-tying, and needle-passing exercises with kinematic data.The most common level grouping was between expert and novice surgeons; however, there was significant heterogeneity in how this was defined.In most cases, expertise was defined by a surgeon's caseload with wide variability, for example 50 to over 2500 cases 33,105 .Other studies assigned assessment scores to group surgeons above a predefined threshold as experts and below as novices 108 .
Forty-one (77.4 per cent) studies evaluated their AI models using data obtained from simulators or dry lab simulations, while 12 (22.6 per cent) studies used data collected from real surgical procedures.The most frequently used dry lab data set (24/41; 58.5 per cent) was JIGSAWS.
The majority approached skill assessment as a classification task (28/53; 52.8 per cent), with the aim of predicting the participant skill level.Twenty (37.7 per cent) studies estimated an assessment score (numerical regression) that corresponds to an assessment tool.Notably, only three [118][119][120] attempted to estimate the individual domains of the tool, with the remainder predicting the total score.
A few studies adopted a different approach to assess skill; ranking performance 121 , estimating the operating field clearness 116 , using stylistic behaviour labels 122 and linking skill levels to clinical outcomes in RARP 36,113,115,123 .
Of the 53 studies, 20 (37.7 per cent) utilized video data, 29 (54.7 per cent) used kinematics, 7 (13.2 per cent) employed system events and 3 (5.6 per cent) used force data.Furthermore, a few others utilized clinical parameters such as BMI and prostate -specific antigen (PSA) 114 , eye-tracking and electroencephalography (EEG) signals, electromyography data (EMG) and galvanic skin response (GSR) 124 , surgical gesture sequences 114 and stylistic behaviour components 122,125,126 .Among these studies, 33 (62.3 per cent) used a single input modality (for example, video only), while 20 (37.7 per cent) utilized two or more input modalities.
Twenty-six (49.1 per cent) studies used classic machine learning methods, with support vector machine (SVM) being the most common (13/26 (50 per cent)).Most used APMs as input.Twenty-seven (51 per cent) employed deep learning methods, with 19 (35.8 per cent) using convolutional neural networks (CNN).Video-based deep learning methods used a CNN to extract visual features, which are then either fed to a temporal model 110,111,[127][128][129][130][131] or to a simple classifier/regressor 108,112,116,131,132 .Kinematic-based deep learning approaches use either temporal convolutional networks (TCN) 110,133,134 or recurrent neural network (RNN) 129,135 or a combination of the two 119,136,137 .Notably, deep learning approaches have gained popularity in surgical skill assessment (Fig. S1).
To evaluate their developed methods, most studies utilized the accuracy metric and Spearman's correlation coefficient (SCC).The accuracy rates and SCC for the models tested on real surgical data ranged from 67 per cent to 100 per cent and 0.41 to 0.64, respectively, but were inferior to simulator/dry-lab data; nearly 60 per cent of classification methods reported accuracy above 90 per cent; while only one study 111 out of 10 reported SCC over 0.90.

Discussion
This systematic review comprehensively analysed the current development and evaluation status of objective technical skills assessment tools in robotic surgery.Despite the plethora of publications, it is evident that full evaluation according to Messick's concept is sparse.This may explain the notable lack of reports showing their implementation within day-to-day practice or curricula.Many manual tools are lacking in scope and are arguably unsuitable to be used as summative tools at their current validation status.Emerging evidence in AI has reached the first in-human studies, but these are predominantly conceptual and require full validation.The current review suggests that research efforts should be focused on validating and implementing existing instruments rather than seeking any further robotic surgery assessment methods.GEARS and VR simulators offer clear opportunities for formative and summative assessment within the basic skills curricula.Simulator studies demonstrated VR participants outperforming controls or an improvement in post-VR curriculum assessments in the operative and laboratory setting.GEARS has not been formally benchmarked and given that it is likely the best manual annotation GRS tool to use with AI models warrants further focused evaluation.Meanwhile, given that AI studies often use OSATS or modified GOALS, efforts are necessary to inform the computer science and surgical community to utilize GEARS instead for robotic global technical ratings.Chen et al. 28 highlighted gaps in the assessment domains of generic robotic skills assessments for GEARS, which provides an opportunity for modification and re-evaluation.VR simulators allow safe transference of basic skills and have defined competency benchmarks before progression to console training, broadly speaking a score between 80 and 90 per cent.
Procedure-specific VR and 3D-printed hydrogel models provide high-fidelity simulation allowing an opportunity for standardized, safe progression to clinical training.These platforms avoid possible ethical, religious and moral issues that can prevent the use of cadaver or live animals.Only six studies were identified evaluating procedure-specific VR, confirming the need for further development and evaluation of different operative VR and 3D model tasks.However, additional issues including training access and the financial implications of these platforms remain unstudied.
Procedure-specific tools can potentially act as excellent formative and summative assessments often with higher reliability than GRS.Three tools (OSATS task-specific, RACE and Task-Performance Metrics) had the highest LoR; however, importantly there are no reports demonstrating predictive validity or benchmarking full procedural tools.Task-Performance Metrics were all developed through Delphi consensus as proficiency-based progression (PBP) assessment tools and had high reliability through trained expert raters undergoing reliability 'checks'.The tools' structure includes phases and subtasks for each procedure and can be commended for including operation-specific error metrics.Their intended application is within proficiency-based It is evident that there is a paucity of procedure-specific tools ready for implementation into robotic training curricula.They also lack scope, with the majority in urology, and so as a surgical community it is imperative to both develop tools missing for key operations and fully evaluate existing ones.
Error tools identified in this review typically used cumulative number of errors and have not been fully evaluated within clinical settings.A key aspect in a surgeon's learning curve is to understand the 'what, where, when, how, why' and corrective mechanisms of an error, which no current study has reported.Granular methods of surgeons' technical performance and errors are necessary to train AI, combined with global rating scales and procedure-specific tools, to fully understand the complexities of any operation.Tools should combine each aspect with full comprehensive evaluation before implementation into training curricula.As demonstrated in this review, reliability can most likely be improved through expert, trained raters and quality assurance processes.
APMs and AI are emerging and promising tools to guide training and assessment in robotic surgery.APMs can be considered truly objective, yet need further focused evaluation to understand and benchmark important metrics for construct and predictive validity.While AI models performed well when analysing intraoperative surgical skill data, they generally perform better on simulator/dry lab.
A significant proportion of the AI models tested on simulated data achieved accuracy rates above 90 per cent, while some models tested on real surgical data demonstrated perfect classification performance of surgical skill levels.Despite this, AI-based skill assessment is still in its conceptual stage with four broad areas that need to be addressed: data sets, manual annotation, AI model evaluation and integration into clinical practice.
For the field of automated surgical skill assessment to advance, it is critical to assess models on real surgical data.Additionally, it is crucial to gather data from high-fidelity simulations tasks so AI models can be evaluated for benchmarking and comparison of different methods.Efforts must focus on collecting large, publicly available, diverse data sets, including surgeons with differing levels of expertise and different robotic platforms with matched clinical outcome data.Utilizing diverse data sets will ensure AI models are unbiased and can generalize effectively on unseen surgeons and tasks.
Identified AI studies used different ways to evaluate their methods making direct comparisons challenging and reducing external validity.Testing models on the JIGSAWS data set has highlighted the performance gap between cross-validation schemes such as Leave-One-User-Out (LOUO) and more relaxed schemes such as Leave-One-Super-Trial-Out (LOSO).However, before automated skill assessment can be used in clinical practice it must first be ensured the models can generalize to unseen surgeons.To achieve this, evaluation should be performed with cross-validation schemes (for small data sets), or with large external test sets containing trials from unseen surgeons from different hospitals to ensure generalizability [138][139][140] .LOSO still remains useful in situations where the performance of a specific surgeon is tracked for proficiency curve analyses.
To credential surgeons as competent for independent practice, blinded expert video rating is considered an essential part of accreditation 21 .This requires fully evaluated objective summative assessment tools.Often, surgeons undergoing robotic training are already credentialed, adding additional challenges to standardizing pathways and ensuring patient safety.Undoubtedly, there are many routes to competency and now also emergent robotic systems to consider.
This review has highlighted many assessment domains, with their advantages, disadvantages and future research needs (Table 5).To achieve implementation of validated and reliable tools into curricula, collaboration between surgical societies is required.Through expert consensus and large, multicentre, international studies, single tools for each procedure should be developed and fully evaluated.Only then, should they be

Limitations
This comprehensive review standardized data extraction with Messick's concept and modified OCEBM guidelines.Nevertheless, due to marked study heterogeneity this was difficult at times, and was particularly evident when utilizing the OCEBM guidance, with previous systematic reviews disagreeing on studies' LoE.Not only this, but some studies have a higher LoE, despite demonstrating less validity evidence than others.It is likely that guidelines require updating as surgical data science evolves.The application of methodological quality tools was found to be impractical for assessing AI studies, primarily as most are in their conceptual stage of development.Future research should focus on developing and piloting a new AI-specific study quality assessment tool.

Conclusion
A large number of manual, automated and artificial intelligence tools in robotic surgery exist.There is huge variability in approach to assessment and the level of evaluation among all domains of robotic technical skill assessment, with few having been well validated.In addition, there is a lack of scope and most tools are presently only used within the research setting, despite the unmet need for both objective formative and summative tools to inform learning and accreditation, respectively.Collaboration between surgical societies, AI scientists and industry, with large high-quality studies and open data sets, appears the most efficient way forward to aid diffusion and implementation of objective assessment tools in clinical practice to enhance training and patient safety.

Table 5 Summary of all assessment domains
within curricula as formative and summative tools, or in the evaluation of APMs and AI. implemented