Outcome selection, measurement and reporting for new surgical procedures and devices: a systematic review of IDEAL/IDEAL‐D studies to inform development of a core outcome set

Background Outcome selection, measurement and reporting for the evaluation of new surgical procedures and devices is inconsistent and lacks standardization. A core outcome set may promote the safe and transparent evaluation of surgical innovations. This systematic review examined outcome selection, measurement and reporting in studies conducted within the IDEAL (Idea, Development, Exploration, Assessment and Long‐term monitoring) framework to examine current practice and inform the development of a core outcome set for early‐phase studies of surgical procedures/devices. Methods Web of Science and Scopus citation searches were performed to identify author‐reported IDEAL/IDEAL‐D studies for any surgical procedure/device. Outcomes were extracted verbatim, including contextual information regarding outcome selection and measurement. Outcomes were categorized to inform a conceptual framework of outcome domains relevant to evaluating innovation. Results Some 48 studies were identified. Outcome selection, measurement and reporting varied widely across studies in different IDEAL stages. From 1737 outcomes extracted, 22 domains specific to evaluating innovation were conceptualized under seven broad categories: procedure completion success/failure; modifications; unanticipated events; surgeons' experiences; patients' experiences; resource use specific to the innovative procedure/device; and other innovation‐specific outcomes. Most innovation‐specific outcomes were measured and reported in only a small number of studies. Conclusion This review highlighted the need for guidance and standardization in outcome selection and reporting in the evaluation of new surgical procedures/devices. Novel outcome domains specific to innovation have been identified to establish a core outcome set for future evaluations of surgical innovations.


Introduction
Unlike pharmaceuticals, where the introduction and evaluation of new medicines into clinical practice is highly regulated, the process for introducing new procedures and devices in surgery is inconsistent and lacks standardization. outcome selection, measurement and reporting. Studies may measure benefits and harms in different ways, so that results cannot be directly compared or combined. Furthermore, investigators can choose which outcomes to measure and report, introducing bias. There is a need for standardized and transparent measurement and reporting of outcomes to improve safe and efficient evaluation of surgical innovation for introduction into clinical practice.
The development and use of a core outcome set (COS) -a minimum set of outcome domains to be measured and reported in all early-phase evaluations of innovative surgical procedures and devices -may be a solution to addressing inconsistent and heterogeneous outcome selection and reporting 3 . The methodology for the development of a COS in effectiveness studies is well established, including taxonomies to classify outcomes 4,5 . However, standard methods to identify outcomes relevant to later-phase effectiveness studies, when the intervention under evaluation is no longer being modified, may not identify outcomes specifically relevant to evaluating surgical innovation. Data sources that specifically focus on innovation are required to provide insight into outcome selection and identify potentially relevant and meaningful outcomes for the first step in developing a COS in this setting.
The IDEAL (Idea, Development, Exploration, Assessment and Long-term monitoring) framework describes a pathway for evaluating surgical innovations from first-in-man through to long-term evaluation 6 . The framework has also recently been adapted for evaluation of new medical devices (IDEAL-D) 7 . It is hypothesized that investigators who have engaged with the IDEAL framework when designing and conducting studies may have considered outcome selection and reporting specifically in the context of evaluation of surgical innovation. As such, these studies may provide significant insight into outcomes that have particular relevance to the process of innovation, offering a valuable data source to identify outcomes and contribute to the conceptualization of outcome domains to inform the development of a COS for early-phase studies.
This study aimed to examine outcome selection, measurement and reporting in author-reported IDEAL/IDEAL-D studies to identify current practice, and serve as one of multiple data sources to help conceptualize novel domains relevant to evaluating innovation to inform the development of a COS 8 .

Methods
A systematic review and content analysis of author-reported IDEAL/IDEAL-D studies was undertaken following PRISMA guidelines 9 , where applicable. This review is one of multiple data sources used to inform the wider study to develop a COS for studies of surgical innovation. A detailed protocol for this systematic review was not registered specifically as methods have been included in the COS development protocol, which has been published previously 8 .

Identification of studies
Electronic searches were performed to identify studies reported as following the IDEAL/IDEAL-D framework. Searches were undertaken in databases with citation tools (Web of Science and Scopus) to identify all publications citing any of ten key IDEAL/IDEAL-D papers 6,7,10-17 deemed significant in describing the framework by members of the IDEAL collaboration 18 . Searches were carried out in April 2019. No restrictions to study design or publication dates were applied. Search results were imported into reference management software and duplicates removed. Results were filtered to select and retain records with the text word IDEAL appearing in the title or abstract.

Selection of studies
Titles and abstracts were screened independently for eligibility by two reviewers. Reports or protocols for studies described as IDEAL or IDEAL-D in the title or abstract were included. Eligible studies were: studies of innovative invasive procedures; studies of innovative devices; or studies where the innovation was something other than an invasive procedure/device but the study involved an invasive procedure, for example radiological imaging for guided biopsy. The latter category of studies was included owing to the potential value of such studies for identifying outcomes relevant to evaluating innovation. Invasive procedures were defined using an existing published definition and included 'purposeful or deliberate access to the body gained via an incision, percutaneous puncture, where instrumentation is used in addition to the puncture needle, or instrumentation via a natural orifice' 19 . Systematic reviews, book chapters, secondary studies and studies that did not involve living human participants (such as cadaver, animal or simulation studies) were excluded. Studies using adaptions of the IDEAL framework that did not include or involve an invasive procedure or device (such as non-surgical complex interventions) were also excluded, along with letters, commentaries, editorials and conference proceedings. Screening results were compared for consistency and discrepancies resolved by discussion.
Full-text copies of potentially relevant publications were obtained and checked for eligibility. Reasons for excluding publications for which full-text copies had been obtained were recorded. Publications of uncertain relevance were discussed within the study team.

Data extraction
A standardized electronic data extraction form (REDCap software 20 ) was developed and piloted by the study team. Data extraction was performed by one reviewer with a second reviewer checking 10 per cent of publications for accuracy and completeness. Discrepancies were resolved by discussion and any additional verbatim outcomes identified during this process were extracted. Study characteristics were extracted including: year of publication, details of the innovation, author-reported IDEAL stage, number of patients, participating centres and surgeons, and geographical origin of the study. All measured and/or reported outcomes were extracted verbatim together with any contextual information regarding the selection and measurement of outcomes. Rationale for outcome selection was categorized as: detailed, with a hypothesized effect (for example, anticipated improvement in at least one outcome); detailed, but with no hypothesized effect for any outcomes; general rationale (for example 'to examine safety'); or no rationale provided 21 . Contextual information considered potentially relevant to evaluating the process of innovation was extracted, where applicable. This included verbatim text relating to: stopping the innovation or making changes to the procedure/device in future use or application; limitations in study interpretation or conclusion in relation to outcomes; outcome assessment in future studies; and assessment of the surgical learning curve 22 . In studies where modifications to the procedure or patients selected to undergo the procedure were reported, details of how these were reported (for example, text, graphs or tables) were extracted. Data were recorded in a specifically designed study database 20 .

Data analysis
Data analysis was undertaken in two stages. Extracted verbatim outcomes were categorized individually into domains to develop a preliminary framework of outcome domains relevant to innovation. Categorization was performed by one reviewer. During this process, the conceptual framework was modified iteratively until data saturation was reached and no new outcome domains were identified (all outcomes had been categorized). A provisional conceptual framework of domains was subsequently attained. A subsample of outcomes (81, from 2 publications selected at random) were independently categorized by a second reviewer with clinical expertise to ensure methodological rigour. Discrepancies were resolved by discussion with the wider study team.
In the second stage of analysis, all outcomes were recategorized using the provisional conceptual framework by a second independent reviewer. Source documents (IDEAL/IDEAL-D study publications) were referred to for context, for example, to determine whether clinical outcomes were anticipated or unanticipated. Outcomes that were not specific enough to categorize (for example, generic descriptions such as safety or feasibility) were coded as 'too broad for categorization'. Outcomes that did not specifically address the evaluation of the innovation but were routinely measured data variables of interest to the clinical specialty were coded as common data elements 23 .
Outcome domains within the provisional conceptual framework were further classified by the authors as being innovation-specific or shared with effectiveness studies. Innovation-specific domains included non-traditional outcome domains (different to outcome domains typically measured in later-phase studies) considered to be conceptually specific to evaluating the ongoing process of innovation when the procedure/device was new or evolving. Examples included modifications to the surgical procedure/device and whether the procedure was completed successfully. Domains shared with effectiveness studies encompassed outcomes that would commonly be measured in later-phase studies 4 considered not unique to evaluating innovation. Examples included mortality and patient-reported outcomes such as postoperative pain and physical function.
Study characteristics, measured/reported outcomes (based on the derived conceptual framework of domains) and contextual information relevant to selecting or measuring outcomes or evaluating innovation were compared across IDEAL stages specifically reported by the authors. Descriptive statistics were summarized using Microsoft Excel ® (Microsoft, Redmond, Washington, USA) and STATA ® statistical software (StataCorp, College Station, Texas, USA).
Assessment of factors relevant to systematic reviews of effectiveness (such as study quality or risk of bias) and data syntheses (for example meta-analyses of outcome data) were not appropriate given the exploratory aims of this review. Findings were synthesized and tabulated using descriptive statistics and a narrative summary directed at the study aims.

Results
After removal of duplicates, the search yielded 1207 records citing any of the ten key IDEAL/IDEAL-D publications. Records excluded: e.g. cadaver studies, not a study report/protocol n = 74 Full-text publications excluded n = 14 Not a primary study report n = 9 Not an IDEAL-badged study n = 3 Study does not involve an invasive procedure or device n = 2 Records screened after removal of duplicates n = 1207 Of these, 136 (11⋅3 per cent) included IDEAL as a text word in the title or abstract. Initial screening identified 62 of 136 records (45⋅6 per cent) for full-text eligibility assessment. Finally, 48 relevant publications (42 study reports, 6 protocols) were included in the analysis 24-71 (Fig. 1). Included publications are detailed in Table 1. The majority were published in 2017 or 2018 (14 and 13 respectively). Thirty-nine publications were from Europe. Twenty-four described studies of innovations in surgical procedures and ten described studies of four different new devices. The remaining 14 publications described non-surgical innovations that included a surgical procedure. The specific IDEAL stage(s) describing the evaluative stage of the innovation was generally reported. Two publications, however, did not specify the IDEAL stage. Frequencies of author-reported IDEAL stages were: stage 1 (Idea, 11); stage 2a (Development, 16); stage 2b (Exploration, 9); stage 3 (Assessment, 1); and multiple stages (9).

Characteristics of IDEAL/IDEAL-D studies by stage
Characteristics of studies across the different IDEAL stages (stated in 46) are summarized in Table 2. The number of centres involved, number of surgeons/operators and number of included patients were broadly in line with the recommendations of the IDEAL framework. For example, early-phase studies (such as IDEAL stages 1 and 2a) typically involved one or two centres, one or two surgeons and a small number of patients, with numbers increasing in stage 2b and stage 3 studies. The exception was one IDEAL-D stage 1 study 50 describing the use of a type of mesh for cystocele repair that included 37 women. The median number of outcomes per study was relatively similar across IDEAL stages 1, 2a and 2b (median 32, 38 and 36 respectively). The range in number of outcomes described per study, however, varied considerably. For example, in Year of publication

Country of study
Europe (non-UK) 33 Asia 8

UK 6
North America 1

Type of innovation
Innovative procedure 24 Device 10 Other innovation that involved surgery* 14 Author-reported IDEAL stage Multiple stages 9 Not stated 2 *Such as radiological imaging.
IDEAL stage 1 studies, the number of outcomes described per study ranged from six to 82.

Outcome selection, measurement and reporting in IDEAL/IDEAL-D studies
Most studies (37 of 48) provided a rationale for outcome selection. Often (20 studies) this was a general rationale (for example 'to examine safety'). Fewer studies (12) provided detail on the expected or anticipated effect on outcomes. Instead, outcomes were reported to be measured without explanation or elaboration regarding their selection. There were no observed differences in the level of detail provided in the rationale between studies of different IDEAL stages ( Table 2). Studies that mentioned stopping the innovation or making changes to the procedure/device in future applications of its use were broadly in line with the recommendations of the IDEAL framework. For example, 15 of 42 study reports (excluding protocols) included text to this effect, most of these being IDEAL stage 1 or 2a studies where innovations are less stable.
Limitations to the interpretation of findings or study conclusions in relation to outcomes were discussed in 18 of 42 study reports. These included, for example, the short follow-up time for measuring outcomes and being underpowered to detect potential adverse events with a low incidence. Recommendations/suggestions for outcome assessment in subsequent future studies were discussed in 30 of 42 study reports. These were primarily about generic outcomes, such as measuring efficacy and safety. Four studies provided specific recommendations for outcome assessment in future studies, including two that recommended the inclusion of patient-reported outcome measures. Fewer than half of the studies (20 of 48) mentioned the learning curve. The frequency of mentioning the learning curve did not differ by IDEAL stage.

Types of outcome
A total of 1737 outcomes were extracted from the 48 studies and categorized into outcome domains. The derived conceptual framework comprised 32 domains, 22 of which were considered to be conceptually specific to evaluating innovation and ten to be shared with effectiveness studies ( Table 3). The majority of outcomes (1098, 63⋅2 per cent) were categorized into domains shared with effectiveness studies (Table S1, supporting information), for example, outcomes assessing whether the overall desired effect of the procedure/device had been achieved, such as the number of positive surgical margins in a study of a new surgical procedure in oncology 28 . Other outcome domains shared with effectiveness studies included anticipated disadvantages, such as adverse events and complications, duration of the procedure, duration of hospital stay, and patient's physical/psychological experiences after the procedure. A smaller number of outcomes (552, 31⋅8 per cent) were categorized into the 22 domains considered to be conceptually specific to evaluating innovation. These are described in detail below under seven broad subheadings.

Innovation-specific outcome domains
Procedure completion success/failure Some 107 outcomes (6⋅2 per cent) were categorized as relating to the success or failure of performing the innovative procedure or using the innovative device (Table S1, supporting information). Examples of success included outcomes assessing whether all the steps of performing the innovative procedure were completed as planned. Examples of failure included the number of patients for which the planned innovative procedure was abandoned or

Reported context relating to outcomes and evaluating innovation
Text relating to stopping the innovation/making changes to the procedure/device in future applications of use † changed to an alternative procedure. Procedure completion success/failure outcomes were measured or reported in most studies (33 of 48) ( Table 4). Relative consistency was observed in the proportion of studies reporting success/failure outcomes across the IDEAL stages.

Modifications
Some 92 outcomes (5⋅3 per cent) were categorized as relating to modifications. This included modifications to the individual steps of the planned innovative procedure (for example, a technical refinement to the procedure, and the number or proportion of patients for whom this was required). Modification outcomes also included changes to any accompanying intervention related to the innovative procedure. For example, one study 42 exploring a new method of transvesical suprapubic externalization of ureteral stents reported that modification from local anaesthesia to midazolam sedation was required. Modifications also included changes to patient selection during the course of the study. Modifications were reported in 24 studies. The number of studies reporting modifications was broadly in line with the IDEAL framework, the majority being IDEAL stage   Outcomes relating to patient experience 1 4 1 0 0 Outcomes relating to resource use 9 9 6 0 7 The analysis includes 46 studies where the authors stated an IDEAL stage.
2a (innovations in early development). Most modifications were reported using narrative text alone (11 studies) and less frequently using flow diagrams, figures or graphs.

Unanticipated events (advantages and disadvantages)
Outcomes that related to an unexpected advantage or disadvantage were categorized as unanticipated events. These were further defined as occurring during or after the procedure. At least one unanticipated event was reported in eight of 42 studies (excluding protocols). These were mostly disadvantages, such as clinical complications (for example, peritonitis due to spontaneous perforation of a sigmoid diverticulum in a study of robot-assisted anterior partial prostatectomy 70 ). In one study 34 , an unanticipated event was reported as advantageous by serving as a useful safety check for the procedure.
Surgeon/operator experience of innovative procedure/device Some 65 outcomes (3⋅7 per cent) were categorized as relating to surgeons' experiences of performing the innovative procedure or experiences of operators using the new device (if not a surgeon). Examples included ergonomics, such as pain or discomfort when carrying out the procedure, or surgeons' reports of how difficult or easy it was to perform the procedure. Further outcomes categorized under this domain included descriptions of problematic points in undertaking the procedure and impressions of the learning experience. Although 23 studies included outcomes relating to surgeon or operator experiences, only three 33,45,58 reported that experiences were assessed using a formal process (including a rating scale/score, questionnaire and discussion in review meetings between cases). There was a slight trend for surgeon/operator experiences to be measured/reported in studies in the earlier stages of the innovation lifecycle (IDEAL stage 1 and 2a).

Patient experiences of innovative procedure/device
Overall, 215 outcomes (12⋅4 per cent) were categorized as assessing patients' experiences. The majority (200, 11⋅5 per cent) were measured after operation and were similar to those measured in effectiveness studies, such as patient-reported quality of life and pain. A minority of outcomes (15, 0⋅9 per cent; in 6 different studies), however, were specifically relevant to the experiences of undergoing the innovative procedure or receiving treatment with the innovative device. One study 39 of percutaneous nephrolithotomy under local anaesthesia, for example, reported patients' experiences specific to undergoing the innovative procedure itself, made possible because the procedure did not require the patient to be under general anaesthetic. Outcomes included patient 'disquietness' (restlessness or uneasiness) and good perioperative cooperation. Further examples of patients' experiences specific to undergoing the innovative procedure or receiving treatment with the innovative device related to psychological or emotional experiences. Details of how these outcomes were measured, however, were lacking. Most studies reporting innovation-specific patient experience outcomes were reported in IDEAL stage 2a ( Table 4).

Resource use specific to innovative procedure/device
A small number of outcomes (26, 1⋅5 per cent) measured resource use specific to the innovative procedure/device. Examples included the cost of the procedure/device or associated costs (such as equipment sterilization costs) and the number of further treatments required owing to complications that had developed as a result of the innovative procedure. Some 31 studies included any kind of outcome relating to resource use (Table 4); however, the majority of outcomes were shared with those typically measured in later-phase studies, such as duration of surgery (72 outcomes, 4⋅1 per cent) and duration of hospital stay (23, 1⋅3 per cent).

Other innovation-specific outcome domains
These domains included: details of patients suitable for the procedure in future; details of operator training/expertise necessary to perform the procedure in future; and mechanical /technical problems with the device. Although few outcomes were categorized into these domains (35, 2⋅0 per cent), they were considered relevant outcomes important for the evaluation of innovation.

Discussion
Although the majority of outcomes extracted reflected those traditionally measured and reported in effectiveness studies 4 , this review identified several novel innovation-specific outcome domains that reflect the dynamic process of surgical innovation. Broadly, these encompassed outcomes relating to procedure completion success/failure, modifications (to the procedure, concomitant interventions, or patient selection), unanticipated events, and innovation-specific surgeon/operator experience, patient experience and resource use. Although it is recognized that unanticipated events may also occur and are reported in effectiveness studies, they are generally rare and less likely to occur when the procedure has stopped evolving and has stabilized. Unanticipated events may indicate unexpected problems and, as such, be a key driver for modifying the procedure/device, particularly if they are related to patient safety. Unanticipated events were, therefore, considered to be of key importance for evaluating innovation and conceptualized as an innovation-specific domain. These findings served as one of multiple data sources to inform a Delphi survey and consensus study to develop a COS for early-phase studies of surgical procedures and devices 8 . The heterogeneity observed in the number of studies measuring and reporting outcomes of specific relevance to evaluating innovation highlights the need for more detailed guidance and improved standardization of outcomes that are important to measure when evaluating surgical innovation. The IDEAL framework provides clear guidance regarding the characteristics of studies at each stage of evaluation of surgical innovation, but uses only broad terms to describe the types of outcome that should be measured at different stages of evaluation. The lack of specific guidance around the selection and measurement of recommended outcomes may be one reason why the widespread uptake of IDEAL has been relatively limited 18,72 .
The review was conducted using robust and established methodology, but has limitations. Methods to identify IDEAL/IDEAL-D studies selected publications that had included IDEAL as a text word in the title or abstract. This pragmatic filtering step to identify relevant studies was based on the rationale that studies firmly engaging with the IDEAL framework, and therefore most likely to have more thoroughly considered outcomes in the context of evaluating innovation, may be more likely to describe their study as an IDEAL study in the title or abstract. It is accepted that some studies undertaken within the IDEAL framework may have been missed as a result of this search strategy. The number of studies identified (48) may be considered small given that the IDEAL framework was published 10 years ago. This number is, however, consistent with a recent review 18 examining the uptake and use of the IDEAL framework. The high proportion of studies published in recent years (2017-2018) suggests a recent increase in uptake of the IDEAL framework 18 . It is acknowledged that publications regarding themselves as IDEAL studies may not necessarily have followed the IDEAL framework from the outset. Authors may have been asked to report the IDEAL stage in their reports by journal editors and reviewers during the publication process. The selection and measurement of outcomes may, therefore, have been chosen independently of the IDEAL framework and recommendations. Analysis of data relied on author-reported information, which may not have always been precise. For example, comparison of studies across the IDEAL stages relied on the author-reported classification of IDEAL stage. It is recognized that classification of studies can be complex and author reports of IDEAL stage may not always be accurate 18 . Similarly, categorization of outcomes as unanticipated (rather than anticipated) also relied on author-reported information. If authors did not specifically report that an outcome was unanticipated, it was categorized as anticipated. It is possible, therefore, that the number of unanticipated events in the studies may have been higher. A final limitation is that individual outcomes were categorized into a single domain. In some instances, an outcome could have been included in more than one domain; for example, the overall desired effect of the procedure may have been measured by a patient-reported outcome. Reporting of the number of outcomes within the different outcome domains was, therefore, dependent on the judgement of the reviewers involved in the outcome categorization/checking process. Although exact reporting of the number of outcomes relevant to each domain was not a key objective in the present study, future studies where this is important could follow the methodology recommended for classifying outcomes from effectiveness studies and classify outcomes into multiple relevant domains 4 .
The identification of innovation-specific outcome domains is an important first step in developing a generic COS to standardize outcome selection, measurement and reporting in studies of novel surgical procedures. Such a COS would be appropriate for use before definitive evaluation within an RCT and is intended to support standardized evaluation of surgical procedures and medical devices throughout the stages of the innovation lifecycle, to facilitate safe and transparent introduction. The present review is one of several data sources to inform a conceptual framework for outcome domains for the COS to ensure its validity and comprehensiveness 8 . Given the diversity of procedures that could be performed, the feasibility of developing a generic COS for all innovative procedures could be questioned. The innovation-specific domains identified in this review, however, are sufficiently broad that they could be applied across a wide range of procedures and devices. An international consensus study involving a diverse range of key stakeholders to agree a final core set of domains for early-phase studies has now been completed 8 . This includes the development of clear reporting guidelines to ensure that the COS is adopted and reported in a meaningful way. A further challenge will be to determine how best to measure these key outcome domains. Engagement and creation of appropriate resources with key stakeholders, including surgeons from various disciplines, will be required to optimize the uptake and value of these tools. and Social Care, Royal College of Surgeons of England or MRC. Disclosure: The authors declare no conflict of interest.