Validation of knowledge acquisition for surgical process models.

Objective : Surgical Process Models (SPMs) are models of surgical interventions. The objectives of this study are to validate acquisition methods for Surgical Process Models and to assess the performance of different observer populations. Design : The study examined 180 SPM of simulated Functional Endoscopic Sinus Surgeries (FESS), recorded with observation software. About 150,000 single measurements in total were analyzed. Measurements : Validation metrics were used for assessing the granularity, content accuracy, and temporal accuracy of structures of SPMs. Results : Differences between live observations and video observations are not statistically significant. Observations performed by subjects with medical background gave better results than observations performed by subjects with technical background. Granularity was reconstructed correctly by 90%, content by 91%, and the mean temporal accuracy was 1.8 s . Conclusion : The study shows the validity of video as well as live observations for modeling Surgical Process Models. For routine use, we recommend live observations due to their flexibility and effectiveness. If high precision is needed or the SPM parameters are altered during the study, video observations are the preferable approach.


I. INTRODUCTION
Surgery is a clinical specialty with a long history, but surgical techniques are learned in an apprenticemaster model that leads to several surgical schools treating the same disease in different ways.
There is no explicit methodology available, which prevents an objective comparison of surgical strategies at a fine-grained level. Using process models with fine-grained descriptions of surgical interventions as the processes, surgeons get a powerful tool for the discussion of different surgical approaches and scientifically sound process models of their surgical work steps.
A detailed Surgical Process Model (SPM) may help in understanding a procedure, especially in difficult cases. Such a detailed model must be available for a broad variety of similar interventions to cover all clinically relevant deviations from the standard procedure.
Furthermore, a collection of verified and valid SPMs of surgical processes, especially for rare cases, could help in the implementation of new surgical techniques (e.g., minimally invasive surgery or computer assisted interventions) that require a detailed understanding of the intervention course in order to optimally assist the surgeon.
Surgical Process Models may be used to facilitate the development of technical components for surgical assist systems (SAS) (1; 2) and to support standardization efforts for desired functionalities of SAS, such as future extensions of Digital Imaging and Communications in Medicine (DICOM) for surgery (3; 4).
The ultimate purpose of SPMs is the generation of these descriptions for technical requirement analysis, evaluation, and systems comparison.
For the modeling, data must be at an adequate level of granularity. The modeling must address behavioral, anatomical, and pathological aspects and surgical instruments (5).
Accuracy is crucial. This is why the modeling must be rigorously validated. The objective of our study was the validation of data acquisition for SPMs. The research question was, "How accurate are observations of Surgical Processes by human observers?" We designed a rigorous validation strategy that assessed the accuracy of SPMs that were acquired from simulated interventions in a controlled environment. We studied several validation criteria: granularity, content accuracy, and temporal accuracy, Neumuth Validation of KA for Surgical Process Models Page 4 of 27 JAMIA 2009 using video and live observation as data acquisition strategies and using medical and technical students as the observer populations. For assessing the validation criteria, metrics have been defined and applied to the SPMs. Secondary questions of interest included time to complete observations, the subjective workload estimation of observers, and the level of surgical knowledge required by observers.

II. BACKGROUND
The amount of information available from surgical processes is large and complex, although the knowledge of the surgeon is mostly implicit and hidden from formal assessment. Data may be acquired by using two main strategies: sensor systems or, in a more classical way, human observation.
Only a few sensor technologies are available for application in the sensitive operating room (OR) environment. These technologies are not suitable for uniform acquisition of data such as work-step information, inter-device communication, human-device behavior or inter-human behavior for modeling due to missing information models, network communication, and interfaces. It is necessary to use human recognition and perception capabilities for parts of the data acquisition, which is a common strategy in biomedicine (6) and empirical social sciences (7).
Only a few approaches for modeling surgical processes are described in the literature. MacKenzie et al. adequate for medical patient records, but they did not provide an overall measure of the accuracy.
Working definitions used in this article are strongly related to Business Process Modeling and Workflow Management Systems (13). By analogy, we define a Surgical Process (SP) as a set of one or more linked procedures or activities that collectively realize a surgical objective within the context of an organizational structure defining functional roles and relationships. The surgical objective is the correction of an undesirable state of the patient's body, which is performed in the organizational structure of a hospital. The responsible surgeon coordinates the performance of the surgical procedure. We define a Surgical Process Model (SPM) as a simplified pattern of a Surgical Process that reflects a predefined subset of interest of the SP in a formal or semi-formal representation (14). The working definitions are also provided to clarify the relationship to the frequently used term Surgical Workflow, which relates to the performance of a Surgical Process with support of a Workflow Management System (15).
The objective of this work was to perform a validation study for assessing data acquisition results of SPMs by human observers with specialized software. The SPs consisted of simulations of Functional Endoscopic Sinus Surgeries (FESS).

III. METHODS
First, the data acquisition software and its underlying ontological concepts are introduced. Then, the experimental setup and post-processing are described. The notion of variables that might influence a validation study for SPMs is discussed in a separate section. These variables were divided into three groups: extraneous variables that need to be held constant, independent variables that were manipulated according to the experimental design, and dependent variables that were affected by the manipulation of the independent variables. Finally, the validation metrics quantified the manipulation effects.

A. Data Acquisition Software and Fundamental Concepts
The data were acquired with a JAVA software application, the Surgical Workflow Editor (16; 17). The objective of the software is to devise ontological concepts used for describing the SP to the observer and  The data acquisition process begins with the definition of the structure of the SPM. The structure is described by the structural ontology and specifies how information of the SP is represented in the SPM.
During actual data acquisition, specific concepts of the observed SP, described by the content ontology (e.g., surgical actions, participants, or instruments) are instantiated by the observer.
Our structural ontology contains three types of flow objects (18): activities, state transitions, and events.
Each SPM consists of these flow objects.
Activities represent manual work steps performed during the interventions. To structure their content, we used the factual perspectives for workflow schema proposed in (19), modified them, and added the spatial perspective. An activity consists of five perspectives, which decompose the observer's view into various viewpoints: • the organizational perspective describing who is performing a work step; • the operational perspective to describe instruments used in performing a work step; • the spatial perspective describing where a work step is performed; • and the behavioral perspective describing when a work step is performed.
Perspectives are extended by perspective attributes. They decompose perspectives further (e.g., indicating that a surgeon is performing a work step with his right hand, where both perspective attributes belong to the organizational perspective). More examples may be found in Table 2. The purpose of the content ontology is to determine the correct intervention-specific relations between perspective contents, e.g., for suctioning (functional perspective) only a suction tube (operational perspective) may be used. The development of the content ontology is based on expert knowledge.

B. Experimental Setup
The validation procedure consisted of recording the simulated SP and comparing the resulting SPM to a reference afterward. The main steps for the experiments are shown in Table 1. In preparation for defining the Gold Standards for the study, flow object patterns of the structural ontology and work step information of the content ontology were used to construct FESS-specific terminology. This was composed of flow object patterns for 41 different activities, which represented surgical work steps, three state transitions, and three events. Pattern examples are shown in Table 2. The patterns of the Gold Standard terminology were used to design three different Gold Standards as simulation scripts that served as references for assessing the accuracy of the simulations. First, the prototype Gold Standard SPM was generated. It contained a typical sequence of work steps with predefined timestamps. From this, two more simulation scripts, the second and the third Gold Standard, were derived by adding noise. The noise additions included modifying the treatment order of nasal cavities, increasing work speed, and switching the surgeon and assistant roles temporarily. The created simulation scripts were checked by two ENT-surgeons for clinical realism. Each of the simulations was 21 minutes in length and was limited to 60 to 90 activities. The three different Gold Standards were spoken and recorded as audio files, containing detailed instructions for the work steps to be performed by the actors. One simulation for each Gold Standard was performed without observers, recorded with multiple video cameras, synchronized, and cut as a video representation of the simulation for later use in video observations. After these protocols of the Bronze Standards were coded in XML-format, they served as reference SPMs for validation of the simulations against the Gold Standard simulation scripts and for validation of the video observations by medical and technical observers. of one educational session day for the uniform training of the observers and three data acquisition sessions days for each observer group. The educational session introduced the purpose of data acquisition for SPM, the Surgical Workflow Editor software, the surgical objectives of FESS procedures, the typical intervention course, and the content ontology to the observers to establish a common context of use. The objective of this session was to simulate the situation for observing real Surgical Processes, where the observer needs to understand the procedure in depth before he or she begins to record data. After each observation, the observers performed a workload assessment, the Task Load Index (TLX) test (20) of the National Aeronautics and Space Administration (NASA), for describing their subjective workload feeling, and they continued with acquiring data by the respective other data acquisition strategy of video and live observation. Additionally, the observers were required to pass a knowledge test twice per data acquisition day.

C. Post-Processing
Before analysis, post-processing was required to link each SPM to its reference. Post-processing started with the manual association of each flow object of an observer protocol to its corresponding reference flow object in a Bronze Standard protocol. By performing this association between flow objects, registration matrices of the protocols were created.

D. Analysis
Preliminary identification of factors that may influence such a validation is required. Inspired by Shah and Darzi (21), we classified the influence factors for SPs by distinguishing surgeon-specific factors , technology-specific factors , and patient-specific factors (see Table 3 for an overview of used symbols). Generally, we consider a surgical treatment to be a Surgical Process , which is a function of the outlined factors.
Technically, a Surgical Process is recorded by a measurement system, influenced by measurement system factors . The measurement system factors therefore influence the representation of a Surgical Process by a Surgical Process Model : . Surgeon-specific factors that influence a surgical process are mainly the human factors of surgeons (21) and the staff in the OR. Two actors performed the simulations of our study: one played the role of the surgeon, and the other played a combined role of assistant and scrub nurse. The surgeonspecific factors were not considered separately because the actors were directed to follow the work steps of the audio representations of the Gold Standards closely.
Surgical Processes vary due to the use of different surgical tools, instruments, and devices. The technology factors were also considered as extraneous variables, not separated, and constant for the study due to the predefinition of instrument names, usage times, and order by the simulation scripts.
We introduced the patient-specific factor group to indicate the patient's current situation, his or her history or future, and his/her specific anatomical and pathological circumstances. We considered the patient-specific factors group as an extraneous variable and constant because the simulations were performed on 3D-Rapid Prototyping models, which all use the same template.
For the study, we focused on data acquisition by human observers, supported by the Surgical Workflow Editor. We classified the measurement system into influence factors . We considered as structural ontology, as content ontology, and the Surgical Workflow Editor as observation support software . For the observer, we opted for the factors as the observation workload and as the knowledge level of the observer. We considered as extraneous variables, assuming them to be constant.

Independent Variables
The focus of this study was the validation of accuracy differences in SPM resulting from different data

Dependent Variables
We defined six different metrics for validation within the context of Surgical Process Modeling: . The six metrics were designed to cover the facets that characterized the quality of data acquisition for SPM and were complementary to each other. For an overview of the computational order of the validation metrics, the reader is referred to Figure 2.

IV. RESULTS
Detailed results for structural outliers are presented in Table 4. Medical students recorded granularity correctly 92.3% (±5.7%) of all activities in the reference in live observations and 92.5% (±5.2%) in video observations, as opposed to 86.6% (±6.8%) in live observation and 91.2% (±6.7%) in video observation for technical students. The mode of data acquisition was significant. Video observations were more accurate in terms of correct granularity. Missing activities and activities with decreased granularity were more prevalent in the live observations. The observer population also had a significant influence on structural outliers. For instance, medical student observers were more likely than technical students to record granularity correctly. The overall content accuracy for activities is 91.5% (±5.4%) in live and 91.5% (±5.3%) in video observation by medical observers. Content accuracy for activities was 88.9% (±2.6%) for live and 87.4% (±8.9%) for video observations by technical students (cp. Table 5). The data acquisition type had no significant influence on content accuracy for activities, but video observations produced significantly lower content accuracy for events. The mean absolute value for temporal accuracy was less than 2 s. for all factors. The data acquisition type had only low significant influence on temporal accuracy (cp. Table 6). The observer population had a significant influence on temporal accuracy.
Data acquisition from videos required 80 % more time than data acquisition for live observations. No significant differences were found in completion time between medical and technical observers.  Nearly all workload criteria, and also the estimation of one's own performance, were rated higher for live observations (cp. Table 7). All workload criteria were rated more demanding by the technical observer population.
The Gold Standards had a significant influence only on the number of structural outliers. Medical students scored 94.1 % correct answers on the knowledge tests, while technical students scored 78.3 %. Former studies validated observations based on inter-observer agreements (23; 24) and used correlations as indirect metrics to quantify the agreements. For valid observations, a threshold of inter-observer agreement is reported (24). Our results were calculated based on direct comparison of observation results with the observed process as a reference.
We found that observers generally record accurately, robustly, and reproducibly. The accuracy of data acquisition for live or video observation was comparable.
The results for structural outliers give a measurement for the assessment of the granularity of an SPM.
Nearly all of the activities were observed with correct granularity. In contrast to the observer population, the influence of the data acquisition type had low significance. We may conclude that differences between video and live observations of activities regarding the validation criterion of structural outliers are not statistically significant.
The observations for state transitions and events were unacceptable. Seemingly, the concentration of the observers was focused on the interventional site and on the monitor displaying the endoscope view, not on the monitor displaying the state transitions and the events. This might be compensated by introducing acoustic signals that highlight them for the observers or perhaps even for the surgeons themselves in the operating room.
Content accuracy showed no significant differences between the data acquisition strategies. Thus, we conclude that live and video observations may be considered similar regarding the validation criterion of content accuracy. The medical observers recorded the activity content significantly better than the technical observers. Low accuracy occurred mainly because students could not properly assess the spatial perspective. None of the perspectives showed a significant difference by data acquisition strategy.
However, there is still work to do to develop a method for direct global content accuracy comparison that accounts for the positive and negative variation in granularity.
The completion time was far longer when recording from videos than from live simulations. This result is especially interesting when considering the comparable outcomes of live and video observations for granularity, content and temporal accuracy.
The small increases of the measured ratios of the knowledge tests during the data acquisition sessions showed the effectiveness of the training sessions. We trained the technical observers in a similar manner to the medical observers, but they were not able to attain the same level of knowledge. Values for all workload criteria were lower for video observations. Technical observers rated all workload criteria more demanding than medical observers. This may have influenced the lower granularity, content accuracy, and temporal accuracy of the technical observers (compared to the medical observers).
The validity of the simulation was checked by comparing the Bronze Standards to the Gold Standards. In the context of the study, the Gold Standards were held as the objective and unequivocal models that were, by definition, the simulation scripts. The Bronze Standards were viewed as the best results that the observers could achieve. The simulation validation was used to cross-check the validity of the simulations. For instance, the mean ratio of correct granularity of the Bronze Standards was , and the mean content accuracy was . Thus, the actors introduced only a very few simulation errors.

Disadvantages
not all information for the SPM can be captured on video field of view can be blocked by intervention participants high costs of time for data acquisition loss of information due to distraction or increased workload of the observer limited temporal resolution

B. Limitations of the Present Study
Limitations to our work include: • The observations were based on simulated Surgical Processes. Of course, simulations are not 100% realistic. Ideally, the study would have used real surgical cases, but that would have prevented control of many factors that could affect results.
• The validation metrics used for assessing the quality of data acquisition for Surgical Process Models need to be validated in additional studies.

C. Implications for Future Work
In this study, we proposed an innovative experimental design for the validation of knowledge acquisition

VI. CONCLUSIONS
The results of this study can provide useful guidance for the design of other studies to acquire knowledge for SPMs. We demonstrated the validity of video as well as live observations for modeling SPMs and that trained human observers generally record accurately, robustly, and reproducibly. We also outlined the areas where human observations were less accurate; future work should concentrate on these areas. Live observations of state transitions and events should be supported by a technical sensor system with intraor post-observation synchronization to the observer protocol or an acoustic signal that draws the attention of the observer to the displaying device. For routine use, we recommend live observations due to their relative speed, flexibility, and effectiveness. If high precision is needed or SPM parameters, such as the ontologies used, are altered during the study, video observations are preferable. Trained medical students can be highly accurate observers.
This study also provided an estimate of the expected accuracy of modeling surgical processes by observation. We identified influence factors that can serve as basis for designing similar studies, in which, for example, the work of surgeons with varying levels of experience or the effect of the use of different surgical instruments might be compared. Our validation metrics can be applied to studies with comparable reference standards, but producing such references is a significant challenge.
Modeling surgical processes is undoubtably a challenge for the observers. Special advance training is required, for example, for live observations in the operating room. The study setup, of course in a narrower context, as well as the validation metrics, can be used to benchmark the level of observers in training. For instance, if it were important for the observers to achieve a certain degree of content accuracy before they can participate in clinical studies, the methods used in this study could be used to measure their proficiency.