Abstract

A key aspect of the practice of anaesthesia is the ability to perform practical procedures efficiently and safely. Decreased working hours during training, an increasing focus on patient safety, and greater accountability have resulted in a paradigm shift in medical education. The resulting international trend towards competency-based training demands robust methods of evaluation of all domains of learning. The assessment of procedural skills in anaesthesia is poor compared with other domains of learning and has fallen behind surgical fields. Logbooks and procedure lists are best suited to providing information regarding likely opportunities within training programmes. Retrospective global scoring and direct observation without specific criteria are unreliable. The current best evidence for a gold standard for assessment of procedural skills in anaesthesia consists of a combination of previously validated checklists and global rating scales, used prospectively by a trained observer, for a procedure performed in an actual patient. Future research should include core assessment parameters to ensure methodological rigor and facilitate robust comparisons with other studies: (i) reliability, (ii) validity, (iii) feasibility, (iv) cost-effectiveness, and (v) comprehensiveness with varying levels of difficulty. Simulation may become a key part of the future of formative and summative skills assessment in anaesthesia; however, research is required to develop and test simulators that are realistic enough to be suitable for use in high-stakes evaluation.

A key aspect of the practice of anaesthesia is the ability to perform practical procedures efficiently and safely. Gaba and colleagues30 differentiated between technical performance, the ‘adequacy of actions taken from a medical and technical perspective’, and non-technical performance, ‘decision-making and team interaction processes’. This broad definition of technical skills includes many different processes, including the recall of factual information, diagnosis, and performing practical procedures. However, for the purposes of this review, we will distinguish between Gaba's broad definition of ‘technical skill’ and a more focused definition of ‘procedural skill’, by which we mean the performance of a practical procedure.

The assessment of procedural skill in anaesthesia is generally given less importance than the assessment of knowledge and judgement-based skills.84,93 This is partly because there has been no universally accepted and comprehensive way to assess procedural skill. It is notable that non-technical skills such as communication constitute another domain that has also been assessed informally, although a full discussion of the evaluation of non-technical skills26 is outside the scope of this review. There has been an international trend towards decreasing working hours for doctors in training and reduced trainee exposure to procedural skills. This has led some in the medical profession to question if there is time for procedural skills to be adequately learned during current training programmes.14 There is a perceived need for greater accountability to the public and government and an increasing emphasis on patient safety.48 The consequences of suboptimal skills include death and permanent brain damage emphasizing the importance of a robust system for ensuring competence in procedural skills in anesthesia.13 These factors have led to a paradigm shift in postgraduate medical education from systems based on completing accredited posts for a specified amount of time to ‘competency-based’ curricula72–74,96 which demand a focused and rigorous method of evaluating procedural skill.86

Surgical outcome relies on sound procedural skills,58 perhaps even more than in anaesthesia, and research in the assessment of procedural skills has been pioneered in surgery.69 We will therefore also discuss innovations in the surgical field that may be useful for anaesthesia in the future. This review will discuss the acquisition of expertise in procedural skills, general principles about the educational theory behind assessment, examine literature regarding the different techniques that can be used for the assessment of procedural skills in anaesthesia and make relevant extrapolations from literature in other fields of medicine. Finally, we will discuss the context of procedural skills assessment, including the use of simulation.

The acquisition of expertise in procedural skills

There are three stages in the acquisition of procedural skills: cognition, integration, and automation.52 Cognition includes developing an understanding of the task, and perceptual awareness. It is assisted by a clear description and demonstrations of the task. In the integration stage, the knowledge from the cognition phase is incorporated into the learning of the motor skills for that task. Ultimately, the task becomes automatic and even subconscious. For an expert, it may be difficult to break a task down into component parts in order to teach a novice.

The acquisition of competence or expertise in a procedural skill requires experience through a variable number of attempts depending on the skill, the quality of teaching, and the aptitude of the individual. In anaesthesia, skills are typically initially learned on relatively straightforward cases. When competence is achieved in those situations, trainees are exposed to a broader range of normal and pathological variations required for the development of true expertise. Retention of motor skills seems to be most dependent on the degree to which the skill was perfected.37

Which skills should be assessed?

A full discussion of which skills should be assessed is beyond the scope of this review. One viewpoint is that technical skills should be assessed if they are either commonly performed or potentially life saving.93 However, determining the core skills for life-saving manoeuvres can be controversial, as demonstrated by a recent debate on whether competence in cricothyrotomy can reasonably be expected.11,39 Core skills also change as medical technology and knowledge develops.42

General principles

A summative assessment is made at the end of a period of training. Summative assessments usually assign either a grade or a pass/fail and have been described as an ‘assessment of learning’.32 Summative evaluations may: (i) be part of progress within a competency-based training scheme; (ii) be required before being allowed more significant levels of responsibility; or (iii) form part of certification or revalidation of medical licensing. Summative assessments of procedural skills within anaesthesia training programmes have historically been performed through a combination of retrospective subjective feedback from supervising faculty without specific criteria, self-reported procedure logs, or both.

Formative assessments are used to aid learning and have been described as ‘assessment for learning’. In order to be useful, feedback from formative assessment needs to occur in a timely manner, so that it can influence a trainee's progress. Traditional thinking has been that feedback should be given as soon as possible, sometimes concurrently during the performance of the procedure.87 However, there is recent evidence that feedback after completion of a task is more effective than concurrent feedback, especially for long-term retention of skills.77,97 Formative assessment is usually undertaken during supervised clinical work and so its effectiveness is subject to attributes of the supervisor such as willingness to teach and interpersonal skills,45 regardless of the assessment tool used.

Criterion-referenced assessment is when the basis for comparison is a well-defined list of criteria. Norm-referenced assessment is when trainees are compared with their peers. Ipsative assessment is when a trainee's performance is compared with their own over a period of time.

The quality of a method of assessment is described by its reliability and validity. Feasibility relates to its usefulness in the ‘real world’. Other important factors in assessing procedural skills are whether they comprehensively test every aspect of the skill and whether feedback is suitably timely to promote further learning.84 Comparison of these factors for procedural skills testing is described in Table 1.

Table 1

Reliability, validity, and feasibility of techniques to assess technical skills

Test Reliability Validity Feasibility Comprehensiveness 
Psychometric testing Automated and objective Limited correlation with early performance in fibreoptic endoscopy Relatively expensive, requires a degree of expertise to be able to collect, process and interpret the results Not comprehensive 
Procedural lists Self-reported and subject to omissions and inaccuracies No guarantee that procedures were performed correctly The easiest form of evaluation Not comprehensive 
Procedural lists with prescribed minimum number Self-reported and subject to omissions and inaccuracies No guarantee that procedures were performed correctly Very easy to use Not comprehensive 
Cusum If self-reported is subject to omissions and inaccuracies If self-reported, there is no guarantee that procedures were performed correctly Requires esoteric statistical analysis, usually depends on self-reporting Depends on how success and failure is defined and identified 
Direct observation without criteria Poor reliability Has face validity Very easy to use Potentially comprehensive 
Checklists Excellent reliability in trained observers, especially within the context of an educational study Construct validity established in anaesthesiology for lumbar epidurals and interscalene blocks. Intrinsic content validity Easy to use, requires some training for optimal reliability Potentially comprehensive: depends on content of checklist 
Global rating scales Excellent reliability in trained observers, especially within the context of an educational study Construct validity established in anaesthesiology for lumbar epidurals and interscalene blocks Easy to use, requires some training for optimal reliability Potentially comprehensive: depends on content of global ratings scale 
Motion analysis Objective with a numerical output and so not subject to differences in the opinion of the observer Construct validity established for open general surgery, laparoscopic surgery, and microsurgery but as yet untested in anaesthesia Relatively expensive, requires a degree of expertise to be able to collect, process, and interpret the results Not comprehensive: only evaluates motor skill 
Simulation Depends on assessment tool used with the simulation Highly variable degrees of face validity between simulations. Otherwise, validity is largely unproven Cost is highly variable depending on the type and complexity of the simulation Depends on assessment tool and the nature of the simulation 
Multi-station bench testing (e.g. OSATS) Excellent reliability in trained observers, especially within the context of an educational study Construct validity established for open general surgery, laparoscopic surgery but as yet no comparable assessment tool in anaesthesia Relatively expensive, time-consuming for faculty, requires co-ordination at a central site Potentially comprehensive 
Test Reliability Validity Feasibility Comprehensiveness 
Psychometric testing Automated and objective Limited correlation with early performance in fibreoptic endoscopy Relatively expensive, requires a degree of expertise to be able to collect, process and interpret the results Not comprehensive 
Procedural lists Self-reported and subject to omissions and inaccuracies No guarantee that procedures were performed correctly The easiest form of evaluation Not comprehensive 
Procedural lists with prescribed minimum number Self-reported and subject to omissions and inaccuracies No guarantee that procedures were performed correctly Very easy to use Not comprehensive 
Cusum If self-reported is subject to omissions and inaccuracies If self-reported, there is no guarantee that procedures were performed correctly Requires esoteric statistical analysis, usually depends on self-reporting Depends on how success and failure is defined and identified 
Direct observation without criteria Poor reliability Has face validity Very easy to use Potentially comprehensive 
Checklists Excellent reliability in trained observers, especially within the context of an educational study Construct validity established in anaesthesiology for lumbar epidurals and interscalene blocks. Intrinsic content validity Easy to use, requires some training for optimal reliability Potentially comprehensive: depends on content of checklist 
Global rating scales Excellent reliability in trained observers, especially within the context of an educational study Construct validity established in anaesthesiology for lumbar epidurals and interscalene blocks Easy to use, requires some training for optimal reliability Potentially comprehensive: depends on content of global ratings scale 
Motion analysis Objective with a numerical output and so not subject to differences in the opinion of the observer Construct validity established for open general surgery, laparoscopic surgery, and microsurgery but as yet untested in anaesthesia Relatively expensive, requires a degree of expertise to be able to collect, process, and interpret the results Not comprehensive: only evaluates motor skill 
Simulation Depends on assessment tool used with the simulation Highly variable degrees of face validity between simulations. Otherwise, validity is largely unproven Cost is highly variable depending on the type and complexity of the simulation Depends on assessment tool and the nature of the simulation 
Multi-station bench testing (e.g. OSATS) Excellent reliability in trained observers, especially within the context of an educational study Construct validity established for open general surgery, laparoscopic surgery but as yet no comparable assessment tool in anaesthesia Relatively expensive, time-consuming for faculty, requires co-ordination at a central site Potentially comprehensive 

Reliability

Reliability refers to the reproducibility of a test. In education, this may refer to inter-rater agreement or test–retest agreement. These agreements are often described as ‘external reliability’. There are different statistical analyses that can be used to describe reliability: Internal consistency is sometimes described as ‘internal’ reliability, when compared with the ‘external’ reliability of inter-rater or test–retest agreement. Internal consistency is not whether a subject performs similarly in each part of a test, but whether different subjects tend to do well or badly on the same parts of the test. Cronbach's α gives a value from 0.0 to 1.0. By convention 0.0–0.5 can be regarded as imprecise, 0.5–0.8 moderately reliable, and 0.8–1.0 can be used with confidence for high-stakes purposes such as certification, although again these cut-offs are arbitrary.47,68 Internal consistency is a commonly used measure of reliability as it describes the reproducibility of a test. However, the consequences of poor internal reliability are less problematic than poor external reliability. For example, a test where different candidates fail on different questions could be due to differences in clinical experience or teaching, whereas a test with poor inter-rater reliability could be considered intrinsically unfair to the subjects.

  1. Pearson's product-moment correlation coefficient has been used to describe inter-rater agreement. An r >0.75 indicates excellent agreement.64 However, this method has been criticized as not accounting for bias.43 For instance, if one examiner consistently uses the low marks in a scale and another consistently uses the higher marks, they could still have a high correlation coefficient if they rank subjects similarly. This bias would be particularly significant if the assessment tool has a set pass mark.

  2. Intraclass correlation coefficient (ICC). The ICC is also used to describe inter-rater agreement. It accounts for the agreement that would be seen by chance and is defined as the ratio of variance between subjects due to error variance. An ICC of 0.8 means that 80% of the variance among scores can be attributed to true variance among subjects.90 Cohen's κ coefficient is a type of ICC61 that can only be used when there are two raters. A κ >0.80 has been described as indicating near-perfect agreement; 0.61–0.80, substantial agreement; 0.41–0.60, moderate agreement; 0.21–0.40, fair agreement; 0.00–0.20, slight agreement; and <0.00, poor agreement.54 However, it should be noted that this categorization is not universally accepted, and in practice, the degree of acceptable agreement depends on the circumstances. For instance, a ‘high stakes’ licensing exam requires a particularly reliable assessment tool.3

Validity

Validity describes whether the test is measuring what it sets out to measure.

  1. Face validity refers to a general impression of whether the evaluation seems appropriate. For example, evaluating performance of epidural anaesthesia on a model consisting of a banana has little face validity compared with direct observation of the placement of an epidural in a patient. Face validity is perhaps best assessed by expert opinion, although good face validity as judged by the subjects of the assessment improves ‘buy-in’ of the evaluation.

  2. Content validity refers to whether an assessment tests the content either of what was being taught or appropriate content as defined by a group of experts.

  3. Concurrent validity establishes validity based on agreement with another established valid measure.

  4. Construct validity is used when there is no established gold standard for comparison. A construct is a concept that is to some extent abstract. For instance, although we can all agree that there is such a thing as ‘expertise’, it is more difficult to precisely define or test that construct. Instead, we can test a surrogate outcome, such as experience, that is easier to quantify and that we expect to be associated with the construct of expertise. An educational evaluation is therefore considered valid if it can differentiate between groups with different levels of experience and is increasingly valid if the groups it can distinguish between are more similar in experience. Reznick comments that ‘validity cannot be proven in any one experiment. Rather, over time and experimentation one accrues evidence for the validity of a test’.68 This is particularly true for construct validity, which only examines a surrogate measure.

  5. Predictive validity is the ability of a test to predict something that happens after the test such as a clinical outcome or the future test results of a subject. Although such outcome data are the most useful demonstration of validity, it is generally the most difficult to establish.

Techniques for the assessment of procedural skills

Psychometric and aptitude testing

Psychometric testing has been found to be of limited value in predicting subsequent procedural performance in the surgical field. A statistically significant but modest correlation was found between the performance of a Z-plasty procedure and scores in tests that assess the ability to rotate 2D and 3D figures mentally, but not with less complex tests that assess the recognition of simple shapes.92 Laparoscopic surgery requires surgeons to infer the shape of 3D structures from 2D screens. The Pictorial Surface Orientation (PicSOr) is a computer-based test of depth perception.31 Gallagher and colleagues31 compared PicSOr performance to simulated laparoscopic cutting tasks in both novices and expert laparoscopic surgeons. There was a modest correlation between performance in PicSOr and laparoscopic performance.

The MICROcomputerised Personnel Aptitude Tester (MICROPAT) measures psychomotor ability and has been used in anaesthesia to compare performance in adaptive pursuit tracking tasks with subsequent performance in fibreoptic nasotracheal endoscopy. Pursuit tracking was correlated with faster times to completion of nasotracheal endoscopy, accounting for approximately one-third of the ability early in the learning curve.17 The MICROPAT has also been investigated as a method of predicting obstetric epidural failure rates but was not correlated with failure rates for either the first 25 or the first 50 epidurals suggesting a limited application for this evaluation.16

As psychometric tests have only been shown to have moderate correlations with performance in the early stages of technical skill acquisition, it remains to be seen if they have a role in medical recruitment or selection for any specialty. Indeed, evidence from fields of expertise outside medicine suggests that large numbers of hours of deliberate practice are more important to the acquisition of motor skills than innate ability.21 In the context of procedural skills in anaesthesia, psychometric testing is currently essentially a research tool and largely unproven.

Procedure lists

Assessment of technical skill has historically been by a combination of a subjective impression from an educational supervisor and logbooks or procedural lists.96 Self-reported procedure lists are a common form of assessment of technical skills mainly because of high feasibility. Although a certain number of procedures performed are clearly necessary to provide the opportunity to progress through the stages of acquisition of technical skills, the actual number is highly variable between individuals.15,20 There are clear limitations in the value of procedure lists, especially if used for summative assessment: there is no guarantee that a task was performed correctly and trainee anaesthetists can consistently repeat mistakes, despite considering their own performance to be acceptable. Performing a procedure badly a large number of times is of little educational value and also puts patients at risk.93 In conclusion, procedural lists are most useful for assessing opportunities provided by training posts and to guide programme directors rather than to assess individuals.8

Cumulative sum analysis

Cumulative sum (Cusum) analysis was originally developed in industry as a method of quality control. It is a statistical method that looks at the outcome rather than the process of performing procedural skills. Cusum plots a graph of the subject's performance over time based on predetermined criteria for success or failure. A value for Cusum is plotted on the y-axis and the number of attempts on the x-axis (Appendix). Failures result in a move up (and successes down) the y-axis as the subject progresses through increasing numbers of attempts. When the Cusum score decreases below a level based on a predetermined ‘acceptable failure rate’, the subject can be considered to be competent with statistical significance. The distance that a trainee is above a predetermined line is an indication of how far they are from achieving competency as defined by Cusum.46 Authors have also described other useful endpoints such as a change in the curve that denotes either an improvement or worsening in performance.46,98

The Cusum analysis is an effective objective tool to define learning curves for technical skills. Learning curves can be constructed for individuals or summated to provide learning curves for a population. Cusum can be used to provide an estimate of the number of cases that are required to achieve competency98 and demonstrates a wide variety in that number between individuals.62 For example, subjects required between nine and 88 attempts to become competent at tracheal intubation.20 It can be used to identify when a change in the training process is indicated,98 and poor technique may be corrected before demoralization.46 Cusum has been used to alter the schedule of training rotations, change curricula, and initiate mentoring programmes.98 As such, it can be used not only to assess the competency of individuals in procedural skills but also as an assessment of a training programme and its ability to teach those skills. Another potential application is as a continuous audit of quality of practice for experienced clinicians, although it is more commonly used this way in surgery where complications that constitute negative endpoints are more common.9

A potential disadvantage of the Cusum method is that it often relies on self-reporting, which may be inaccurate. Direct observation of all procedures by trainees is unlikely to be possible for most procedures, as a considerable amount of work may be performed while on call. Cusum is only as objective as the pre-defined success/failure endpoint and these definitions may vary widely.20,51,98 Acceptable rates of success can be determined by institutional rates or expert consensus46,62,98 and also depends on the definition of success. As training in anaesthesia is based on the principle of gradually increasing responsibility and reducing supervision, an increase in responsibility may result in more difficult cases and therefore result in a deterioration of the Cusum curve, despite no deterioration in skill. The issue of accounting for the difficulty level posed to the subject can also be problematic for other methods of assessing technical skill and is further discussed in the simulation section.

A final disadvantage of Cusum is that a great number of attempts may be necessary to prove statistical significance7,78 and this could be unfair to trainees if their progression through a training rotation depends on Cusum defined competency.

Direct observation without criteria

Direct observation by a consultant is traditionally used to assess procedural skills and is feasible in anaesthesia because of the high degree of supervision in cases performed by trainees.84 Despite being feasible, assessments without specific criteria result in poor reliability and validity. To overcome these problems, direct observation with specific criteria has been developed.

Direct observation with criteria

Checklists

Binary content checklists can be used as a way of grading performance during direct observation. Checklists break a task down into its component parts and assign a dichotomous pass/fail outcome to each point. A new checklist needs to be designed and validated for each procedural skill that is to be assessed. Checklists can be constructed by surveying experts,64 although a different group of experts may not agree on each point of the checklist. For instance, different groups have published checklists for epidural anaesthesia with 14, 27, and 61 items on the list.28,71,80 A systematic review and content analysis of checklists for procedural skills assessment can be found elsewhere.58 Checklists have also been designed with outcomes of ‘not performed’, ‘performed poorly’, and ‘performed well’ rather than a binary pass–fail outcome to allow them to become more qualitative28 at the cost of a loss in objectivity. A potential problem with checklists is that if all stages are weighted equally regardless of clinical importance, then a trainee may be able to obtain a high score, despite omitting important stages. To prevent this, certain stages can be marked as resulting in an automatic fail if not completed or an overall pass/fail option can be added to the scoring system.

Checklists have been found to have excellent reliability in the assessment of epidural anaesthesia28 and good reliability for the assessment of interscalene brachial plexus blocks when used by trained assessors.64 An advantage of checklists is that they have intrinsic content validity, if they are constructed well, for instance, ensuring that the checklist examines what is taught at that centre. Construct validity has been established for checklists in the assessment of epidural anaesthesia28 and interscalene brachial plexus blocks.64 Predictive validity has not been studied for checklists in anaesthesia.

Global rating scales

Global rating scales (GRSs) differ from checklists, in that they use a Likert scale rather than a dichotomous outcome. As the GRS has a gradation of response in each category, it is less objective than a checklist, although this allows the assessment to be more qualitative.90 GRS can be used to assess many different skills and they are the most objective way that aspects of performance such as professionalism and interpersonal skills can be assessed. When used to assess procedural skill, a GRS may either describe an overall impression of the quality of performance or there may be a Likert scale for a number of different domains within an overall performance.64,70 GRS can be used prospectively or retrospectively, although, as in other forms of assessment, there is evidence that reliability is poor if used retrospectively.85 Potential pitfalls with GRS include the ‘halo effect’, when good or bad performance in one domain unduly influences the grading of performance in other domains. This may be partly due to lack of training of assessors41,85 and can make a GRS seem falsely internally consistent.47 Another common problem with GRS is self-imposed scale limitation. Assessors commonly restrict themselves to the high end of the scale: in one study, 95.6% of scores on a nine-point scale were between 6 and 9.85 This may also be because of either lack of training of assessors or because assessors are unwilling to fail a trainee knowing the potentially serious consequences. An alternative explanation for scale limitation is that most trainees are of a high standard. However, to be useful, GRS must be designed to be able to differentiate grades of quality beyond distinguishing between outstanding or failing trainees.

A GRS developed for the assessment of procedural skills in surgery at the University of Toronto (Table 2)55,68 has been repeatedly found to have construct validity for the assessment of procedural skills in both surgery55,68 and anaesthesia, differentiating between junior and senior trainees performing an interscalene block,64 and to discriminate between various levels of experience at performing epidural anesthesia.28 It has also been found to have good reliability for the assessment of orotracheal fibreoptic intubation,63 epidural anaesthesia,28 and interscalene brachial plexus blocks.64

Table 2

GRS for procedural skills in anaesthesia

 
Preparation for procedure Did not organize equipment well. Has to stop procedure frequently to prepare equipment  Equipment generally organized. Occasionally has to stop and prepare items  All equipment neatly organized prepared and ready for use 
Respect for tissue Frequently used unnecessary force on tissue or caused damage  Careful handling of tissue but occasionally caused inadvertent damage  Consistently handled tissues appropriately with minimal damage 
Time and motion Many unnecessary moves  Efficient time/motion but some unnecessary moves  Clear economy of movement and maximum efficiency 
Instrument handling Repeatedly makes tentative or awkward moves with instruments  Competent use of instruments but occasionally appeared stiff or awkward  Fluid moves with instruments and no awkwardness 
Flow of procedure Frequently stopped procedure and seemed unsure of next move  Demonstrated some forward planning with reasonable progression of procedure  Obviously planned course of procedure with effortless flow from one move to the next 
Use of assistants Consistently placed assistants poorly or failed to use assistants  Appropriate use of assistants most of the time  Strategically used assistants to the best advantage at all times 
Knowledge of procedure Deficient knowledge  Knew all important steps of procedure  Demonstrated familiarity with all aspects of procedure 
Overall performance Very poor  Competent  Clearly superior 
 
Preparation for procedure Did not organize equipment well. Has to stop procedure frequently to prepare equipment  Equipment generally organized. Occasionally has to stop and prepare items  All equipment neatly organized prepared and ready for use 
Respect for tissue Frequently used unnecessary force on tissue or caused damage  Careful handling of tissue but occasionally caused inadvertent damage  Consistently handled tissues appropriately with minimal damage 
Time and motion Many unnecessary moves  Efficient time/motion but some unnecessary moves  Clear economy of movement and maximum efficiency 
Instrument handling Repeatedly makes tentative or awkward moves with instruments  Competent use of instruments but occasionally appeared stiff or awkward  Fluid moves with instruments and no awkwardness 
Flow of procedure Frequently stopped procedure and seemed unsure of next move  Demonstrated some forward planning with reasonable progression of procedure  Obviously planned course of procedure with effortless flow from one move to the next 
Use of assistants Consistently placed assistants poorly or failed to use assistants  Appropriate use of assistants most of the time  Strategically used assistants to the best advantage at all times 
Knowledge of procedure Deficient knowledge  Knew all important steps of procedure  Demonstrated familiarity with all aspects of procedure 
Overall performance Very poor  Competent  Clearly superior 

An advantage of GRS is that they are not confined to one procedure but can be used for different procedural skills. However, some domains may be particularly useful for certain kinds of procedure, for instance, the domain ‘depth perception’ has been added for laparoscopy but is unlikely to be useful in anaesthesia while ‘autonomy’ is likely to be useful for procedural skills assessment in any speciality.90

GRSs are currently being used for in-training assessment of procedural skills in the UK Foundation Programme that covers the first 2 postgraduate years in all specialities. One mandatory competence assessment tool is the Direct Observation of Procedural Skills (DOPS),10 a specific six-point, 11 domain GRS used to assess performance in procedural skills (Table 3). It is notable that DOPS focuses on the context of the procedural skill: nine of the domains describe pre- and post-procedure care and non-technical skills. The actual assessment of the procedural skill is limited to a single domain. DOPS was developed by the Royal College of Physicians UK and is currently in the process of being studied as a pilot system. Foundation year trainees need to undertake 6 DOPS each year from an approved list.66 The trainee chooses the timing, the procedure, and the assessor who may be a more senior doctor or a nurse but is expected to have had some training in the use of DOPS.

Table 3

The DOPS domains

Demonstrates understanding of indications, relevant anatomy, technique of procedure 
Obtains informed consent 
Demonstrates appropriate preparation pre-procedure 
Demonstrates situation awareness 
Aseptic technique 
Technical ability 
Seeks help where appropriate 
Post-procedure management 
Communication skills 
10 Consideration of patient 
11 Overall ability to perform procedure 
Demonstrates understanding of indications, relevant anatomy, technique of procedure 
Obtains informed consent 
Demonstrates appropriate preparation pre-procedure 
Demonstrates situation awareness 
Aseptic technique 
Technical ability 
Seeks help where appropriate 
Post-procedure management 
Communication skills 
10 Consideration of patient 
11 Overall ability to perform procedure 

McKinley and colleagues argue for a holistic evaluation of procedural skills for summative purposes. Their Leicester Clinical Procedure Assessment Tool (LCAT)57 was developed in response to a systematic review58 of checklists and GRS for the evaluation of procedural skills that found that teamwork competencies and humanistic competencies such as safety and infection prevention were omitted from the majority of assessment tools. The LCAT has been shown to be reliable and has thoroughly demonstrated content and face validity, but construct and predictive validity have not yet been investigated.57

It is not clear whether construct validity as tested within an educational research trial is necessarily generalizable to ‘real world’ practice using in-training assessments by untrained raters. Potential impediments to similar degrees of reliability in the ‘real world’ to that achieved in research with direct observation of a procedure on actual patients include an imprecise GRS, patient variability resulting in heterogeneous levels of difficulty, lack of training of the assessor resulting in differing levels of expectation from staff and the degree to which the trainee is acting independently.55

Comparisons between GRSs and checklists

Both checklists and GRSs have been found to have good reliability.33,60 Checklists have been challenged as being able to distinguish novice and expert performance but failing to differentiate between high levels of performance, rewarding thoroughness rather than expertise. This may be particularly true for non-technical skills: Hodges and colleagues40 compared checklists and GRSs in psychiatry OSCE stations and found that experts scored better than trainees or medical students when assessed with GRS but worse when assessed with checklists. The authors concluded that checklists might penalize experts who take shortcuts that they have learned as part of their expertise and that ‘an instrument that is valid at one level of training may not be valid at another’.40 Similarly, for procedural skills assessment, a comparison of checklists and GRSs in the performance of simulated surgical procedures by trainees found that GRS had better construct validity than the checklist, although both instruments demonstrated construct validity and good reliability.67

Other authors have suggested that checklists may be better suited for assessing procedural skills than other domains of learning as procedural skills tend to be sequential and predictable.53 It has also been suggested that procedure-specific scales may provide an additional degree of formative feedback to generic scales.1 A good example is in ultrasound-guided regional anaesthesia, where radiological visualization of the nerve is a key skill that is not accounted for by generic scales.79 It is notable that GRS generally gather different information than a checklist. Friedman and colleagues29 have demonstrated the value of a checklist in identifying poor aseptic technique, despite good performance in other aspects of procedural skills. This suggests that a combination of a checklist and GRS may be advantageous when a comprehensive evaluation is required, for instance, when testing an intervention in education research.

A dilemma for both GRSs and checklists is determining cut-off scores for what can be considered competent or not competent if used for summative evaluation. One solution is to examine a population of expert anaesthetists and define proficiency from their scores. Using a normative marking scheme with a fixed proportion of subjects passing and failing has the limitation that it would fail to identify either a good or bad cohort of trainees.

Other instruments

Global Operative Assessment of Laparoscopic Skills (GOALS) is an assessment tool that combines both a GRS and a checklist with visual analogue scales (VAS). The two 10 cm VAS are anchored at each end with specific descriptors and refer to (i) overall competence and (ii) the observed difficulty of the procedure. GOALS was found to be more reliable than a GRS and checklist alone with suitable inter-rater reliability for a high-stakes examination. The VAS for competence was found to have construct validity in the assessment of performance of laparoscopic cholecystectomy where a checklist did not.90 The use of a VAS for difficulty when directly observing procedural skills has not yet been explored in anaesthesia and future research in this area is warranted. Another alternative to checklists and GRSs is to measure the number of pre-determined errors.79

Motion analysis

The Imperial College Surgical Assessment Device (ICSAD) is a motion analysis device originally designed for the investigation of hand movements in surgeons. It provides an objective measure of technical ability that has been validated in various surgical fields and has begun to be used in anaesthesia.94 It uses an electromagnetic tracking system (Isotrak II; Polhemus Inc., Colchester, VT, USA) consisting of an electromagnetic field generator and two 10 mm sensors that are attached to the dorsum of each hand. Robotic Video and Motion Analysis Software retrieves time-stamped Cartesian coordinates and defines hand movements by changes in velocity. It processes this information to produce values for total distance moved by each hand, number of movements, total time, and average hand velocity. A Gaussian filter is used to eliminate background noise, so that only meaningful actions register as movements: the size of movement that designates a movement is adjustable; however, after this point, the ICSAD is entirely objective.

Different aspects of economy of movement have been validated as assessment tools for different procedures. Time taken and reduced total movements have been demonstrated to be associated with expertise in open5,19 laparoscopic83 and micro-surgery.75 Reduced number of movements has also been found to be correlated with less anastomotic leakage in a simulated arterial graft model providing some evidence of predictive validity.18 Reduced total path length is associated with expertise in laparoscopic59,81,83 and micro-surgery75 but not in open surgery.5,19

In anaesthesia, the ICSAD has potential to objectively assess the development of expertise in technical skills but has not yet been validated for specific anaesthesia procedures and this should be an area of future research. It may also be of use in formative assessment, although the data output is not intuitively interpretable and would perhaps be best compared with previous performance or that of peers. A limitation of the ICSAD is that it can only ever assess process rather than outcome, for instance, it gives no information on whether the procedure was performed well. The data are likely to be more useful if triangulated with a checklist or GRS.

Contexts of procedural skills assessment

Procedural skills have historically been both taught and evaluated on patients. In recent years, advances in simulation technology have enabled the evaluation of procedural skills to be taken out of the clinical context into the simulation laboratory. This has created the possibility of standardized assessment of procedural skills.

Simulation

Simulation includes the use of manikins, human cadavers, animals, virtual reality, and standardized patients. ‘Part-task’ trainers (PTT) simulate a particular anatomical area or procedure in contrast to full patient simulation with a computer-enhanced mannequin. PTT are generally less expensive than full-patient computerized manikin-based simulators, especially when taking into account the costs of a technician and actors required, but full-patient systems can better simulate the whole clinical environment in order to recreate situations where procedures must be completed quickly to prevent further physiological deterioration.88 Hybrid simulation involves more than one type of simulation and can be used to put procedural skills in context. An example is having an actor playing the part of a patient to assess consent and communication skills, with an attached mannequin arm to assess placing an i.v. cannula.49,50

The fidelity of a simulation refers to its similarity to an actual task or patient. Although simulation will never be exactly the same as a clinical experience, there are a number of advantages in assessing procedural skills in a simulated rather than a clinical environment. Some procedures such as emergency cricothyrotomy are uncommon enough that most anaesthetists will not perform this procedure during their training. Assessing procedural skills using a simulator prevents potential harm to patients and has been described as an ethical imperative,99 although ethical issues remain with the use of animals or human cadavers. In-training assessment has also been challenged as problematic due to power relations between the trainee and the trainer,65 potential conflicts of interest between the consultant as a teacher and as an evaluator,95 and the trivialization of evaluation and development of a ‘tick box’ mentality.6 As there are problems with both standardized simulated evaluations and in-training assessment, a combination of both may offer the best solution.

Predictive validity has been demonstrated for the Human Patient Simulator (METI, Sarasota, FL, USA) for teaching tracheal intubation: 10 h of deliberate practice was found to be as effective as 15 intubations in the operating theatre.36 The AirSim (TruCorp, Belfast, Northern Ireland) has content validity as it is anatomically correct as it was designed using a 3D model, using spiral CT scans of the human airway.23 The extubated anaesthetized sheep model also has content validity for the ‘can't intubate, can't ventilate’ scenario: it has secretions, a mobile larynx, it will bleed if a surgical airway is attempted, it can develop s.c. emphysema, it has a cricothyroid membrane, and it desaturates after extubation producing a sense of responsibility to regain the airway. There are, however, both cost and ethical disadvantages to this model.38 Construct validity has been established for some simulations in anaesthesia including the Bill 1 airway simulator (VBM Medizintechnik GmbH, Sulz, Germany) for cricothyrotomy82 and a virtual reality flexible bronchoscopy simulator (Immersion Medical, Gaithersburg, MD, USA).15 However, the majority of studies on the validity of anaesthesia simulators for procedural skills have generally limited themselves to discussions of face validity, rated by subjective realism by the users.89 Face validity depends largely on the simulation's fidelity. Although there is some evidence that low-fidelity PTT are as effective as high-fidelity simulations in the teaching of anaesthesia procedural skills,12,56 this has not been investigated for evaluation.

Surgical specialities are considerably ahead of anaesthesia in the development and validation of PTT. One example is the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS), a non-anatomical simulator that has been assessed at multiple institutions.27 The MISTELS simulation and metric has been found to be reliable with excellent internal consistency, inter-rater and test–retest agreement,27,91 construct validity,27 and concurrent validity.24,27

There is good evidence that simulation-based learning with PTT is effective for teaching procedural skills in surgery. Virtual reality PTT, both non-anatomical25,35 and anatomical,2 have been demonstrated to have predictive validity for the teaching of laparoscopic cholecystectomy and reduce error when operating on patients. Virtual reality PTT have also been used for high-stakes evaluation in surgery.76 Predictive validity has only been demonstrated in anaesthesia for the teaching of tracheal intubation.36 However, procedural skills in anaesthesia can be as vital as decision-making out of the operating theatre can be in the surgical specialities and it could be argued that PTT for anaesthesia have been neglected compared with full-patient manikin simulators. As an example, it is notable that there is no commercially available airway PTT that is capable of simulating a comprehensive range of pathology that can cause a difficult airway.

The success of outcome-driven simulation-based learning suggests that using simulation as an evaluation before independent practice has the potential to impact positively on patient safety. At the present time, there has been no research that has demonstrated predictive validity for any evaluation using any simulation. A priority for future patient safety research should be to demonstrate that a competent performance on a simulator results in competence in actual patients.

Multiple station examinations of procedural skills

The Objective Structured Assessment of Technical Skills (OSATSs) were developed as an objective assessment of procedural skill outside of the operating theatre68 and is similar to an OSCE. Candidates perform a series of standardized surgical skills on bench models. At each, time-limited station, candidates are examined by the direct observation of experts and technical skills are assessed using both a generic GRS and a task-specific checklist. OSATSs have been shown to be a reliable measure of technical skill.4,33,55,68 Construct validity33,68 and concurrent validity have been established for OSATS including correlating surgical faculty rankings to OSATS scores for senior surgical trainees.22 However, organizing and running an OSATS examination is labour intensive and relatively expensive.34 An OSATS examination format has not yet been used in anaesthesia but has the potential to standardize procedural skills assessment if suitable simulations can be validated, as discussed above.

Conclusions

Procedural skills in anaesthesia are assessed poorly compared with other domains of learning. This domain of learning has undergone detailed investigation in surgery because of the high importance of procedural skill on surgical patient outcome and anaesthesia has the opportunity to learn from recent advances in the surgical fields.

It has been argued that evaluation drives learning.58 Current evaluations in anaesthesia tend to focus on broadly defined technical skills but neglect the details of procedural skills (and non-technical skills). Improving the evaluation of procedural skills has the potential to promote excellence in a neglected domain of learning.

Research into the assessment of technical skills in anaesthesia has been conducted with heterogeneous methodologies, which makes comparison difficult. Future studies should include core elements that ensure methodological rigor and facilitate robust comparisons between trials. Such studies should be able to demonstrate assessments: (i) validity—it measures what it purports to measure, (ii) reliability, (iii) feasibility—including cost-effectiveness, and (iv) comprehensiveness (allows for various levels of difficulty). The demonstration of predictive validity for the evaluation of procedural skills should be considered an achievable priority.

This review has presented a diverse array of different methods each having their own particular advantages and disadvantages. A key question is ‘how should a training programme assess the procedural skills of trainees in practice?’ First, several methods can be excluded. Logbooks and procedure lists are best suited to providing information regarding likely opportunities within training programmes and there is little evidence to promote the use of psychometric or aptitude testing in anaesthesia. Cusum analysis has the potential to provide a robust statistical measure of procedural competence but relies on either self-reported performance or repeated direct observations and can require large numbers of performances to demonstrate competency. New technology such as motion analysis may have a role in focusing on manual dexterity during technical tasks but requires further validation before being used to assess procedural skills in anaesthesia.

Currently, the best evidence for a valid, reliable, feasible, and comprehensive assessment tool to assess procedural skill in anaesthesia lies in the use of checklists and GRSs. There is good evidence for the use of a combination of a checklist and GRS in the setting of medical educational research and this combination of tools could be considered the ‘gold standard’ in this setting. When choosing an assessment tool for procedural skills, the key question is what the purpose of the evaluation is. For formative assessment, checklists and GRSs provide in-depth information to promote learning. Most research has focused on the measurement of performance; however, summative evaluation requires not only the measurement of performance but also the setting of standards and the designation of a trainee as competent or not competent. Subjective judgements of competence as a binary outcome are not psychometrically robust enough for high-stakes purposes and more reliable options include a GRS that includes competence as a behavioural descriptor or calibrating the scores of an assessment tool against an expert group.

There is a need to investigate available assessment tools in real-time in-training evaluations using large numbers of both trainees and raters. Directorate or faculty development is key: trained staff are more reliable in summative assessment and more likely to give feedback in formative assessment.65 As Kelly and colleagues44 have stated: ‘a macroview is needed on the purposes and systems for assessment, before becoming involved in the details of developing particular instruments, in particular, although a mediocre assessment within a good system can be manipulated to be “good enough”, a good instrument within a poor system is likely to perform poorly’.

Patient simulation is a growth area in anaesthesia training programmes; however, simulation for evaluation remains a controversial area. Research is required to develop and test simulators that are realistic enough and have a suitable range of difficulty to be used for high-stakes evaluation. Simulation and multistation assessments similar to OSATS may become a key part of the future of procedural skills assessment in anaesthesia if suitable part-task simulators can be validated. However, at the present time, there is not enough evidence to recommend that anaesthetic trainees are evaluated in procedural skills using simulators, especially when these skills can be reliably assessed by direct observation of performance on patients.

Appendix

Formulae for calculating the variables for Cusum analysis:46 

formula
 
formula
 
formula
 
formula
where h0 is the upper boundary limit =−b/(P+Q); h1, the upper boundary limit =−a/(P+Q); s=Q/(P+Q); α, the risk of a type I error; β, the risk of a type II error; p1, the acceptable failure rate; and p0, the unacceptable failure rate.

References

1
Aggarwal
R
Grantcharov
T
Moorthy
K
Milland
T
Darzi
A
Toward feasible, valid, and reliable video-based assessments of technical surgical skills in the operating room
Ann Surg
 , 
2008
, vol. 
247
 pg. 
372
 
2
Ahlberg
G
Enochsson
L
Gallagher
AG
, et al.  . 
Proficiency-based virtual reality training significantly reduces the error rate for residents during their first 10 laparoscopic cholecystectomies
Am J Surg
 , 
2007
, vol. 
193
 (pg. 
797
-
804
)
3
Altman
DG
Practical Statistics for Medical Research.
 , 
1991
London
Chapman Hall/CRC Press
4
Ault
G
Reznick
R
MacRae
H
, et al.  . 
Exporting a technical skills evaluation technology to other sites
Am J Surg
 , 
2001
, vol. 
182
 (pg. 
254
-
6
)
5
Bann
SD
Khan
MS
Darzi
AW
Measurement of surgical dexterity using motion analysis of simple bench tasks
World J Surg
 , 
2003
, vol. 
27
 (pg. 
390
-
4
)
6
Bisson
DL
Hyde
JP
Mears
JE
Assessing practical skills in obstetrics and gynaecology: educational issues and practical implications
Obstet Gynaecol
 , 
2006
, vol. 
8
 pg. 
107
 
7
Bolsin
S
Colson
M
The use of the Cusum technique in the assessment of trainee competence in new procedures
Int J Qual Health Care
 , 
2000
, vol. 
12
 (pg. 
433
-
8
)
8
Bould
MD
Crabtree
NA
Are logbooks of training in anaesthesia a valuable exercise?
Br J Hosp Med
 , 
2008
, vol. 
69
 pg. 
236
 
9
Bowles
TA
Watters
DA
Time to Cusum: simplified reporting of outcomes in colorectal surgery
ANZ J Surg
 , 
2007
, vol. 
77
 pg. 
587
 
10
Carr
S
The Foundation Programme assessment tools: an opportunity to enhance feedback to trainees?
Postgrad Med J
 , 
2006
, vol. 
82
 (pg. 
576
-
9
)
11
Chambers
WA
Difficult airways—difficult decisions: guidelines for publication?
Anaesthesia
 , 
2004
, vol. 
59
 (pg. 
631
-
3
)
12
Chandra
DB
Savoldelli
GL
Joo
HS
Weiss
ID
Naik
VN
Fiberoptic oral intubation: the effect of model fidelity on training for transfer to patient care
Anesthesiology
 , 
2008
, vol. 
109
 (pg. 
1007
-
13
)
13
Cheney
FW
Posner
KL
Lee
LA
Caplan
RA
Domino
KB
Trends in anesthesia-related death and brain damage: a closed claims analysis
Anesthesiology
 , 
2006
, vol. 
105
 (pg. 
1081
-
6
)
14
Cooper
GM
McClure
JH
Anaesthesia chapter from Saving mothers' lives; reviewing maternal deaths to make pregnancy safer
Br J Anaesth
 , 
2008
, vol. 
100
 (pg. 
17
-
22
)
15
Crawford
SW
Colt
HG
Virtual reality and written assessments are of potential value to determine knowledge and skill in flexible bronchoscopy
Respiration
 , 
2004
, vol. 
71
 (pg. 
269
-
75
)
16
Dashfield
AK
Coghill
JC
Langton
JA
Correlating obstetric epidural anaesthesia performance and psychomotor aptitude
Anaesthesia
 , 
2000
, vol. 
55
 (pg. 
744
-
9
)
17
Dashfield
AK
Smith
JE
Correlating fibreoptic nasotracheal endoscopy performance and psychomotor aptitude
Br J Anaesth
 , 
1998
, vol. 
81
 (pg. 
687
-
91
)
18
Datta
V
Chang
A
Mackay
S
Darzi
A
The relationship between motion analysis and surgical technical assessments
Am J Surg
 , 
2002
, vol. 
184
 (pg. 
70
-
3
)
19
Datta
V
Mackay
S
Mandalia
M
Darzi
A
The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model
J Am Coll Surg
 , 
2001
, vol. 
193
 (pg. 
479
-
85
)
20
de Oliveira Filho
GR
The construction of learning curves for basic skills in anesthetic procedures: an application for the cumulative sum method
Anesth Analg
 , 
2002
, vol. 
95
 (pg. 
411
-
6
)
21
Ericsson
KA
Charness
N
Expert performance
Am Psychol
 , 
1994
, vol. 
49
 (pg. 
725
-
47
)
22
Faulkner
H
Regehr
G
Martin
J
Reznick
R
Validation of an objective structured assessment of technical skill for surgical residents
Acad Med
 , 
1996
, vol. 
71
 (pg. 
1363
-
5
)
23
Fee
JPH
Murray
JM
McBride
A
Edgar
T
A realistic manikin for airway training
Anaesthesia
 , 
2003
, vol. 
58
 (pg. 
509
-
10
)
24
Feldman
LS
Hagarty
SE
Ghitulescu
G
Stanbridge
D
Fried
GM
Relationship between objective assessment of technical skills and subjective in-training evaluations in surgical residents
J Am Coll Surg
 , 
2004
, vol. 
198
 (pg. 
105
-
10
)
25
Feldman
LS
Sherman
V
Fried
GM
Using simulators to assess laparoscopic competence: ready for widespread use?
Surgery
 , 
2004
, vol. 
135
 (pg. 
28
-
42
)
26
Fletcher
G
Flin
R
McGeorge
P
, et al.  . 
Anaesthetists' Non-Technical Skills (ANTS): evaluation of a behavioural marker system
Br J Anaesth
 , 
2003
, vol. 
90
 (pg. 
580
-
8
)
27
Fried
GM
Feldman
LS
Vassiliou
MC
, et al.  . 
Proving the value of simulation in laparoscopic surgery
Ann Surg
 , 
2004
, vol. 
240
 (pg. 
518
-
25
)
28
Friedman
Z
Katznelson
R
Devito
I
Siddiqui
M
Chan
V
Objective assessment of manual skills and proficiency in performing epidural anesthesia—video-assisted validation
Reg Anesth Pain Med
 , 
2006
, vol. 
31
 (pg. 
304
-
10
)
29
Friedman
Z
Siddiqui
N
Katznelson
R
Devito
I
Davies
S
Experience is not enough: repeated breaches in epidural anesthesia aseptic technique by novice operators despite improved skill
Anesthesiology
 , 
2008
, vol. 
108
 (pg. 
914
-
20
)
30
Gaba
DM
Howard
SK
Flanagan
B
Smith
BE
Fish
KJ
Botney
R
Assessment of clinical performance during simulated crises using both technical and behavioral ratings
Anesthesiology
 , 
1998
, vol. 
89
 (pg. 
8
-
18
)
31
Gallagher
AG
Cowie
R
Crothers
I
Jordan-Black
JA
Satava
RM
PicSOr: an objective test of perceptual skill that predicts laparoscopic technical skill in three initial studies of laparoscopopic performance
Surg Endosc
 , 
2003
, vol. 
17
 (pg. 
1468
-
71
)
32
Gardner
J
Assessment and Learning
 , 
2006
London
Sage Publications
33
Goff
BA
Lentz
GM
Lee
D
Houmard
B
Mandel
LS
Development of an objective structured assessment of technical skills for obstetric and gynecology residents
Obstet Gynecol
 , 
2000
, vol. 
96
 (pg. 
146
-
50
)
34
Goff
BA
Nielsen
PE
Lentz
GM
, et al.  . 
Surgical skills assessment: a blinded examination of obstetrics and gynecology residents
Am J Obstet Gynecol
 , 
2002
, vol. 
186
 (pg. 
613
-
7
)
35
Grantcharov
TP
Kristiansen
VB
Bendix
J
Bardram
L
Rosenberg
J
Funch-Jensen
P
Randomized clinical trial of virtual reality simulation for laparoscopic skills training
Br J Surg
 , 
2004
, vol. 
91
 (pg. 
146
-
50
)
36
Hall
RE
Plant
JR
Bands
CJ
Wall
AR
Kang
J
Hall
CA
Human patient simulation is effective for teaching paramedic students endotracheal intubation
Acad Emerg Med
 , 
2005
, vol. 
12
 (pg. 
850
-
5
)
37
Hamdorf
JM
Hall
JC
Acquiring surgical skills
Br J Surg
 , 
2000
, vol. 
87
 (pg. 
28
-
37
)
38
Heard
A
Eakins
P
Emergency surgical airway access using a sheep model
Anaesthesia
 , 
2005
, vol. 
60
 (pg. 
833
-
4
)
39
Henderson
J
Popat
M
Latto
P
Pearce
A
Difficult Airway Society guidelines
Anaesthesia
 , 
2004
, vol. 
59
 (pg. 
1242
-
3
)
40
Hodges
B
Regehr
G
McNaughton
N
Tiberius
R
Hanson
M
OSCE checklists do not capture increasing levels of expertise
Acad Med
 , 
1999
, vol. 
74
 (pg. 
1129
-
34
)
41
Holmboe
ES
Hawkins
RE
Methods for evaluating the clinical competence of residents in internal medicine: a review
Ann Intern Med
 , 
1998
, vol. 
129
 (pg. 
42
-
8
)
42
Hopkins
PM
Ultrasound guidance as a gold standard in regional anaesthesia
Br J Anaesth
 , 
2007
, vol. 
98
 (pg. 
299
-
301
)
43
Hunt
RJ
Percent agreement, Pearson's correlation, and kappa as measures of inter-examiner reliability
J Dent Res
 , 
1986
, vol. 
65
 (pg. 
128
-
30
)
44
Kelly
A
Canter
R
A new curriculum for surgical training within the United Kingdom: context and model
J Surg Educ
 , 
2007
, vol. 
64
 (pg. 
10
-
9
)
45
Kelly
SP
Shapiro
N
Woodruff
M
Corrigan
K
Sanchez
LD
Wolfe
RE
The effects of clinical workload on teaching in the emergency department
Acad Emerg Med
 , 
2007
, vol. 
14
 (pg. 
526
-
31
)
46
Kestin
IG
A statistical approach to measuring the competence of anaesthetic trainees at practical procedures
Br J Anaesth
 , 
1995
, vol. 
75
 (pg. 
805
-
9
)
47
Keynan
A
Friedman
M
Benbassat
J
Reliability of global rating scales in the assessment of clinical competence of medical students
Med Educ
 , 
1987
, vol. 
21
 (pg. 
477
-
81
)
48
Kmietowicz
Z
Make patient safety part of everyday routines, says watchdog
Br Med J
 , 
2008
, vol. 
336
 (pg. 
294
-
5
)
49
Kneebone
R
Kidd
J
Nestel
D
Asvall
S
Paraskeva
P
Darzi
A
An innovative model for teaching and learning clinical procedures
Med Educ
 , 
2002
, vol. 
36
 (pg. 
628
-
34
)
50
Kneebone
R
Nestel
D
Yadollahi
F
, et al.  . 
Assessing procedural skills in context: exploring the feasibility of an Integrated Procedural Performance Instrument (IPPI)
Med Educ
 , 
2006
, vol. 
40
 (pg. 
1105
-
14
)
51
Konrad
C
Learning manual skills in anesthesiology: is there a recommended number of cases for anesthetic procedures?
Anesth Analg
 , 
1998
, vol. 
86
 (pg. 
635
-
9
)
52
Kopta
JA
The development of motor skills in orthopaedic education
Clin Orthop Relat Res
 , 
1971
, vol. 
75
 (pg. 
80
-
5
)
53
Lammers
RL
Davenport
M
Korley
F
, et al.  . 
Teaching and assessing procedural skills using simulation: metrics and methodology
Acad Emerg Med
 , 
2008
, vol. 
15
 (pg. 
1079
-
87
)
54
Landis
JR
Koch
GG
An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers
Biometrics
 , 
1977
, vol. 
33
 (pg. 
363
-
74
)
55
Martin
JA
Regehr
G
Reznick
R
, et al.  . 
Objective structured assessment of technical skill (OSATS) for surgical residents
Br J Surg
 , 
1997
, vol. 
84
 (pg. 
273
-
8
)
56
Matsumoto
ED
Hamstra
SJ
Radomski
SB
Cusimano
MD
The effect of bench model fidelity on endourological skills: a randomized controlled study
J Urol
 , 
2002
, vol. 
167
 (pg. 
1243
-
7
)
57
McKinley
RK
Strand
J
Gray
T
Schuwirth
L
Alun-Jones
T
Miller
H
Development of a tool to support holistic generic assessment of clinical procedure skills
Med Educ
 , 
2008
, vol. 
42
 (pg. 
619
-
27
)
58
McKinley
RK
Strand
J
Ward
L
Gray
T
Alun-Jones
T
Miller
H
Checklists for assessment and certification of clinical procedural skills omit essential competencies: a systematic review
Med Educ
 , 
2008
, vol. 
42
 (pg. 
338
-
49
)
59
Moorthy
K
Munz
Y
Dosis
A
Bello
F
Chang
A
Darzi
A
Bimodal assessment of laparoscopic suturing skills: construct and concurrent validity
Surg Endosc
 , 
2004
, vol. 
18
 (pg. 
1608
-
12
)
60
Morgan
PJ
Cleave-Hogg
D
Guest
CB
A comparison of global ratings and checklist scores from an undergraduate assessment using an anesthesia simulator
Acad Med
 , 
2001
, vol. 
76
 (pg. 
1053
-
5
)
61
Muller
R
Buttner
P
A critical discussion of intraclass correlation coefficients
Stat Med
 , 
1994
, vol. 
13
 (pg. 
2465
-
76
)
62
Naik
VN
Devito
I
Halpern
SH
Cusum analysis is a useful tool to assess resident proficiency at insertion of labour epidurals
Can J Anaesth
 , 
2003
, vol. 
50
 (pg. 
694
-
8
)
63
Naik
VN
Matsumoto
ED
Houston
PL
, et al.  . 
Fiberoptic orotracheal intubation on anesthetized patients: do manipulation skills learned on a simple model transfer into the operating room?
Anesthesiology
 , 
2001
, vol. 
95
 (pg. 
343
-
8
)
64
Naik
VN
Perlas
A
Chandra
DB
Chung
DY
Chan
VW
An assessment tool for brachial plexus regional anesthesia performance: establishing construct validity and reliability
Reg Anesth Pain Med
 , 
2007
, vol. 
32
 (pg. 
41
-
5
)
65
Norcini
J
Workplace-based assessment as an educational tool: AMEE Guide No. 31
Med Teach
 , 
2007
, vol. 
29
 (pg. 
855
-
71
)
66
Norcini
JJ
McKinley
DW
Assessment methods in medical education
Teach Teach Educ
 , 
2007
, vol. 
23
 (pg. 
239
-
50
)
67
Regehr
G
MacRae
H
Reznick
RK
Szalay
D
Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination
Acad Med
 , 
1998
, vol. 
73
 (pg. 
993
-
7
)
68
Reznick
R
Regehr
G
MacRae
H
Martin
J
McCulloch
W
Testing technical skill via an innovative ‘bench station’ examination
Am J Surg
 , 
1997
, vol. 
173
 (pg. 
226
-
30
)
69
Reznick
RK
MacRae
H
Teaching surgical skills—changes in the wind
N Engl J Med
 , 
2006
, vol. 
355
 pg. 
2664
 
70
Ringsted
C
Ostergaard
D
Ravn
L
Pedersen
JA
Berlac
PA
van der Vleuten
CPM
A feasibility study comparing checklists and global rating forms to assess resident performance in clinical skills
Med Teach
 , 
2003
, vol. 
25
 (pg. 
654
-
8
)
71
Ringsted
C
Ostergaard
D
Scherpbier
A
Embracing the new paradigm of assessment in residency training: an assessment programme for first-year residency training in anaesthesiology
Med Teach
 , 
2003
, vol. 
25
 (pg. 
54
-
62
)
72
Royal College of Anaesthetists
CCT in Anaesthesia III: Competency Based Intermediate Level (Years 3 and 4) Training and Assessment. A Manual for Trainees and Trainers.
 , 
2007
London
Royal College of Anaesthetists
73
Royal College of Anaesthetists
CCT in Anaesthesia II: Competency Based Basic Level (ST Years 1 and 2) Training and Assessment. A Manual for Trainees and Trainers.
 , 
2007
London
Royal College of Anaesthetists
74
Royal College of Anaesthetists
CCT in Anaesthesia IV: Competency Based Higher and Advanced Level (Years 5, 6 and 7) Training and Assessment. A Manual for Trainees and Trainers.
 , 
2007
London
Royal College of Anaesthetists
75
Saleh
GM
Voyatzis
G
Hance
J
Ratnasothy
J
Darzi
A
Evaluating surgical dexterity during corneal suturing
Arch Ophthalmol
 , 
2006
, vol. 
124
 (pg. 
1263
-
6
)
76
Salgado
J
Grantcharov
TP
Papasavas
PK
Gagne
DJ
Caushaj
PF
Technical skills assessment as part of the selection process for a fellowship in minimally invasive surgery
Surg Endosc
 , 
2009
, vol. 
23
 (pg. 
641
-
4
)
77
Schmidt
RA
Bjork
RA
New conceptualizations of practice: common principles in three paradigms suggest new concepts for training
Psychol Sci
 , 
1992
, vol. 
3
 (pg. 
207
-
17
)
78
Schuepfer
G
Johr
M
Generating a learning curve for penile block in neonates, infants and children: an empirical evaluation of technical skills in novice and experienced anaesthetists
Paediatr Anaesth
 , 
2004
, vol. 
14
 (pg. 
574
-
8
)
79
Sites
BD
Spence
BC
Gallagher
JD
Wiley
CW
Bertrand
ML
Blike
GT
Characterizing novice behavior associated with learning ultrasound-guided peripheral regional anesthesia
Reg Anesth Pain Med
 , 
2007
, vol. 
32
 (pg. 
107
-
15
)
80
Sivarajan
M
Miller
E
Hardy
C
, et al.  . 
Objective evaluation of clinical performance and correlation with knowledge
Anesth Analg
 , 
1984
, vol. 
63
 (pg. 
603
-
7
)
81
Smith
SG
Torkington
J
Brown
TJ
Taffinder
NJ
Darzi
A
Motion analysis
Surg Endosc
 , 
2002
, vol. 
16
 (pg. 
640
-
5
)
82
Sulaiman
L
Tighe
SQM
Nelson
RA
Surgical vs wire-guided cricothyroidotomy: a randomised crossover study of cuffed and uncuffed tracheal tube insertion
Anaesthesia
 , 
2006
, vol. 
61
 (pg. 
565
-
70
)
83
Taffinder
NJ
McManus
IC
Gul
Y
Russell
RC
Darzi
A
Effect of sleep deprivation on surgeons' dexterity on laparoscopy simulator
Lancet
 , 
1998
, vol. 
352
 pg. 
1191
 
84
Tetzlaff
JE
Assessment of competency in anesthesiology
Anesthesiology
 , 
2007
, vol. 
106
 (pg. 
812
-
25
)
85
Thompson
WG
Lipkin
M
Jr
Gilbert
DA
Guzzo
RA
Roberson
L
Evaluating evaluation: assessment of the American Board of Internal Medicine Resident Evaluation Form
J Gen Intern Med
 , 
1990
, vol. 
5
 (pg. 
214
-
7
)
86
Tooke
J
Aspiring to Excellence. Finding and Recommendations of the Independent Inquiry into Modernising Medical Careers
 , 
2008
Chiswick
Aldridge Press
87
Turnbull
J
Gray
J
MacFadyen
J
Improving in-training evaluation programs
J Gen Intern Med
 , 
1998
, vol. 
13
 (pg. 
317
-
23
)
88
Vadodaria
BS
Gandhi
SD
McIndoe
AK
Comparison of four different emergency airway access equipment sets on a human patient simulator
Anaesthesia
 , 
2004
, vol. 
59
 (pg. 
73
-
9
)
89
Varaday
SS
Yentis
SM
Clarke
S
A homemade model for training in cricothyrotomy
Anaesthesia
 , 
2004
, vol. 
59
 (pg. 
1012
-
5
)
90
Vassiliou
MC
Feldman
LS
Andrew
CG
, et al.  . 
A global assessment tool for evaluation of intraoperative laparoscopic skills
Am J Surg
 , 
2005
, vol. 
190
 (pg. 
107
-
13
)
91
Vassiliou
MC
Ghitulescu
GA
Feldman
LS
, et al.  . 
The MISTELS program to measure technical skill in laparoscopic surgery: evidence for reliability
Surg Endosc
 , 
2006
, vol. 
20
 (pg. 
744
-
7
)
92
Wanzel
KR
Hamstra
SJ
Anastakis
DJ
Matsumoto
ED
Cusimano
MD
Effect of visual-spatial ability on learning of spatially-complex surgical skills
Lancet
 , 
2002
, vol. 
359
 (pg. 
230
-
1
)
93
Watts
J
Feldman
WB
Nuefeld
VR
Norman
GR
Assessment of technical skills
Assessing Clinical Competence.
 , 
1985
New York
Springer
(pg. 
259
-
74
)
94
Weiss
ID
Naik
VN
Salvoldelli
G
Chandra
DB
Joo
HS
LeBlanc
V
Sleep deprivation and anesthesiologists' technical skills (Abstract)
Canadian Anesthesiologists' Society Annual Meeting
 , 
2007
Calgary
95
Wilkinson
TJ
Wade
WB
Problems with using a supervisor's report as a form of summative assessment
Postgrad Med J
 , 
2007
, vol. 
83
 pg. 
504
 
96
Wragg
A
Wade
W
Fuller
G
Cowan
G
Mills
P
Assessing the performance of specialist registrars
Clin Med
 , 
2003
, vol. 
3
 (pg. 
131
-
4
)
97
Xeroulis
GJ
Park
J
Moulton
CA
Reznick
RK
LeBlanc
V
Dubrowski
A
Teaching suturing and knot-tying skills to medical students: a randomized controlled study comparing computer-based video instruction and (concurrent and summary) expert feedback
Surgery
 , 
2007
, vol. 
141
 (pg. 
442
-
9
)
98
Young
A
Miller
JP
Azarow
K
Establishing learning curves for surgical residents using Cumulative Summation (CUSUM) analysis
Curr Surg
 , 
2005
, vol. 
62
 (pg. 
330
-
4
)
99
Ziv
A
Wolpe
PR
Small
SD
Glick
S
Simulation-based medical education: an ethical imperative
Acad Med
 , 
2003
, vol. 
78
 (pg. 
783
-
8
)

Comments

0 Comments