The role of experience level in radiographic evaluation of femoroacetabular impingement and acetabular dysplasia

Accurate radiographic interpretation is essential for properly diagnosing the etiology of pre-arthritic hip pain such as femoroacetabular impingement (FAI) and acetabular dysplasia (AD); however, radiographic interpretation can be significantly influenced by the observer’s experience level. This study assesses the accuracy and inter- and intraobserver reliability in the radiographic evaluation of FAI and AD based on experience level. Fifty-five patients diagnosed with FAI, AD or normal hip morphology were identified from the principal investigator’s institutional database. Four observers performed an independent and blinded radiographic review, assessing 14 radiographic parameters and an interpretation of a final diagnosis. A second radiographic evaluation of 20 preselected cases was completed 6 weeks after the initial reading to assess intraobserver reliability. Inter- and intraobserver reliability was determined using Cohen’s Kappa Coefficient (κ) and intraclass correlation coefficient (ICC) for continuous parameters in a four-rater design. Interobserver reliability was highest across experience levels for lateral centre edge angle (ICC = 0.92) and alpha angle (ICC = 0.90) and lowest (κ < 0.3, ICC < 0.3) for joint congruency and detection of herniation pits. Intraobserver reliability was highest for acetabular depth (κ = 0.89) and alpha angle (ICC = 0.80) and lowest for head–neck offset ratio and Tönnis grade. Final diagnosis was consistent with the original blinded clinical diagnosis 75–84% of the time across four experience levels. The attending orthopaedic hip surgeon demonstrated greatest diagnostic sensitivity but lowest specificity for making an accurate radiographic diagnosis. Subjective parameters must be redefined, and objective parameters must be further developed to improve the reliability of accurately diagnosing FAI or AD.


INTRODUCTION
Pre-arthritic hip conditions such as femoroacetabular impingement (FAI) and acetabular dysplasia (AD) have been recognized to predispose patients to early hip osteoarthritis (OA) [1][2][3][4][5][6]. A delay in diagnosis for these patients could preclude them from pursuing a timely joint preserving surgery and instead result in worsening joint degeneration requiring replacement [7]. Unfortunately, FAI and AD often have overlapping symptoms, making an accurate diagnosis challenging. Therefore, radiographs remain the cornerstone in the evaluation, diagnosis and management of patients with these pre-arthritic hip conditions [8,9]. The ability to accurately and consistently interpret the radiographs in this particular patient population is of great importance.
There are a number of radiographic parameters used to measure and describe AD as well as FAI; however, it is unclear which of these parameters are considered the most useful for an accurate diagnosis [8]. Several studies to date have examined the interobserver and intraobserver reliability of individual radiographic measurements such as alpha angle, head sphericity, head-neck offset and acetabular index [10][11][12][13][14][15]. However, the majority of these publications have focused exclusively on measuring the reliability amongst only orthopaedic hip specialists [16]. Since patients are often seen by physicians of varying levels of experience, dependable radiographic parameters to assess FAI and AD should yield the same assessment after multiple readings by different observers regardless of their experience level. Additionally, no study to our knowledge has looked to see if increasing surgical experience leads to a higher tendency of 'over-reading' radiographs, thereby resulting in a greater number of pathologic diagnoses and potential recommendation for further work-up or surgical intervention [17].
The objective of this study was to determine the interobserver and intraobserver reliability for the radiographic evaluation and diagnosis of FAI and AD between physicians with varying levels of experience. Additionally, we were interested in analysing the final radiographic diagnosis to investigate whether there was a tendency for more experienced surgeons to read radiographs as abnormal. Our hypothesis was (i) there will be poor interobserver reliability across experience levels, (ii) there will be better intraobserver reliability with more experienced readers and (iii) more experienced surgeon readers will categorize a greater number of radiographs as being abnormal.

MATERIALS AND METHODS
After Institutional Review Board approval, a statistician from the author's institution determined the number of cases needed for each group to determine statistical significance in regard to interobserver and intraobserver reliability for each radiographic parameter and final diagnosis.
Fifty-five patients were then identified from the principal investigator's institutional patient database. ICD-9 codes were initially used to identify patients with FAI, AD and normal hip morphology. Anteroposterior (AP) and elongated neck lateral radiographs for 20 patients with FAI (cam or pincer morphology), 20 patients with AD and a control group of 15 patients without abnormal hip morphology were selected. All FAI and AD subjects were greater than 18 years of age with a confirmed clinical diagnosis based on clinical, radiographic and surgical findings. Controls were required to have been evaluated clinically and radiographically by their surgeon and found to have no evidence of hip disease or previous hip surgery. Finally, all radiographs were standardized by evaluating for excessive rotation and tilt using the gender neutral symphysis-sacrococcygeal joint distance as described by Sibenrock et al. [18]. Patients with excessively tilted or rotated AP radiographs were excluded.
De-identified AP and elongated neck lateral radiographs were taken from our institution's picture achieving and communication system (PACS) for each study patient and individually stored on a compact disc. All images retained the capacity to be manipulated as needed by each reader with use of the bundled PACS measurement tools. Case order was randomly assigned for each of the three diagnoses.
Four observers from the same institution-an attending orthopaedic hip surgeon, an attending musculoskeletal radiologist, an orthopaedic sports fellow and a third-year orthopaedic surgery resident-performed a blinded radiographic review of all 55 patients. Observers assessed 14 radiographic parameters and made a final diagnosis of FAI, AD or normal.
The objective radiographic parameters for this study included acetabular inclination, alpha angle (lateral), head-neck offset, head-neck offset ratio, head-neck offset ratio (lateral) and lateral centre edge angle. Subjective radiograph features included acetabular depth, acetabular version, detection of herniation pits, head sphericity, head sphericity (lateral), joint congruency, position of head centre and Tönnis grade. A second radiographic evaluation of 20 randomly selected cases was completed 6 weeks after the initial reading to assess intraobserver reliability.

STATISTICS
Intraobserver and interobserver reliability was determined using Cohen's Kappa coefficient (j) and intraclass correlation coefficient (ICC) as well as the corresponding 95% confidence intervals for continuous parameters in a fourrater design. For intraobserver reliability, agreement was calculated for each observer separately. The pooled estimates of the intraobserver Kappa statistics/ICC, the corresponding asymptotic standard errors and the 95% confidence intervals were also determined. All statistical analysis was performed using SAS 9.2 for Windows (SAS, Cary, NC).

RESULTS
Interobserver reliability was highest across all experience levels for lateral centre edge angle and alpha angle (ICC ¼ 0.92 and ICC ¼ 0.90, respectively) and lowest for acetabular depth, joint congruency, Tönnis grade and detection of herniation pits. The agreement between all readers for final diagnosis was moderately reliable (ICC ¼ 0.68; range 0.60-0.75) ( Table I) [19]. Of all 14 radiographic parameters, seven demonstrated poor interobserver reliability with kappa or ICC values less than or equal to 0.40 (Table I) [18]. Eighty-six percent (6 of 7) of these unreliable parameters were considered to be subjective.
Intraobserver reliability was highest for acetabular depth and alpha angle (j ¼ 0.89 and ICC ¼ 0.80, respectively). The head-neck offset ratio (ICC ¼ 0.38) had the poorest reliability amongst the individual readers (Table II). For the final radiographic diagnosis, intraobserver reliability ranged from moderate to excellent across the various readers (j ¼ 0.54-1.0). The pooled intraobserver reliability for all readers was considered to be excellent (ICC ¼ 0.79) (Table II) [18].
Finally, we found that surgical experience had an effect on the final radiographic diagnosis. The attending orthopaedic hip surgeon demonstrated the highest sensitivity (90%) but lowest specificity (53.3%) in making an accurate diagnosis. Of his incorrect diagnoses, 63.6% (7 of 11) were because of 'over-calling' the case by identifying a pathologic condition in a normal radiograph. The orthopaedic sports fellow had a slightly lower diagnostic sensitivity of 82.5% and the junior orthopaedic resident demonstrated the lowest sensitivity (75%) but highest specificity (86.6%). Of the resident's incorrect diagnoses, only 16.6% (2 of 12) were due to falsely identifying a pathologic condition. The difference in the rate of false positives between the attending orthopaedic surgeon and resident readers was statistically significant (P ¼ 0.036).

DISCUSSION AND CONCLUSION
Radiographs remain the cornerstone in the diagnosis and management of pre-arthritic hip conditions such as FAI and AD. Accurate radiographic evaluation is of paramount importance as these conditions can often have overlapping clinical symptoms. However, radiographic interpretation of subtle abnormal hip morphology can be difficult, particularly to the untrained eye. For this study, we wanted to see how a varying level of experience affects intraobserver and interobserver reliability. Based on our statistical analysis of a four-rater system composed of physicians with a broad spectrum of clinical experience, we found a relatively low level of interobserver and intraobserver reliability between readers, especially for subjective parameters. Thus, many of the standard radiographic measurements on AP pelvis and lateral hip views to diagnose FAI or AD were observed as not being reproducible.
We found interobserver reliability to be highest among objective parameters such as the lateral centre edge angle and alpha angle. Poor interobserver reliability values were observed with predominately subjective measures such as joint congruency, Tönnis grade and detection of herniation pits. In fact, of the radiographic markers that were found to have poor interobserver reliability, six of the seven were subjective parameters. Our findings are similar to two other studies that have also looked at the reliability of prearthritic radiographic hip parameters [16,20]. Carlisle et al. [16] assessed reliability using observers with a narrower range of surgical hip experience-two orthopaedic residents, one orthopaedic adult reconstruction fellow, one sports orthopaedic attending without interest in the hip and two attending musculoskeletal physiatrists. Similar to our findings, they observed that the centre edge angle was the most reliable interobserver measurement (ICC ¼ 0.64) and that, in general, the more objective radiographic parameters were more reliable [16]. Another study by Clohisy et al. examined the reliability of the same 14 radiographic parameters as our study. In contrast, they employed a rater  system composed of only hip specialists-one orthopaedic hip fellow and five orthopaedic attendings all with extensive hip experience. They found the highest interobserver reliability with acetabular inclination, position of head centre and Tönnis grade [20]. Interestingly, all three of these parameters had kappa values lower than 0.5 in our study. Intraobserver reliability was highest in our study for acetabular depth, alpha angle, acetabular inclination and lateral centre edge angle. Additionally, we observed that intraobserver reliability was generally higher than interobserver values. These findings support similar observations made in prior publications. Carlisle et al. [16] found excellent intraobserver reliability for all radiographic parameters including lateral centre edge angle, Tönnis angle, headneck offset with a frog-leg and cross-table lateral as well as the alpha angle with a frog-leg and cross-table lateral. Only the Tönnis grade had poor reliability [16]. In contrast, Clohisy et al. found that only acetabular inclination, position of femoral head and acetabular depth had intraobserver reliability kappa values >0.60 [20]. We had similar findings of high reliability with two of these three measurements (acetabular inclination and depth). Taking our results as well as those of the two aforementioned studies, we could not find one single radiographic parameter that demonstrated consistently excellent interobserver and intraobserver reliability across all studies.
One unique aspect of our investigation was that we found that the amount of surgical hip experience increased the likelihood of radiographically making a pathologic diagnosis. Our senior hip surgeon demonstrated the highest diagnostic sensitivity among all readers. In other words, he most often correctly diagnosed a radiograph as abnormal (90.0%) than the orthopaedic sports fellow (82.5%) or resident (75.0%). However, he also demonstrated the lowest degree of diagnostic specificity meaning he was most likely to diagnosis a normal radiograph as abnormal. Being a high sensitivity but low specificity evaluator is clinically beneficial as it serves as an effective screening tool in identifying the greatest number of individuals with a pathologic condition so that further diagnostic workup (e.g. CT and MRI) or treatment (physical therapy, diagnostic cortisone injection, etc) can be initiated. This also has important implications for the increasing number of hip arthroscopies being performed every year [17]. The tendency to overdiagnose radiographs can explain the increasing incidence of hip arthroscopy procedures. Clear surgical indications should depend on radiographs in conjunction with advanced imaging (CT and MRI), physical exam findings and hip injections. Advanced imaging is especially important in the assessment of FAI and AD as modalities such as three-dimensional (3D) CT have been shown to be as accurate as radiographs for diagnosing and characterizing hip pathologies such as FAI and AD [21,22,23]. Advances in imaging will continue to improve our diagnostic ability in evaluating the pre-arthritic hip; however, the cornerstone of FAI and AD diagnosis and evaluation of treatment is properly performed radiographs.
Our study was not without limitations. Since only radiographs were evaluated, observers could not use clinical information such as patient history or physical examination findings to help corroborate any abnormalities seen on radiograph. This could be seen as a possible detractor in the clinician's ability to determine the final diagnosis and emphasizes the importance of a complete clinical work-up and the need to use other imaging modalities to confirm radiographic findings before proceeding with treatment. Additionally, although an attempt was made to standardize the AP pelvis radiographs by recognizing cases with extreme tilt or rotation, the images were not corrected using computer-assisted methods that have been described in the literature [14,24]. Because of this there is a possibility that subtle rotation and tilt of the pelvis on AP radiographs inappropriately influenced the physician's interpretation of radiographic features such as acetabular retroversion.
In conclusion, we found that objective radiographic measurements such as lateral centre edge angle and alpha angle had stronger interobserver and intraobserver reliability than more subjective measurements such as Tönnis grade. We believe that radiographic evaluation ideally allows for consistent assessment between observers to come to the same conclusion on multiple readings regardless of varying skill set or specialty. Therefore, we believe that subjective radiographic parameters need to be redefined and objective parameters need to be further developed to improve the reliability of accurately diagnosing FAI or AD. Additionally, we found that surgical experience not only increased the sensitivity of diagnosing a pathologic condition based on radiograph but also led to a higher rate of 'over-reading' the radiograph and making a falsely positive diagnosis.

SOURCE OF FUNDING
No external funding source.