The reproducibility of cervical histology diagnoses is critical for efficient screening and to evaluate the effectiveness of new technologies. The vast majority of cervical intraepithelial neoplasia (CIN) diagnoses reported in the New Technologies for Cervical Cancer study were blindly reviewed by 2 independent pathologists. Only H&E-stained slides were used for the review. The reviewers were asked to reclassify cases using the following categories: normal CIN 1, CIN 2, CIN 3, and squamous and glandular invasive cancer. We reviewed 1,003 cases. The interobserver agreement was 0.36 (95% confidence interval [CI], 0.32–0.40) with an unweighted κ and 0.54 with a weighted κ (95% CI, 0.50–0.58). The κ values from dichotomous classifications with the threshold at CIN 2 were 0.69 (95% CI, 0.64–0.73) and 0.57 (95% CI, 0.51–0.63) with the threshold at CIN 3. The CIN 2 diagnosis had the lowest class-specific agreement, with fewer than 50% of cases confirmed by the panel members, which supports the fact that CIN 2 is not a well-defined stage in the pathogenesis of cervical neoplasia.
The reproducibility of cervical cytology and histology diagnoses is critical for implementing an effective screening program. The correct classification of precancerous cervical lesions is a matter of debate,1–4 and histology reports are generally used as the “gold standard” to measure the impact of new screening and vaccine technologies.5
Difficulties in the interpretation of these reports may arise for 2 reasons. First, the morphologic features of precancerous cervical lesions can be seen as a continuum.4 Evidence indicates that CIN 1 and koilocytes are related to the lytic cycle of human papillomavirus (HPV) infection and that CIN 3 is the true cervical cancer precursor, more commonly associated with virus integration in the host DNA,3,6 while the meaning of CIN 2 diagnosis is less clear.2,7 Second, some nonneoplastic abnormalities such as atrophy, immature squamous metaplasia, and atypia related to intense inflammation can mimic CIN,8,9 especially when histologic slides are not optimal and not clearly oriented.
In this study, we evaluated the reproducibility of CIN diagnoses in a group of well-trained observers, using a consensus method in the context of a multicenter randomized trial: the New Technologies in Cervical Cancer (NTCC) study.10–12
Material and Methods
Some 95,000 women aged 25 to 60 years were enrolled in the NTCC trial. Women were randomly assigned to a conventional Papanicolaou test or to HPV testing with or without liquid-based cytology. Women in the conventional arm were managed according to standard screening protocols. The first and second phases in the experimental arm were managed differently. In phase 1 all women 35 years or older who were HPV+ or had a cytology result of atypical squamous cells of undetermined significance (ASCUS) or worse (ASCUS+) were referred for colposcopy. HPV+ women younger than 35 years were referred for colposcopy if the cytology result was also ASCUS+, but if the cytology result was normal, they were recalled after 1 year for repeating of both tests. At this time, they were referred for colposcopy for positive results to any HPV or cytology testing. In phase 2, women randomized to the experimental arm received only the HPV test, and all HPV+ women were referred for colposcopy, regardless of age. Colposcopy-guided biopsies were performed on 60% of the women referred for colposcopy, with no difference between arms.10–12
All participating pathology units were the reference centers for the local cervical cancer screening program. This implies that they had extensive experience in cervical pathology, and about one third to one half of their activity was dedicated to cervical pathology. Only 1 pathologist (in Turin) worked exclusively with cervical diseases.
Case Selection and Slide Review
Because the end point of the NTCC study is CIN 2 or worse (2+), to obtain a blind review of the end point, all CIN cases were reviewed. All histologic specimens diagnosed as CIN 1, 2, or 3 in phase 2 of the NTCC study were blindly reviewed according to the protocol already published.13 Furthermore, all diagnoses of CIN 2 and 3 and 42% of CIN 1 diagnoses obtained in phase 1 were reviewed according to the same protocol (the remaining 58% of the CIN 1 diagnoses were revised according to a different protocol and were not suitable for this analysis). According to the scope of the revision, negative and invasive cancer diagnoses were not reviewed because the probability of a significant change in the study end point, ie, from negative to CIN 2+ or from invasive cancer to CIN 1 or less severe, was a priori considered very low.
Briefly, 4 workshops were conducted, and 1 pathologist from each center participated. Each case was randomly assigned for review by 2 pathologists from a pool of 9 (1 per center) but not the same as the one who made the original diagnosis. Reviewers were unaware of the original diagnosis and cytology and/or HPV test results. For reevaluation, only H&E-stained slides were used. Each reviewer was asked to formulate a single diagnosis for each patient, corresponding to the worst pathologic finding in all of the patient’s biopsy specimens, according to the following 5 categories: negative or inflammation, CIN 1 including HPV condyloma, CIN 2, CIN 3, or invasive carcinoma (squamous and adenocarcinoma). Cases with large discrepancies (normal or CIN 1 vs CIN 2 or 3) between the 2 reviewers or with the original diagnosis and all cases with a diagnosis of invasive cancer were reviewed and discussed further by all pathologists involved, using a multiheaded microscope. When the panel could not come to a unanimous verdict, the final diagnosis was based on the majority.
For the purpose of this study, only the diagnoses made by the 2 reviewers were used to calculate interobserver agreement. This is different from the NTCC study end point, which was based on the final diagnosis after revision.
The κ values were calculated only from the ratings of each individual pathologist for each case reviewed. Cases classified by at least 1 of the 2 reviewers as inadequate, CIN not otherwise specified, or glandular dysplasia were excluded from the κ calculations.
We report the results of the following: (1) overall κ values on a 5-category scale (normal, CIN 1, CIN 2, CIN 3, invasive cancer), weighted and unweighted, for all cases and by age group (younger than 40 years and 40 years or older); by recruiting center, ie, the center that performed the biopsy and prepared the slides; and by the cytology or HPV result that led to referral of the woman for colposcopy; (2) overall κ values for dichotomous classifications with thresholds at CIN 2 and CIN 3 and specific κ values for CIN 1, CIN 2, and CIN 3 separately; (3) κ values comparing each individual center with all the other centers, for dichotomous classifications with thresholds at CIN 2 and CIN 3 and specific κ values for CIN 1, CIN 2, and CIN 3; and (4) one-to-one comparisons between centers, only for dichotomous classifications with thresholds at CIN 2 and CIN 3.
For weighted κ values, we used squared weights according to the following formula: 1 – ([i – j]/[k – 1])2, where i and j index the rows and columns of the ratings by the 2 reviewers and k is the maximum number of possible ratings.14
To test the symmetry of 2 × 2 tables, the McNemar test was used.15 This statistic tests whether there are systematic “calibration” differences between centers.
Overall κ values were computed by means of precision-weighted κ values of all intercenter comparisons.14 Because comparing every combination of centers was not feasible in the subgroup analyses (by the center that performed the biopsy and prepared the slides and by cytology or HPV result that led to referral of the woman for colposcopy), we computed the κ by building a unique contingency table adopting a fixed order of reviewers (ie, Bologna was always first, Florence was always the first except when paired with Bologna, Imola was always the first except when paired with Bologna or Florence, etc; Viterbo was always second). Heterogeneity between κ values was tested by using the χ2 test computed as the precision-weighted sum of differences between each κ value and the overall κ value.
To test the heterogeneity between recruiting centers or between pairs of reviewers, a random effect logistic model was built using agreement in the 5-group classification as a binary outcome variable and the center or the pair as a panel variable. By using univariate logistic models to account for variability between centers, we also tested the effect on agreement of age, presence of a CIN 2 diagnosis, and precolposcopy cytology and HPV results.
During the 2 phases of the study, 1,496 CIN cases were identified. Of these cases, 85 were unavailable for review and 344 CIN 1 cases from the first phase were reevaluated by only 1 reviewer and excluded from this analysis. Cases originally diagnosed as negative or inflammatory were not included in the review. In addition, 12 cases were excluded because 1 of the 2 diagnoses was unclear or missing. In total, 1,003 cases were included in the analysis.
Table 1 presents the overall agreement. One or both reviewers classified 243 cases as negative, all reflecting a downgrade from the original diagnosis, at least from CIN 1. The same was true for invasive cancer. Consequently, all cases classified as cancer here represent an upgrade of cases originally classified as CIN.
The strongest source of disagreement was the threshold between normal and CIN 1. (Only 30% of cases with a negative diagnosis were agreed on by the 2 reviewers.) Agreement was higher for CIN 1 and CIN 3 than for CIN 2. Both reviewers agreed on a CIN 2 diagnosis in fewer than half of the cases.
Ten cases were classified as carcinoma (1 frankly invasive squamous carcinoma, 7 microinvasive squamous carcinoma, and 2 in situ or intramucosal adenocarcinoma) by at least 1 reviewer, but there was total agreement only for 1 case (the frankly invasive case). All of these cases were discussed in plenary sessions, and only 5 microinvasive squamous cancers and no adenocarcinomas were confirmed.
The κ values calculated on the 5-grade classification (normal, CIN 1, CIN 2, CIN 3, or squamous cancer) were 0.34 (95% confidence interval [CI], 0.31–0.37) when unweighted and 0.65 when weighted (95% CI, 0.61–0.69). The κ values did not differ greatly according to age: 0.61 (95% CI, 0.54–0.67) and 0.62 (95% CI, 0.56–0.67) in women younger than and 40 years or older, respectively. There were no significant differences between recruitment centers (test for ρ = 0; ; P = 1) Table 2. The κ values were slightly higher when the HPV result was positive and the cytology result was more severe than ASCUS, but none of these differences was statistically significant Table 3. Table 4 shows the results of a logistic model with the 5-group agreement as binary outcomes and the pair of reviewers as a panel variable: only the presence of CIN 2 as 1 of the 2 diagnoses had a statistically significant negative effect on agreement, while age, precolposcopy cytology result, and HPV result did not have a significant effect.
The overall κ values for dichotomous classifications were 0.65 (95% CI, 0.59–0.71) and 0.52 (95% CI, 0.46–0.58) with the thresholds at CIN 2 and CIN 3, respectively Table 5 and Table 6. There was a statistically significant variation at both thresholds between the κ values of each center compared with the others: for CIN 2+, these κ values ranged from 0.51 to 0.76. One-to-one κ values between centers showed a wider range between point estimates, from 0.2 to 1 (Table 5), but the observed variability was less than what is expected by random fluctuations (test for ρ = 0; ; P = 1). The κ values for CIN 3 or worse at a single center compared with the others ranged from 0.47 to 0.67, while one-to-one κ values ranged from 0.23 to 0.81 (Table 6). (Also, in this case, the observed variability was less than what is expected by random fluctuations: test for ρ = 0; ; P = 1.) The McNemar test was significant only in 7 of 72 one-to-one comparisons.
When considering class-specific κ values Table 7, CIN 2 and CIN 1 showed the lowest values: for CIN 2, the overall κ was 0.31 (95% CI, 0.25–0.37), and the κ values for individual vs all other centers ranged from 0.38 to 0.08. For CIN 1, the overall κ was 0.33 (95% CI, 0.27–0.39; κ values for individual vs all other centers, 0.49 to 0.20). The overall class-specific κ for CIN 3 was 0.50 (95% CI, 0.44–0.56; κ values for individual vs all other centers, 0.60 to 0.32).
In this study, more than 1,000 cases (approximately 3,000 slides) were reviewed by at least 2 reviewers. The 9 centers participating in the NTCC trial all operate in well-established Italian screening programs. Consequently, results can be extended to most screening programs currently operating in Italy.
One of the major sources of variability was the marked tendency of the reviewers to downgrade the original interpretations from CIN 1 to negative. Although CIN 1 is frequently overinterpreted in cervical pathology practice, in this study, as in the ALTS,16 most of the CIN 1 cases downgraded at review were HPV DNA–positive, reflecting problems in establishing specific criteria for recognizing the morphologic features of the HPV cytopathic effect.
Agreement on invasive (microinvasive and frankly invasive) cancers was low. However, as specified, the studied cases represent an equivocal group because specimens originally diagnosed as invasive, the most typical, were not included in the review. The most reasonable explanation for squamous cancers was the uncertainty of the basal membrane’s integrity using only H&E-stained slides. Lack of desmoplasia and the possibility of a cut artifact identified during the consensus discussion allowed us to exclude microinvasion in some cases.
There are a variety of reasons for the low consistency of adenocarcinoma diagnoses. Very few glands with atypia were present in the 2 cases. While one reviewer judged this aspect as sufficient evidence to diagnose adenocarcinoma even in small endocervical biopsy specimens, in consensus sessions, this aspect was considered insufficient to make this diagnosis.
We found an overall κ value of 0.65 for CIN 2+ and of 0.52 for CIN 3+. These figures are consistent with other similar studies. Stoler and Schiffman16 found a value of 0.68, but their study also reviewed negative biopsy specimens. Carreon et al7 found a 0.7 κ value between 2 reviewers from the same center and values from 0.42 to 0.46 between reviewers from different centers. Malpica et al17 found a κ value of 0.75 for CIN 2+ and of 0.71 for CIN 3+. Crum et al18 reported slightly better κ values, at least 0.77, among 3 pathologists who classified histology as negative or low- or high-grade squamous intraepithelial lesion.
Our results do not confirm those of Carreon and colleagues,7 who reported higher agreement for older (older than 40 years) than for younger women. We must consider that in that study, the lower agreement regarding younger women was associated with a higher proportion of CIN 1. In our study, not all CIN 1 cases were included in the analysis, and, consequently, we have less statistical power to detect a similar effect. More surprising, agreement was similar independent of recruitment center, ie, the center that performed the biopsy and prepared the slides. A weak, not significant but suggestive, association was found with cytology results: the more severe the cytology result, the greater the agreement in histology. The same occurred with the HPV results: agreement was greater in HPV+ cases. These findings are consistent with our previous results that indicated a higher positive predictive value for histology when the test before the colposcopy had a higher positive predictive value.13
We observed highly significant heterogeneity between the κ values of individual centers in classifying CIN 2 and, to a lesser extent, CIN 1. The poor reproducibility of CIN 2 might reflect that this grade of neoplasia does not correspond to a well-defined phase of the pathogenetic pathway of infection and transformation.2,6 Similarly, Carreon et al7 found that only 30% and 12% of CIN 2 diagnoses were confirmed by external reviewers, and Cai et al19 observed that about 80% of the final diagnoses of CIN 2 were not unanimous, compared with 52% of CIN 3 diagnoses (class-specific κ, 0.38 for CIN 2 and 0.72 for CIN 3).
Nevertheless, in our study, the highest agreement on a dichotomous scale was obtained when including CIN 2 or worse in one group and less severe than CIN 2 in the other. Reproducibility was better in this classification than in a CIN 3 or worse vs less severe than CIN 3 classification, suggesting that uncertainty was greater between CIN 2 and CIN 3 than between CIN 1 and CIN 2.
Reproducibility is one element to judge a classification. Validity is another, in this case meaning whether CIN 2 corresponds to a separate entity, representing a specific stage of the clinical development from HPV infection to invasive cancer. Our study does not provide direct information on this subject, although other authors have suggested a lack of correspondence to a specific entity as an explanation for poor reproducibility.2,6 A third element is usefulness in clinical practice. Although the current standard is to treat all lesions classified as CIN 2 or worse, data suggest an extremely low probability of progression in some groups of lesions, like those detected by primary screening with HPV DNA tests in younger women.10 Future studies could suggest that a more conservative approach is appropriate in some subgroups. Introducing a dichotomous classification would reduce the flexibility of the therapeutic approach. Furthermore, shifting to a dichotomous classification at this time will probably push most pathologists toward including all CIN 2 diagnoses in the higher class for the sake of prudence, creating a strong barrier to implementing more conservative protocols for all or part of what we now call CIN 2.
Usually a diagnosis is formulated by applying well-established morphologic criteria, but it can also be influenced by previous cytologic diagnoses and HPV DNA test results. A diagnosis is formulated under enormous pressure because of the risks of undertreatment and of missing cancer. In our study, reviewers were blinded to these data and could judge more freely, without diagnostic responsibility. However, reviewing many cases in a short period is tedious and tiring.
The use of alternative, more objective methods as an adjunct to histology allows resolving some uncertain cases. Immunostaining by antibodies against cell cycle–related antigens has been used to support the interpretation of cervical biopsy specimens. The most extensively used are MIB-1 and p16.20–23 Image analysis methods have also been used in an attempt to bring more objectivity to the classification of CIN.24,25 This study was performed without the help of ancillary techniques, mainly because we wanted to avoid causing an artifactual correlation between screening or triage tests and histology, which was the NTCC study end point.
Although CIN 2 as a single, intermediate category is less reproducible than CIN 3, a dichotomous classification comparing less severe than CIN 2 to CIN 2 or worse is more reproducible than one that compares less severe than CIN 3 to CIN 3 or worse.
We thank all staff who assisted in running the study, Margaret Becker for editing the text, and the thousands of women who participated in the study.