The histopathologic interpretation of cervical intraepithelial neoplasia (CIN) is subject to a high level of interobserver variability and a substantial number of false-positive and false-negative results. We assessed the impact of the conjunctive interpretation of p16INK4a-immunostained slides on the accuracy of community-based pathologists in diagnosing high-grade cervical intraepithelial neoplasia (CIN; CIN 2 and CIN 3) in biopsy specimens. Twelve pathologists rendered independent diagnoses on a set of 500 H&E-stained cervical punch and conization specimens. Results were compared with a dichotomized “gold standard” established by consensus of 3 gynecopathology experts. When p16INK4a-immunostained slides were added and conjunctively interpreted with the H&E-stained slides, a significant increase in diagnostic accuracy for the detection of high-grade CIN was observed (P = .0004). Sensitivity for high-grade CIN was increased by 13%, cutting the rate of false-negative results in half. Agreement of community-based pathologists in diagnosing high-grade CIN was significantly improved (mean κ values advanced from 0.566 to 0.749; P < .0001). Reproducibility of p16INK4a stain interpretation was excellent (κ = 0.899). Our results show that conjunctive interpretation of p16INK4a-stained slides could significantly improve the routine interpretation of cervical histopathology.
The results of the histopathologic interpretation of cervical biopsy specimens guide the subsequent management of women who have been screened for cytologic abnormalities and have been referred for colposcopy for further diagnostic follow-up. In most countries with established screening protocols, the histologic classification of lesions as cervical intraepithelial neoplasia (CIN) grade 2 or 3 (CIN 2 or CIN 3) leads to excisional or ablative therapeutic interventions to remove the abnormal tissue and prevent potential progression to invasive cancerous growth. The accuracy in the diagnostic interpretation of cervical biopsy specimens is important to distinguish between low-grade (CIN 1) and high-grade CIN (CIN 2 and CIN 3) to avoid overtreatment of false-positive cases and undertreatment of false-negative cases. Moreover, this is clinically relevant because on the one hand, many low-grade lesions have the potential to spontaneously regress (reviewed by Schiffman et al1), and, on the other hand, ablational therapy has been shown to have a potentially negative impact on the reproductive outcome of women.2
Various studies have shown that there is a substantial variation between and within observers in the interpretation of CIN on H&E-stained tissue sections.3–9 κ coefficients as a measure of the level of agreement among observers corrected by chance typically are found in the range of 0.45 to 0.50,9,10 indicating moderate agreement, although in earlier studies, κ values as low as 0.20 and indicating poor agreement have been reported.4 To provide realistic estimates for the reproducibility of histologic cervical specimen interpretations, analyses have been performed in the course of the ASCUS/LSIL Triage Study comparing the diagnostic results of the original clinical center pathologists with the results from a quality control review.9 Reproducibility of histopathologic interpretations of biopsy specimens was moderate (κ < 0.5). Lack of reproducibility was substantially higher for punch biopsy specimens compared with loop electrode excision procedure specimens, and variability was most prominent in the low-grade abnormalities.
There is continued interest in evaluating the opportunity for applying adjunctive methods to enhance the diagnostic accuracy of morphologically based interpretation of cervical tissue specimens. Overexpression of the cyclin-dependent kinase inhibitor p16INK4a (p16) has been shown to be strongly associated with the onset of transforming infections of high-risk (HR) human papillomaviruses (HPV) (reviewed by Mulvany et al11 and Wentzensen and von Knebel Doeberitz12). Inactivation of the cell-cycle regulatory retinoblastoma protein (pRb) by HR HPV E7 oncoproteins results in overexpression of the p16 protein, as the negative feedback control of the transcription of the p16 gene by functional pRb is abrogated (reviewed by Wentzensen and von Knebel Doeberitz12). Thus, overexpression of p16 may serve as a surrogate marker for the inactivation of pRb, which is a key event in the HR HPV–mediated carcinogenesis of the cervical epithelium.
Numerous studies have investigated the correlation between p16 overexpression and the presence of CIN (reviewed by Mulvany et al,11 Cuschieri and Wentzensen,13 and Tsoumpou et al14). Despite a considerable amount of variation in immunohistochemical assays and microscopic interpretation criteria used, p16 overexpression has been demonstrated for the vast majority of CIN 2 and CIN 3 cases and for highly variable proportions of squamous lesions classified as CIN 1 on H&E-stained tissue sections (reviewed by Tsoumpou et al14).
The potential role and clinical value of immunohistochemically based detection of p16 overexpression in cervical biopsy specimens has been investigated in various research studies mainly focusing on the effect on reliably diagnosing high-grade CIN10,15–17 or on the prognostic and predictive value of p16 overexpression in CIN 1 lesions.18–22 Several analyses focused on the evaluation of the effect of p16 stain interpretation on the interrater reliability of pathologists diagnosing CIN. An initial study showed that the assessment of p16-stained cervical tissue specimens had a substantially higher degree of agreement between pathologists compared with conventional H&E-stained slide interpretation.23 Recent studies compared interobserver agreement between pathologists for diagnosing high-grade CIN on H&E-stained slides in conjunction with p16-stained slides prepared from consecutive tissue sections with diagnosing the CIN on H&E-stained slides only. Sayed and colleagues16 reported substantially improved interobserver agreement between pathologists in diagnosing cervical biopsy specimens from adolescents (κ values increased from 0.36–0.39 to 0.44–0.53). In another study, conjunctive interpretation of p16 stains in addition to H&E-stained slides resulted in a significant increase in interobserver agreement among 6 pathologists in diagnosing squamous lesions, with κ values improving from 0.49 to 0.64 (punch biopsy specimens) and from 0.63 to 0.70 (cone biopsy specimens), respectively (P < .001).10 However, although these studies confirmed the effect of p16 stain interpretation on improving interrater reliability, they were not statistically powered to demonstrate improved diagnostic accuracy for diagnosing CIN 2 and CIN 3.
In the present study, we evaluated the impact of the conjunctive interpretation of p16-immunostained slides on the accuracy of 12 community-based pathologists collectively performing 12,000 readings in diagnosing high-grade CIN in cervical tissue specimens and using standardized immunostaining as well as microscopic interpretation protocols. For the study, 500 cervical tissue specimens, including comparable numbers of punch and cone biopsy specimens, were assessed, and results for both methods (H&E vs H&E and p16) were compared with a “gold standard” established as adjudicated consensus diagnoses of 3 expert gynecopathologists (C.B., J.O., and D.S.) on H&E-stained slides only. Besides the effect on diagnostic accuracy, the level of agreement between the community-based pathologists was evaluated for diagnosing high-grade CIN for both methods, as well as the reliability of the p16-staining pattern interpretation itself.
Materials and Methods
Case Selection and Preparation of Tissue Slides
The surgical pathology files of 2 anatomic pathology laboratories (Laboratoire Cerba, Cergy Pontoise, France, and Institute of Pathology, Mannheim, Germany) were searched for archived cervical tissue specimens using clear text search terms corresponding to diagnostic categories of negative for dysplasia, CIN 1, CIN 2, and CIN 3. The results of these searches were used to retrieve consecutive cases from the respective tissue specimen archive and stratified according to the individual diagnostic categories and to the specimen type, going backward from January 1, 2007, as a starting point. For cases with more than 1 tissue block, the block with the highest histologic grade of lesion for each case was included. The 2 pathology laboratories were asked to select a total of 275 cases each and to include at least 100 cases previously categorized as negative for dysplasia, 50 cases of CIN 1, 50 cases of CIN 2, and 50 cases diagnosed as CIN 3. The search lists remained at the 2 pathology laboratories and were not made available to the study personnel. All slides were numbered at the pathology sites in a way that did not allow tracking of the identity of the cases. None of these cases has been included in any of the previously reported diagnostic studies on cervical biopsies.
Tissue blocks were cut at the pathology laboratory sites. All H&E staining was performed in one of the pathology laboratories providing the tissue specimens and using routine procedures. Immunohistochemical staining was performed in a central histopathology laboratory. Cases in which the H&E-stained and immunostained slides did not show the same tissue areas owing to technical reasons (eg, loss of tissue material during the heat-induced epitope retrieval procedure) were withdrawn from the study and documented. Continuous cases within the diagnostic categories were selected to reach the targeted numbers of 250 punch biopsy specimens and 250 cone biopsy specimens. More cone biopsy specimens than anticipated were excluded from the study for technical reasons; thus, only 246 cone biopsy cases could be included, rather than the target of 250. A total of 254 punch biopsy cases were included to reach the targeted total number of 500 cervical specimens for investigation in the study.
All H&E-stained slides and immunostained slides were labeled with a study identifier at the sponsor’s study site using a computer-generated randomization list to blind all reviewers to the origin of the cases.
Immunostaining of cervical biopsy specimens for p16INK4a was performed using the CINtec Histology Kit (REF 9511, mtm laboratories, Heidelberg, Germany) according to the instructions of the manufacturer. In brief, antigen retrieval was performed for 10 minutes at 95 to 99°C in a water bath. After blocking endogenous peroxidase activity, the slides were incubated for 30 minutes with the p16INK4a antibody (clone E6H4) or with the Negative Reagent Control (isotype control antibody), both included in the CINtec Histology Kit. Secondary antibody reagent (polymer-based goat-antimouse antibody fragment conjugated to horseradish peroxidase) was applied for 30 minutes. After the chromogenic visualization step using the 3,3′-diaminobenzidine chromogen, slides were counterstained with hematoxylin and coverslipped. For each staining run, a positive control slide containing tissue sections from a cervical biopsy with known positive immunoreactivity for p16INK4a was used to validate the staining procedure.
Establishment of Gold Standard Diagnoses
The H&E-stained slides were independently reviewed by 3 expert gynecopathologists (C.B., J.O., and D.S.). Slides were shipped to the expert gynecopathologists for review in their offices. A diagnosis was established for each case without the knowledge of additional patient information, and the results were assigned in 1 of the 5 following diagnostic categories: negative for dysplasia, CIN 1, CIN 2, CIN 3, and invasive cancer. The most severe diagnostic finding for each case was to be reported; thus, only a single category was allowed to be chosen per case. For the category negative for dysplasia, the following subclassification options were provided: cervicitis, regenerative changes/tissue repair, reserve cell hyperplasia/immature metaplasia, atrophy, transitional cell metaplasia, squamous papilloma, atypical immature metaplasia, and “without findings.” Multiple answers were allowed for these subcategories, unless without findings was noted.
Every case with a discrepant result in the expert gynecopathologists’ slide interpretation with respect to the 5 aforementioned levels of diagnostic categories was subjected to an adjudication meeting. A 2-of-3 majority consensus diagnosis was established during the joint review of the H&E-stained slides. The results for all cases with complete agreement and the majority diagnoses for the cases subjected to the adjudication process were used as the gold standard diagnoses for the study.
Information about glandular lesions was recorded in addition to the evaluation of the squamous lesions but was not intended to be analyzed in the course of this study owing to the low expected frequency in the specimen cohort included in this study.
Community-Based Pathologist Slide Readings
Community-based certified pathologists from 4 European countries (France, Germany, Italy, and Spain) diagnosing cervical histology tissue specimens on a regular basis were invited to participate in the study based on their interest. The identical diagnostic categories as used for establishing the gold standard diagnoses were applied.
Slide readings of the study specimens by the community-based pathologists were performed in a series of slide reading sessions. In a first round of reading sessions, panel pathologists established their independent diagnoses on the H&E-stained slides for all 500 study specimens. After a washout period of at least 4 weeks, the same set of H&E-stained slides was reassessed by the community-based pathologists in conjunction with the matched slides stained with the CINtec Histology Kit, ie, slides stained for p16INK4a and slides stained with the Negative Reagent Control provided with the CINtec Histology Kit, respectively. For this second review, the cases were renumbered using a randomization list. The community-based pathologists were blinded to their original diagnoses and to the gold standard diagnoses.
For the interpretation of p16 immunostaining, the following classification criteria were used. The rating “positive” was assigned if the p16-stained slide specimen showed continuous staining of cells of the basal and parabasal cell layers of the cervical squamous epithelium, with or without staining of cells of superficial cell layers (“diffuse staining pattern”) Image 1B and Image 1D. The rating “negative” was assigned if the p16-stained slide specimen showed staining of isolated cells or small cell clusters, ie, noncontinuous staining, particularly not of the basal and parabasal cells (“focal staining pattern”) Image 1F and Image 1H or no immunoreactivity in the squamous epithelium Image 1J. Image 1A, Image 1C, Image 1E, Image 1G, and Image 1I show the corresponding H&E-stained sections for these p16-stained slides.
The community-based pathologists received structured training of approximately 30 minutes before the interpretation of p16-immunostained slides.
Improvement of diagnostic accuracy for high-grade CIN in the interpretation of H&E-stained slides conjunctively with p16-stained slides vs H&E-stained slides alone was the experimental hypothesis. A multireader-multicases design (MRMC)24,25 was applied, whereby paired reading of samples with both methods was used (paired reader, paired cervical specimen). The mixed-effects analysis of variance model for jackknife pseudovalues, commonly known as Dorfman-Berbaum-Metz (DBM) method,24 was used to test for effects of diagnostic tests on accuracy. Sample size (number of readers, numbers of slides) was calculated as described by Hillis and Berbaum26 using an α level of .05 and 80% power and assuming treatment effect and parameters describing treatment × reader, treatment × case, and treatment × case × reader interactions on basis of a pilot study. All calculations were performed with SAS software, version 9.1 (SAS Institute, Cary, NC). Software DBM MRMC 2.224 (available per http://perception.radiology.uiowa.edu) was used to validate the internal SAS program for calculation of main study effects.
Interrater reliability was assessed by establishing estimates of agreement in the identification of high-grade CIN measured by the κ statistic and compared between the 2 methods, whereby elevated agreement was the experimental hypothesis. κ coefficients measure the percentage of agreement among observers adjusted for the degree of agreement that would be expected by chance alone. Z-transformed κ values were analyzed by analysis of variance using the diagnostic test as the fixed factor and pairs of readers as random factors for effects of the new diagnostic test.
For the assessment of agreement on the p16 staining pattern interpretation, κ statistics were used in a descriptive manner.
Gold Standard Diagnoses
Of a total of 500 cervical biopsy cases that were individually reviewed on H&E-stained slides only, there were 253 tissue specimens in which all 3 expert gynecopathologists agreed on the same diagnostic category. All of the remaining 247 cases in which pathologists disagreed (n = 155) or in which all 3 expert pathologists reported results discrepant from each other (n = 92) were subjected to a joint microscopic review for adjudication. Consensus diagnoses established as 2-of-3 majority diagnoses were achieved for all of these cases.
For 482 of these 500 cases, complete data sets of H&E slide readings and H&E plus p16 slide readings by the community-based panel pathologists were available and were used for the analysis of the effect of conjunctive p16 slide interpretation on diagnostic accuracy in this study. The frequencies of the diagnostic categories for these 482 cases as defined by the gold standard diagnoses on H&E-stained slides only are shown in Table 1. Of 482 cases, 194 (40.2%) were categorized as negative for dysplasia, 96 (19.9%) as CIN 1, 69 (14.3%) as CIN 2, and 123 (25.5%) as CIN 3 lesions, per adjudicated consensus diagnosis. There were no tissue specimens that were classified as invasive cancer cases. By using dichotomous categorization by grouping CIN 2 and CIN 3 lesions into disease-positive cases (high-grade CIN) and CIN 1 lesions and cervical squamous tissues called negative for dysplasia into the disease-negative group (<CIN 2), we found the total number of high-grade CIN cases was 192 of 482 cases, representing 39.8% of the cohort of tissue specimens analyzed in this study.
There was a similar distribution of cases for the various diagnostic categories ranging from negative for dysplasia to CIN 3 for the 2 different types of cervical biopsy specimens, ie, cervical punch and cone biopsies (Table 1).
Accuracy of Community-Based Pathologists for Diagnosing CIN 2+ on H&E-Stained Slides
For the primary study goal, the accuracy of community-based pathologists in diagnosing high-grade CIN was analyzed by comparing their diagnoses using the same dichotomous categorization into CIN 2 and higher grade lesions (CIN 2+, including CIN 2, CIN 3, and invasive cancer) and disease-negative cases (including cases categorized as negative for dysplasia or CIN 1). The results of the interpretation of the H&E-stained tissue specimens from each of the 12 community-based pathologists compared with the gold standard diagnoses are shown in Table 2 and Table 3. The mean true-positive fraction (sensitivity) for the community-based pathologists’ diagnoses of high-grade CIN on H&E- stained slides only was 0.77, ranging from 0.42 to 0.94 for the individual pathologists. The mean false-positive fraction (1 – specificity) on H&E-stained slides was 0.11, with a range from 0.00 to 0.23 for the individual reviewers.
Increase of Diagnostic Accuracy by Conjunctive p16 Slide Reading
Tables 2 and 3 show the results for the community-based pathologists readings on H&E- plus p16-stained slides compared with the gold standard diagnoses. Overall, diagnostic accuracy for high-grade CIN was significantly improved when p16-immunostained slides were added to the interpretation of conventional H&E-stained slide specimens (Table 2). The mean true-positive fraction (sensitivity) for the community-based pathologists’ diagnoses of high-grade CIN increased from 0.77 on H&E-stained slides to 0.87 on H&E- plus p16-stained slide reading (ie, an increase in sensitivity of 13%), ranging from 0.68 to 0.95 for the individual pathologists. By conjunctively reading H&E- and p16-stained slides, the mean sensitivity of the community-based pathologists for classifying cervical biopsy specimens identified as CIN 3 cases per adjudicated gold standard diagnosis as high-grade CIN lesions was 95%, an increase of 11% compared with H&E-stained slide interpretation only.
The relative improvement for community-based pathologists’ agreement with the gold standard diagnoses was highest in the CIN 2 category (447 concordant results for H&E- plus p16-stained slide interpretation compared with 340 concordant results for H&E-stained slides only; an increase of 31.5%), followed by CIN 1 cases (630 for H&E plus p16 vs 555 for H&E only, an increase of 13.5%) and CIN 3 cases (1,002 vs 926, an increase of 8.2%) (Table 3).
The mean false-positive fraction (1 – specificity) on H&E- plus p16-stained slides only slightly increased from 0.11 (H&E) to 0.12 (H&E plus p16), ranging from 0.04 to 0.27 for individual reviewers. Mean area under the curve (AUC) values for the correct identification of disease cases (high-grade CIN) were 0.877 for the interpretation of H&E-stained slides alone compared with 0.925 for the conjunctive interpretation of H&E- and p16-stained slides. The difference in diagnostic accuracy between the 2 methods as measured by AUC values was 0.048 (95% confidence interval [CI], 0.024–0.072; P = .0004). Table 4 lists the mean values for both methods for all pathologists and the results for each individual pathologist.
A statistically significant effect of the conjunctive p16-stained slide interpretation on the improvement of diagnostic accuracy was observed for both types of tissue specimens, ie, punch and cone biopsy specimens, with a higher effect size for the group of cone biopsy specimens compared with the group of punch biopsy specimens analyzed in this study. Mean AUC values increased from 0.896 to 0.930 (difference, 0.034; P = .0057) for punch biopsy specimens and from 0.851 to 0.918 (difference, 0.067; P = .0002) for cone biopsy specimens (Table 4). Figure 1 shows the sensitivity-specificity profiles for all 12 community-based pathologists when reading by both methods.
There was considerable variation in the histologic assessment of H&E-stained slides recorded for the 12 panel pathologists reading H&E-stained slides. Results for all cases with complete readings (n = 482) are shown in Table 5. With the dichotomous categories of high-grade CIN lesions vs the group of CIN 1 lesions and tissue cervical specimens classified as negative for dysplasia, the mean weighted κ coefficient was 0.566 (95% CI, 0.537–0.595). The agreement of community-based pathologists for categorizing lesions as high-grade CIN vs CIN 1 or negative for dysplasia significantly improved with H&E- plus p16-stained slides, as demonstrated by a κ coefficient of 0.749 (95% CI, 0.732–0.767), an increase of 32.3% (difference of κ coefficients for both methods, 0.183; 95% CI, 0.157–0.210; P < .0001).
Reproducibility of p16 Stain Interpretation
The reproducibility of classifying the p16 immunostaining pattern as positive (diffuse immunostaining pattern, Images 1B and 1D) or negative (focal pattern, Images 1F and 1H; or no immunoreactivity, Image 1J) was assessed in the course of the slide review sessions by the community-based pathologists. The frequency of a diffuse p16 stain interpretation was similar for all 12 readers, ranging from 242 to 286 cases being categorized as positive Table 6. There was a high level of agreement in the pathologists in calling the staining pattern diffusely positive. The mean κ coefficient for agreement on the dichotomized categories of diffusely positive vs the group of focal or no immunoreactivity was 0.899, indicating excellent agreement (SD, 0.032; median κ coefficient, 0.903). When comparing the results of the assessment of the p16 staining pattern with the adjudicated gold standard diagnoses within the groups of diagnostic categories of CIN 2 and CIN 3, or CIN 3 only, agreement on a diffuse staining pattern was observed in 94.5% (2,211/2,340) and 97.6% (1,452/1,488) of the readings, respectively.
This analysis shows that the accuracy of community-based pathologists for diagnosing high-grade CIN on cervical biopsy specimens can be significantly increased by the conjunctive interpretation of p16-immunostained slides along with the routine H&E-stained slide interpretation. The sensitivity for diagnosing high-grade CIN was significantly increased: the number of missed high-grade CIN cases was reduced by 45%, and cases classified as CIN 3 according to the gold standard and being missed (ie, <CIN 2) by individual community-based pathologist readings were decreased by 60% (81 vs 201 cases; Table 3). This gain in sensitivity was not associated with a relevant loss of specificity, which is important when using a stain on a routine basis in clinical practice.
When looking into the exact agreement of the 12 readers for the individual diagnostic categories, the rates for agreement with the gold standard results were higher for the H&E plus p16 readings in all individual diagnostic categories of CIN (CIN 1 to CIN 3), with only a marginally lower agreement for cases categorized as negative for dysplasia (Table 3). The highest level of improvement of the community-based pathologists’ agreement with the gold standard diagnoses was achieved in the CIN 2 category, which is known to show only a low level of reproducibility when being assessed on H&E-stained slides. By adding p16-stained slides to the interpretation, the level of agreement between the community-based pathologists and the gold standard diagnoses increased by 31.5% (447 vs 340 concordant results; Table 3).
Previous studies assessing the effect of the conjunctive p16 stain reading on the interpretation of cervical biopsy specimens predominantly focused on reliability between readers.10,16,27 Horn and colleagues10 in a similar but separate study involving a different set of specimens and a lower number of study pathologists reported an improvement of diagnostic accuracy for the correct identification of high-grade CIN for 4 of 6 pathologists; however, the study was not sufficiently powered to demonstrate statistical significance over all pathologists. In our study, 12 community-based pathologists were included, all of them reading 500 cervical biopsy specimens with both methods (ie, H&E stain only and H&E plus p16 stain) and providing a total of 12,000 individual reading results. Similar to the previous study, two thirds of the pathologists (8 of 12) improved the sensitivity of their diagnoses for CIN 2+, whereas the remaining 4 pathologists substantially increased their specificity for correctly identifying cervical tissue specimens categorized as negative for dysplasia or CIN 1 when adding p16-stained slides to the interpretation (Table 2). These observations further support the hypothesis that the conjunctive interpretation of p16-stained slides results in a more accurate diagnosis of cervical biopsy specimens; however, individual pathologists may benefit to variable amounts in the sensitivity or specificity of their diagnoses. As graphically summarized in Figure 1, there was an overall improvement of the sensitivity-specificity profiles for the 12 community-based pathologists when adding p16 stains to the interpretation of cervical biopsy specimens.
As for many diagnostic studies in anatomic pathology, there is a lack of an objective measure of clinical truth. Therefore, gold standards have to be established that are targeted to get as close to truth as possible. In this study, we used the majority consensus diagnoses of 3 gynecopathology experts based on the interpretation of H&E-stained slides alone as the gold standard defining the diagnostic categories. To achieve the highest level of agreement, every case in which at least 1 of 3 expert reviews had provided a discrepant result was subjected to a joint review by all 3 gynecopathologists. The use of an adjudicated consensus diagnosis of the expert pathologists that would not only rely on the morphologic interpretation of H&E-stained slides but also would include the conjunctive interpretation of p16-stained slides as well might have resulted in a reference standard closer to clinical truth. However, it is important to avoid incorporation bias, which occurs when the test results are incorporated into the evidence used to establish true diagnosis. Therefore, the reference standard was created based on the interpretation of H&E-stained slides only, which may lead to a conservative estimation of the accuracy of the new test.25
Besides the increase in diagnostic accuracy, the agreement of pathologists interpreting cervical biopsy specimens was significantly improved by adding p16 stains to the slide interpretation. This is in complete agreement with all previous reports.10,16,27 The calculated mean κ coefficient as a measure of agreement corrected by chance was increased by 30% for the agreement of readers between both methods and using the dichotomous gold standard of high-grade CIN vs less than CIN 2. Notably, the mean κ value for the H&E-stain-only readings was 0.566 (Table 5), indicating an already moderate level of agreement among the 12 readers, which was higher than in most prior reports.9,10,16,27 With p16, the level of agreement improved significantly (mean κ value increased to 0.749; P < .0001). This may indicate that the effect size might be even higher in clinical practice settings where the accepted level of diagnostic agreement between pathologists on H&E-stained slides has, in part, been found to be substantially lower than in this study.3–8,28
The positive effect of adding p16-stained slides to the interpretation of H&E-stained cervical tissue sections on the diagnostic accuracy and agreement among observers was independently demonstrated for both types of cervical specimens, ie, punch and cone biopsy specimens (Tables 4 and 5). In a previous study, a higher level of agreement was reported for the initial H&E-based review of cone biopsy specimens,10 as well as a lower relative increase by adding p16-stained slides. However, this may have been due to a large proportion of CIN 3 lesions within the cone biopsy population analyzed in that study, for which it is known that the level of diagnostic agreement is higher than for the other diagnostic categories.9 The almost identical levels of agreement found in our study for both specimen types and methods based on a similar distribution of diagnostic categories suggest that the relative increase of agreement by the conjunctive p16 slide interpretation is independent of the type of cervical specimen analyzed.
Standardization of reagents and technical immunostaining protocols, as well as consistency in microscopic interpretation following clear and reproducible criteria for the rating of immunostaining patterns in tissue specimens, is important for the evaluation of new immunohistochemically based diagnostic methods and their implementation into routine clinical practice.29,30 For p16 immunostaining, a reagent set extensively validated for its use on cervical tissue specimens was used for this study. Similar to the general lack of standardization of immunohistochemical protocols, there is also a relatively high level of heterogeneity regarding the interpretation criteria described in the literature for the rating of p16 immunohistochemical stains (reviewed by Wentzensen and von Knebel Doeberitz12 and Ordi et al15).
In this study, we used well-defined but qualitative criteria for the rating of the p16 immunostaining pattern. Staining results were grouped into the 3 categories, diffuse staining, focal immunoreactivity, and no immunoreactivity. This interpretation system has been used by several other authors in the past10,15,22 and further validated in a recent meta-analysis as it reflects the pathogenetic process of high-risk HPV–mediated transformation of replication-competent cells in the cervical epithelium, which results in p16 overexpression involving the basal and parabasal cell layers (reviewed by Tsoumpou et al14). In addition, the effectiveness of this p16 immunostaining rating system for improving the diagnostic accuracy has been further validated in our study.
This study has shown that the agreement of community-based pathologists in rating the p16 immunostaining pattern as diffusely positive or negative (showing focal or no immunoreactivity) is very high (Table 6). The mean κ coefficient for agreement on these dichotomized categories was 0.899, indicating excellent agreement. This confirms data from a smaller study that suggested a high level of agreement among pathologists for rating p16 immunostaining patterns on cervical biopsy specimens.23 There was some variability in differentiating no immunoreactivity from a focal immunostaining pattern (data not shown), which likely is due to some differences in reader-dependent thresholds of calling sporadic staining of single or few cells as focally immunoreactive or not. The high level of agreement of pathologists on the categorization of immunostained slides, such as p16-stained slides in this study, as positive or negative is an important step for the implementation of a conjunctive stain into the diagnostic decision-making process. Notably, the majority of community-based pathologists had not been exposed to the interpretation of p16-stained cervical slide specimens before participation in the study and underwent only a relatively short, structured training session before the immunohistochemical slide review sessions. This validates the relative ease with which staining patterns can be identified and then implemented into clinical practice for p16.
Various studies have found very high correlation of a diffuse staining pattern and high-grade CIN lesions, with highly variable amounts of the CIN 1 lesions showing diffusely positive p16 immunoreactivity (reviewed by Tsoumpou et al14). It is interesting that several studies have shown that diffusely immunoreactive CIN 1 lesions have a substantially higher risk of progression to high-grade CIN as compared with CIN 1 lesions showing no diffuse p16 expression pattern.18–22 This also provides further evidence that the immunostain rating method used in this study generates clinically meaningful results because it reflects high-risk HPV–mediated pathogenetic changes in replication-competent squamous cells of the cervix at the molecular level.
This well-controlled study provides additional strong evidence for the usefulness of conjunctive p16 immunohistochemical stain interpretation in improving the diagnostic accuracy of cervical biopsy interpretations, which may further optimize the diagnostic pathway leading to appropriate therapeutic management of women with cervical disease.
The clinical research assistance of H. Beckert and R. Schwarz (Heidelberg) is gratefully acknowledged.