-
PDF
- Split View
-
Views
-
Cite
Cite
Gabrijela Kocjan, MD, FRCPath, Ashish Chandra, MD, FRCPath, Paul A. Cross, FRCPath, Thomas Giles, MB, ChB, FRCPath, Sarah J. Johnson, MB BS, PhD, FRCPath, Timothy J. Stephenson, MD, FRCPath, Michael Roughton, MSc, David N. Poller, MD, FRCPath, The Interobserver Reproducibility of Thyroid Fine-Needle Aspiration Using the UK Royal College of Pathologists’ Classification System, American Journal of Clinical Pathology, Volume 135, Issue 6, June 2011, Pages 852–859, https://doi.org/10.1309/AJCPZ33MVMGZKEWU
Close -
Share
Abstract
The overall interobserver reproducibility of thyroid fine-needle aspiration (FNA) has not been comprehensively assessed. A blinded 6-rater interobserver reproducibility study was conducted of 200 thyroid FNA cases using the UK System, which is similar to The Bethesda System for Reporting Thyroid Cytology: Thy1, nondiagnostic; Thy2, nonneoplastic; Thy3a, atypia, probably benign; Thy3f, follicular lesion; Thy4, suspicious of malignancy; and Thy5, malignant. There was good interobserver agreement for the Thy1 (κ = 0.69) and Thy5 (κ = 0.61), moderate agreement for Thy2 (κ = 0.55) and Thy3f (κ = 0.51), and poor agreement for Thy3a (κ = 0.11) and Thy4 (κ = 0.17) categories. Combining categories implying surgical management (Thy3f, Thy4, and Thy5) achieved good agreement (κ = 0.72), as did combining categories implying medical management (Thy1, Thy2, and Thy3a; κ = 0.72). The UK thyroid FNA terminology is a reproducible and clinically relevant system for thyroid FNA reporting. This study demonstrates that international efforts to harmonize and refine thyroid cytology classification systems can improve consistency in the clinical management of thyroid nodules.
Thyroid fine-needle aspiration (FNA) has been in use for many years and is now the mainstay of preoperative diagnosis of thyroid lesions.1 The thyroid gland is the organ most commonly sampled by FNA because FNA can make a real difference to patient management. From 70% to 80% of FNA specimens from the thyroid can be classified as benign or malignant with a negative predictive value for a benign diagnosis of 92% and a positive predictive value for a malignant diagnosis of up to 100% in some series.2
Although thyroid cytology is widely used as a first-line investigation to guide clinical management, until recently, there was no standardized terminology for FNA reporting.3 Consequently, to our knowledge, there are no studies in the literature that address the issue of interobserver reproducibility using κ statistics for the full range of thyroid FNA appearances seen in clinical practice using standardized reporting terminology.4–6 This study derived from the international efforts to standardize diagnostic terminology in thyroid FNA, which resulted in The Bethesda System for Reporting of Thyroid Cytology (TBSRTC), published in 2008.7 Following publication of TBSRTC, a working group of The Royal College of Pathologists (RCPath) updated the reporting system already in use in the United Kingdom since 2003,8–11 using criteria that are similar to those used in TBSRTC.12 The working group decided that it would be important to test the interobserver reproducibility of the UK RCPath system, as there are no published data on the interobserver reproducibility of TBSRTC.
The present study is a reproducibility study undertaken by 6 experienced cytopathologists (A.C., P.A.C., T.G., S.J.J., T.J.S., and D.N.P.), all members of the RCPath working group, using the updated UK RCPath classification.
Materials and Methods
Cases
The cases included in this study were selected by 1 author (G.K.) to represent a wide range of thyroid lesions from routine clinical practice Table 1. All cases originated from the Department of Cellular Pathology, University College Hospital, London, England. Each circulated FNA thyroid case was represented by direct smears obtained in the ultrasound clinic by a team of experienced radiologists using ultrasound guidance, stained with May-Grünwald-Giemsa, and initially interpreted by 1 author (G.K.), whose results were excluded from the final statistical analysis to avoid bias. No Papanicolaou-stained slides and no liquid-based preparations were included. The slides were circulated among the 6 observers by post in batches of a minimum of 10 during a period of approximately 12 months. For practical purposes, only the 1 or 2 of the most representative slides from each case were circulated, with most cases comprising a single stained slide. The slides were selected to ensure that they were representative of diagnostic lesions or categories, similar to daily practice in which a junior pathologist prescreens and presents the “good” slide(s) first. The case selection for the reproducibility study did not reflect normal day-to-day case practice, as the aim of the study was to establish the interobserver agreement for all diagnostic categories of thyroid FNA.
Prior to the reproducibility study’s commencing, all participants were members of an RCPath working group, which had defined the UK RCPath system9 and so had discussed at length the cytologic details of the UK System and TBSRTC. The 6 participants did not confer on cases over a multiheaded microscope or via teleconsultation, as one of the study aims was to provide reassurance that the UK RCPath nomenclature they had helped define works in practice.
Because this study was conceived as an audit of cytology performance and the results were unknown, institutional ethical committee permission was not required for its conduct. The 6 observers were given all relevant clinical information such as patient’s age, the site of the nodule(s), the ultrasound size and characteristics of the nodule(s), if available, and any other clinical information that was supplied, but they were blinded as to the initial cytologic diagnosis.
The primary aim of the study was to establish whether the UK System is robust enough to be used as a means of triage of patients for either surgical or nonsurgical management, while acknowledging that the management of surgical and nonsurgical patient treatment groups may vary according to local practice. The histologic results for the patients included in this study are incomplete, because the FNA cases selected were recent and not all patients requiring surgery had undergone surgery at the time of study completion.
The anonymized results were collated with the data being analyzed by a professional statistician (M.R.). The κ statistic was used to assess the agreement between all 6 observers simultaneously. In this method of κ analysis, an agreement score is given for each component (whether observers agreed a case was in that category vs any other category) and an overall assessment of agreement is made. Values of the Cohen κ were interpreted as follows: 0 to 0.2, slight agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, substantial agreement, and 0.8 to 1.00, almost perfect agreement.13 All analyses were performed using Stata 10 (StataCorp, College Station, TX).
Observer Diagnoses of 200 Randomly Selected Thyroid Fine-Needle Aspiration Specimens Using the UK Royal College of Pathologists’ Thyroid FNA Classification and The Bethesda Thyroid FNA Classification

Observer Diagnoses of 200 Randomly Selected Thyroid Fine-Needle Aspiration Specimens Using the UK Royal College of Pathologists’ Thyroid FNA Classification and The Bethesda Thyroid FNA Classification

Cytologic Criteria
All slides were reported according to the UK RCPath Guidelines for the reporting of FNA of thyroid, which has similar criteria to those of TBSRTC. The comparison and similarity with TBSRTC are shown in Table 2.9,12 The main difference between the UK RCPath terminology and the terminology in TBSRTC is that the UK RCPath terminology assumes that all Thy3a, Thy3f, Thy4, and Thy5 aspirates (equivalent to TBSRTC III, IV, V, and VI, respectively) will be reviewed in a multidisciplinary clinical team meeting before a clinical decision is made about the required therapeutic action.9 Conversely, every TBSRTC category has its own implied therapeutic action.
Nondiagnostic for Cytologic Diagnosis—Thy1
The same cellularity criteria as in TBSRTC are used, ie, to be considered of adequate epithelial cellularity, samples from solid lesions should have at least 6 groups of thyroid follicular epithelial cells across all the submitted slides, each with at least 10 well-visualized epithelial cells. The reason for a nondiagnostic sample should be clearly stated in the cytology report. This category includes samples that are nondiagnostic most likely because of the operator or technique, eg, the sample consists entirely of blood or is so heavily bloodstained that the epithelial cells or colloid cannot be visualized, is acellular, has lower epithelial cellularity than the aforementioned criteria, or cannot be technically evaluated (eg, poorly spread, delayed air drying or fixation artifact, prominent crush artifact, or cells trapped in fibrin).
The assessment of cytology from thyroid cysts can be particularly problematic. There is a recognized risk of nonrepresentative sampling, especially in cystic papillary thyroid carcinomas, and it is important not to offer false reassurance based on suboptimal epithelial cellularity. It is important for auditing results that any samples of insufficient epithelial cellularity that are cystic can be separated from samples that are nondiagnostic for the various aforementioned reasons. Careful assessment is needed, possibly with multidisciplinary team discussion. Included in this category are aspirates that are most likely to be related to a cystic lesion or are cystic lesion fluid specimens that do not reach the epithelial cell adequacy criteria and that contain mostly macrophages but without abundant colloid. Useful phrasing may be as follows: “The sample is in keeping with fluid from a cyst, but there are no epithelial cells or colloid to confirm the cyst type.” For these cases, the use of category Thy1c, in which “c” indicates a cystic lesion, is recommended.
Comparison Table of the UK Royal College of Pathologists’ Thyroid FNA Classification and The Bethesda System for Reporting Thyroid Cytopathology*

Comparison Table of the UK Royal College of Pathologists’ Thyroid FNA Classification and The Bethesda System for Reporting Thyroid Cytopathology*

Nonneoplastic—Thy2
Samples in this category should achieve the epithelial cellularity adequacy criteria described for Thy1 (samples from solid lesions should have at least 6 groups of thyroid follicular epithelial cells across all submitted slides, each group containing at least 10 well-visualized thyroid epithelial cells). This nonneoplastic category, therefore, includes normal thyroid tissue, thyroiditis, hyperplastic nodules, and colloid nodules. These samples contain abundant, easily identifiable colloid with cytologically bland follicular epithelial cells reaching the cellular adequacy criteria outlined, usually with the presence of macrophages. The specific diagnosis (such as lymphocytic thyroiditis or colloid nodule) should be stated in the report whenever possible.
Cyst fluid samples that have sufficient thyroid follicle cells to achieve the adequacy criteria, irrespective of any possible colloid and/or macrophage content, and cystic lesion specimens that consist predominantly of colloid and macrophages, even if too few follicular epithelial cells are present to meet the adequacy criteria outlined, can be considered “consistent with a colloid cyst” in the appropriate clinical setting. Samples with low cellularity could be reported along the following lines: “The sample is in keeping with fluid from a cystic colloid nodule, but there are no (or too few) epithelial cells for confirmation.” To allow audit, this particular category should be coded as Thy2c (“c” for cyst).
Neoplasm Possible—Thy3
The majority of the lesions in the Thy3 category are follicular neoplasms. Owing to the limitations of FNA cytology, the nature of these lesions cannot be determined solely by FNA cytology, and multidisciplinary discussion is needed to decide further management. This category, therefore, encompasses hyperplastic or other cellular but nonneoplastic nodules, as well as neoplasms, including follicular adenomas and follicular carcinomas. Follicular variants of papillary thyroid carcinoma without clear nuclear features of papillary thyroid cancer may fall into this category. This group is classed as Thy3f (“f” for follicular). Samples consisting almost exclusively of Hürthle cells are also included in this category, and the report should mention the cell type.
Samples that exhibit cytologic atypia or other features that raise the possibility of neoplasia, but that are insufficient to enable confident placing into any other category, should be classed as Thy3a (“a” for atypia). These samples should form only a small minority of Thy3 cases. Clinical scenarios when this category may be used include samples in which there is architectural atypia in the form of a mixed microfollicular and macrofollicular pattern (approximately equal proportions of each) or in which a definite distinction between a follicular neoplasm and hyperplastic nodule is difficult. Other examples include a specimen in which only sparse colloid is evident and a definite distinction between a follicular neoplasm and a hyperplastic nodule is difficult, sparsely cellular samples containing predominantly microfollicles, focal cytologic changes that are most probably benign but papillary carcinoma cannot be confidently excluded, and a compromised specimen (eg, obscured by blood or a poorly spread smear) in which some cells appear to be mildly abnormal but are not obviously from a follicular neoplasm or suspicious for or indicative of malignancy. Last, it should be recognized that cyst lining cells may appear atypical. The cytologic interpretation must be clearly stated in the conclusion of the report, which may mean listing the likely differential diagnoses.
Suspicious for Malignancy—Thy4
This category includes samples that are “suspicious” for malignancy but that do not allow confident diagnosis. This category includes specimens of low cellularity and mixed cell types (normal and malignant). The tumor type suspected should be clearly stated and is often a papillary carcinoma. This category should not be used for samples that exhibit mild atypia, which should be categorized as Thy3a. Cases of definite malignancy, but in which a specific diagnosis cannot be made (eg, lymphoma vs anaplastic carcinoma), belong in the Thy5 category.
Malignant—Thy5
These samples can be confidently diagnosed as malignant. The tumor type should be clearly stated, eg, papillary thyroid carcinoma, medullary thyroid carcinoma, anaplastic thyroid carcinoma, lymphoma, and other malignancy, including potentially nonthyroid or metastatic malignancy. Sometimes it may be possible to be confident of malignancy but not of the tumor type. This issue should be clearly stated and a differential diagnosis given, eg, between anaplastic carcinoma and lymphoma or anaplastic carcinoma and metastatic malignancy.
Results
The observer diagnoses using the RCPath classification are shown in the upper part of Table 1 and are “translated” into TBSRTC according to the key shown in the lower part of Table 1. For ease of comparison between categories, the RCPath categories Thy1 and Thy1c and Thy2 and Thy2c were merged into Thy1 and Thy2, respectively Figure 1. Overall, there were 56 of 200 cases with complete agreement: 20 were nondiagnostic (Thy1), 26 were benign (Thy2), 7 were follicular neoplasms (Thy3f), and 3 were malignant (Thy5).
The κ statistic for the Thy1 category was 0.69, for Thy2 was 0.55, for Thy3a (neoplasm possible, atypia) was 0.11, for Thy3f (neoplasm possible, follicular) was 0.51, for Thy4 (suspicious of malignancy) was 0.17, and for Thy5 (malignant) was 0.61 (Figure 1). To assess the clinical impact of these results, the analysis was repeated based on clinical management implications. For this purpose, Thy3f, Thy4, and Thy5, all of which imply various degrees of surgical intervention, were combined, achieving a κ of 0.72 (good agreement). The same level of good agreement (κ = 0.72) was also achieved when the categories Thy1, Thy2, and Thy3a, implying conservative management (patient discharge, follow-up, or repeated FNA), were also combined. Combining the atypical (Thy3a) and follicular (Thy3f) categories achieved a κ of 0.47. Combining categories that are currently managed by hemithyroidectomy (Thy3f and Thy4), as opposed to total thyroidectomy (Thy5), achieved a κ of 0.55 (Figure 1).
Discussion
Interpretation of thyroid FNA is challenging because there is comparatively little difference in the morphologic features of the many nonneoplastic and neoplastic conditions of the thyroid and there is variability in FNA specimen preparation and interpretation. Thyroid FNA has traditionally been performed by various aspirators: endocrinologists, surgeons, radiologists, and cytopathologists, resulting in variable specimen quality. Before TBSRTC was introduced, reports were largely descriptive, with a multiplicity of category names, descriptive reports (no categories), or the use of surgical pathology terminology.
Redman et al3 studied perceptions of diagnostic terminology and cytopathologic reporting of thyroid FNA by pathologists and clinicians and found that practice among pathologists varied. For categories such as atypical, indeterminate, suspicious, and nondiagnostic, 27% of pathologists used 3 categories, 44% used 2 categories, and 27% used just 1 category.3 The interpretation of the cytology reports by clinicians also varied. The nondiagnostic category was followed by a repeated FNA in 98%, the suspicious category by surgery in 96%, the indeterminate category by repeated FNA in 58% and by surgery in 32%, and the atypical category by repeated FNA in 37% and by surgery in 52%.3
In a recent study of 742 interinstitutional referrals, thyroid FNA accounted for 23% of all major diagnostic disagreements.14 Despite this finding, there is scant published literature on the interobserver reproducibility of thyroid FNA, and the interobserver reproducibility of the full range of lesions seen in thyroid FNA has not been comprehensively assessed in a large study with multiple raters using a standardized, evidence-based terminology system for reporting.
Clary et al4 examined the interobserver agreement of 4 observers in 50 selected cases of follicular lesions or follicular neoplasms. The range of paired κ interobserver agreement for the 4 pathologists was fair to substantial (range, 0.199–0.617). The 2 pathologists with substantial diagnostic agreement (κ = 0.617) had the most experience and had worked together for the longest period,4 suggesting that awareness of diagnostic criteria increases interobserver diagnostic agreement.
Interobserver agreement for various UK Royal College of Pathologists individual and combined reporting categories. For an explanation of the categories, see Table 2.
Stelow et al5 studied the interobserver variability in thyroid FNA in 20 selected cases showing predominantly colloid and follicular cell groups, with cases of colloid nodule (non-neoplastic hyperplastic changes), follicular lesion (a lesion worrisome for a follicular patterned neoplasm, whether adenoma or carcinoma, but not having features of papillary carcinoma), and follicular neoplasm in which the features were thought to indicate a true neoplasm. They found relatively poor agreement in these 3 categories (κ = 0.35) when follicular lesion and follicular neoplasm were considered separate diagnoses, but when follicular lesion and follicular neoplasm were considered equivalent, the interobserver agreement increased to κ = 0.57. Agreement was further improved when the data were collapsed to treatment recommendations, ie, to operate or not (κ = 0.65).
Gerhard and da Cunha Santos6 examined the cytologic reproducibility of thyroid FNA between 2 observers in 91 cases using the Papanicolaou Society guidelines for FNA of thyroid nodules.15 These authors showed an overall level of interobserver agreement of κ = 0.71.6 According to Table 1 in their report, 73 (80%) of 91 diagnoses showed complete interobserver concordance. The 18 discordant diagnoses comprised a mixture of disagreements for cellular follicular lesions, follicular neoplasms, and papillary carcinomas.6
The most common method of assessing the degree of variability between observers is with κ statistics. This statistic measures the diagnostic agreement among several observers or repeated observations from the same group of observers, although it does not allow assessment of the way in which observers agree or disagree.16 Although interobserver agreement is desirable, it is not absolutely necessary for the interpretations of different observers to be relatively accurate. It is also possible, although unusual, to have good diagnostic reproducibility without necessarily good interobserver agreement.16
Variations in cytologic interpretation may have a number of causes. An individual observer may have a particular diagnostic bias, which in a study such as this might be because the observer did not strictly apply the written criteria as requested or because the criteria supplied were insufficiently detailed or descriptive to include all potential diagnostic possibilities. For example, in our study, the number of cases diagnosed as benign (Thy2 and Thy2c, equivalent to TBSRTC II) ranged from 52 to 93 cases of 200. Similarly, the number of malignant cases (Thy5, equivalent to TBSRTC VI) ranged from 8 to 23. The results of using of TBSRTC prospectively have been published by various authors showing that it can be easily applied and that the results in terms of overall sensitivity and specificity seem excellent.17,18 The European Federation of Cytology Societies is considering adopting TBSRTC or adapting national classification systems to TBSRTC.19
In practice, the most useful achievement of this interobserver study is to reassure pathologists and clinicians that the Thy2 (benign) and Thy5 (malignant) FNA diagnoses are robust, both showing good interobserver agreement. Benign thyroid lesions represent by far the most common clinical finding (76% of all thyroid FNA in the originating laboratory for this study),11 indicating that diagnostic triage involving FNA is reliable. In addition, when categories of nondiagnostic (Thy1), benign (Thy2), and atypical (Thy3a) were combined, the κ was 0.72, representing good agreement for lesions that are currently managed conservatively, ie, not needing surgical intervention.
The area identified as most difficult in TBSRTC is class III/atypia of unknown significance/follicular lesion of undetermined significance. The original report in 2008 following the Bethesda conference stated that this diagnosis should have a risk of malignancy of 5% to 10% and commented that this is “a heterogeneous category that includes cases in which the cytological findings are not convincingly benign, yet the degree of cellular or architectural atypia is not sufficient for an interpretation of ‘follicular neoplasm’ or ‘suspicious for malignancy.’”20 It stated that cases were to be placed in this category because of a compromised specimen (eg, low cellularity, poor fixation, or obscuring blood) and that, when used, class III should ideally represent fewer than 7% of all thyroid FNA interpretations. TBSRTC group further refined this definition in later reports in November 200921 and in The Bethesda Slide Atlas, published in early 2010.22 Cibas and Ali21 refer to the definition for class III stating “The heterogeneity of this category precludes outlining all scenarios for which an atypia of unknown significance (AUS) interpretation is appropriate,” and they then give a list of 9 situations in which they thought that this diagnosis was appropriate. In our study, category Thy3a, similar to or equivalent to TBSRTC III, constituted 4.5% to 9% of all diagnoses. We accept that there may be some case selection bias in this study because the combined proportion of Thy3 and Thy4 was 6% in the laboratory from which the study originated, and benign lesions are far more prevalent.11
In our study, the interobserver reproducibility of class Thy3a (TBSRTC III) was poor. Although this lack of agreement might have been anticipated, other possible contributing reasons include case selection bias and the lack of clarity around definitions for TBSRTC III, which were still evolving at the time the study commenced. In the earlier studies, indeterminate FNA diagnoses accounted for 13% of all cases and represented 41% of thyroid surgeries, 18% of which were for malignant neoplasms.2
In contrast with Thy3a, the category Thy3f (follicular lesion, equivalent to TBSRTC IV) achieved a moderate to good agreement and was used relatively frequently by all observers. Its robustness is particularly important because this category currently implies surgical management, usually a hemithyroidectomy.
The reproducibility of the category of aspirates suspicious for malignancy, Thy4 (equivalent to TBSRTC V) was poor, the diagnostic decision partially hampered by the lack of diagnostic material represented by 1 or 2 slides only, whereas in real practice, this diagnosis would be based on the assessment of multiple slides and clinicopathologic discussion. The lack of reproducibility of Thy3a and Thy4 categories mimics other important cytologic and histologic distinctions. Pathologists frequently use diagnostic categories that have only fair to poor reproducibility; they do so because they believe certain distinctions are clinically important despite their less-than-ideal reproducibility.23
However, when the Thy3f, Thy4, and Thy5 categories were combined, the κ value of the combined group was 0.72, implying that follicular lesion, suspicious for malignancy, and malignant (TBSRTC IV, V, and VI combined) are also relatively robust diagnoses on which a surgical management can be based. One suggested approach to post-FNA clinical management of the various RCPath Thy categories is shown in Figure 2 and is based on recommendations for TBSRTC (see Layfield et al24).
The lesson learned from this study of 200 cases of thyroid FNA reported according to the UK RCPath classification, closely resembling TBSRTC, is that the interobserver agreement for diagnosis of malignant and benign disease on thyroid FNA cytology is robust. This finding is reassuring for clinicians and diminishes the importance of the minor differences between various classifications relating mainly to the reporting of atypical, suspicious, and indeterminate categories. The use of standardized reporting and classification systems allows for clinically relevant distinction between lesions that need immediate intervention and lesions that can be followed up or disregarded. We look forward to the refinement of the categories of “atypical” and “suspicious” by means of ancillary techniques and believe that the current imperfect agreement in these categories should not deter colleagues from adopting the RCPath or TBSRTC thyroid classifications or their equivalents. It is likely that further improvements in thyroid FNA classification will evolve with the use of molecular testing in parallel with conventional light microscopic assessment of aspirated material.25
Clinical management implications following from the 6 main categories of thyroid cytology reporting as described by The Bethesda System for Reporting Thyroid Cytology and adapted to the UK Royal College of Pathologists classification. Modified from Layfield et al.24 FNA, fine-needle aspiration.
References

