Standardized Interpretation of Chest Radiographs in Cases of Pediatric Pneumonia From the PERCH Study

Abstract Background. Chest radiographs (CXRs) are a valuable diagnostic tool in epidemiologic studies of pneumonia. The World Health Organization (WHO) methodology for the interpretation of pediatric CXRs has not been evaluated beyond its intended application as an endpoint measure for bacterial vaccine trials. Methods. The Pneumonia Etiology Research for Child Health (PERCH) study enrolled children aged 1–59 months hospitalized with WHO-defined severe and very severe pneumonia from 7 low- and middle-income countries. An interpretation process categorized each CXR into 1 of 5 conclusions: consolidation, other infiltrate, both consolidation and other infiltrate, normal, or uninterpretable. Two members of a 14-person reading panel, who had undertaken training and standardization in CXR interpretation, interpreted each CXR. Two members of an arbitration panel provided additional independent reviews of CXRs with discordant interpretations at the primary reading, blinded to previous reports. Further discordance was resolved with consensus discussion. Results. A total of 4172 CXRs were obtained from 4232 cases. Observed agreement for detecting consolidation (with or without other infiltrate) between primary readers was 78% (κ = 0.50) and between arbitrators was 84% (κ = 0.61); agreement for primary readers and arbitrators across 5 conclusion categories was 43.5% (κ = 0.25) and 48.5% (κ = 0.32), respectively. Disagreement was most frequent between conclusions of other infiltrate and normal for both the reading panel and the arbitration panel (32% and 30% of discordant CXRs, respectively). Conclusions. Agreement was similar to that of previous evaluations using the WHO methodology for detecting consolidation, but poor for other infiltrates despite attempts at a rigorous standardization process.

The chest radiograph (CXR) is a valuable diagnostic tool for pneumonia, both as part of clinical management [1] and for determining case status in epidemiological studies [2]. CXRs can be archived and systematically evaluated, enabling crossstudy comparisons. However, CXR interpretations are subjective, making it difficult to achieve measurements that are reproducible, reliable, and valid [3][4][5]. Acknowledging this, the World Health Organization (WHO) developed a standardized methodology for the interpretation of pediatric CXRs (the "WHO methodology"), designed to optimize the identification of Streptococcus pneumoniae and Haemophilus influenzae type b (Hib) pneumonia in vaccine trials [2,6]. The WHO methodology has since been adopted by many studies of vaccine efficacy and effectiveness [7][8][9][10][11], a trial of indoor air pollution reduction [12], incidence and surveillance studies [13][14][15], and descriptive epidemiology of pneumonia cases [16,17]. Despite widespread use, there has been no evaluation of how best to implement the WHO methodology, especially beyond its initial application in vaccine trials.
Here we describe the process for CXR interpretation in a large childhood pneumonia study, evaluate the standardization of readers and observer variability, and assess the process of arbitration for discordant interpretations.

Data Collection
Pneumonia Etiology Research for Child Health (PERCH) is a multicountry, standardized, case-control study of the causes and risk factors of childhood pneumonia [18]. A total of 4232 cases of hospitalized, WHO-defined severe or very severe pneumonia in children aged 1-59 months were enrolled from August 2011 to January 2014. Nine sites in 7 countries were chosen to be representative of the epidemiological contexts where pneumonia is most prevalent: Dhaka and Matlab, Bangladesh; Basse, The Gambia; Kilifi, Kenya; Bamako, Mali; Soweto, South Africa; Nakhon Phanom and Sa Kaeo, Thailand; and Lusaka, Zambia. The institutional review board or ethical review committee approved the study protocol at each of the 7 institutions and at the Johns Hopkins Bloomberg School of Public Health. Parents or guardians of participants provided written informed consent.
A CXR was sought from each case as soon as practical after clinical evaluation and study enrollment; some children had repeat CXRs if clinically indicated. In cases where a CXR was not obtained, the reason was recorded. All CXRs were taken in either anterior-posterior or posterior-anterior format as required by the WHO methodology [2]. Most sites used digital CXR imaging equipment, except Zambia and Matlab where analog techniques were used. The Gambian site used an analog machine when there were technical problems with their digital system. At Nakhon Phanom and South Africa, analog CXRs were performed for 11 and 8 months, respectively, before digital systems were installed. All analog images were scanned into digital format [19]. All sites were assessed as meeting quality and safety requirements prior to study enrollment.

Chest Radiograph Interpretation
Two members from each of the 7 study sites (5 radiologists and 9 pediatricians with 0-28 years postspecialization experience) formed the CXR reading panel. Four additional radiologists (3 with extensive WHO methodology experience) from Australia, Kenya, and the United Kingdom formed an arbitration panel to interpret CXRs discordant at the initial interpretation, and ensured consistency to previous studies by using a common arbitration process [2]. Members of the arbitration panel also provided a 2-day, in-person training workshop for the reading panel. To ensure this training was optimized for PERCH, the arbitration panel met first to calibrate the application of the WHO definitions to PERCH CXRs. Three members of the reading panel who were unable to attend the training viewed recorded lectures and met with another member of the reading panel to review key concepts. Prior to interpreting PERCH study CXRs, all readers were assessed by interpreting 20 randomly selected WHO reference CXRs. Readers were required to correctly identify the reference conclusion for ≥50% of all images, ≥66% of images with consolidation, and ≥66% of normal images. Repeat training and assessment with additional sets of 20 images was performed until standardization was achieved. Continuing education was provided through monthly emails that reviewed key teaching points, and a voluntary reassessment with the first set of 20 WHO images. Figure 1 shows the process for interpretation of CXRs. Arbitrators were blinded to previous conclusions except at final consensus discussions. Table 1 shows the classification of findings, conclusions derived from these findings, and the arbitration process used [6]. The WHO methodology was optimized for "any consolidation" (also termed "primary endpoint pneumonia" as a specific reference to the outcome of interest in vaccine trials) and thus this conclusion is frequently evaluated. Also outlined in Table 1 are alternate conclusions and arbitration processes used to evaluate the effects of 4 different interpretation methods on observer agreement, the distribution of conclusions, and the number of interpretations required.

Analysis
We assessed agreement for the primary reading and arbitration panels, as well as separately for each member of the primary reading panel. Observer agreement was evaluated by observed percentage agreement and the kappa statistic (κ), which provides a measure of agreement adjusted for chance agreement [20]. Fleiss' κ was used for interobserver calculations because PERCH used randomized reader-pairs rather than observers with a constant identity across interpretations [20]. Cohen's κ was used for intraobserver assessment of repeat standardization assessments, and for interobserver calculations for individual conclusions to allow calculation of confidence intervals. For analyses of individual conclusions, a κ adjusted for prevalence and differences in each reader's distribution of findings (also known as marginal distributions) was also calculated [21,22]. Because uninterpretable images are assumed to be a consequence of the imaging process and image quality may contribute to variability in interpretation, for some analyses images with one or more interpretations of uninterpretable are excluded, as is common in evaluating observer agreement for CXRs [3,6]. The χ 2 goodness-of-fit test was used to assess the distribution of final arbitration discussion conclusions that agreed with each arbitrator's initial interpretation, using equal proportions (25%) as expected values.
Data exploration and analyses were completed using Stata software version 12.1 (StataCorp, College Station, Texas). Other infiltrate Linear and patchy densities (interstitial infiltrate) in a lacy pattern involving both lungs, featuring peribronchial thickening and multiple areas of atelectasis; it also includes minor patchy infiltrates that are not of sufficient magnitude to constitute primary endpoint consolidation, and small areas of atelectasis which in children may be difficult to distinguish from consolidation Pleural effusion Presence of fluid in the lateral pleural space between the lung and chest wall; in most cases, this will be seen at the costophrenic angle or as a layer of fluid adjacent to the lateral chest wall; this does not include fluid seen in the horizontal or oblique fissures a The presence of consolidation or pleural effusion was described in the World Health Organization methodology as "primary endpoint pneumonia" rather than "consolidation" as a specific reference to the outcome of interest in bacterial vaccine trials. The descriptive term "consolidation" is preferred in a more general epidemiologic context such as PERCH. b Atelectasis of an entire lobe that produces an opacity and a positive silhouette sign with the mediastinal border was considered to be consolidation. c Where any reader or arbitrator reported a finding of consolidation alongside a finding of uninterpretable for other infiltrate (or vice versa) the interpretation was consolidation (or other infiltrate). That is, where a pathological finding was reported this was prioritized over an uninterpretable finding when determining the interpretation for the image. d This interpretation process was used to define chest radiograph (CXR) outcomes for PERCH cases. Other processes are examined here to illustrate effects of different interpretation methods on CXR outcomes. e For 64 images where the altered definition of uninterpretable produced discordant interpretations by 2 readers or 2 arbitrators, and no further arbitration interpretations were available, conclusions were imputed based on the distribution of conclusions from arbitration of uninterpretable images using the PERCH definitions. f "Arbitration panel" = where the primary reading resulted in discordant interpretations for any conclusions, the CXR was randomized and independently interpreted by 2 arbitrators. Where these arbitrators' conclusions were discordant, the 2 arbitrators reached agreement through a consensus discussion. Arbitrators were aware of previous conclusions at the final arbitration discussion only; "Single arbitrator" = where the primary reading resulted in discordant interpretations for any conclusions, an arbitration decision was sought from a single interpretation by the most experienced arbitrator, or by the next most experienced arbitrator when available, or by the third most experienced arbitrator for remaining images. Arbitrators were not aware of previous conclusions. Seven (50%) readers passed the standardization assessment on the first attempt, 3 on a second attempt, and 4 on a third attempt. The voluntary standardization assessment 8 months after interpretations began was completed by 11 of 14 readers, with intraobserver agreement for the identification of any consolidation in the WHO reference CXRs ranging from 85% to 100% (mean, 91%) and κ values from 0.63 to 1.0 (mean, 0.82).
Observed agreement from the interpretation process is summarized in Figure 1. Of the 4172 CXRs reviewed there was at least one interpretation of uninterpretable for 675 (16%) of primary readings and 497 (21%) of arbitration readings. Among images without an uninterpretable reading (ie, "interpretable" CXRs), interobserver agreement was highest for the detection of any consolidation for both the primary reading panel (78% observed agreement; κ = 0.50; 95% confidence interval [CI], .47-.53) and arbitration panel (84%; κ = 0.61; 95% CI, .56-.65; Table 2). The adjusted κ for any consolidation was 0.56 and 0.67 for the primary and arbitration panels, respectively. There was variation in observer agreement for the detection of any consolidation between sites; however, much of this variation was not present after κ values were adjusted for prevalence and marginal distributions (Supplementary Figure 2). Differences between observed agreement and κ values were influenced by the prevalence of each conclusion more than the different marginal distributions between readers (Supplementary Table  3 Figure 2). Considering all 5 conclusions, observed agreement for individual readers ranged from 30% to 55% (κ = 0.07-0.32). The reader with the lowest agreement across these 5 conclusions was an outlier, with results (27%; κ = 0.07; 95% CI, .02-.11) significantly lower than the reader with the next lowest κ (45%; κ = 0.19; 95% CI, .14-.23). These 2 readers with the lowest κ values had no prior experience in the WHO methodology and did not attend the in-person training. The most frequent type of discordance was between normal and other infiltrate, which accounted for 743 of 2358 (32%) CXRs discordant at the primary interpretation and 360 of 1214 (30%) CXRs discordant at arbitration ( Table 3).
The 4 arbitrators interpreted an average of 1179 images (range, 1175-1187). Of these interpretations, 561-647 (range, 48%-55%) were discordant with the other arbitrator. We expected that each arbitrator would have an equal proportion (25%) of all consensus discussion conclusions agree with their initial interpretation; however, one arbitrator had a lower proportion (15%) and one a higher proportion (34%, P < .0001). The final arbitration discussion conclusion was different from both arbitrators' initial interpretations for 89 of 1214 (7%) CXRs.
The arbitration panel reviewed 184 CXRs for quality control. Of the 44 CXRs concordant for any consolidation at the primary reading panel, there was agreement on this conclusion by both arbitrators for 30 (68%) CXRs and agreement by at least one arbitrator for 41 (93%; Table 4). Across all 5 conclusions, there was concordance between arbitrators for 102 (55%) images and, of these, 83 (81%) had the same conclusion as the primary reading panel. After final arbitration discussions, there was concordance between the conclusion of the readers and arbitrators in 129 of the 184 CXRs (70%).
We evaluated 4 different interpretation processes that had alternate conclusions or arbitration methods and assessed their effect on the distribution of conclusions, observer agreement, and total number of interpretations compared to the PERCH interpretation process ( Table 5). The process used in the vaccine trials [23], which considered images discordant between other infiltrate/normal or other infiltrate/uninterpretable as other infiltrate (ie, no arbitration), resulted in 23% of images with final conclusions different from those obtained under the PERCH process. As expected, this process also identified the highest proportion of images with other infiltrate (39% vs 24% using the PERCH process). Although a process using a single arbitration interpretation produced a similar distribution of conclusions to the PERCH process, 16% of CXRs had a different conclusion. Using a majority decision from the primary reading interpretations and that of a single arbitrator left 480 of 4172 (12%) CXRs without a conclusion under the PERCH process and 226 (5%) without a conclusion under the vaccine trial process (data not shown). The various processes also required different total numbers of interpretations; the PERCH process required the most, while the single arbitration process and the vaccine trial process required 25% and 24% fewer interpretations, respectively.  This study is the largest published evaluation of the WHO methodology, and one of few studies where standardization has been attempted across multiple sites with different epidemiological characteristics. Achieving standardization is important to provide confidence in the use of CXR results, including the interpretation of pneumonia etiology. Our results show measures of observer agreement for the detection of any consolidation that are consistent with other high-quality studies of childhood pneumonia [9,13,24], and similar to other subjective diagnostic tests, such as cervical cytopathology [25] and prostatic histopathology [26]. Our experience reaffirms findings that observer agreement is best for consolidation and poorest for findings of other infiltrate [3,27]. The interpretation of observer variability requires consideration of study-specific factors that can influence κ, such as the prevalence of the conclusion under evaluation. Detailed understanding of the core components of the CXR interpretation process informs wider PERCH analyses and the transition of the WHO methodology from vaccine trials to other epidemiological contexts. Standardized interpretation of CXRs is important to ensure that differences between sites or across time are not due to differences in CXR interpretation but to differences in the case mix of enrolled children. We minimized bias by ensuring readers did not interpret CXRs from their own site. This is important for a multisite study like PERCH, as comparisons by site will be central to some analyses. Our structured training process aimed to achieve a common standard of interpretation with the WHO methodology, calibrated to CXRs from the PERCH study. Although readers did not have to correctly interpret 100% of test images to be eligible to interpret PERCH CXRs, the requirements were pragmatic but robust, with several readers requiring repeated attempts to pass. However, our ability to evaluate whether the training itself actually improved individual ability was limited because there were no pretraining assessments and the number of images interpreted for assessments was small. Observer agreement for the primary reading panel declined between the first and second halves of the interpretation process, suggesting the readers had increasing difficulty in applying the interpretation criteria. Future studies may benefit from continuing education and regular standardization assessments.

Table 3. Summary of Discordant and Concordant Conclusions for Either the 2 Randomly Assigned Readers or the 2 Randomly Assigned Arbitrators
The WHO methodology was designed to optimize the detection of any consolidation (termed primary endpoint pneumonia for the vaccine trials), and this conclusion had the highest level of agreement in our study, similar to other pneumonia studies [9,13,23,24] and evaluations of the WHO methodology  [3,27]. Interobserver agreement for any consolidation in both the Californian and Gambian pneumococcal conjugate vaccine (PCV) trials was κ = 0.58 (data was not reported for other trials) [9,23]. Similarly, a Mozambique pneumonia incidence study had an agreement of 77% (κ = 0.52) for any consolidation [13]. In an antibiotic treatment study in Brazil, agreement for the detection of any consolidation or other infiltrate was 87% (κ = 0.68). This higher κ likely reflects a case mix with a higher prevalence of consolidation because of an enrollment criterion requiring the presence of CXR infiltrates [24]. Relying solely on consolidation may underestimate burden of disease [28], as suggested by estimates from the South African PCV trial where only 38% of children with pneumococcal pneumonia were thought to have CXR consolidation [29]. While study methods and case selection criteria can influence prevalence estimates of consolidation [23,30], PERCH used a rigorously standardized study protocol and demonstrated a varied prevalence between sites [31]. Other radiographic appearances also capture cases of true pneumonia, pneumococcal or otherwise. Unfortunately, agreement on the presence of other infiltrates is more difficult to achieve [3,27]; our results show lowest agreement for a finding of other infiltrate ( Table 2) and that discordance is most common between interpretations of normal and other infiltrate ( Table 3). The limitation of the WHO methodology in identifying nonconsolidation findings is particularly important in contexts where the prevalence of consolidation is low and milder radiographic changes predominate, such as areas with access to early antibiotic therapy or widespread use of pneumococcal and Hib conjugate vaccines.
Despite some consistency between studies in observer agreement for any consolidation, it can be misleading to compare κ values without reference to differences in the prevalence of the conclusion under evaluation [22]. This can arise when comparing results for different CXR definition categories or between epidemiological contexts. We observed the paradox of prevalence unexpectedly altering κ values for any consolidation and only consolidation where approximately 80% agreement was observed for both but κ was 0.50 and 0.33, respectively, owing to the prevalence of any consolidation being closer to 50% (Table 2  and Supplementary Table 3). A paradoxically high κ can also be produced if the readers conclude a different proportion of positive findings, although we did not observe this (Supplementary  Table 3). Nonetheless, examining differences in marginal proportions offers an important check to demonstrate the interchangeability of readers [22], particularly for a large panel of readers from different regions with a range of professional experience. Agreement will also decrease as the number of conclusion categories increases, explaining why agreement was higher for the any abnormality interpretation process (which had 3 conclusion categories) than the PERCH interpretation process (which had 5 conclusion categories; Table 4). Despite this, our results show consistency in the proportion of any consolidation (range, 25%-27%) identified by 4 different interpretation processes (Table 4). Determining the "best" method for arbitration depends on the desire to maximize accuracy of interpretation of pneumonia cases within the study, the ability to standardize methods across studies to facilitate between-study comparisons, and financial and logistical constraints. Use of a separate, common, arbitration panel was established in the WHO methodology [2] and adopted for vaccine trials [7,8,11,32] to ensure consistency between studies. Using arbitrators with extensive experience in the WHO methodology is favored over a consensus discussion between primary readers because the former are assumed to have higher agreement on arbitration images, which are the most difficult to interpret. While a process with a single arbitrator may be necessary in studies with logistical constraints, this is not favored because variability among arbitrators means reproducibility between studies may be limited.
We found that a majority of CXRs at arbitration required consensus discussion to reach a final conclusion, which likely reflects the complexity of those CXRs. While initial blinded review by 2 arbitrators before a final discussion necessitates additional interpretations, feedback from our arbitrators suggests this may not be an increased workload compared to a discussion alone. Therefore, initial blinded review by 2 arbitrators followed by consensus discussion for discordant images appears to be an effective method to resolve the interpretation of CXRs that are discordant at the primary reading. Because we observed differences in the proportion of conclusions from consensus discussions that agreed with each arbitrators' initial conclusions, future studies may benefit from ensuring these discussions are blinded to previous interpretations.
The PERCH study is the largest evaluation of the WHO methodology for the standardized interpretation of pediatric CXRs. Our results reinforce the reproducibility for detecting consolidation and the failure to achieve equally high concordance on other conclusions, including distinguishing normal from other infiltrates. The misclassification between these categories must be acknowledged in the analyses drawn from studies that use CXR findings. While limiting the number of final conclusion categories will improve observer agreement, the conclusion definition is the primary influence on agreement. Furthermore, resolving conclusions of discordant CXRs at primary reading should be done through additional independent arbitration readings, with any further discordance resolved through consensus discussion blinded to previous interpretations. Finally, the training process, quality control process, algorithm for drawing final conclusions, and the effect of prevalence on observer agreement all influence study results and need to be reported in detail so that any cross-study comparisons take these differences into consideration. Chest imaging continues to be an important element of pneumonia epidemiologic research, and efforts to improve image interpretation and observer variability, including use of computer-aided detection or other imaging techniques such as ultrasound, warrant additional evaluation.