Interobserver Reliability of the Paris Classification for Superficial Gastrointestinal Tract Neoplasms: A Systematic Review

Abstract Background and study aims The Paris classification characterizes the morphology of superficial gastrointestinal tract neoplasms. This system has been shown to predict the risk of submucosal invasion in certain subtypes of lesions. There is limited data that assesses its agreement amongst endoscopists. We performed a systematic review to summarize the available literature on the interobserver reliability (IOR) of the Paris classification. Methods We conducted a search through December 2020 for studies reporting IOR of the Paris classification. Studies were included if they quantitatively evaluated the IOR of the Paris classification with at least five participating endoscopists. Two authors independently screened studies and abstracted data using an a priori-designed data collection form. Evaluation of study quality and risk of bias was performed using an adapted version of the Guidelines for Reporting Reliability and Agreement Studies. Results Of the 1,541 studies retrieved, 5 were included in the review. All studies were observational cohort studies published between 2014 and 2020. The IOR of the Paris classification was moderate amongst all four studies evaluating colorectal neoplasms (range, κ = 0.42 to κ = 0.54) and substantial in one study that evaluated gastric neoplasms (κw = 0.65). An educational intervention was conducted by three studies with variable methodology and no significant change in IOR. Conclusions IOR of the Paris classification is moderate for superficial colonic neoplasms. Further study is needed to determine the reliability of this system for superficial gastric lesions. Standardized training programs are required to investigate the impact of educational intervention on the Paris classification amongst endoscopists.


Introduction
With an estimated incidence of 3.6 million cases, malignancies of the luminal gastrointestinal tract comprised almost one-fifth of all cancers in 2020. 1 Early precancerous forms of these neoplasms are those that are confined only to the superficial layers of the gastrointestinal tract.][4] Developed in 2002 by a group of Western and Japanese physicians, the Paris Classification of Neoplastic Lesions of the gastrointestinal tract is one such prevalent morphological classification system that has demonstrated, that its, different subtypes are predictive factors for the risk of submucosal invasion. 5,6For example, while less frequently encountered, especially in Western studies, colonic lesions that are excavated or depressed are generally known to harbour a higher risk of invasion. 7As recommended by the US Multi-Society Task Force on Colorectal Cancer, the Paris classification should be used as part of the initial assessment of colorectal polyps to assist in guiding endoscopic or surgical management. 8here has, however, been some data to suggest only moderate interobserver reliability (IOR) of this classification system among users as well as considerable variation in the classification of flat (non-polypoid) lesions among expert endoscopists. 9Moreover, despite its widespread use in clinical decision-making and comparative endoscopic research, the data surrounding its reproducibility amongst endoscopists is limited.As such, we performed a systematic review to investigate the IOR of the Paris classification system.

Materials and methods
We conducted a systematic review that was reported as per the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) recommendations (Supplementary Table 1). 10The protocol was registered on PROSPERO (CRD42021247175).

Search strategy
We searched the online biomedical databases MEDLINE, EMBASE (Excerpta Medica Database), Scopus, CINAHL Complete, and Cochrane Library with the assistance of an information scientist.Searches were executed on December 31, 2020.An example of a search query conducted for MEDLINE is demonstrated in Supplementary Table 2.

Study selection
After duplicate records were removed, two reviewers independently screened all abstracts to identify any for full-text review.Any inter-reviewer discrepancies were resolved through discussion and consensus.

Eligibility criteria
We included a study if it reported a quantitative value on the IOR of the Paris classification, had a minimum of five participating endoscopists, was published in English, and was available in full-text format.We excluded studies for any of the following reasons: published as abstracts, conference papers, commentaries, and case reports; no reported quantitative value for the IOR; and/or evaluation of the IOR for other morphological features (e.g., size, mucosal surface pattern) in isolation.If relevant data were not available in the primary manuscript, we contacted the corresponding authors in an attempt to increase the sample size.

Data extraction and quality assessment
A standardized data abstraction form was created a priori to extract prespecified variables from each included study.Data extraction was performed independently and blinded by two reviewers, wherein any inter-reviewer discrepancies were resolved through discussion and consensus.
Two reviewers independently assessed study quality and risk of bias using a tool adapted from the Guidelines for Reporting Reliability and Agreement Studies. 11We defined an article as "high quality" for a score of ≥11, "moderate quality" for a score of between 8 and 10, and "low quality" for a score of ≤7.Inter-reviewer discrepancies were resolved through discussion and consensus.

Outcome measures
The primary outcome was the IOR of the Paris classification for superficial gastrointestinal tract neoplastic lesions.Secondary outcomes evaluated the effects of an educational intervention on the IOR.
Agreement analysis is described based on the number of raters and possible responses per rater.When there are two raters involved, Cohen's kappa (κ), 12 Scott's Pi, 13 or Gwet's AC1 coefficient 14 can be used.Fleiss kappa (κ) 15 is used when there are three or more raters with the possibility of three or more responses per rater.κ can be weighted to account for the degree of difference between observers (κw).Proportion of agreement or pairwise agreement (PA) can also be used to describe interrater reliability; however, unlike kappa statistics, these do not account for the possibility of agreement occurring by chance. 11nterpretation of κ statistics followed standard values, wherein κ = 0.0-0.20 indicates "slight agreement", κ = 0.21-0.40indicates "fair agreement", κ = 0.41-0.60indicates "moderate agreement", κ=0.61-0.80indicates "substantial agreement", and κ = 0.81-1.0indicates "almost perfect agreement" as per Landis and Koch. 16

Study selection
The study flow is outlined in Figure 1.The initial search identified 1,541 articles after duplicate removal.A total of five studies were included in the review.

Study characteristics and risk of bias
All included studies were observational cohort studies that were published between 2014 and 2020 (Table 1).Four out of the five studies were Western (North American and/ or European) and one was Asian (single-centre study from South Korea).Only one study assessed superficial gastric lesions while the remaining four investigated colonic lesions.One study used images as the method of assessment while the remaining four used recorded video clips.Gastroenterology trainees were involved in four out of the five studies.
Study quality is summarized in Table 2. Three studies (60 percent) were classified as high quality. 9,19,21The most common limitations found across the studies were methodologic.Only one study included a sample size calculation while none of the studies described adequate sampling methods for their selection of endoscopists.Two studies failed to report measures of statistical uncertainty for their estimates of reliability. 18,20terobserver reliability of the Paris classification: 9][20] All four of these studies assessed colonic lesions, whereas the only study that reported a substantial IOR evaluated gastric lesions. 21able 3 depicts various other objective measures for IOR of the Paris classification across all studies.While all included studies presented a kappa statistic, only one was weighted. 21wo of the studies noted a proportion of agreement (PA) value for reliability and only one measured Gwet's AC1 coefficient. 20 simplified morphological classification system was proposed by van Doorn et al which classified lesions into three categories: pedunculated (Ip, Isp), elevated (Is, IIa, IIb), and depressed (IIc, III), where the "elevated" category combined sessile (Is) and flat (IIa, IIb) lesions.The study

Discussion
Since its development in 2002, the Paris classification became widespread in its use as a clinical decision-making tool and in comparative research to study differences in superficial gastrointestinal tract lesion morphology.Despite its endorsement by multiple professional societies, [23][24][25] no validity and reproducibility studies existed until 2014.van Doorn et al. was the first group to identify a moderate IOR (κ = 0.42) for this classification system amongst international experts identifying colorectal neoplastic lesions. 9Subsequent study data-while limited-have demonstrated variable results amongst endoscopists globally.
In this study, we systematically reviewed data from eligible studies to assess IOR for the Paris classification.From 211 superficial colonic neoplasms assessed by 204 endoscopists

Lag time following intervention a
Preintervention IOR (95 percent CI)

Post-intervention IOR (95 percent CI)
Cocomazzi, 2020 19 All participants received learning materials and attended an 1-h conference regarding the Paris classification.They then evaluated twenty-five still endoscopic images of colonic lesions retrieved from literature (pre-intervention IOR).
Respondents were then made aware of the correct answers and given a chance to address mistakes in a meeting.On the same day, they subsequently evaluated seventy video clips of colonic lesions that were 10 s to 4 min in length (post-intervention IOR).Video clips were obtained from colonoscopies performed at their endoscopy unit.
across studies, we found that the IOR of the Paris classification is consistently moderate (range, κ = 0.42, 95 percent CI [0.38-0.46]) to κ = 0.54, percent CI [0.43-0.65]).The sole study in our review that demonstrated a substantial IOR evaluated gastric polyps. 21Given that there was only one study (with eight endoscopists) that focused on gastric lesions, further data are required to reliably conclude possible differences in IOR when comparing lesion type.We observed a high degree of variability in the methodologic approach to the educational interventions conducted by each of the three studies that assessed this variable.While both van Doorn et al. and Cocomazzi et al. report contradicting outcomes post-educational intervention, neither of these two studies demonstrate a statistically significant change in IOR.We note that although Kim's group reported an increase in interrater reliability following their educational intervention, their data cannot be reliably interpreted due to missing measures of statistical uncertainty.Further prospective studies with standardized programs are required to determine the impact of educational training on the IOR of the Paris classification amongst both trainees and experts at early and late follow-up intervals.
A simplified classification system was proposed by van Doorn et al. and externally validated by Cocomazzi's group.Both groups reported a slightly improved, moderate to substantial IOR in the assessment of colonic lesion morphology.Combining Paris Is, IIa, and IIb lesions into an "elevated category", and Paris IIc and III into a "depressed category" increases reliability amongst endoscopists given that there is known to be a high amount of variation in the proportion of lesions identified as "flat." 9This simplified system is appropriate as the risk of dysplasia and submucosal invasion are similar within these categories. 26As indicated by Cocomazzi et al., a major drawback to this system, however, is an inability to classify LSTs, which have prognostically and therapeutically relevant features of elevation or depression (e.g., IIc + IIa). 19,27Additional investigation is required to address these deficiencies and to determine the ability of a simplified classification system in predicting submucosal invasion.
There are limitations to our study.First, there is significant heterogeneity and data incongruence across the included studies in the review.This is likely explained by the variability that exists in the available methods of measuring and representing reliability amongst users.While each study consistently presented IOR in the form of kappa statistics, Ribeiro et al applied a "weighted" kappa (κw).As this kappa allows disagreements to be weighted differently, interpreting this data in the context of non-weighted kappa studies should be cautioned.Similarly, van Doorn et al. and Kim et al. both used Fleiss kappa, a specific type of interrater reliability measure when two or more raters are involved. 15Second, given a low sample size and significant data variability across the five studies, we were unable to conduct meaningful statistical analyses from this review due to the risk of introducing selection bias.Furthermore, while there are some hypothesisdriven theories available in the literature, there is no validated or universally accepted method for the analysis of multiple kappa statistics from different populations. 28onetheless, this is the first study to have systematically presented important IOR data for the Paris classification.As a prominent system that is often utilized as one of the initial steps in gastrointestinal polyp characterization, this review has established an emerging platform for ongoing research to improve its reliability amongst endoscopists.Our findings suggest that the IOR of the Paris classification is only moderate for superficial colonic neoplastic lesions.While the Paris classification has been shown to predict the risk of submucosal invasion for certain subtypes of precancerous lesions, this review supports that its sole use in the initial endoscopic evaluation of superficial colonic neoplasms should be cautioned.When available, adjunctive endoscopic characterization techniques such as mucosal and vascular surface pattern examination should also be employed to further enhance the risk assessment of such lesions in guiding further management decisions.For gastric lesions, additional prospective observational studies are required to confirm whether the reliability of this classification system is higher.Finally, although educational interventions did not significantly improve IOR amongst endoscopists, further investigation with standardized training programs is warranted amongst both experts and trainees.

Table 1 .
Baseline characteristics of the studies included in the systematic review.

Study author, year Study type Region Lesion type Lesions (n) Method of evaluation Participants (n) GI trainees (n) Experts (n) a
Isp), fifty-two elevated (Is, IIa, IIb, IIa+Is), and nine depressed (IIc, IIa+IIc, Is+IIc).The breakdown of lesion subtypes was not reported in van Doorn's study.Three out of the five studies conducted an educational training intervention to assess the effects on the IOR of the Paris classification.While Cocomazzi et al. and Kim et al. both noted an increase in the IOR post-educational intervention, it remained in the moderate to substantial range.
a Expert defined as an endoscopist who has performed over 1,000 colonoscopies and/or performs complex polypectomy.bFirstvalue represents number of lesions depicted as images; second value represents number of lesions shown in video clips.authorsreportedthat the IOR of this system was still moderate, despite demonstrating a higher reliability score of κ = 0.55, 95 percent CI [0.51-0.58].Cocomazzi et externally validated this simplified morphological system in their study by finding a similar increase in reliability, with a κw = 0.68, 95 percent CI [0.58-0.78].The simplified system in Cocomazzi's study evaluated the performance of nine pedunculated (Ip, van Doorn et al. noted a decrease in IOR that resulted in a categorical decrease in kappa to "fair".When stratifying by subgroup, while Cocomazzi's educational intervention significantly increased IOR amongst experts and decreased it amongst trainees, Kim's educational intervention increased IOR in both experts and trainees.Two out of the three studies assessed the late effects of their educational intervention on study participants at three (van Doorn et al.) and four months (Kim et al.).Kim et al. found a greater increase in IOR in the early versus late assessment intervals for both experts and trainees.However, this study observed that the experts experienced a greater decline in IOR at the fourmonth follow-up interval in comparison to the trainee group.A summary of these interventions and their effects on reliability are described in Table4.

Table 2 .
11udy quality and risk of bias assessment adapted from GRRAS.11 9,18nce between trainees and experts in the study byKim et  al.Only two studies reported an IOR of lesion size.9,18vanDoorn et al. reported a substantial IOR (κ = 0.72, 95 percent CI [0.65-0.79])when the size of the lesions was classified into three categories: diminutive (<6 mm), small (6-9 mm), or large (>9 mm).When classifying colonic lesions as polypoid and non-polypoid, van Doorn et al. reported no substantial increase in reliability as the IOR remained only moderate (κ

Table 3 .
Interobserver reliability of the Paris classification across studies.

Table 4 .
Summary of educational interventions and effect on IOR of the Paris classification.