The Impact of Clinical Information on the Assessment of Endoscopic Activity: Characteristics of the Ulcerative Colitis Endoscopic Index Of Severity [UCEIS]

Background and Aims: To determine whether clinical information influences endoscopic scoring by central readers using the Ulcerative Colitis Endoscopic Index of Severity [UCEIS; comprising ‘vascular pattern’, ‘bleeding’, ‘erosions and ulcers’]. Methods: Forty central readers performed 28 evaluations, including 2 repeats, from a library of 44 video sigmoidoscopies stratified by Mayo Clinic Score. Following training, readers were randomised to scoring with [‘unblinded’, n = 20, including 4 control videos with misleading information] or without [‘blinded’, n 20] clinical information. A total of 21 virtual Central Reader Groups [CRGs], of three blinded readers, were created. Agreement criteria were pre-specified. Kappa [κ] statistics quantified intra- and inter-reader variability. Results: Mean UCEIS scores did not differ between blinded and unblinded readers for any of the 40 main videos. UCEIS standard deviations [SD] were similar [median blinded 0.94, unblinded 0.93; p = 0.97]. Correlation between UCEIS and visual analogue scale [VAS] assessment of overall severity was high [r blinded = 0.90, unblinded = 0.93; p = 0.02]. Scores for control videos were similar [UCEIS: p ≥ 0.55; VAS: p ≥ 0.07]. Intra- [κ 0.47–0.74] and inter-reader [κ 0.40–0.53] variability for items and full UCEIS was ‘moderate’-to-‘substantial’, with no significant differences except for intra-reader variability for erosions and ulcers [κ blinded: 0.47 vs unblinded: 0.74; p 0.047]. The SD of CRGs was lower than for individual central readers [0.54 vs 0.95; p < 0.001]. Correlation between blinded UCEIS and patient-reported symptoms was high [stool frequency: 0.76; rectal bleeding: 0.82; both: 0.81]. Conclusions: The UCEIS is minimally affected by knowledge of clinical details, strongly correlates with patient-reported symptoms, and is a suitable instrument for trials. CRGs performed better than individuals.


Introduction
Endoscopy is central to the assessment of disease activity in ulcerative colitis [UC] both in trials and in clinical practice, but assessment is often inconsistent. 1,2,3,4,5 Activity indices for UC, such as the Mayo Clinic Score, incorporate an endoscopic component and are commonly used to evaluate response to treatment in clinical trials. 6 Lack of consistency can affect the outcomes of trials, independently of the effect of treatment, and negatively affect decisions by regulatory authorities or clinicians. 7 The Ulcerative Colitis Endoscopic Index of Severity [UCEIS] was developed as a rigorous instrument for assessing endoscopic disease activity in UC. 5,8 The UCEIS was developed using a predefined protocol. Initially, the level of disagreement for 10 endoscopic items, each with 3-5 levels of severity, was determined among 10 investigators. 5 The intraand inter-investigator variability for each item was then assessed by 30 different investigators, and a model constructed that best represented overall endoscopic activity evaluated on a visual analogue scale [VAS]. The final model consisted of three items: vascular pattern [3 levels], bleeding [4 levels], and erosions and ulcers [4 levels] [ Table 1]. 5,8 In practice, the UCEIS is scored from the worst affected area at video sigmoidoscopy and the final score is the sum of components ranging from 0 [normal] to 8 [most severe; it should be noted that this simplifies the original index which ranged from [3][4][5][6][7][8][9][10][11]. 5,8 The UCEIS accounted for 88% of the variance between observers in the overall assessment of endoscopic activity in the test cohort 5 and 86% in the validation cohort. 8 The aim of the present study was to understand the UCEIS in a clinical context. The primary objective was to determine whether the UCEIS is affected by clinical information, using an independent cohort of central reader investigators. Henceforth, the term 'readers' rather than 'investigators' or 'observers' will be used for consistency, since 'investigators' implies clinicians who recruit patients to trials and often have knowledge of symptoms, whereas 'readers' implies independence from the patient, and 'observers' is a generic term for either. Secondary objectives were to investigate potential benefits of a group of central readers for reducing variability, 9 and to compare scores with other indices and patient-reported symptoms.

Development of the UCEIS
The development of the UCEIS has been reported. 5,8 In short, the resource was a library of 670 videosigmoidoscopies from patients in three clinical trials, supplemented by individuals without UC and hospitalised patients with acute severe UC. 5 In phase 1, the authors determined agreement in overall endoscopic assessment and defined Table 1. The UCEIS: items, levels, and definitions used as anchor points for evaluating ulcerative colitis. 5,8 Descriptor [score most severe lesions] Likert scale anchor points Definition

Vascular pattern Normal [0]
Normal vascular pattern with arborisation of capillaries clearly defined, or with blurring or patchy loss of capillary margins Patchy obliteration [1] Patchy obliteration of vascular pattern Obliterated [2] Complete obliteration of vascular pattern Bleeding None [0] No visible blood Mucosal [1] Some spots or streaks of coagulated blood on the surface of the mucosa ahead of the scope that can be washed away Luminal mild [2] Some free liquid blood in the lumen Luminal moderate or severe [3] Frank blood in the lumen ahead of endoscope or visible oozing from mucosa after washing intra-luminal blood, or visible oozing from a haemorrhagic mucosa Erosions and ulcers None [0] Normal mucosa, no visible erosions or ulcers Erosions [1] Tiny [< 5 mm] defects in the mucosa, of a white or yellow colour with a flat edge Superficial ulcer [2] Larger [> 5 mm] defects in the mucosa, which are discrete fibrin-covered ulcers when compared with erosions, but remain superficial Deep ulcer [3] Deeper excavated defects in the mucosa, with a slightly raised edge The worst affected area of the colon visible at sigmoidoscopy is scored. The copyright of UCEIS is held by Watson Laboratories, a subsidiary of Actavis Inc., as successor in interest of Warner Chilcott and Procter and Gamble. descriptive terms ['items']. Phase 2, conducted in a separate cohort of 30 investigators, rated items in 25 of 60 different videos and assessed overall severity on a 100-point VAS [0 completely normal and 100 = worst ever seen]. 8 The UCEIS developed from this study included three items [ Table 1]. 5,8 The UCEIS is freely accessible to all at no cost, though the terminology is subject to copyright and acknowledgement. Phase 3, in another cohort of 25 investigators, demonstrated reproducibility of the UCEIS in 28 of 57 videos and rebased normality to '0', rather than '3'. 8  Results], assessments on each video had to align with the level of 'erosions and ulcers' assigned by the authors [SPLT,PK,BRY,WJS], and within one level for vascular pattern and bleeding. For readers initially failing to qualify, a retest was permitted; they had correctly to score two of three different videos for the item[s] that they had previously scored incorrectly. Readers who failed the second qualifier were excluded.

Video selection
A new library of 44 anonymised videos was created from 670 videos and supplements created for phase 1-3, stratified by clinical disease activity [Mayo Clinic Score; MCS] assigned on the date that they were derived. [According to MCS: stool frequency: 0 = normal, 1 = 1 to 2 more stools than normal, 2 = 2 to 3 more stools than normal, 3 = >4 more stools than normal; rectal bleeding: 0 = none, 1 = visible blood with stool < half the time, 2 = visible blood with stool > half the time, 3 = passing blood alone; Physician's Global Assessment: 0 = normal, 1 = mild, 2 = moderate, 3 = severe.] A total of 34 of the videos were selected [by PK and BRY] from sigmoidoscopies conducted to a standard procedure as part of clinical trials. 10,11 No video had been used in the earlier phases of UCEIS development. Three further videos were taken from patients with severe UC [recorded before colectomy] and three from subjects without UC [colorectal cancer screening]. Four videos were repeated as common controls between readers, drawn from the 34 videos of patients with active disease [two videos with MCS 1-2 and two with MCS 10-11].

Video allocation
Each reader performed 28 evaluations from the 44 videos, which included two repeats of non-control videos to evaluate intra-reader variation and the four common controls [ Figure 1]. Readers were either provided with clinical information on disease activity for each video [one or two sentences on age, symptoms and history, summarised by PK/BRY/ST; unblinded group] or with no accompanying clinical information [blinded group]. Patient information was extracted from the original trials. 10,11 To ensure some disparity between clinical information and endoscopic assessment, the two common control videos with MCS 1-2 in the unblinded group were assigned information more severe than reported [originally: rectal bleeding [RB] = 1 / stool frequency [SF] = 0 / Physician's Global Assessment [PGA] = 0; and RB = 0/ SF = 1/PGA = 0; both changed to: RB = 1/SF = 2/PGA = 1]. The two control videos with MCS 10-11 were given symptom information less severe [originally: RB = 3/SF = 3/PGA = 3; and RB = 2/SF = 3/ PGA = 3; both changed to: RB = 2/SF = 2/PGA = 2]. 1 Readers in the blinded group were not provided with any clinical information for the common control videos, as with their other videos.
To ensure that sufficient numbers of readers viewed the same videos to power the analyses, two 'pools' of 26 videos [n = 28, including the two duplicates] were created with 10 unblinded and 10 blinded readers randomly allocated to each pool.

Video evaluation
The 40 readers who passed training evaluated the UCEIS [ Table 1] in the worst area affected at videosigmoidoscopy. As with phase 3, 8 still photographs from the training were provided for reference during the evaluations and the overall assessment of endoscopic severity recorded on a 100-point VAS [0 = completely normal and 100 = worst ever seen]. Data were captured using a programme developed by one of the authors [PS] that ran simultaneously and saved responses after scoring each video.

Assessment of the potential for central reader groups to reduce overall variability
The effect of central reading to improve consistency of scoring the UCEIS was examined through 'virtual' central reading groups [CRGs] of three readers. Some of the readers [6/40] were recruited specifically for their experience as central readers in other trials [see acknowledgements]; three were randomly allocated to the blinded group and three to the unblinded group [ Figure 1]. In the blinded group, each CRG consisted of one randomly chosen 'experienced central reader', together with two randomly selected 'standard readers' from the other seven in the group [all, therefore, scored the same  videos]. This allowed 21, 3-person virtual CRGs to be created, representing 21 possible pairings of two 'standard readers' with one of three 'experienced central readers'. Thus, each virtual CRG differed by at least one standard reader, to maximise independence. Adjudication of the UCEIS within the virtual CRG was accomplished as follows: if the UCEIS score was agreed between the two 'standard' blinded readers, then that score was considered the adjudicated CRG score; if their UCEIS scores did not agree, then the score from the 'experienced central reader' was included and the adjudicated CRG UCEIS score was set to the median of the three. Such detail matters when considering the implications of central reading.  12 ]. Standard deviations [SD] of the assessments on a per video basis were compared using the Wilcoxon signed rank test. The correlation between the UCEIS and overall endoscopic assessment of severity by VAS, as quantified by Pearson correlation coefficients, was compared between the blinded and unblinded groups. Comparison of these correlations allowed the accuracy of UCEIS assessments to be checked, in addition to quantifying the similarity of accuracy estimates between the blinded and unblinded groups. Pearson correlation coefficients were calculated from reader's scores of the UCEIS for their set of videos and the mean VAS for the appropriate videos, derived from the responses of all other readers in the blinded or unblinded group. This addressed any lack of independence between UCEIS and evaluation of overall severity. Correlations were summarised by median, minimum, and maximum within each group, and compared using the Wilcoxon rank sum test. The UCEIS and overall [VAS] severity scores for the common control videos, which were presented with misleading symptom information to the unblinded group, were also compared using the Wilcoxon rank sum test. Intra-and inter-reader variability in blinded and unblinded groups was evaluated through kappa [κ] statistics, as qualitatively interpreted by Landis and Koch [κ: < 0.00, 'poor' agreement; 0.00-0.20, 'slight' agreement; 0.21-0.40, 'fair' agreement; 0.41-0.60, 'moderate' agreement; 0.61-0.80, 'substantial' agreement; 0.81-1.00, 'almost perfect' agreement]. 13 The standard kappa summarised the precise level of agreement, used for individual items. Since the overall UCEIS score is a 9-point [0-8] ordinal scale, a weighted kappa was also calculated to take account of close agreement, assigning a weight of 1 for precise agreement, 0.5 for scores that differed by 1 level, and 0 in all other cases. For intra-reader variability analyses, only data from duplicate videos were used. Inter-reader κ values were calculated by stratifying by reader pairs and using the common videos that they scored, but excluding second scoring of duplicate videos. An average of reader-pair κ values ['overall κ'] was calculated, where the weighting was the inverse of their variance.

Secondary objective: assessment of the potential benefits of CRGs in reducing variability
To evaluate the effect of blinded central reader groups, for the 18 non-control videos read by all three experienced central readers, SDs were calculated across the seven blinded 'standard readers' and compared with those across the 21 virtual CRGs. Differences in SDs between groups were assessed by nonparametric methods.

Secondary objective: comparison of UCEIS with other indices and patient-reported symptoms
The comparative indices selected were the Mayo Clinic Score [MCS], 14 partial MCS [excluding endoscopic subscore], 15 ; and the modified Baron score, 14 as recorded in the original clinical trials. To compare the UCEIS with patient-reported symptoms and variables recorded in practice, stool frequency [SF], rectal bleeding [RB], both SF and RB, and patient functional assessment were derived from subscores of the MCS contemporaneously recorded for each video. These variables were compared with UCEIS scores for the blinded readers, excluding those for 'normal' and 'most severe' videos [for which there was no MCS] and common controls [which were repeats], by Spearman rank correlation. SF, RB, PFA, and partial MCS were also correlated with the modified Baron score, to determine how results compared with those calculated for the UCEIS. The relationship between the UCEIS bleeding item [scored 0-3] and patient-reported RB on a 3-point scale [0 = none to 2 = visible blood with stool > half the time or passing blood alone] was also evaluated. Finally, the distribution of RB, SF, and RB and SF items across UCEIS scores 0-8 was assessed for blinded video evaluations, excluding 'normal', 'most severe', and common control videos.
Except as noted for comparison of mean scores by video, statistical significance was assumed at the 5% level [unadjusted p < 0.05], using the Statistical Analysis System [SAS, Cary, NC] software, version 9.2.

Reader qualification
A total of 47 readers underwent UCEIS training to reach the target of 40 for the study. Of those that qualified, 20 succeeded on their initial assessment and the remainder after scoring the additional set of three videos. The remaining seven physicians failed to qualify. Variance between qualifiers was not evaluated, because of disparity between reader numbers and variables.

Range of disease severity
Mean assessments of overall endoscopic severity by VAS [0-100] ranged from 1.65 for videos in the 'normal' stratum to 92.75 for videos in the 'most severe ever seen' stratum according to the 40 readers. Using the same videos, this corresponded to mean UCEIS scores of 0.15/8 for 'normal' subjects to 7.90/8 for 'most severe', indicating that the videos comprehensively covered the range of endoscopic severity of UC [ Figure 2].

Mean agreement, variability, and correlation in UCEIS and overall severity scores
Mean UCEIS scores did not differ between blinded and unblinded readers for any of the 40 videos in the main analysis set [Wilcoxon rank sum tests with Holm's multiplicity adjustment, all p-values ≥ 0.05] [ Figure 2A]. There was one video [severe UC recorded before colectomy] for which the VAS score was significantly higher in the blinded group [p = 0.045] [ Figure 2B]. There were no systematic differences found in the UCEIS SDs between the blinded and unblinded groups in the main dataset [median SD 0.94 vs 0.93, respectively; p = 0.97], although as expected, the SD was lowest at the severe end of the video spectrum [

Intra-reader and inter-reader agreement
Overall, intra-reader variability for the three items ranged from κ 0.47 to 0.74, indicating 'moderate' to 'substantial' agreement [ Table 2]. Intra-reader agreements for 'vascular pattern' and 'bleeding' items were similar for the blinded and unblinded groups, whereas the difference for the 'erosions and ulcers' item [blinded: κ 0.47 vs unblinded: κ 0.74; p = 0.047] just reached statistical significance. Clinical information tended to increase variability [i.e. reduced the κ] for the bleeding item and improved consistency between readers for erosions and ulcers. Weighted intra-reader kappas for the full Inter-reader agreement for the items was also 'moderate' to 'substantial' [κ 0.40 to 0.71; Table 3]. There were no significant differences in inter-reader variability between blinded and unblinded groups for any of the items, whether analysed within the 40 videos in the main dataset [excluding the four common control videos and the second evaluation in each of the two repeated videos] or across the four common control videos. Weighted inter-reader kappas for the full UCEIS were 0.47 [95% CI = 0.46, 0.49] and 0.47 [95% CI = 0.44, 0.50] for the blinded and unblinded readers, respectively.

Potential benefits of central reader groups to reduce overall variability
The median SD was significantly lower in the blinded, virtual CRG than for 'standard' readers [0.54 vs 0.95, respectively; p < 0.001]. This was most apparent in videos representing UCEIS scores between 3 and 5, most likely to represent the spectrum mild or moderate endoscopic disease severity [ Figure 4], which has substantial implications for clinical trials. The SD was least and also most similar at the extremes of the UCEIS range [≤ 2 and ≥ 6], which implies that the UCEIS has least variance for defining remission or severe endoscopic activity.  Table 4]. The modified Baron Score consistently correlated less well than the UCEIS with all patient-reported symptoms [ Scores for RB and SF items showed a clear correlation with endoscopic severity as measured by the UCEIS [ Figure 5]. When the UCEIS score was ≥ 5, then RB or an increase in SF was present at least 95% of the time. Such patient-reported symptoms inform clinical thresholds of the UCEIS for evaluating its relationship to outcomes, which are relevant to regulatory assessment.

Discussion
This study shows that clinical information has minimal impact on endoscopic scoring of disease activity determined by the UCEIS. It also characterises the performance of the UCEIS with regard to patient-reported symptoms and the impact of central reading. In an independent cohort of 40 central readers completing 28 evaluations from a new library of 44 videos, clinical information did not produce a significant change in the UCEIS in any video [0/40]. The UCEIS correlated well with patient-reported symptoms of stool frequency and rectal bleeding [or both], such that when endoscopic bleeding was scored 0 on the UCEIS [none], this corresponded to contemporaneously patient-reported RB of 0 or 1 in 95% of the time. The UCEIS had least variance for defining remission or severe endoscopic activity, but where variance was greatest [mild-moderate activity, , central reading by a group of three readers had most impact. Central reading matters for clinical trial design, whereas the lack of impact of symptoms on endoscopic assessment matters in clinical practice. The collective performance of the three items accounts for 86% of the variance in the overall assessment of endoscopic severity. 8 Agreement in scoring of individual items is modest but it is the overall score that matters most, assessed in the most severely affected area at flexible sigmoidoscopy. In this study, the intra-[κ 0.47 to 0.74] and inter-reader [κ 0.40 to 0.50] agreement for the three components of the UCEIS [vascular pattern, bleeding, and erosions and ulcers] was consistent with that reported in phase 3 [intra: κ 0.47 to 0.87; inter: κ 0.48 to 0.54]. 8 Although the impact of training was not quantified in this study, it was considered appropriate as with any descriptive process such as endoscopy reporting. Consistency highlights a strength of the UCEIS in providing a simple, standardised reporting system for the endoscopic appearances of UC and confirms that it is a reliable instrument for assessing endoscopic disease severity.
The influence of clinical information had minimal effect on the variability and reproducibility of UCEIS scoring. No significant differences between readers were demonstrated for any of the items, whether readers were provided with clinical information or not ['vascular pattern': κ 0.53 vs 0.50, respectively; 'bleeding' κ 0.44 vs 0.40; 'erosions and ulcers': 0.47 vs 0.48]. This is internally consistent with the finding that providing unblinded readers with clinical information apparently too severe or too mild for the common control videos did not influence their scoring of the items compared with blinded readers ['vascular pattern': 0.66 vs 0.67, respectively; 'bleeding' κ 0.56 vs 0.55; 'erosions and ulcers': 0.66 vs 0.71]. Although the items of 'erosions and ulcers' registered statistical significance [blinded intra-reader κ = 0.47 vs unblinded: κ = 0.74; p = 0.047; Table 2], the confidence intervals are wide and consistent with 'moderate' to 'substantial' agreement whether blinded or unblinded to clinical information when viewing the videos. It will be difficult to improve on an index that [overall] accounts for 86% of variance between readers for the assessment of endoscopic disease activity. A Delphi procedure on the videos with most disagreement might further enhance agreement, but given the range of analyses conducted herein, some of the differences may be chance findings.
The UCEIS correlated well with patient-reported symptoms, including rectal bleeding, stool frequency [or both] and patient functional assessment [rank correlations 0.76 to 0.82]. There were strong correlations between the scores for individual items and overall UCEIS [ Figure 5], encouraging application in clinical practice. There was also 'substantial' correlation [0.78 to 0.86] with established  Table 4). The UCEIS consistently correlated more strongly than the modified Baron score with all patient-reported symptoms. Unfortunately, a comparison between the UCEIS and the MCS for patient-reported symptoms was not possible, since these were derived from data used to calculate the MCS. The impact of patient symptoms on the endoscopic subscore of the MCS is not known. On the other hand, an independent comparative study on responsiveness using clinical trial data suggests that the UCEIS is marginally but consistently more responsive than the MCS. 16 This may have an advantage in early-phase drug development. It is in this field that a binary endpoint of remission / no remission is inefficient and there is value in assessing relative changes in mean scores. 16,17,18 This study has shown that variation is both least and most similar for blinded/unblinded groups at the extremes of the UCEIS range [≤ 2 and ≥ 6]. It means that the UCEIS has least variance for defining either remission or severe endoscopic activity. This is an asset when determining responsiveness [the ability to detect change], which was not evaluated in this or previous studies. Relevant to clinical trials, however, is the variance in the mid range [UCEIS [3][4][5], which was significantly reduced by central reading groups. This was assessed through random groups of three readers, one of whom was 'experienced' in central reading, to act as an adjudicator should there be disagreement between 'standard' readers, representing normal investigators. Central reading of endoscopy can affect the outcome of clinical trials in mild-moderately active UC. A study of a wellestablished mesalazine product for mild-moderately active UC did not reach significance compared with placebo [p = 0.069]. 7 After blinded central reading of endoscopic videos had excluded 31% of patients for being ineligible for the criterion of a MCS subscore of > 2, the difference between the treatment groups readily achieved significance [29.0% vs 13.8%; p = 0.011]. This has implications for a charter on central reading, since the optimal configuration of central reading for scoring is unknown. 9 Options include an index reader, polling multiple central reader results, adjudicating reads on 'outliers', or random selection of videos to be read centrally. The effect of training and revalidation on central readers needs to be determined, since all factors will affect conclusions on drug efficacy and registration.
What still needs to be defined are the levels of the UCEIS for remission, mild, moderate, and severe disease. Taking account of the variance and performance characteristics, the authors speculate that a UCEIS of 0-1 indicates remission and > 6 represents a threshold for  severe disease with prognostic implications, although thresholds and their implications need to be defined by prospective study in clinical trials, currently in progress. 17,18 A score of 0 is likely to become the aspirational goal for both regulatory trials and clinical care; 18 which should be index independent, since remission is remission. On the other hand, the UCEIS also seems likely to become the favoured instrument for early drug development, when a binary endpoint of remission / no remission is not efficient and there is enormous value in assessing relative changes in mean scores. 16,17,18 Further limitations of the current study are that the readers in this study may not reflect endoscopists in clinical practice. Readers were selected, trained, and had images available during the rating, so the question of whether clinical information has an impact in real life may be questioned. The answer is training. Although the UCEIS is simple in concept and easy to apply, that 7/47 proposed readers failed to agree with defined interpretations indicates that training is a necessary component of its application, no less for evaluating colitis than for endoscopic polyp detection.
Next steps in the development of the UCEIS include establishing thresholds for remission and severity, and responsiveness to change. It would be valuable to examine how the UCEIS is affected by evaluation of colonic segments at full colonoscopy. 19,20 On the other hand, caution should be exercised to prevent the UCEIS becoming more complex than necessary. Of greater interest is correlation with histological disease activity or biomarkers, especially for the prognostic value in remission. Central readers can markedly decrease the variability in UCEIS scoring, particularly in the mild-moderate disease spectrum, which is most relevant to clinical trials. The UCEIS is simple to use, derived from the sum of just three items, and accurately accounts for a widest range of disease severity associated with UC, is affected minimally if at all by clinical information, and is ready for practice after appropriate training.

Conflicts of interest
None declared.