Variation in descriptors of patient characteristics in randomized clinical trials of peptic ulcer repair: a systematic review

The ability to compare findings across surgical research is important. Inadequate description of participants, interventions or outcomes could lead to bias and inaccurate assessment of findings. The aim of this study was to assess consistency of description of participants using studies comparing laparoscopic and open repair of peptic ulcer as an example.


Introduction
The purpose of scientific enquiry is to expand the realms of knowledge and understanding. One of the key ways in which this can be achieved is through standardization or control of a range of variables to eliminate bias. The rigorous application of this approach allows us to add weight to the findings. It can also help to address some of the current issues around replication of findings, the so-called 'replication crisis'. The other way of assessing the consistency of findings is through systematic review and meta-analysis. This allows us to identify whether treatments work, and for whom. Reviews are, however, necessarily limited by the quality of the studies entered into them.
There are three main components within any surgical study: the subjects who were studied (in terms of both inclusion and exclusion criteria and unreported characteristics); the intervention or treatment(s) compared; and the outcomes reported. A mismatch between studies in any of these components has the potential to introduce heterogeneity into an analysis, potentially leading researchers to an incorrect conclusion. This is a particular risk as there has been a proliferation of systematic reviews and meta-analyses in recent years, some of which have been completed with methodological flaws 1,2 . The potential error introduced by incomplete or inconsistent reporting of population characteristics could have several effects. One of these might be preventing the identification of characteristics associated with favourable or unfavourable outcomes. Inadequate characterization of patient populations also presents challenges to the external validity of studies, as it impairs the comparison of a study to real-world clinical populations.
Previous work [3][4][5] has shown a range of outcomes reported in surgical studies, with limited ability to compare across them. This has led to recent attempts to develop 'core outcome sets' across a number of clinical settings [6][7][8] . This is an established methodology 9 which aims to rationalize outcomes to an agreed set that can be reported and compared consistently. It does not prevent researchers from reporting additional outcomes.
It is not clear whether there are issues relating to the description of patients entered into surgical studies, although it is recognized that selective inclusion of patients could alter study findings 10 . The aim of this analysis was to explore variation in reporting of baseline descriptors. This was conducted by assessing the literature comparing laparoscopic with open treatment of perforated peptic ulcer (PPU) as an example model, and exploring and quantifying any variation between these studies in their description of participants. This condition was selected as a subject of interest to the authors' research team.

Methods
This systematic review was performed with reference to the Cochrane Handbook 11 , and is reported in line with PRISMA guidance 12 . It was not registered prospectively. Systematic searches of the MEDLINE and Embase databases were performed using a predefined search strategy, adapted from the previous Cochrane systematic review 13 . Search strategy is presented in Appendix S1 (supporting information).
Manuscripts reporting comparison of short-term clinical outcomes (up to 90 days) between laparoscopic and open repair of PPU, published in the English language at any time were eligible for inclusion. Papers reporting longer-term outcomes or non-clinical outcomes (such as health economic evaluation) were not eligible for inclusion. Single-arm (non-comparative) studies and case series were excluded. Abstracts were screened against selection criteria for eligibility independently by two reviewers, with conflicts assessed by a third reviewer. This process was repeated for full-text assessments.
Year of publication and study design were collated for each included manuscript. Data extracted from studies included all baseline descriptors presented in the study demographics table (typically Table 1), as well as any descriptors presented in the results paragraph. Baseline descriptors were those characteristics present at baseline and not impacted by the treatment decision, intervention or operative approach (such as age, sex and peritoneal contamination). Descriptors were classed as unique even if they reported the same measure (such as age or blood biochemistry measures) but used different cut-off levels, as these would affect the ability to interpret across studies. The number and nature of unique descriptors in each study was identified and aggregated across studies.
Baseline descriptors were presented to the research team as a long list. The team grouped descriptors according to the major concept they were thought to be measuring. After they had been grouped, a descriptive name was attached to each category. No bias assessment was performed, as this study reflects collation of content and not assessment of substance or results.
Descriptive reporting of the number of items reported was performed. With the hypothesis that reporting of descriptors may have improved in recent years, correlation between number of descriptors reported and year of publication was explored using Spearman's correlation. Significance was set at P = 0⋅050 a priori, and analyses were Full-text articles assessed for eligibility n = 37 Full-text articles excluded n = 14 Conference abstract n = 5 Unable to retrieve n = 3 Not in English n = 2 Long-term clinical outcomes n = 1 No comparator n = 1 Cost-effectiveness study n = 1 Letter n = 1 Studies included in qualitative synthesis n = 23 performed using R for statistics (The R Foundation for Statistical Computing, Vienna, Austria).

Participant descriptor domains
After longlisting of the identified domains, the items were categorized into seven conceptual groups: demographics; measures of baseline health; laboratory tests; risk factors for development of PPU; vital signs; disease-specific characteristics; and presentation and pathway factors. The longlist of descriptors and frequency of reporting is shown in Table 2.
A summary of the number of descriptors reported per study in each domain is presented in Fig. 2.
Owing to the relatively small number of studies identified, a formal statistical analysis of variation between study designs was not performed. Fig. S1 (supporting information) shows general overlap in the ranges of number of descriptors reported in each study type. There was no correlation between year of publication and number of descriptors reported (r s = 0⋅73). A scatter plot of number of descriptors by year of publication is presented in Fig. S2 (supporting information).

Demographics
Eleven demographic measures were identified with variable levels of reporting. Sex of participants was reported in 22 (96 per cent) of the 23 studies, and age as a continuous measure in 20 (87 per cent). Patient age was presented as a non-continuous measure in three studies (13 per cent), with variable age groupings used. The characteristics of minority ethnicity and socioeconomic status by quartile were reported in one study (4 per cent) each (Fig. 3a).

Measures of baseline health
Twenty-eight measures were related to baseline or chronic health (Fig. 3b). ASA grade was the most frequently reported descriptor, in 13 studies (57 per cent). Charlson Co-morbidity Index, presence of chronic obstructive pulmonary disease or diabetes mellitus were reported in three studies each (13 per cent). The descriptor 'co-morbidities present', reported as a binary 'yes' or 'no' status, was used in two studies (9 per cent), as was the presence of congestive cardiac failure, hypertension and renal disease (which was itself variably defined or undefined).

Risk factors for development of perforated peptic ulcer
Eight risk factors for PPU disease were reported in studies (Fig. 3c). These were smoking status (7 studies, 30 per cent), alcohol use (4 studies, 17 per cent), previous peptic ulcer disease (4 studies, 17 per cent), steroid use (3 studies, 13 per cent) and non-steroidal anti-inflammatory use (2 studies, 9 per cent). The remaining characteristics of use of ulcerogenic drugs, use of aspirin and 'patients with prognostic factors' were reported in one study (4 per cent) each.

Vital signs
There was considerable variability in the reporting of vital signs (Fig. 3d). Overall, nine descriptions of vital signs were recorded. Five of these were related to haemodynamic variables such as heart rate, systolic or diastolic BP. Preoperative shock was reported in five studies (22 per cent), mean systolic BP in two studies (9 per cent) and systolic BP of less than 90 mmHg on admission in two studies (9 per cent). Systemic inflammatory response syndrome was reported in four studies and temperature at admission in two studies. Shock on admission, diastolic BP on admission and heart rate were reported in one study (4 per cent) each.

Disease-specific characteristics
Ten measures of disease-specific characteristics were identified. These included factors related to the ulcer including site of perforation (9 studies, 39 per cent), defect size greater than 1 cm (3 studies, 13 per cent) and mean size of defect (2 studies, 9 per cent). Measures of items relating to degree of peritoneal contamination were also reported: the Boey score was reported in five studies (22 per cent), the Mannheim Peritonitis Index in three studies (13 per cent), and patients with a Mannheim Peritonitis Index score above 27 was reported in one study (4 per cent). Need for blood transfusion, volume of blood transfused and need for preoperative ventilation was reported in one study (4 per cent) each.

Presentation and pathway factors
Four factors were reported relating to timing of presentation. These were admission within 24 h (3 studies, 13 per cent) and duration of symptoms (2 studies, 9 per cent); delayed presentation and pain for more than 24 h were reported in one study (4 per cent) each.

Coverage of domains by study
Reporting of factors in each domain was assessed both as presence or absence and as the proportion of identified characteristics in that domain ( Table 1). Laboratory tests was the least frequently reported domain, covered in eight studies. Eleven studies addressed risk factors, 12 reported preoperative vital signs, and 11 reported disease-specific characteristics. Seventeen studies reported baseline health descriptors, and all 23 studies reported some aspects of patient demographics.

Discussion
This study reviewed reporting of baseline characteristics of patients included in studies comparing laparoscopic with open repair of PPU. It demonstrated variable reporting of characteristics, with a range of measures used. These measures were, in turn, reported variably, with continuous variables frequently presented as categorical data. This poses challenges for the comparison of outcomes across studies.
The variation in descriptors between studies may in part be explained by variation in the prognosticators of clinical outcome that are reported by the literature. Yet, as an example, a putative predictor of mortality is systolic BP at admission. In the present systematic review it was found that this baseline characteristic was reported in five different ways, making comparison across studies difficult. Many studies used scores or measurements that were composites or indirect descriptors, for example BMI or Boey score. Where these were used, summaries of the constituent data were not presented. This hints that descriptive data were collected but not reported. This may be for reasons such as brevity of report, or may relate to journal style or policies.
Although it is recognized that the evidence base for emergency surgery should be improved by way of high-quality RCTs 37,38 , there is a long-standing issue of unrepresentative samples contained within clinical research 39 . This affects both interpretation and comparison of results. These data show, through the example intervention of surgery for PPU, that rigorous and consistent characterization may be lacking in the acute surgical setting. This, in turn, may impact the external validity of studies, and prevent the identification of interventions that will truly improve outcomes for specific patient groups. A further benefit of standardizing descriptors for conditions under study might be to reduce heterogeneity of the populations under study, and thereby reduce heterogeneity in meta-analyses. Specifically, this might permit reliable comparisons of individual-level data in individual patient data meta-analysis in a way not currently achievable.
There are some limitations to this study that necessarily affect the strength of the conclusions. The study was not registered prospectively and no bias assessment has been performed, in common with other studies that longlist outcomes 3,4 . However, the review employed standard systematic review techniques with dual assessors of studies for inclusion. The search terms were robust, which should ensure the widest possible sampling of the literature. It is plausible that poor characterization of the underlying patient groups impedes comparison of studies. Where populations with varied and unreported characteristics are compared, this may contribute to heterogeneity 40 . To deal with the related issue around outcomes, the development of core outcome sets has occurred. These are likely to reduce heterogeneity attributable to variable outcome reporting 41 . Equally, improving descriptions of characteristics of interventions will likely address some of the heterogeneity attributable to surgical procedures 42 . The logical extension of this is to attempt to standardize the reporting of at least key descriptors in studies. A review of the literature identified only one further study 43 that addressed the question of variable descriptor reporting. This looked at the issue within the context of lower back pain, and highlighted the same issues around barriers to comparison and appreciating clinical applicability.
The noted variation in baseline descriptors is unlikely to be limited to this setting alone. Other avenues for investigation might be the characterization of patients in benign versus malignant conditions, emergency versus chronic or elective conditions, and also quality of descriptor reporting according to study design. This is likely to be an issue in other acute surgical conditions, although to what extent remains to be defined. Further work is required to define this across other conditions and interventions.
The development of a core descriptor set should be part of the formal process for development of a core outcome set, as the two are intrinsically entwined. The development of such a methodology will require work with stakeholders to establish common definitions of included characteristics. Implementation of such an intervention would also require scrutiny. Specifically, at what number of descriptors would such a set become a burden that is ignored by researchers? Opinion is also required on which descriptors may bring particular challenges in data collection such that missing data could become a problem. The present study suggests that around 11 descriptors seems to be a broadly acceptable number.
It might be anticipated that key characteristics could be defined that map to the domains identified here. These are proposed in Table 2. The exact measurement of each domain is subject to debate. For example, renal function is measured preferentially using creatinine concentration in some settings, and blood urea nitrogen levels in others. There is also the potential that a key prognostic factor might not be presented here. Further work is required to establish a definitive core descriptor set for this condition that can be used across multiple health systems and study designs, with minimal burden to researchers.

Disclosure
The authors declare no conflict of interest.