Assessing the impact of open-label designs in patient-reported outcomes: investigation in oncology clinical trials

Abstract Background Knowledge of treatment assignment may affect patient-reported outcomes (PROs), which is of concern in oncology, where open-label trials are common. This study measured the magnitude of open-label bias by comparing PROs for similar patient groups in oncology trials with different degrees of concealment. Methods Individual patient data from ipilimumab arms of 2 melanoma and docetaxel arms of 2 non-small cell lung cancer (NSCLC) trials were adjusted for differences using propensity score weighting. Patients were aware of treatment assignment in CA184-022 and CheckMate 057 (open-label) but not in MDX010-20 and VITAL (blinded). Overall survival (OS) and mean changes from baseline to week 12 in the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire-Core 30 (melanoma) and Lung Cancer Symptom Scale (NSCLC) scores were compared between open-label and blinded groups. Results After adjustment, baseline characteristics were balanced between blinded (melanoma, n = 125; NSCLC, n = 424) and open-label (melanoma, n = 69; NSCLC, n = 205) groups. Study discontinuation and PRO completion rates at week 12 and OS were similar. There was no clear direction in differences in change scores between groups. In the melanoma trials, role functioning (mean = -5.2, 95% confidence interval [CI] = −15.4 to 5.0), global health status (mean = -1.3, 95% CI = -8.7 to 6.1), and pain (mean = 6.2 , 95% CI = −1.8 to 14.2) favored the blinded, whereas emotional functioning (mean = 2.2, 95% CI = -5.8 to 10.2) and diarrhea (mean = -8.3, 95% CI = −17.3 to 0.7) favored the open-label group. In the NSCLC trials, changes in dyspnea (mean = 5.4, 95% CI = -0.7 to 11.5) favored the blinded and changes in appetite (mean = -1.2, 95% CI = -8.1 to 5.7) favored the open-label group. None were clinically or statistically significant. Conclusions This study adds to the growing evidence demonstrating that concerns regarding open-label bias should not prohibit the interpretation of large and meaningful treatment effects on PROs.

Patient-reported outcomes (PROs) are important to consider alongside clinical outcomes in drug trials, and their incorporation during drug development continues to be emphasized by regulatory agencies such as the US Food and Drug Administration and the European Medicines Agency (1,2). The European Medicines Agency recommends blinded designs for PRO measurement in oncology studies to avoid concerns about bias in open-label settings, however, in circumstances where blinding is not possible, it states that "PRO data should also be supported by objective measures" (1). Meanwhile, the US Food and Drug Administration rarely considers open-label trials to be adequate to support labeling claims based on PRO instruments because of concerns that patients' knowledge of treatment assignment may lead to challenges in interpreting benefit (2).
As opposed to proximal outcomes to the physiology of disease and its treatment, such as symptoms, distal concepts such as emotional and social function, and quality of life (QoL) are hypothesized to be more susceptible to open-label bias, because knowledge of treatment assignment may instill hope in patients in the experimental group and disappointment in those in the control group (3). Additionally, there are concerns that in openlabel settings, patients assigned to the control group may be less likely to complete PRO assessments compared with patients in the experimental group (3).
Although these concerns are prudent a priori, they warrant careful empirical evaluation to assess the actual risk, direction, and magnitude of open-label bias or risk ignoring otherwise valuable PRO information. Furthermore, given different findings regarding the effect of blinding on effect size across various clinical areas (4,5), more research is needed within specific clinical contexts, especially oncology, where unblinded trials are common (6). Prior articles and systematic reviews that have explored these concerns in cancer trials using various approaches have not found clear evidence of differential effect by blinding status (7)(8)(9)(10)(11)(12). In particular, compared with blinded trials, open-label trials rarely impacted PRO completion rates (3,10) and baseline scores (10). However, a limitation of this prior work is that most of it (8)(9)(10)(11) is based on published aggregate results, with few studies (12) using individual patient data (IPD). The focus on aggregate data, although helpful, can be limited in the ability to directly compare similar treatments with different levels of blinding while adjusting for cross-trial differences in populations. Despite this, there is a paucity of research using IPD.
Therefore, this study sought to measure the presence and magnitude of open-label bias by comparing PROs using IPD from melanoma and non-small cell lung cancer (NSCLC) trials with different degrees of concealment for treatment.

Study population
A targeted literature review was conducted using ClinicalTrials.gov and Project Data Sphere (PDS), a digital openaccess data-sharing library, which includes de-identified IPD from academic and industry-sponsored phase 3 oncology trials. Eligible pairs of oncology trials were required to meet the following criteria: 1) included the same treatment and dose in at least 1 of the trial arms; 2) included the same cancer site and stage; 3) used the same PRO instrument; 4) differed in the level of blinding to treatment assignment; and 5) had IPD available and accessible for the treatment arm of interest (either through Bristol-Myers Squibb or publicly available through PDS). The search identified 6 trial pairs that met the inclusion criteria and were reviewed in depth. Four trial pairs were excluded because their data were subsequently retracted from PDS or were missing for the PROs. One trial pair in metastatic melanoma [MDX010-20 (NCT00094653) (13) and CA184-022 (NCT00289640) (14)] and 1 in advanced NSCLC [VITAL (NCT00532155) (15) and CheckMate 057 (NCT01673867) (16)] were comparable and met the criteria (see Table 1). Race variable was available in VITAL and CheckMate 057. In VITAL, race categories included either Caucasian/White and Other. In CheckMate 057, race categories included Asian, Black or African American, Native Hawaiian or other Pacific Islander, White, Other. For consistent reporting, we recategorized race categories in CheckMate 057 the same way as what was available in VITAL.
Patients were aware of the drug they received in CA184-022 (but not the dose) and in CheckMate 057 (open-label). They were unaware of the treatment and dose assignment in MDX010-20 and VITAL (blinded). Our study used data from the ipilimumab 3 mg/kg arm from CA184-022 (experimental) and MDX010-20 (active control), the docetaxel 75 mg/m 2 arm (active control) from CheckMate 057, and the placebo with docetaxel 75 mg/m 2 arm (placebo control) from VITAL.
Patients who were randomly assigned to and received the assigned treatment (melanoma: ipilimumab 3 mg/kg; NSCLC: docetaxel 75 mg/m 2 ) and had PROs at baseline were included in the analysis. The PROs of interest were assessed at week 12 in both sets of trials. Assessments were performed regardless of treatment discontinuation for all trials except CheckMate 057, where they were performed on treatment visits only. A sensitivity analysis was conducted excluding patients who discontinued treatment but reported PRO data at week 12

Study outcomes
Clinical outcomes included overall survival (OS) and progressionfree survival (PFS). PROs included the European Organization for Research and Treatment of Cancer (EORTC) Quality of Life Questionnaire-Core 30 (QLQ-C30) (17) for the melanoma trials, and the Lung Cancer Symptom Scale (LCSS) (18,19) for the NSCLC trials.
We evaluated the 5 functioning scales of the EORTC QLQ-C30 (physical, social, role, cognitive, and emotional functioning), the 8 symptom scales (fatigue, nausea and/or vomiting, pain, dyspnea, insomnia, appetite loss, constipation, and diarrhea), and global health status (17). All items are scored on a scale of 0 to 100. For the functioning scales and global health status, higher scores indicate better functioning; for the symptom scales, higher scores indicate higher symptom burden.
The LCSS consists of 9 items: 6 measuring lung cancer symptoms (appetite loss, fatigue, cough, dyspnea, hemoptysis, and pain) and 3 related to symptom distress, interference with activity level, and global health-related QoL (HRQoL) (18,19). Scores for each item range from 0 to 100, with 0 representing the best possible score and 100 the worst (18,19). In addition to the item scores, following standard approaches, we computed the average symptom burden index score as the mean of the 6 symptomspecific scores, with higher scores indicating greater symptom burden.

Statistical analysis
Continuous variables were summarized using means and 95% confidence intervals (CIs), and categorical variables were summarized using frequency counts and percentages. Propensity score weighting (20,21) was used to adjust for differences in the baseline characteristics between the open-label and blinded groups. Patient characteristics that were commonly available in each trial pair were included in the propensity score models (eg, age, sex, geographic region, disease stage, Eastern Cooperative Oncology Group performance status, baseline PRO scores). Means and frequencies of baseline characteristics before and after weighting were compared using standardized mean differences (SMDs). SMDs below the commonly accepted threshold of 0.1 indicate that the characteristics of the 2 groups are balanced (22).
OS and PFS were compared before and after weighting between the open-label and blinded groups to assess similarities in the trial populations using Kaplan-Meier analysis and log-rank test. Mean differences in scores from baseline to week 12 for all PROs were calculated as the difference between the mean change scores for the open-label and blinded group. Statistical comparisons of PROs between the paired groups were conducted using Wilcoxon rank sum tests for continuous variables and v 2 tests for categorical variables.
Clinical significance of the differences was assessed using 1) the SD of the baseline PRO scores and 2) the minimal clinically important difference (MCID) reported in the literature. Differences of half or more of the baseline SD were considered clinically significant (23). For the EORTC QLQ-C30 scales, absolute difference of 5-10 points is considered a small change, and 10-20 points a moderate change (24); hence, MCID was defined as a difference of 10 points. For the LCSS, the MCID was defined as an absolute difference of 10 points (16,25).
All analyses were conducted using SAS Enterprise Guide version 7.15 and R version 3.6.1. Unless otherwise indicated, a P value less than .05 was the threshold for statistical significance based on 2-sided tests. Patients in CA184-022 were blinded to the treatment dose but all were certain to receive ipilimumab monotherapy, therefore the 3 mg/kg ipilimumab-alone arm of the trial was considered as open-label group when compared with the 3 mg/kg ipilimumab-alone arm of MDX010-20 (blinded). b Indicates relevant treatment arm of interest included in the analysis.

Study population and baseline characteristics
In the melanoma trials, 71 patients in the open-label and 131 patients in the blinded group received ipilimumab 3 mg/kg, and of those, 69 (97.2%) and 125 (95.4%), respectively, had PRO assessments at baseline and were included in the analysis. The PRO completion rate at week 12 was similar between the open-label and blinded groups (63.8% vs 66.4%). The study discontinuation rate at week 12 was also similar (16.9% vs 18.8%). Before weighting, the average age was 58.5 years in the openlabel and 56.5 years in the blinded group, and less than half of patients were female (33.3% and 40.8%, respectively; Table 2). Large cross-trial differences prior to weighting were seen for metastatic stage at baseline and prior immunotherapy use, but these characteristics and others were balanced after weighting, with SMDs less than 0.1 and a P value of .05 or higher for all comparisons (Table 2).

Clinical outcomes
OS was similar between the open-label and blinded groups before and after weighting (after weighting: P ¼ .89; Figure 1, A). However, differences in PFS remained after weighting, and PFS was lower for the open-label compared with the blinded group (P ¼ .003; Figure 1, B).
Results from the sensitivity analysis excluding patients who reported PRO data at week 12 after having discontinued  groups (45.9% vs 49.1%). The study discontinuation rate at week 12 was also similar (13.7% vs 10.8%).
Before weighting, the average age was 62.1 years in the open-label and 59.6 years in the blinded group, and less than half of patients were female (42.9% and 34.4%, respectively; Table 3). Nearly all patients had metastatic disease (96.1% and 90.3%, respectively). Statistically significant cross-trial differences prior to weighting were seen for age, sex, region of recruitment, metastatic disease, symptom distress, and HRQoL, but these were likewise balanced following weighting, with SMDs less than 0.1 and P values of at least .05 for all comparisons (Table 3).

Clinical outcomes
OS and PFS were similar between the open-label and blinded groups before and after weighting (after weighting: P ¼ .73 for OS; P ¼ .85 for PFS; Figure 3, A and B).
Results from the sensitivity analysis excluding patients who reported PRO data at week 12 after having discontinued treatment were consistent with the main findings (Supplementary Figure 2, available online).

Discussion
In this analysis of IPD from melanoma and NSCLC trials, there was no evidence to suggest clinically or statistically significant differences between the open-label and blinded groups in any of the EORTC QLQ-C30 domains (eg, global health status, functioning, and symptom scales) and the LCSS (eg, lung cancer symptoms, symptom distress, activity level, and HRQoL). Although small, nonsignificant differences were found between the groups after adjustment, and there was no clear or consistent pattern of direction, with some differences favoring the open-label and some favoring the blinded groups. For the EORTC QLQ-C30 scales in the melanoma trials, most mean differences in change scores were less than 5 points, and those that were larger were inconsistent in direction. For the LCSS in the NSCLC trials, most mean differences between the 2 groups were small (1-2 points), and the slightly larger differences of 5 points for dyspnea and pain  favored the blinded group. No differences were greater than the 10-point MCID for all trials.
The PRO findings are consistent with the lack of differential effect found with the similar OS between the open-label and blinded groups for melanoma and NSCLC and similar PFS for NSCLC, both before and after weighting. Difference in PFS was observed between the 2 groups from the melanoma trials before and after weighting. We recognize that although we made the best attempt to balance the characteristics of the patients between the trials using the propensity score weighting approach, it is possible there remain differences in baseline characteristics of the patients that were not captured and hence not available in the trial data though such variables might play an important role in PFS. To be noted, PFS was investigator determined in all 4 trials, hence, we do not expect it to be a potential source of differences (26). Excluding patients who reported PROs after discontinuing treatment in the melanoma and NSCLC trials did not change our interpretation of the PRO findings.
In this study, PRO completion rates at week 12 were similar between the open-label and blinded groups in the melanoma and NSCLC trials, which is consistent with previous findings in the literature. Though Roydhouse et al. (7) and Anota et al. (10) found some differences in PRO completion favoring the experimental arm, treatment concealment generally seemed to have little impact on PRO completion rates (7,10). Consistent with the findings regarding PRO completion rates, the current study also found no evidence of significant bias associated with PROs because of the absence of blinding. These findings are consistent with those from Chakravarti et al. (27), who reported no clear pattern of overestimation of improvement in emotional domain scores in the open-label investigational arms compared with blinded investigational arms in 3 oncology trial pairs. In a separate analysis, Roydhouse et al. (12) used propensity score weighting and multiple imputation to compare global QoL, function, and symptoms in multiple myeloma trials using IPD and did not find evidence of consistent or meaningful differences by blinding status. Taken together, the current study adds to the growing literature demonstrating a lack of meaningful differential effect by blinding status in cancer trials, using a robust cross-trial comparison that incorporates IPD from melanoma and NSCLC trials using 2 different PRO instruments and corroboration with clinical outcomes.
Prior work has hypothesized that distal domains such as emotional function, social function, and global QoL may be more susceptible to open-label bias, with patients in the experimental arm feeling more optimistic and those in the control arm feeling disappointed with their treatment assignment (3). However, the present study found no evidence of clinically or statistically significant differences in either proximal or distal domains favoring the open-label group. Additionally, the NSCLC cross-trial comparison was based on data from the nonexperimental arms (which would have captured any disappointment patients felt regarding their treatment allocation), and there was no systematic indication of patients underreporting improvement in the open-label relative to the blinded group. Although patients in the NSCLC open-label group experienced slight improvement in some LCSS items and worsening in others, none of the differences were statistically significant.
A limitation of this study was that it compared treatment arms from separate trials. Thus, there is a risk of bias because of unobserved confounding factors even after adjusting for differences in the baseline characteristics between the trial arms, including potential biases introduced through differential missing data. As commonly seen in oncology trials, various circumstances can result in a high proportion of missing data (eg, disease progression, treatment, and study discontinuation). Although all patients with available PRO data were included in the analyses, patients who left the trial may have reported different outcomes had PRO assessments been collected. Additionally, this study was limited to trials of melanoma and NSCLC; hence, results may not be generalizable to other patient populations. Further studies of open-label bias are needed using different PRO instruments in other indications to assess the generalizability of these findings in other oncology populations.
In this evaluation of open-label bias using patient-level data, changes in EORTC QLQ-C30 domain scores in 2 melanoma trials and LCSS scores in 2 NSCLC trials were similar between patients in the open-label vs blinded groups. Any numerical differences were not consistent in direction and did not indicate clinically or statistically significant bias favoring the open-label group. This study adds to the growing body of evidence demonstrating that concerns regarding open-label bias should not prohibit the interpretation of large and meaningful treatment effects on PROs.

Funding
This work was supported by Bristol Myers Squibb Company.

Notes
Role of the funder: The funder did not play a role in the design of the study; the collection, analysis, and interpretation of the data; the writing of the manuscript; and the decision to submit the manuscript for publication.
Disclosures: JLB is an employee of Bristol Myers Squibb Company and a minor stockholder. JS, MY, and MG are employees of Analysis Group, Inc, which received consulting fees from the study sponsor to conduct this research. JR reports personal fees from Amgen, outside the submitted work, and consultancy with University of Birmingham Enterprise, outside the submitted work.