A Benchmark for Dose Finding Studies with Continuous Outcomes

An important tool to evaluate the performance of any design is an optimal benchmark proposed by O'Quigley and others (2002, Biostatistics 3(1), 51-56) that provides an upper bound on the performance of a design under a given scenario. The original benchmark can be applied to dose finding studies with a binary endpoint only. However, there is a growing interest in dose finding studies involving continuous outcomes, but no benchmark for such studies has been developed. We show that the original benchmark and its extension by Cheung (2014, Biometrics 70(2), 389-397), when looked at from a different perspective, can be generalised to various settings with several discrete and continuous outcomes. We illustrate and compare the benchmark performance in the setting of a Phase I clinical trial with continuous toxicity endpoint and in the setting of a Phase I/II clinical trial with continuous efficacy outcome. We show that the proposed benchmark provides an accurate upper bound for model-based dose finding methods and serves as a powerful tool for evaluating designs.


Introduction
A variety of dose finding methods for Phase I clinical trials aiming to find the maximum tolerated dose (MTD) were proposed in the literature in past three decades. The conventional way to assess the performance of a design is to conduct an extensive simulation study. One of the key characteristics of any dose-finding method is its accuracy which is usually computed as the proportion of times the correct dose is selected. The majority of novel proposals are studied in scenarios chosen by investigators themselves. This, clearly, adds subjectivity to the assessment of the method's operating characteristics as one can always find scenarios in which the MTD identification is easier than in others. To solve this problem, O'Quigley and others (2002) proposed the non-parametric optimal benchmark that provides an upper limit of accuracy (in terms of proportion of correct selections) for dose finding methods based on a binary toxicity endpoint. The benchmark uses the concept of the complete information which assumes that outcomes of each patient can be observed at all dose levels (in contrast to an actual trial in which patients can be assigned to one dose only). The benchmark shows how 'difficult' the MTD identification is in the chosen scenario and provides the objective context for the performance evaluation of the design under investigation. Since its proposal, the benchmark has proven its great usefulness by the ability to assess the newly proposed designs comprehensively (see e.g. Paoletti and Kramar, 2009;Yin and Yuan, 2009). Additionally, based on the benchmark, Cheung (2013) derived sample size formulae for the continual reassessment method (CRM) by O'Quigley and others (1990).
The benchmark was originally proposed for studies with a binary endpoint. Motivated by more complex studies, for instance, Phase I/II clinical trials evaluating binary toxicity and efficacy endpoints simultaneously (Thall and Russell, 1998) or Phase I trials with multiple grades of toxicities (Lee and others, 2011), Cheung (2014) generalized the benchmark to both of these cases. This has broadened the application of the benchmark significantly. However, there is a growing number of Phase I and Phase I/II clinical trials involving continuous endpoints, but no corresponding benchmark exists yet. For example, Bekele and Thall (2004); Yuan and others (2007); Ivanova and Kim (2009); Bekele and others (2010); Ezzalfani and others (2013); Wang and Ivanova (2015), considered a continuous toxicity endpoint while, for example, Bekele and Shen (2005); Hirakawa (2012); others (2015, 2017) studied Phase I/II trials with binary toxicity and continuous efficacy endpoints.
In this work, we propose a simple benchmark which can be applied to dose finding studies with continuous outcomes. The novel benchmark employs the same concept of the complete information as the original method and is based on the well-known probability integral transform. This general method also allows to find a benchmark for designs with multiple correlated outcomes and several treatment cycles. It is shown that the evaluation of the novel benchmark does not require any additional information other than already provided in the simulation study of a design. We apply the novel benchmark to evaluate the performance of two recently proposed dose finding methods: a design for a Phase I trial with continuous toxicity endpoint and a design for a Phase I/II trial with binary toxicity and continuous efficacy endpoints.
In Section 2, we review the original benchmark and propose its generalization. We compare design proposals for Phase I and Phase I/II to the benchmark in Section 3 and conclude with a discussion.

Benchmark for Binary Endpoint
Consider a Phase I clinical trial with a binary toxicity outcome, dose-limiting toxicity (DLT) or no DLT, n patients and a discrete set of dose levels d 1 , . . . , d m . Let Y ij be a Bernoulli random variable taking value y ij = 0 if patient i has experienced no DLT at dose d j and y ij = 1 otherwise. This random variable is characterised by probability p j such that p j = P (Y ij = 1), i = 1, . . . , n. The goal of the trial is to find the maximum tolerated dose (MTD), the dose corresponding to a prespecified risk of toxicity, γ.
The non-parametric optimal benchmark uses the concept of the complete information. For a given patient the complete information consists of the vector of outcomes (DLT or no DLT) at all dose levels assuming that p 1 , . . . , p m are known. In other words, for a given patient one knows the maximum toxicity probability that this patient can tolerate. Formally, the information about the DLT of patient i at each dose level is summarised in a single value u i ∈ (0, 1), which is drawn from a uniform distribution, U(0, 1). For instance, u i = 0.3 means that patient i can tolerate doses d j with p j ≤ 0.3, but would observe a DLT if given dose d j′ with p j′ > 0.3. It follows that u i is transformed to y ij = 0 for doses with p j < 0.3 and to y ij = 1 otherwise. The procedure is repeated for all n patients which results in the vector of responses for each dose level y j = (y 1j , . . . , y nj ), j = 1, . . . , m. Let T (y j , γ) be a summary statistic for the dose level d j upon which the decision about the MTD selection is based. Conventionally, T (y j , γ) is chosen such that its minimum (or maximum) value corresponds to the estimated MTD. Therefore, d j for which T (y j ) is minimised (maximised) for all j = 1, . . . , m is declared as the MTD in a single trial. The procedure is repeated for S simulated trials and then proportions of each dose selected as the MTD is computed.
In a context of a Phase I clinical trial with binary response is a conventional choice for the MTD selection criterion. We refer the reader to the Web application by Wages and Varhegyi (2017) for the benchmark evaluation using criterion (2.1).

Benchmark for Continuous Endpoint
Consider now a Phase I clinical trial with continuous outcome Y ij at dose d j for patient i having cumulative distribution function (CDF) F j (y). The goal of the trial is find the target dose (TD) which minimises (or maximises as defined by an investigator) some decision criterion T (·). In simulations the CDF, F j , is chosen by an investigator and specifies the distribution of outcomes for a given dose d j , and the set of CDFs corresponding to doses d 1 , . . . , d m defines a simulation scenario. This simple fact is going to be a central part of our proposal. To illustrate the construction of the novel benchmark step-by-step, we use a setting studied by Wang and Ivanova (2015) throughout this section.
Example 1. Wang and Ivanova (2015) considered a setting with m = 6 doses and a biomarker for toxicity measured on a continuous scale. In one of the simulation scenarios presented, it is assumed that a toxicity outcome Y ij given dose level d j has normal distribution N (0.1j, (0.1j) 2 ), j = 1, . . . , 6. Then, the CDF F j is the CDF of a normal random variable with corresponding parameters Φ(·, µ j = 0.1j , σ 2 = (0.1j) 2 ). These CDFs will be used to obtain the benchmark in this scenario.
Let us denote the quantile transformation as Then, Probability integral transform. If U ∼ U(0, 1) is a uniform random variable on the unit interval, then F j is the cumulative distribution function of a random variable F −1 j (U).
This result is commonly used for inverse transform sampling (e.g. see Bekele and Shen, 2005, for an example in dose finding) which allows to generate a random variable with any distribution F j .
Assume that the whole information about a patient's profile is summarised in a single value u i drawn from U(0, 1). For patient i with profile u i , the quantile transformation y ij = F −1 j (u i ) is applied to obtain a continuous outcome that this patient would have at dose d j , j = 1, . . . , m. Different dose levels are modelled by applying the quantile transformation using corresponding CDFs. This results in a vector of responses (y i1 , . . . , y im ), also called the complete information about patient i. The same procedure is repeated for all patients i = 1, . . . , n which, again, results in the vector of responses for each dose level y j = (y 1j , . . . , y nj ), j = 1, . . . , m.
The complete information for 5 patients with randomly generated profiles u 1 , . . . , u 5 is given in Table 1.
Recalling the decision criterion T (y j ) on which the TD selection is based, the dose level d j for which T (y j ) is minimised (or maximised) is declared as the TD in a single trial. For instance, if the goal of the trial is to find the dose having the average level of toxicity γ, the decision criterion (2.1) can be used. The benchmark can be constructed for various decision criteria and then be adapted to evaluate any design under investigation.  Example 1 (Continued). The goal of the trial considered by Wang and Ivanova (2015) is to find the dose with the mean response closest to the target response γ. The criterion of choosing the dose which maximises the probability of the average level of toxicity µ j to be in the ε neighbourhood of γ was considered. Let g j (·|y j ) be a probability density function of µ j given the data y j . Then, the decision criterion takes the form The TD is the dose for which the criterion T (y j ) is maximised. Following the original framework, γ = 0.1 and ε = 0.01 are chosen. Using the complete information generated in Table 1 and the density function of Normal distribution with corresponding mean and variance parameters yields: T (y 1 ) = 0.09; T (y 2 ) = 0.04; T (y 3 ) = 0.02; T (y 4 ) = 0.01; T (y 5 ) = 0.01 and T (y 6 ) = 0.01. The value of the criterion is maximised for dose level d 1 which is selected as the TD in this single trial. The procedure is repeated for s = 1, . . . , S simulated trial to obtain the proportion of correct selections. The evaluation of the method by Wang and Ivanova (2015) using the proposed benchmark is provided in Section 3.1.
Algorithm 1 provides the step-by-step guidance how the benchmark can be constructed based on S simulated trials.
Algorithm 1 Computing a benchmark for a single continuous outcome 1. Specify CDFs F j for all doses d j , j = 1, . . . , m and define the decision criterion T (·) 2. Generate a sequence of patients' profiles . . , n, j = 1, . . . , m and store y j = (y 1j , . . . , y nj ). 4. Compute T (y j ) for all j = 1, . . . , m, find dose J for which T (y J ) is maximised (minimised) and set Z s = J. 5. Repeat steps 2-4 for s = 1, . . . , S simulated trials 6. UseZ (j) = S s=1 I (Z s = j) /S as the selection proportion of dose d j , j = 1, . . . , m The proposed benchmark can be applied to a wide range of distributions as it requires the quantile information only, which is available for many distributions in various statis-tical software (for example, qbinom, qnorm, qexp , etc in R (R Core Team, 2015)). Note that the probability integral transform can be also applied to discrete random variables in which case the quantile transformation F −1 j (·) is given explicitly. It is easy to see that using the F −1 j (·) corresponding to a Bernoulli random variable in Algorithm 1 results in the original benchmark construction proposed by O'Quigley and others (2002).
The novel the benchmark can be also applied to clinical trials with multiple endpoints. This construction is provided below.

Benchmark for Multiple Endpoints
In the setting with several endpoint, the correlation between them is important. Below, we describe the algorithm generating correlated outcomes in the benchmark framework. In fact, the approach described below has been known for a long time (Tate, 1955;Molenberghs and others, 2001). We apply it to an arbitrary distribution of outcomes to generate the complete information. We start from the case of binary toxicity and continuous efficacy that has attracted a lot of attention in the literature recently.
Consider a Phase I/II clinical trial with toxicity outcome Y (1) ij and efficacy outcome Y j , respectively, at dose level d j for patient i. We will use the setting studied by Bekele and Shen (2005) to illustrate the construction of the benchmark for multiple endpoint through this section.
The toxicity/efficacy profile of patient i is given by two characteristics: u (1) i ∈ (0, 1) corresponding to toxicity and u (2) i ∈ (0, 1) corresponding to efficacy. Firstly, we generate a bivariate standard normal vector (x i ) with mean µ = (0, 0) and covariance matrix where ρ is the correlation coefficient. In a simulation study, the correlation coefficient, ρ, is specified by the investigator as part of the simulation scenario. By applying the CDF of the standard normal random variable (u i )), one can obtain two correlated random variables with uniform distributions. Then, the corresponding quantile transformations are applied to u (1) i and u (2) i marginally as described in Section 2.2 and values of response for patient i at dose levels d j are obtained y (1) Example 2 (Continued). The correlation coefficient considered by Bekele and Shen (2005) is ρ = 0.25. The bivariate normal vector with mean µ = (0, 0) and covariance matrix Σ is initially generated: (x 1 , x 2 ) = (−0.892, 0.292). Then, the first patient has a toxicity profile u 1 = 0.615, λ 1 τ = 2.5, τ = 0.1) = 26.3 (apply the quantile transformation of Gamma distribution). Subsequently, the vector of the complete toxicity information is (0, 0, 1, 1) and the vector of the complete efficacy information is (26.3, 74.6, 121.8, 134.3). The complete information for 5 patients with random generated profiles u (2) 5 is given in Table 2.  Similar to a single endpoint case, the TD selection is based on a pre-specified decision criterion, T (y j ), which takes the minimum (maximum) value for the most desirable dose level. This would, however, involve the information for all endpoints of interest and can have more complicated structure. In the context of the Phase I/II clinical trial the decision criterion is also known as a trade-off function (see e.g. Thall and Cook, 2004).
Example 2 (Continued). Bekele and Shen (2005) defined the target dose as the dose with the highest expected efficacy while being safe (p j < 0.35) and efficacious (λ j > 5). This translates in the criterion j (·|y (2) j ) are probability density functions of an efficacy response and of a toxicity probability given the data y (1) j , y (2) j , respectively, and θ (1) , θ (2) are controlling probabilities. This decision criterion is used to construct the benchmark in this setting. Applied to the benchmark, the integrals in (2.5) are computed using density functions of Beta distribution and Normal distribution for toxicity and efficacy outcomes, respectively. Using summary statistics given in Table 2 and controlling probabilities θ (1) = θ (2) = 0.50, values of the criterion are T (y 4 ) = 0;. The criterion is maximised for dose level d 3 which is selected as the TD in this single trial. The procedure is repeated for s = 1, . . . , S simulated trials to obtain the proportion of correct selections. The evaluation of the method by Bekele and Shen (2005) using the proposed benchmark is provided in Section 3.2.
Similarly, the benchmark can be applied to an arbitrary number of endpoints. For instance, consider a Phase I/II trial in which toxicity and efficacy are evaluated in four cycles. Then, the profile of patient i is given by u each drawn from U(0, 1) and the rest of the construction remains unchanged. The procedure to generate the benchmark for K endpoints is given in Algorithm 2.
Algorithm 2 Computing a benchmark for multiple outcomes 1. Specify K × K covariance matrix Σ and define objective function T (·). In the following section, we illustrate the implementation of Algorithm 1 (in Section 3.1) and Algorithm 2 (in Section 3.2) in different clinical contexts.

Continuous Toxicity in Phase I Trials
The dichotomization of the toxicity endpoint (DLT/no DLT) in Phase I clinical trials restricts the available information about the drug's toxicity. In fact, a continuous toxicity endpoint can provide a better insight on the drug's profile (Wang and others, 2000;Bekele and Thall, 2004;Wang and Ivanova, 2015).
Recently, Wang and Ivanova (2015) proposed the Bayesian Design for Continuous Outcomes (BDCO) which can be applied to clinical trials with continuous toxicity endpoint. In short, BDCO assumes that outcome Y ij at dose d j for patient i has normal distribution N (µ j , σ 2 j ) where µ j is considered as a random variable itself. Based on the posterior distributions of µ j , BCDO is driven by the probability that µ j is within ε of the target, γ: The design targets the dose which maximizes the probability in (3.1). This is equivalently to maximising the decision criterion T (·) given in Equation (2.3). Below, we apply the proposed benchmark to the setting considered in the original paper using this decision criterion and compare its performances to BDCO. Recalling the setting by Wang and Ivanova (2015), we consider six scenarios with six dose levels d 1 , . . . , d 6 , a sample size of n = 36, parameter ε = 0.01 and two cases: (i) the case of equal variances in which outcome Y ij has normal distribution N (0.1j, 0.2 2 ) and (ii) the case of unequal variances corresponding to normal distributions N (0.1j, 0.1 2 j 2 ). In each of six scenarios the target values γ = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6} are used, respectively. As a consequence, the target dose is dose d 1 in scenario 1, d 2 in scenario 2, and so on. Table 3 shows the operating characteristics of the BDCO against the benchmark. The results of the BDCO are extracted from Table 2 in the original article, and the benchmark is evaluated using S = 10 6 trial replications.
Under Scenarios 2-5, the proportion of correct selection using the benchmark is 87%, which illustrates that they have the same level of "complexity". Conversely, the benchmark shows that it is easier to find the MTD if it is either the first or the last dose. Under all scenarios with equal variances, the BCDO has the accuracy close to the benchmark. The ratio of the probability of correct selection of the BCDO relative to the benchmark ranges between 92% and 98% in these cases.
Under Scenarios with unequal variances, the benchmark demonstrates that it is harder to find the MTD if the corresponding variance is high. For example, the benchmark leads to 86% of correct selections under Scenario 2 and 45% under Scenario 5. Again, it appears that it is easier to find the MTD when it is the first or the last dose for any methods. BCDO shows very high accuracy in Scenario 1-5 with unequal variance. The correct probability ratios never go below 91% and even reach nearly 100% under Scenario 5. In the former case, BCDO recommends the MTD in 45% of replications (as well as the benchmark), but it recommends the highest dose d 6 systematically less often -20% against 29% by the benchmark. This implies that BCDO tends to more conservative decisions. Scenario 6 confirms this finding in which the correct probability ratio equals 76% which, however, is still high.
Overall, BCDO selects the correct dose uniformly less often than the benchmark in all scenarios (as expected), but the efficiency of the design is high. The minimum ratio of the probability of correctly selecting is 76% which corresponds to highly variable outcomes. This indicates that parameters of the BCDO are adequately calibrated and the BCDO in the proposed form is able to find the MTD in various scenarios.  (2015)

Continuous Efficacy and Binary Toxicity in Phase I/II Trials
Similarly to the continuous toxicity outcome, the continuous efficacy endpoint can provide better guidance on the target dose selection than the dichotomized one. One of the first designs proposed for Phase I/II clinical trial considering continuous efficacy outcome is by Bekele and Shen (2005) who developed a Bayesian approach to model toxicity and (continuous) biomarker of efficacy jointly. We denoted this design by BS. Bekele and Shen (2005) introduced a latent normal random variable which is related to the observed binary toxicity. A bivariate normal distribution allows for different strengths of the dependence between toxicity and efficacy. Dose escalation/de-escalation decision rules are based on the posterior distribution of both toxicity and efficacy. The design was shown to have good operating characteristics in many scenarios. Therefore, the majority of subsequently proposed designs (e.g. see Hirakawa (2012) and Yeung and others (2015)) were compared to it. Below, we provide the comparison of the design by Bekele and Shen (2005) against the respective benchmark.
Recalling the framework by Bekele and Shen (2005) we consider an efficacy outcome at dose d j having a Gamma distribution Γ(λ j τ, τ ) with rate parameter τ = 0.1 and a DLT outcome having probability p j . A total of six scenarios and four dose levels per scenario are explored using the total sample size n = 36. The parameters of λ j and toxicity probability p j are given in Table 4. In each scenario a weak association, ρ = 0.25, between the toxicity and efficacy biomarker is used. The target dose is defined as given in the criterion (2.5) -the dose with the highest expected efficacy while being safe (p j < 0.35) and efficacious (λ j > 5). Table 4 shows the operating characteristics of the BS design against the respective benchmark. The results for BS are extracted from Table 1 of the original work which uses 1000 replications, and the benchmark is evaluated using S = 10 6 trial replications.
Under Scenarios 1, 3 and 5 with an increasing dose-efficacy relationship, the BS design performs with high accuracy and the proportion of correct selections is close to the benchmark. Interestingly, the BS design recommends the target dose d 3 3% more often than the benchmark under Scenario 1. Given the number of replications for the BS and the benchmark, 3% difference is significant. This can be an indication that the prior distribution used by BS is in favour of d 3 . It would also explain the relatively lower performance under Scenario 4 in which the BS recommends the target dose d 2 in 83% of trials against 100% by the benchmark. The BS recommends the dose with the same efficacy, but noticeably greater toxicity in 17% of trials. An alternative explanation of the difference in proportion of selections under Scenario 4 can be a plateau in dose-efficacy relation that is not modelled by the BS. Nevertheless, the ratio of correct probabilities is 83% demonstrating good operating characteristics of the BS design.
Under unsafe Scenario 2 and inefficacious Scenario 6, the BS design comes to the correct conclusion nearly the same proportion of trials as the benchmark. This shows the ability of the BS design to avoid the unethical selections due to either high toxicity or low activity.
Overall, the benchmark confirmed that the BS design is flexible and can recommend the target dose under many different scenarios. It also gives some possible clue to the super-efficient performance under Scenario 1 and to a potential challenges that the BS design can face in the plateau dose-efficacy scenarios.

Discussion
In this work, the novel benchmark for dose finding studies is formulated. In essence, the novel benchmark is similar to the original proposal by O'Quigley and others (2002) as the whole information about a patient is summarised in a single value u, but can be also applied to studies with continuous outcomes. In the era of increasing complexity of clinical trial the procedure evaluating an adequacy of the novel dose finding methods is crucial. As it is shown above, the proposed benchmark provide an accurate upper limit on the performance of model-based dose finding design. It is also able to reveal some inadequacy in the model/parameter/prior specifications or, alternatively, confirm the robustness of the design. The benchmark assesses the complexity of scenarios and can serve as a standardization of scenarios of various difficulty. Therefore, it should be definitely recommended for the complete analysis of the dose finding design as it helps to evaluate the dose finding designs in more comprehensive way. The possibility of the benchmark application to several endpoints allows to investigate the influence of the correlated outcomes on design's characteristics which is an important aspect of a Phase I/II dose finding studies. Moreover, it worth investigation what correlation structure on the endpoints of interest the used method of correlated outcomes generating implies. Clearly, the outcomes of the interest may no longer have the same correlation ρ. Finally, it is important to mention that while the benchmark is a useful tool for assessing performances of any given dose finding methods, it does not capture all aspects of the evaluation. For instance, it does not provide information on the distribution of dose allocation, average number of DLTs or stopping rules. Developments in this direction are of the great value for the complete design assessment.