Efficient parameterization of large-scale dynamic models based on relative measurements

Abstract Motivation Mechanistic models of biochemical reaction networks facilitate the quantitative understanding of biological processes and the integration of heterogeneous datasets. However, some biological processes require the consideration of comprehensive reaction networks and therefore large-scale models. Parameter estimation for such models poses great challenges, in particular when the data are on a relative scale. Results Here, we propose a novel hierarchical approach combining (i) the efficient analytic evaluation of optimal scaling, offset and error model parameters with (ii) the scalable evaluation of objective function gradients using adjoint sensitivity analysis. We evaluate the properties of the methods by parameterizing a pan-cancer ordinary differential equation model (>1000 state variables, >4000 parameters) using relative protein, phosphoprotein and viability measurements. The hierarchical formulation improves optimizer performance considerably. Furthermore, we show that this approach allows estimating error model parameters with negligible computational overhead when no experimental estimates are available, providing an unbiased way to weight heterogeneous data. Overall, our hierarchical formulation is applicable to a wide range of models, and allows for the efficient parameterization of large-scale models based on heterogeneous relative measurements. Availability and implementation Supplementary code and data are available online at http://doi.org/10.5281/zenodo.3254429 and http://doi.org/10.5281/zenodo.3254441. Supplementary information Supplementary data are available at Bioinformatics online.


Overview
In the main manuscript, we discuss the analytic hierarchical optimization of scaling, offset, and noise parameters in a particular setting assuming additive Gaussian measurement noise. Here, we provide some background information and a more general description (Section 2). For the specific situation of objective functions based on ordinary differential equation (ODE) models, we outline for time-discrete measurements the adjoint approach of computing the objective gradient (Section 2.3). We describe how hierarchical optimization can be combined with adjoint sensitivities for efficient gradient-based optimization of large-scale problems (Section 2.4). Then, we discuss in detail the derivation of analytic expressions for scaling, offset, and noise parameters assuming an additive Gaussian noise model (Section 3). Last, we give details on the case study and its implementation (Section 4).

Hierarchical minimum is global minimum
To define the general concept of hierarchical optimization, we consider an objective function J : R m × R n → R, (x, y) → J(x, y), where R := R ∪ {∞}. We want to minimize J, i.e. find inf x,y J(x, y).
When J is defined only on a subdomain, set J(x, y) = ∞ otherwise. We will be interested in the biologically relevant situation that this infimum is finite and assumed as a minimum, i.e. that there are x * , y * : J(x * , y * ) = min x,y J(x, y) ∈ R. To find the minimum, we perform hierarchical optimization, i.e. we minimize in x using the conditional minimum in y given x. That is, we compute min x min y J(x, y), or, re-written, The following proposition verifies that this procedure is indeed feasible: , and for all (x, y) ∈ R m ×R n holdsĴ(x) ≤ J(x, y), so, taking the infimum, we also find inf xĴ (x) ≤ inf x,y J(x, y) = i.
Note that this only verifies that, of course, we recover the global optimum of J inĴ. While under certain conditions like convexity, further statements about changes in the problem structure can be made, in general we in particular cannot say whether an optimization algorithm, started at the same point, will converge to the same (local) minimum for both problems.
Further note that when applying an iterative optimization algorithm on the hierarchical problem yielding values (x i , y i ) i with y i := min y J(x i , y), it is important to recalculate y i for every new x i to ensure conditional optimality.
2 Hierarchical gradient-based optimization of ODE models A common situation in systems biology is an ODE model of the form (this is the same as in the main manuscript, included here just for completeness of discussion) (1) Here, the state vector x(t, θ, u) ∈ R nx describes the involved species concentrations. The vector field f : R nx ×R n θ × R nu → R nx describes the temporal evolution of the states. The vector x 0 (θ, u) denotes the initial states at time t 0 . The vector θ ∈ R n θ denotes unknown parameters like reaction rates or initial concentrations, whereas u ∈ R nu encodes (possibly time-dependent) differential experimental conditions like drug doses, which are given as an input to the system, and need not be inferred. We first discuss the general problem, and link to hierarchical optimization later. The function h : R nx × R n θ × R nu → R ny maps the system state to observables y(t, θ, u) ∈ R ny , where typically n y < n x since not all states can be observed. These observables correspond to measured experimental data D = ȳ it,iy,iu (it,iy,iu)∈I for some index set I over time points 1 ≤ i t ≤ n t , observables 1 ≤ i y ≤ n y , and experiment-specific conditions 1 ≤ i u ≤ n u (where we w.l.o.g. include possible experimental or technical replicates in the experimental conditions). When the origin of a specific data point does not matter, we will simplify the notation to an arbitrary index i ∈ I. The measurements are linked to the states viā y it,iy,iu = h iy (x(t it , θ, u iu ), θ, u iu ) + it,iy,iu for some noise model ε it,iy,iu , a random variable. This noise model can again be subject to parameters, denoted σ = σ(θ) here, possibly parameter-dependent. Here, we will consider Gaussian noise ε it,iy,iu ∼ N (0, σ 2 it,iy,iu ), the most commonly used noise model. We assume the noise on all data points to be independent, however different noise models, e.g. for different conditions, are possible. To infer the unknown parameters θ, we maximize the likelihood of observing the measured data D given parameters θ under the noise model π. To increase convexity and for numeric reasons, usually we minimize the negative log-likelihood J(θ) := − log L(θ) instead. It takes the general form In particular, assuming Gaussian measurement noise, we get (2)

Gradient-based optimization
When we want to optimize J w.r.t. the parameters θ, we usually employ an iterative gradient-based scheme, assuming our objective is sufficiently smooth. That is because the gradient ∇ θ J gives the direction of steepest descent in the objective function landscape. Higher derivatives (like the Hessian ∇ 2 θ J) provide additional curvature information. In the following, we will for simplicity of notation abstract from the details of J. All we require is that our objective can be written in the form as a sum over the discrete time points t 0 < . . . < t N , where we w.l.o.g. hide dependence on data and experimental conditions. When we have multiple experimental conditions, experiments, or observables, one can simply sum over all of them inside the J j . Then, we have for the derivative where we used and dropping the obvious arguments at time points t j . The interesting part here is the last summand which involves dx dθ , the derivative w.r.t. the parameter θ of the state x, which in turn is given only as the solution of an ODE. To compute ∇ θ J, there exist -apart from finite differences, which are often observed to be numerically unstable and computational expensive -basically two approaches, which we will shortly discuss.

Forward sensitivity analysis
We can compute ∇ θ J once we have the state sensitivities s x := dx dθ , or equivalently the observable sensitivities s y := dy dθ = ∂h ∂x s x + ∂h ∂θ . They can be obtained by differentiating the ODE (1) w.r.t. θ, giving the forward sensitivity equationsṡ These ODEs can be solved along with the original ODE, giving an enlarged system of ODEṡ . . , n θ of dimension n x (1 + n θ ), the solution of which is the state x along with its sensitivities s x .

Adjoint sensitivity analysis
An alternative way of getting rid of the problematic term in (3) Here, it becomes apparent that it is convenient to choose p s.t. it satisfies p(t) = 0 for t > t N , and on (t j , t j+1 ] p satisfies the backward differential equatioṅ Then, in the above expression the first integral vanishes, and due to the re-initialization of the backward equation at time point t j , the expression simplifies to Summing over j = 0, . . . , N − 1, simplifying the telescope sum, and adding This is independent of the state sensitivities dx dθ , except from those at time 0, which can however usually be easily computed. In turn, it now depends on the adjoint state p, whose dimension is n x and does not increase with the dimension of θ. Thus, compared to forward sensitivity analysis, the computational effort for one gradient evaluation has changed from n θ + 1 systems of size n x to 2 systems of size n x . In addition, a numerical quadrature is required, the effort for which is often negligible. Computation times may however also be affected by e.g. stiffness of the quadrature as well as the need to store results of the forward equation at intermediate points.

Combining hierarchical optimization with adjoint sensitivities
In this manuscript, we assume the particular unscaled observation functioñ s.t. the former full (scaled) observation function becomes with scaling parameters s and offset parameters b, to put the simulations on the same scale as the measurements. Then, simulations and data are related viaȳ it,iy,iu = y it,iy,iu + it,iy,iu , encoding the (additive) noise model, which we here assume to be Gaussian independent additive (see main manuscript). Together, our negative log-likelihood objective function (2) thus reads For ease of notation, we summarize the static parameters as such that our overall parameter vector becomes (θ, w) with dynamic parameters θ and static parameters w. Further, we abstract from any condition dependence. Then, we can write s.t. only the dynamic parameters are involved in the ODE simulation, wherẽ denotes the mapping from the states x to the unscaled observablesỹ, which have to be scaled to y using w in order to allow for the comparison to the observed data. As in the discussion for the derivation of forward and adjoint sensitivity analysis above (there yet without static parameters), we write our objective function (7), highlighting only time dependence (with N := n t − 1), as Note that here we added some extra θ parameters inh and J j compared to (6), in case one does not treat all parameters not directly involved in the ODE simulation of x as static parameters, but rather as dynamic parameters. This notation thus allows for more flexibility.
When we want to estimate the parameters, the most straightforward way is to treat both kinds of parameters equally, i.e. to optimize over the full vector (θ, w) simultaneously. This we refer to as the standard optimization approach. However, it can be favorable to use hierarchical optimization as introduced in the main manuscript on Now, assumingĴ is differentiable, the gradient is given as Here, assuming that the inner optimization problem was solved s.t.
∂J ∂w = 0 (9) (necessary for a local minimum in the interior of the domain of definition, and for our applications always the case), the last term vanishes and we do not need to compute the terms dw dθ , as our gradient reduces to Then, in one optimizer iteration, for given θ, we can first simulate the observables y(θ) for the required time points and experimental conditions, then compute the optimal w(θ), and then compute the objectiveĴ as well as potentially derivatives to guide optimization. This was done using forward sensitivity analysis in Loos et al. [2018]. While we will not go into details here, the basic algorithm works as follows: 1.ỹ, d θỹ = simulate forward given θ, 2. w, d θ w = compute optimal static parameters givenỹ, D, Here, in the first step the unscaled states and their sensitivities w.r.t. θ are simulated, and in the third step all derivatives are gathered according to (10) to give the full gradient. Here, an ODE simulation is involved only in the first step.
However, the question remained open whether the hierarchical approach can also be combined with adjoint sensitivities, which is of great interest for large-scale models. The main problem here is that, as we see in equation (4), in general the adjoint state depends on the data and the full simulated observables y via the residuals J j . But unlikeỹ, y depends also on the static parameters w. So, we cannot first simulate the ODE and then compute optimal static parameters and the objective along with its gradient, as done in the forward approach. Nonetheless, combining the hierarchical approach with adjoints turns out to be not that difficult: Again, there is only one problematic term in (10) involving the state sensitivities dx dθ , which depends on an ODE, all the other terms are in practice readily computed. In this term, it does not matter how w was computed, so it can be treated as constant and h as a function of only x and θ. So, replacing ∂h ∂x by ∂h ∂x , we see that we can again apply the adjoint equation (4) to get an adjoint formulation comparable to (5), here taking the form Then, in practice we can proceed as follows: First, we simulate the unscaled observablesỹ, then compute optimal static parameters w, and then use both to compute the likelihoodĴ as well as to simulate the adjoint state p which allows us to compute ∇Ĵ. In algorithmic form: 1.ỹ = simulate forward given θ, 2. w, d θ w = compute optimal static parameters givenỹ, D, 3. computeĴ givenỹ, w, D, 4. simulate adjoint state p givenỹ, w, d θ w, D to get ∇Ĵ.
This algorithm requires no additional ODE simulations compared to the standard approach. Moreover, since the number of parameters is reduced, we can hope for reduced computation times, and in particular that the problem is easier to optimize. It should be noted that since the inner subproblem was assumed to be solved exactly in this chapter in the sense of (9), s.t. the gradient reduces to all expressions involving ∂J ∂w vanish from the gradient, so that (8) effectively reduces to (3) with the additional application of rescalings toh in J j . Also, then dw dθ does not need to be evaluated. From a practical perspective, this implies that, once computed, we can treat the static parameters w as constant when computing dĴ dθ using the adjoint approach. Thus, we can easily re-use existing software implementations that allow to pass parameters θ the gradient shall be computed along, as well as constants (primarily used to express differential experimental conditions) along which no gradients are supposed to be evaluated.
If the inner subproblem is not solved exactly in the sense of (9), adding these terms to the gradient dĴ dθ involves some additional effort and cannot be addressed by the proposed approach.

Properties of the different optimization approaches
In Table 1, the properties of the diverse optimization approaches discussed above, standard vs. hierarchical and forward vs. adjoint sensitivities, are listed. Scalable means that the approach scales well with the number of parameters. Handle static parameters efficiently means that the handling of additional static parameters is possible without considerable additional computational overhead. FIM uncertainty assessment means that the Fisher Information Matrix can be computed in order to assess parameter uncertainty. Least-squares based optimization means that (non-linear) least-squares optimizers can be used, which are based on the observable residuals, in contrast to objective based optimization, which only requires the full objective function. Finally, improved convergence means that the approach has been observed to converge more often to the global optimum than the other approaches.
3 Analytic formulas for scaling, offset, and noise parameters for additive Gaussian measurement noise In this section, we derive general formulas for the scaling, offset, and noise parameters assuming additive Gaussian noise. We consider the observation function (6) and objective function (7) from the previous chapter.
In general, scaling parameters s, offset parameters b, and noise parameter σ can be shared among various data points, and either need to be estimated or sometimes can be fixed externally. Sharing parameters translates to using e.g. the same rescaling for all data points in a time series, or generated under the same conditions. To define which time points, observables, and experimental conditions share a scaling, offset, or noise parameter, one can define I s α , I b β , I σ γ ⊂ N nt + × N ny + × N nu + for α = 1, . . . , n s , β = 1, . . . , n b , γ = 1, . . . , n σ , where n s , n b , n σ denote the numbers of scaling, offset, and noise parameters respectively, s.t. each data point belongs to exactly one I s α , I b β , I σ γ . Then, all parameters s it,iy,iu for which the indices i = (i t , i y , i u ) are contained in the same group I is s share the same scaling parameters, similarly for offset and noise parameters.
A common assumption in the following will be that first, i.e. the sets I s = I b coincide, and that second i.e. data points sharing the scale parameter (and thus also the offset parameter) share also the noise parameter. This we will also refer to as the resolution of σ being coarser than that of s and b.
These two assumptions will be required only when all parameters are to be optimized. We will call variables fixed, if they are not to be optimized, but specified to some fixed value externally, or optimized as a dynamic parameter. This will be made explicit in the respective subsections.
For ease of notation, we write in the followingȳ where i ∈ I is an arbitrary index over e.g. time points, observables, replicates, and experimental conditions, and h i =h i (t i , θ, u i ) denotes the unscaled observable. So, in the following, we consider the negative log-likelihood for minimization, assuming a Gauss noise variable, see (2). We can decompose the objective function J to sums over indices of interest, which we will w.l.o.g. do in the following, i.e. in the following sections, we only consider the part of the sum relevant for a given parameter.

Scalings and offsets
We assume that the resolution of the scalings s and offsets b coincide, and that either σ 2 is not to be optimized, or its distribution is coarser than that of the scalings. So, we consider a reduced sum where we assume that either σ 2 is not to be optimized as a static parameter, or, if they are to be optimized as static parameters. In that case imagine σ 2 i = 1 for all i, as it will then drop out of the formulas here and can be determined afterwards.
To find optimal parameters, we have the necessary conditions that the derivatives w.r.t. said parameters disappear: If either scaling or offset is fixed, we are done by inserting that value (in particular 1 or 0 assuming absolute data) in the respective formula for the other one. Note that also different values for the fixed variables for different data points would be possible. If both are to be optimized, we proceed by inserting the first equation into the second one, giving If the factor on the left-hand side is non-zero, we find Otherwise, our simulations are not diverse enough to allow the identification of both s and b, in which case we can simply set b = 0. However, if such a case is detected, a warning to the user is a good idea, as it indicates that revising the data and static parameters would be a good idea. In either case, we can then use b to compute s as in the previous formula.
Inspection of the second derivative shows that so both are obviously (global) minima.

Noise parameters
We consider the reduced objective function This sum will typically involve multiple sums of the type discussed for the scalings and offsets. Here, it does not matter where the values for s i and b i come from. We find Inspection of the second derivative shows that confirming we have a minimum.
A problem arises when for the optimal noise parameter holds σ(θ) = 0. This indicates that all (rescaled) simulation and data that are relevant for the given σ are a perfect match (easily possible e.g. when there is just one data point per noise parameter). In this case, we would have a singularity for the objective function, which is undesirable as it does not allow to infer the other parameters. Such constellations should be filtered out beforehand. If this is not possible, a practical solution would be to limit σ > σ 0 for some small minimum σ 0 > 0. For the models and data used in this study, this was not an issue.

Objective gradient used for optimization
The objective function gradient then used in optimization (assuming s, b, σ ∈ θ, otherwise those derivatives need to be added) is given specifically by as one easily verifies. While this expression still contains the sensitivities of states w.r.t. θ, we employed a reformulation using adjoint sensitivity analysis, as explained in the previous chapter. Here, the formula explicitly reads with p satisfying p(t) = 0 for t > t N and on (t j , t j+1 ] obeying the backward differential equatioṅ where we take the sum only over summands involving t i . See also Stapor et al. [2018] for a more extensive derivation of the explicit formulas in the Gaussian case.
Note that in these formulas the problem of dependence of the adjoint state re-initialization terms on the scaling parameters becomes apparent once more.

Mapping protein measurements to the ODE model
To map the measured proteins from, e.g. MCLP, we added an observable as the sum over all states that included this specific protein, weighted by their stoichiometric coefficients. Phosphorylation site-specific antibodies were assumed to bind any modeled state of the given protein carrying the respective phosphorylation (irrespective of any other modification). Antibodies not specific for any phospho-site were assumed to bind any modeled state of this protein, modified or unmodified.

Model simulation
Model simulation and gradient computation was performed using AMICI (https://github.com/ICB-DCM/AMICI/, commit 974093). The SBML model was imported using the AMICI Matlab interface (Matlab, Mathworks, versions R2017a or R2017b) and simulated using either the Matlab interface (for fmincon optimization) or the C++ interface (other optimizers). For computing the gradient ∇J of the objective function, we employed the adjoint approach as implemented in AMICI.
All measurements were considered steady-state measurements and the model was simulated from t 0 = 0, x 0 = 0 until t = 10 8 , which showed to be sufficient for equilibration in previous studies.

Comparison of computation time for forward and adjoint sensitivities
We calculated the computation time for forward and adjoint sensitivity calculation by randomly sampling 6 combinations of parameters and experimental conditions. For these 6 samples, we calculated forward and adjoint sensitivities for a randomly sampled subset of parameters,starting by one parameter up to all dynamic parameters ( Figure S1A). For all dynamic parameters, this yielded a 2703 fold speedup. We subsequently used this approximated speedup, to estimate the computation time needed for a full optimization using fmincon. The computation time is estimated by multiplying the computation time using adjoints with the speedup factor.

Optimization objective and gradient computation
With all optimizers we used an objective of the form of (2). The model was simulated for each experimental condition and for the standard setting, the log-likelihood as well as, where required, its gradient as provided by AMICI were aggregated.
In the hierarchical optimization setting, simulations were performed the same way, but then optimal static parameters were computed, the observables rescaled and the log-likelihood was computed. For computing the gradient ∇J of the objective function, we employed the adjoint approach as explained above. It was, for ease of implementation, that the forward simulation was performed twice, once before computing the optimal static parameters, and afterwards again in order to compute the adjoint state. This was because intermediate results from the forward simulation are required for the backward simulation, which we did not store. So, we had a little computational overhead compared to the theoretically most efficient algorithm depicted in Figure 1 of the main manuscript.

Parameter optimization settings
For parameter optimization, we used log-transformed parameters except for the offsets, which may also take negative values. Initial parameters were sampled within the interval [−2, 2] ([−10 2 , 10 2 ] for offsets) and for optimization, the parameters were bound by [−5, 3] ([−10 3 , 10 3 ] for offsets). Wider bounds sometimes led to numerical problems, because of an increased number of parameters for which the objective function or the gradients were not evaluable. We chose the parameter bounds to be wider than the sampling interval, to avoid problems that occur, especially with Ipopt, when the starting parameters are too close to the bounds.
Initially, we encountered premature stopping of the optimizers due to failure to evaluate the objective function at certain parameter values. Therefore, we implemented an automated integration tolerance relaxation scheme. In case of integration failures, we repeated the respective simulation with absolute and relative tolerances relaxed by a factor of 100, starting from an initial absolute tolerance of 10 −16 and relative tolerance 10 −8 . This process was repeated for three times.

Parameter optimization fmincon
For parameter optimization, objective function for hierarchical and standard optimization was implemented in Matlab (Supplemental data, script path. Parameter estimation was performed using PESTO (https://github. com/ICB-DCM/PESTO/, commit 1e30f9). Matlab code was compiled using the Matlab compiler toolbox and run on compute nodes equipped with either: AMD Opteron processors with nominal frequency 2.4 GHz, 64 cores and 512 GB / AMD Opteron processors with nominal frequency 2.4 GHz, 64 cores and 256 GB / AMD Opteron processors with nominal frequency 1.7 GHz, 48 cores and 192 GB. Model simulations were parallelized across 32 cores using Matlab parfor. Individual local optimizations were run independently. Fmincon was supplied with gradients as described above and called with the following options: MaxIter : 100, TolX : 10 −8 , TolFun: 0, MaxFunEvals: 10 7 , algorithm: interior-point, GradObj : on. We used the same AMICI version and settings as for the optimizers described in the following section. However, due to different hardware, the reported run times for fmincon cannot be compared directly to those of the other optimizers.
• SUMSL (Gay [1983], http://netlib.org/toms/, file 611.gz) SUMSL is originally provided as Fortran code. We used f2c (Fortran to C Translator, version 20100827) to convert this to C++ code. To allow for parallel usage of SUMSL, all static variables were changed to static thread. Default settings were used except for: Maximum number of objective function evaluations (mxfcal ): 10 8 .
Ceres and SUMSL do not natively support parameter bounds. Therefore, we used a naive implementation, reporting failure to evaluate objective function to the optimizer in case the proposed parameters where outside the specified bounds.
We interfaced above-mentioned optimizers using a custom C++ parameter estimation library parPE (https:// github.com/ICB-DCM/parPE/, commit 0e0fb1) which provides a generalized objective function for AMICI models as described above and allows for its parallel evaluation. parPE was linked with the AMICI-generated model C++ codes to generate executables for parameter estimation and model simulation.
Parallelization at the level of individual simulations corresponding to individual experimental conditions was realized using the IBM message passing interface (MPI) library in a master-worker architecture. The respective optimizer was run on the master process and for objective function evaluation, batches of simulation jobs were sent to workers for evaluation. Aggregation of individual likelihoods and gradients, as well as where applicable, the computation of optimal static parameters were performed on master process. Different local optimizations for a given setting were either run independently or in parallel. In the case of parallel optimizer runs, MPI-based parallelization of simulations was combined with pthreads-parallel optimizer runs.
This code was run on the SuperMUC phase2 supercomputer at the Leibniz Supercomputing Centre (Leibniz-Rechenzentrum, LRZ) of the Bavarian Academy of Sciences and Humanities, Garching, Germany. The employed computes nodes were equipped with Haswell Xeon Processor E5-2697 v3 processors with nominal frequency 2.6 GHz, 28 cores per node and 64GB RAM per node. Compute jobs were submitted using IBM LoadLeveler using either 128 nodes (3576 cores) for parallel runs of 20 local optimizations within one job, or as 20 individual jobs using 10-12 nodes (280-336 cores).

Data and code availability
All simulation, parameter estimation, data analysis and visualization codes are provided as Supplementary Code at http://doi.org/10.5281/zenodo.3254441. This archive additionally includes a human-and machine-readable definition of the parameter estimation problems considered in this manuscript in the PEtab format (https://github.com/ICB-DCM/PEtab). Intermediary data and result files are available at http://doi.org/10.5281/zenodo.3254429.
To reuse our implementation of the applied parameter estimation pipeline, we refer users to the https://github. com/ICB-DCM/parPE/ repository, which contains a constantly updated and more recent version than the one used for this study. The codebase comes with some high-level documentation on how to use the workflow with different SBML models and a test case of the hierarchical optimization using a toy example.

Synthetic data generation
We used synthetic absolute and relative viability data to evaluate the loss of information using relative data and the differences in optimizer performance. The data was simulated using parameters from a former optimization run, where viability and protein measurements were used. Noise was added to the simulated data according to the estimated sigma parameter for the viability data: Notation: · -estimates from optimization with viability + protein datâ θ -dynamic parameters ŝ -scalingŝ σ viability -sigma for viability data With this notation, the synthetic and noisy relative data was generated by: To get synthetic absolute data, the simulated relative data was first divided by the estimated scalingsŝ to make them absolute and then noise was added:

Parameter estimation
Parameters were estimated for three different settings: 1. Usingȳ abs as synthetic data 2. Usingȳ rel as synthetic data with standard optimization 3. Usingȳ rel as synthetic data with hierarchical optimization Each time, Ipopt was used with 20 initial parameters and 150 maximal iterations. Each setting used the same initial dynamic parameters.

Making relative data absolute
The relative simulations were made absolute by: where * indicates the here optimized parameters. As measurements, the absolute measurementsȳ abs were used. So the correlations were computed by: corr(ȳ abs ,h)

Analysis of simulated data
Here, we assess the quality of the simulated data. First, we compared the absolute data with and without added noise ( Figure S2A) showing a substantial amount of noise. Next, we looked at the simulated absolute data, before adding noise and the real measured data from CCLE ( Figure S2B). This reveals, that there is a poor agreement between the simulated absolute and the measured relative data. Therefore, it is necessary to estimate the scaling parameters. After scaling the simulated data and adding noise, we see that the simulated data is a realistic representation of the real measurements ( Figure S2C).  Figure S2: A: Simulated absolute data before and after adding noise. B: Simulated absolute data, before noise was added and relative measured data from CCLE. C: Relative simulated data with noise and relative measured data. Related to main  Pear−on correla.ion Figure S7: Pearson correlation for the best local optimization using standard and hierarchical optimization with dataset 1&2. All observables with more than 2 data points are shown. Related to main manuscript Figure 4.