Benchmark problems for dynamic modeling of intracellular processes

Abstract Motivation Dynamic models are used in systems biology to study and understand cellular processes like gene regulation or signal transduction. Frequently, ordinary differential equation (ODE) models are used to model the time and dose dependency of the abundances of molecular compounds as well as interactions and translocations. A multitude of computational approaches, e.g. for parameter estimation or uncertainty analysis have been developed within recent years. However, many of these approaches lack proper testing in application settings because a comprehensive set of benchmark problems is yet missing. Results We present a collection of 20 benchmark problems in order to evaluate new and existing methodologies, where an ODE model with corresponding experimental data is referred to as problem. In addition to the equations of the dynamical system, the benchmark collection provides observation functions as well as assumptions about measurement noise distributions and parameters. The presented benchmark models comprise problems of different size, complexity and numerical demands. Important characteristics of the models and methodological requirements are summarized, estimated parameters are provided, and some example studies were performed for illustrating the capabilities of the presented benchmark collection. Availability and implementation The models are provided in several standardized formats, including an easy-to-use human readable form and machine-readable SBML files. The data is provided as Excel sheets. All files are available at https://github.com/Benchmarking-Initiative/Benchmark-Models, including step-by-step explanations and MATLAB code to process and simulate the models. Supplementary information Supplementary data are available at Bioinformatics online.


Detailed description of provided files
In this section, an overview of the human-readable output that is available for each benchmark model will be given. This comprises a general overview, and condition-specific data and model files. In the git repository Benchmarking-Initiative/Benchmark-Models, MATLAB scripts are available to automatically read and simulate the models, and for performing a comparison to the supplied model values of our simulation.

General overview
The Excel files for model information comprise a general information, where the name of the model, its number of data points, parameters and experimental conditions are summarized, cf. Fig. S1a. In addition, the likelihood value of the best fit, if an error model is estimated, and the χ 2 value with fixed uncertainties are provided. Details can be found in Eqs. (1), (2). Moreover, the different model compartments and possible non-trivial model features are summarized. In the second sheet, all model parameters, their values at the best fit, potential logarithmic transformation, bounds and flags indicating whether they are estimated or fixed in the original model are provided (Fig. S1b).
The second to last sheet includes a comprehensive table of the distinct experimental conditions with their individual parameter assignments, number of time and data points and χ 2 value (Fig. S2). The predictor column indicates whether a different predictor than time is used, for example in a dose-response relationship. The parameter assignments are either fixed values or functions of other model parameters. The table is split into assignments that are condition-specific and differ between data files, noted in the first column, and model-wide transformations that are summarized in the lower part.
This information can be used together with the last sheet, which lists all original ordinary differential equations (ODEs) before assignments are applied (Fig. S3). The left-and right-hand sides are stated in distinct columns Supplementary Figure S1: Screenshots of general model information.
to simplify machine-readability. To obtain condition-specific ODEs that can be compared to the corresponding data, the specific assignments of the desired data set as well as the model-wide assignments have to be applied. Thereby, only one substitution is required, since no parameters should appear within the formulas that need to be replaced themselves. In rare cases, integration time also has to be replaced by the predictor that enters the ODEs. In addition, some parameters might require assignments from the definitions section, which are stated below the ODEs in the sheet Raw ODEs. After this substitution, all remaining parameters should appear in the sheet Parameters, and the data-specific observation functions and initial values for integration can be found in the condition-specific model file, see next subsection.
Supplementary Figure S2: Table of different experimental conditions within a model with corresponding parameter assignments.
Supplementary Figure S3: ODEs of the model before any parameter assignments were applied.

Condition-specific model files
Modeling problems can contain a large set of experimental conditions with distinct dynamics which are possibly linked to multiple data sets. In order to provide a simple and clear assignment of model equations for each data set, we provide the respective mechanistic model individually for each data file. The files report in the first sheet the corresponding ODEs (after the execution of the experiment-specific parameter assignments listed in the general overview, see Fig. S4). Similar to the general overview, the left-and right-hand sides are printed in different columns. Steps at discrete time points are denoted by heaviside functions that can be readily used within MATLAB.
Moreover, initial values x(0), which are either given by a parameter, a function representing an analytical steady state or a fixed value are listed (Fig. S5b). The first entry specifies the starting point that has to be set for numerical integration algorithms. In addition, the observation functions that map ODE solutions to the data are given, whereby a possible log scale, information about simultaneously estimated uncertainties and the error model are stated. Note that for observations on the log scale, a single absolute error parameter corresponds to a relative error parameter on the original data. Auxiliary definitions, e.g. a sum of all modifications of a protein measured as a total protein concentrations, are listed as well (Fig. S5a).
The information provided for each data set individually can directly be used to simulate the model using standard ODE solvers and to obtain model trajectories that can be compared to the particular data set. Tables providing the experimental data as well as model predictions which are valuable for testing correct implementation are stated in the following.
Supplementary Figure S4: ODEs of the model after condition-specific parameter assignments are applied.

Condition-specific data files
The data for individual experiments are provided as separate Excel files. File names and enumeration correspond to the respective model files and to the condition-specific parameter transformation table provided as (a) Example table containing experimental measurements including standard deviations representing measurement uncertainties. These standard deviations are either available as part of the experimental outcomes, or are otherwise taken from the fitted error model. part of the general information (Fig. S2).
Therein, the measurement time points are listed in the first column, followed by blocks of three columns for each measured observable. Each block provides measured values as first column and two columns for the standard deviations of measurement errors (Fig. S6a). If standard deviations are known from the experiments, they are provided in the third column. If such standard deviations are not available, we provide an estimated standard deviation from the error model in the second column of each block. The information from a fitted error model is only provided in case of absence of an experimental standard deviation. This strategy allows usage of the benchmark problems for approaches requiring measurement uncertainties for all data points.
The likelihood or χ 2 values provided in the general information use the same logic. Moreover, a sheet with the model output at each measurement point is provided, obtained with the provided parameter values (Fig. S6b).
If the corresponding observation is on the log-scale (Fig. S5a), the measurements and simulations are provided on the log-scale as well.
The simulated model response can be used to compare and test different algorithms and software implementations. In our case, CVODES from the SUite of Nonlinear and DIfferential/ALgebraic equation Solvers (SUNDIALS) (Hindmarsh et al., 2005) has been utilized for numerical integration. See Supplementary Section 3 for detailed information about numerical integration and optimization.

Usage of models and optional simulation of data
The provided benchmark models are available in the git repository Benchmarking-Initiative/Benchmark-Models together with directly usable MATLAB scripts to simulate the models, and simulated reference time courses which can be used for testing proper implementation in other software environments. Also, extensive information about usage of the models in either Data2Dynamics or with the independent MATLAB scripts is given in the GitHub Wiki pages.
Besides the condition-specific data files with experimentally measured data, an additional folder with similar files containing data simulated with the parameters of the best fit are provided. For simulation, the estimated standard deviation from the underlying error model, or the measured uncertainty if provided, is specified. The mentioned MATLAB scripts and Data2Dynamics include functions to automatically generate data that mirrors the experimental data concerning time points and number of measurements, but with the given parameters as ground truth. Moreover, an arbitrary noise level can be specified in terms of the estimated noise of the original model.

Numerical integration and parameter estimation
No analytic solutions are available for the ODEs of the provided benchmark models. Thus, numerical integration was performed via the CVODES integrator of the SUite of Nonlinear and DIfferential/ALgebraic equation Solvers (SUNDIALS) suite (Hindmarsh et al., 2005) which is tailored to solving stiff and non-stiff systems and computation of sensitivities. Absolute and relative tolerances were set for all models by default to 10 −6 , except for the models of Beer, Boehm, Crauste and Sobotta with a value of 10 −8 . Minimization of the negative log-likelihood was performed by lsqnonlin or fmincon (Coleman and Li, 1996) which are both available in MATLAB's optimization toolbox. We used MATLAB release R2017a. Thereby, mainly the default configuration settings were chosen. For fmincon, the algorithm-option was chosen as trust-region-reflective or interior-point, respectively. Additional changes to the default settings comprise: • In fmincon, user-defined gradient and Hessian were defined to perform Gauss-Newton optimization, equivalent to optimization by the user-defined Jacobian in lsqnonlin.
• The termination tolerance on first-order optimality and function value was set to 0 in order to terminate optimization only due to small parameter changes.
• Termination tolerance for the parameter changes was set to 10 −6 .
• The maximum number of iterations was set to 10000.
The derivative information required for deterministic optimization was computed via forward sensitivities, i.e., the ODE system was augmented by the appropriate sensitivity equations (Leis and Kramer, 1988) providing first order derivatives of the dynamics with respect to parameters and initial conditions. Bessel's correction is a general procedure to reduce the bias of the estimated error parameters, if a noise model is estimated simultaneously. However, it is rather rarely applied in Systems Biology and it provokes the undesired property that the result depends on the number of estimated parameters. Thus, for fitting the benchmark models Bessel's correction was omitted to enable the calculation of χ 2 terms within the objective function. Two times the negative log-likelihood (1) was used as objective function for parameter estimation and the result after optimization is provided as part of the general information. In addition, the best fit for a fixed error model is provided with its χ 2 value, (2) Minimizing (1) and (2) is equivalent in the case of known experimental errors, which was denoted as Ex in Table 1 of the main manuscript. The estimated and, if known, experimental uncertainties of the measurements are stored in the data Excel files. The files comprise 3 columns per observation, with (1) the experimental data point, (2) the error output from the estimated error model, and (3) the experimental uncertainties if they exist. This enables fitting the models in case an algorithm cannot handle simultaneous error estimation and requires data uncertainties for each data point.
The SBML files contain information about the observation function, and already comprise a log-transformation in case a log-transformation of the data is utilized in the respective benchmark model. Thus, mapping of data to the models provided in SBML format can easily be accomplished.

Additional model features
At this point, we provide further implementation details for some benchmark models. For additional and updated descriptions we refer to the GitHub repository and wiki at https://www.benchmarking.uni-freiburg.de/ and https://github.com/Benchmarking-Initiative/Benchmark-Models. For all models, we provide best fit parameters resulting from minimization of (1) which sometimes does not coincide with published parameters.
Several of the described benchmark models utilize input functions, such as step functions or splines. In the case of an external input, events are typically utilized that reset the integrator, and are specified in Table 1 of the main manuscript. A detailed description of event handling can be found in (Fröhlich et al., 2017). Within SBML, splines are provided as cubic functions with fixed parameters in the intervals between knots. In SBML, these parameters are re-set via events on each anchor point, whereas step functions are implemented as piecewise functions.
In the human-readable output format, a step function switching from level1 to level2 at t = switch time is described via level1 + (level2 − level1) · heaviside(switch time).

Bachmann
In Bachmann et al. (2011), several reparameterizations were applied compared to the standard rate-equation approach: • Some rate constants were defined relative to an initial concentration parameter, e.g. JAK2EpoRDeaSHP 1 relative to init SHP 1 and EpoRCISRemove, SHP 1ActEpoR, ST AT 5ActJAK2 relative to init EpoRJAK2. Similarly, CISRN AEqc and SOCS3RN AEqc were analyzed relative to init ST AT 5.
• ST AT 5ActEpoR has concentration unit [1/conc 2 ] before reparametrization and was analyzed relative to (init EpoRJAK2) 2 to obtain a parameter which is independent of concentration units.
• Overexpression parameters CISEqcOE and SOCS3EqcOE were analyzed as dimensionless fold-factors relative to the respective wild-type parameters.
• In addition, the following substitutions were performed as reparametrization: Model equations before and after substitutions are available in the human readable model definition files.

Becker
The model published by Becker et al. constitutes of two model components. The first component is an ODE model describing Epo receptor signaling. The second model component is a function which originates from the so-called Scatchard plot analysis. This function describes the relationship between bound and free ligand concentrations on the log-log scale. The slope provides an estimate for the number of receptors. In the Becker model, this slope parameter was simultaneously estimated with the remaining model parameters.
Compared to a standard rate-equation approach, two reparametrizations were in applied in the Becker model: 1. The initial state for the Epo receptor concentration was modeled relative to the Epo ligand amount. For this purpose, the initial condition init EpoR for Epo receptors is substituted by the product init Epo · init EpoR rel.
2. The binding constant of Epo to receptors termed k on was also parameterized relative to the initial Epo ligand amount. For this purpose k on from the standard rate-equation had been substituted with k on /init Epo.

Beer
The model published by Beer et al. is characterized by a large number of data points. These data points are the cells' absorbance evaluated every five minutes in the time range [0, 5, . . . , 3565] min. and occurs as 38 rather smooth and almost continuous time course, each with 714 data points.
Assuming independent measurement noise for each of these data points seems questionable from our point of view. Nevertheless this benchmark model constitutes a valuable test case for this kind of application data.
Within the chosen integrator and optimizer tolerances, the best optimum could not be found repeatedly for the model of Beer et al. which indicates that optimization did not reliably converge to a global optimum without manual fine-tuning (Fig. S9).

Brannmark
The  (Brännmark et al., 2010). The raw data had been generated in triplicates and means and standard errors of the means (SEMs) were analyzed in Brännmark et al. (2010). However, since the raw data was scaled to the range 0-100, several SEMs are zero which yields undefined terms in the loglikelihood (1). In our benchmark model we therefore omitted the published experimental errors and used an individual error parameter for each observable.

Chen
For the model of Chen et al., we used the SBML-format version from the Biomodels database (Le Novere et al., 2006) termed as BIOMD0000000255 -Chen2009 -ErbB Signaling. This model implements step functions as a sine function to smoothly connect the two levels specified in input functions.
The Chen model has 500 states and 191 parameters which both exceeds the number of data points (N=105). The total number of parameters was reduced with respect to the original publication, since the additional parameters defined in Chen et al. do not take part in the model dynamics. Within the chosen integrator and optimizer tolerances, optimization did not converge. The step in Fig. S13 results from initial parameter values with flat model trajectories due to small scaling factors and does not correspond to an optimum after successful fitting.
A successful optimization requires adjoint sensitivities that are not implemented in Data2Dynamics, see (Villaverde et al., 2018).

Crauste
Within the chosen integrator and optimizer tolerances, the best optimum could not be found repeatedly for the model of Crauste et al. which indicates that optimization did not reliably converge to a global optimum without manual fine-tuning (Fig. S14). Yet, with adapted optimization settings, the best fit value was found repeatedly.

Fujita
Within the chosen integrator and optimizer tolerances, the best optimum could not be found repeatedly for the model of Fujita et al. which indicates that optimization did not reliably converge to a global optimum without manual fine-tuning (Fig. S16).

Hass
The model of Hass et al. was implemented with initialization at negative time points at t = −30 min to account for prestimulation settings. If the model should be utilized for strictly positive times, the equations and the measurement times have to be reformulated accordingly.

Merkle
In the original publication, there are additional model versions and stages which were applied in combination with L1-penalization to select cell-type specific parameters. In this benchmark collection, we used the comprehensive model for CFU-E and H838 cells after identification of cell-type specific parameters which was termed as "parsimonious model" with "final parameter estimates with a non-regularized optimization". Within the Data2Dynamics examples, this model version is termed "Merkle JAK2STAT5 PCB2016, final model". The model constitutes of two submodels representing both cell types. A number of experimental conditions were added for model analysis and validation, which do not contain data but can be used to simulate the model.

Sobotta
The model of Sobotta et al. requires integration to be initialized at negative time points at t = −10 min. If the model should be utilized for strictly positive times, the equations and the measurement times have to be reformulated accordingly.

Swameye
The model published by Swameye et al. is available in several public versions. In the original publication (Swameye et al., 2003), a delay-differential equation was applied and the input was defined by linear interpolation between measurements. In (Raue et al., 2009), the delay was replaced by a linear chain and a cubic interpolation spline was applied to estimate the input. In (Schelker et al., 2012) the model was utilized to introduce comprehensive parameter estimation of ODE, observation-and input parameters. Following this progress, the spline is represented by a cubic spline with with five knots at t=0, 5, 10, 20, and 60 min. The parametrization by control points sp 1 , sp 2 , . . . , sp 5 as used in (Schelker et al., 2012) is utilized and the spline is evaluated at the log-concentration scale. Within this parametrization, the control points represent the estimated function values of the spline at the knot times. The spline parameters are comprehensively estimated with the remaining model parameters. The delay is represented with a five step linear chain which are linked by a single parameter.
Please note that there are several different implementations for cubic splines available. For benchmarking purposes, it is important to qualitatively stick to setup described here but any implementation for cubic splines might be applied. However, for perfectly reproducing the results of this paper or for comparisons of the obtained value of the objective function after fitting, it is essential to use exactly the same implementation. For this case, we provide appropriate C and MATLAB spline implementation files.

Weber
The model of Weber et al. was parameterized using relative data for several biochemical species. The relative data provide the ratios of the abundance of biochemical species under different experimental settings, e.g. the ratio of measured intensities at two time points. In the original publication, the corresponding ratios were computed from the simulation results and compared to the measured ratios. As many toolboxes do not support this, we implemented these relative data by introducing scaling constants.

Zheng
We used the aggregated data of Zheng et al. of the histone modifications, summing over the intermediate states for the labeling of the methyl group. The model directly models the overall methylation kinetics and includes an ODE for each of the 15 considered methylation states.

Waterfall-and scatter plots for the benchmark models
In order to assess the performance of numerical optimization utilized to fit the models and estimate the parameters, multi-start optimization has been applied. For each benchmark problem, 1000 random initial guesses were drawn, uniformly distributed within the specified bounds. For each initial guess, optimization was performed and the resulting optimized objective function values were sorted and plotted. Thereby, objective values were shifted in order to obtain an objective function value of 1 for the best fit. The resulting figures are shown in the following.
Following this procedure, reliable optimization is characterized by repeatedly finding the same optima, i.e., the same levels of the objective function which results in "steps" in the resulting plots. Moreover, existence of local optima is indicated by steps at different levels. Because the shape is reminiscent of a waterfall, this kind of depiction has been termed waterfall plot.
As a second kind of depiction, we plot the resulting value for the objective function against computation time in order to highlight whether there is a relationship between performance benefits and computational effort. 6 Example of turnover reaction The simple turnover reaction model used to illustrate the level sets shown in the main manuscript in Fig. 2B is depicted in Fig. S27A. This system corresponds to the ODĖ with initial condition x(0) = 0 and observation function which has the solution g(x(t, θ), θ) = s θ 1 θ 3 (1 − e −θ 3 t ).
The degradation rate θ 3 = 1 was assumed to be known and the scaling factor θ 2 and synthesis rate θ 1 were estimated from the data using the objective function with a Gaussian prior with mean=1 and standard deviation equals to 10 for the parameters θ 1 and θ 2 . The data y i was generated with additive independent normally distributed noise with σ = 0.05 around the true trajectory for θ = (1, 1, 1).
x 7 Impact of parameter transformation of the convexity of the objective function The impact of a log-transformation of model parameters on the convexity of the likelihood landscape was assessed for investigating a potential explanation for the performance benefits of numerical optimization at log scale.
An equivalent criterion for twice differentiable function J(θ) is that its Hessian is positive definite: Since derivatives (8) are difficult to be calculated for ODE models, e.g. because numerical integration errors leads to numerically instable finite differences, we employ definition (6) for evaluation of convexity. Because numerical methods only evaluate the objective function at discrete points in the parameter space we randomly draw a point for evaluating (7).
While a function is either convex or not, we interpret the faction of samples for which (7) does hold as the degree of convexity. Alternatively, one might consider the fraction for which (7) does not hold as the degree of non-convexity. Indeed, for a function with many local optima, one would expect that (7) is usually violated.
In this study, we considered four scenarios. Parameters that were either sampled uniformly in log-or linearscale; and the connection line was either constructed in log-or linear-scale. Default boundaries for all model parameters were -5 to 3 on a log-scale.
Detailed results of all benchmark models are shown in Fig. S28. For most of the models, sampling the parameters and connecting them in log-scale yields the highest number of samples for which (7) holds. This suggest that the objective function is more convex in log-than in linear-scale. In addition, the improved convexity in log-space was also visible when parameters were drawn in linear space.
The overall significance of convexity in parameters sampled in log-space against linear space is p = 0.045, tested by signed rank sum test with null hypothesis that the connecting line in log-space leads to the same proportion of convexity as the connecting line in linear space.
Since the Chen model could not be simulated for almost all sampled parameters, we excluded the model from this analysis. Supplementary Figure S28: Measure of convexity for different setups. Both sampling of the two parameter sets and the line connecting it were drawn either in log or linear space. The convexity of this set is shown for each of the 20 benchmark models as percentage. Each row is for one model, and the columns represent different parameter boundaries. The left column shows the results for halved parameters boundaries, the middle for the regular boundaries and the right for doubled boundaries.

Number of converged local optimization runs
The performance metric used in this manuscript accounts for the average computation time of a local optimizer and the percentage of converged runs. To assess whether there are differences for the latter between lsqnonlin, fmincon with algorithm-option trust-region-reflective and fmincon with algorithm-option interior-point, we evaluated the percentage of converged starts for all methods and benchmark problems. Our analysis revealed that lsqnonlin mostly outperforms fmincon with the algorithm-option trust-region-reflective and interior-point (Fig. S29).

Influence of threshold for defining convergence
In the main manuscript, we showed the results for which optimization runs were considered to be converged when the objective function value differs at most by 10 −1 from the globally best objective function value found across all runs for the given benchmark problem. The performance metric was then defined as average number of converged starts per minute (see Villaverde et al. (2018)), whereby models where only one single fit marks the globally best objective function value were omitted. We assessed the influence of this threshold on the results by evaluating additional thresholds for the analysis of the log-transformation (Fig. S30) as well as for the comparison of trust-region-reflective and interior-point (Fig. S31). The results qualitatively coincide for all thresholds.
We note that for all evaluations absolute differences between values of the log-likelihood functions was used. As (i) uncertainty analysis such as profile likelihoods as well as (ii) model selection criteria consider absolute differences, the optimizers need to achieve a small absolute error. Small relative errors are insufficient despite the fact that the optimal log-likelihood values for different models are on different scales.

Correlation of number of function evaluations and iterations
For the considered optimizers, we analyzed the correlation between function evaluations and iterations of the algorithms. For all optimizers, we observe a high correlation. The result for fmincon interior-point are shown in Figure S32. For the other optimizers, it is always one and two more function evaluations than iterations for lsqnonlin and fmincon trust-region reflective, respectively. This yields a correlation of 1.

Optimization results on simulated data
We generated simulated data with the estimated noise levels and repeated the optimization using lsqnonlin. The comparison of the performance of the optimization for the simulated data and the experimental data are shown in Fig. S33A. We tested the hypothesis that optimization for simulated data works significantly better. For this we analyzed the log-performance difference between simulated and experimental data, i.e., log-performance difference := log 2 (performance for simulated data)−log 2 (performance for experimental data).
A one-sided Wilcoxon signed rank test revealed that this difference is greater than zero with p-value < 0.1 and a one-sided t-test gave a p-value of 0.03 (Fig. S33B). The median of the log-performance difference is 0.19, while the mean is 0.83. For the subset of non-identifiable models, these differences were even bigger (median=0.26 and mean 0.9).

Identifiability analysis
Identifiability of the model parameters was analyzed in terms of unique parameter estimates, i.e. it has been investigated whether the minimum of the negative log-likelihood of the data is given by a unique parameter vector. Non-uniqueness might either originate from a-priori non-identifiability, i.e. from structural nonidentifiability of the model equations, or from practical non-identifiability raised by limited amount of data. Moreover, the assignment to structural and practical depends on the question whether observation parameters like scalings or offsets are considered as part of the model structure, or whether non-identifiabilities related to these parameters are practical because of practical limitations for directly observing the biological system.
Uniqueness of the estimated parameters was analyzed using the identifiability test by radial penalization (ITRP) (Kreutz, 2018). This approach omits classification into structural or practical and has two major steps: 1. Classical parameter estimation, e.g. by minimizing the negative log-likelihood yielding estimatesθ 2. Minimization of a penalized negative log-likelihood and evaluating whether the same value of the unpenalized part is obtained.
In the second step, the negative log-likelihood used as classical merit function is augmented by adding a penalty which is proportional to This penalty P (θ) has radial symmetry aroundθ and is minimal with P (θ) = 0 on a sphere with radius R. Such a radial penalty pulls away from the first estimateθ and thereby a new minimum is searched in arbitrary direction on the sphere with radius R. In our analysis we chose R = 1 on the log 10 -parameter scale as suggested in the original publication. For further details we refer to (Kreutz, 2018).

Analysis of sloppiness
The empirically observed Fisher information (Kreutz and Timmer, 2013) is the Hessian of the log-likelihood for the estimated parameters, and coincides with the Hessian of the least-squares objective function in the case of normally distributed noise. The diagonal elements (H) ii of this Hessian H are traditionally used for assessing parameter uncertainties and for the calculation of standard errors of estimated parameters.
For ODE models as they are applied in systems biology, it has been observed that the eigenvalues of the Hessian of the least-squares objective function are spread over several orders of magnitude, even if the parameters are considered at the log-scale (Gutenkunst et al., 2007). This characteristic has been termed as sloppiness (Waterfall et al., 2006). In (Gutenkunst et al., 2007), sloppiness was observed for every ODE model evaluated from a collection of 17 systems biology models from the literature and was therefore claimed as a universal property of such application models which hampers parameter estimation (Scheff et al., 2013;Transtrum et al., 2010).
In Tönsing et al. (2014), it has been shown that the origin of sloppiness partly originates from the eigenvalue calculation, and predominantly from the fact that correlated observations are provided by time-course observations of a subset of the underlying compounds. Moreover, it was argued (Apgar et al., 2010;Chis et al., 2016;Tönsing et al., 2014) that sloppiness of the eigenvalues of the Hessian does not automatically prevent identifiability of parameters and that parameter uncertainties are a matter of the experimental design and should not be interpreted as an universal characteristic.
In Gutenkunst et al. (2007), the approximation of a Hessian H sim , with χ 2 sim (θ) = with "normalization constants" s i , was calculated as the second derivatives of the model predictions g i with respect to the logarithms of the parameters are computationally challenging. Because our benchmark models provide experimental data, it enables the evaluation of sloppiness using derivatives (H) ij = ∂ 2 ∂θ i ∂θ j log(L(θ))| θ=θ ≈ ∂ ∂θ i log(L(θ))| θ=θ ∂ ∂θ j log(L(θ))| θ=θ (12) of log-likelihood log(L), i.e. evaluation of exactly the same objective function as used for estimation of the parameters.
In the setting of estimation from experimental data, the set of unknown parameters comprises parameters of the ODEs and, in contrast to the simulation-based setting investigated in Gutenkunst et al. (2007), also involves unknown parameters of the observation functions and of the error model. Our benchmark collection allows the comprehensive evaluation of sloppiness by consideration of dynamic parameters, observation-and error parameters as they occur in realistic applications.
As shown in the main manuscript, we found that in agreement with earlier observations, the models exhibit large spreads of the eigenvalue spectra. For 19 models, the spectra cover more than six orders of magnitude