Tracing the molecular basis of transcriptional dynamics in noisy data by using an experiment-based mathematical model

Changes in transcription factor levels, epigenetic status, splicing kinetics and mRNA degradation can each contribute to changes in the mRNA dynamics of a gene. We present a novel method to identify which of these processes is changed in cells in response to external signals or as a result of a diseased state. The method employs a mathematical model, for which the kinetics of gene regulation, splicing, elongation and mRNA degradation were estimated from experimental data of transcriptional dynamics. The time-dependent dynamics of several species of adipose differentiation-related protein (ADRP) mRNA were measured in response to ligand activation of the transcription factor peroxisome proliferator-activated receptor δ (PPARδ). We validated the method by monitoring the mRNA dynamics upon gene activation in the presence of a splicing inhibitor. Our mathematical model correctly identifies splicing as the inhibitor target, despite the noise in the data.

shows the pre-mRNA species (x pro , G-I, dotted) and mature mRNA (x mat , D-F, dotted).
The second model takes into account a multi-step promoter cycle, mRNA maturation and degradation (Fig. S4C). Here we explore a general multi-step model, which we will make more detailed (including explicitly elongation and splicing) in the subsequent sections for the purposes of data fitting. The system is described by the following set of equations: (2a)   Table S2. At the start of the simulation all the variables are 0 except for i =1: In Figs. S4D-I the time courses of the pre-mRNA and mature mRNA abundance are shown upon activation. The rate constants were adjusted so that the overall production, maturation and degradation rates and starting/ending steady state values of the mature mRNA are the same. Notably, for the multi-step model the dynamics vary significantly depending on the parameter regime, i.e. the relative rates of promoter cycling, mRNA maturation and degradation, but are always distinct from those of the simplified single step model. If degradation and maturation occur fast compared to the promoter cycling, both the mRNA and the pre-mRNA accumulation displays damped oscillatory dynamics for the more complicated model (Figs. S4D and G). This is explained by the fact that simultaneously activated cells go through a round of mRNA production leading to a rise and then during promoter silent periods part of the pre-mRNA is spliced and the mature mRNA is degraded resulting in a decrease (Fig. S4D). When the promoter is active again, it leads to an increase of RNA production and so on. In the single step model, on the other hand, the parameter regimes only affect the accumulation rate but not the overall form of the profile; in all cases both the pre-mRNA and mature mRNA increase monotonously without damped oscillations (Figs. S4D-I).
As shown above, a number of our ODE models produce damped oscillations in the dynamics of transcript concentrations. We emphasize that those models describe the average behaviour of a population of cells. Hence, it is relevant to shortly discuss some of the conditions that the single cells would have to obey in order for a population of them to show (damped) oscillatory dynamics of the average transcript concentrations. Firstly, the initial conditions of the cells should be close to identical when transcription is activated. Otherwise single cells would display different dynamics from the onset. Secondly, the cells induced at the same time and from similar state can stay in synchrony if the waiting times of the sequential processes that they carry out are 'precise'. By precise we mean that the cells have similar waiting times for the same processes, such as promoter activation, transcript elongation, etc.
This is not to be expected for first order reactions, for which the standard deviation in the waiting time equals the mean time; so it has a coefficient of variation (CV) of 1. Thus single reactions are extremely noisy and do not lead to synchrony between cells. However, when several (for instance, N) of those reactions are occurring in a sequence, then the waiting time of the complete sequence becomes much more precise: it has a coefficient of variation of 1/N. Therefore, multistep processes that occur in transcription initiation cycle and elongation can lead to passive, transient synchronisation of cells. Finally, due to small stochastic differences between cells -that inevitable will occur and accumulate -in all cases, the oscillations at the level of single cells will slowly desynchronize, which leads to dampening on the population levels.

Data analysis
We used data from two types of time course experiments: ligand activation and transcription inhibition (degradation where a, b, and c are component-dependent experimental constants and the bar indicates the mean value. The b (the Ct value, if the copy number equals 1 in the calibration experiment) and c (the exponential factor in the PCR, theoretically equal to -2 log10 = -3.32) values come from calibration experiments that establish the relationship between Ct-values and copy numbers in the cDNA sample ( Fig. S2). They allow determining the number for the respective molecule in the sample. Factor a is then used to obtain the copy number per cell as it is a product of the multiplication of: (i) the sample dilution factor (2,500 for mature mRNA, 625 for pre-mRNA), (ii) the inverse cDNA synthesis efficiency coefficient (1 for mature and 0.1 for pre-mRNA) and (iii) the inverse of the cell number in the culture (10 6 cells per culture dish). The cDNA synthesis efficiency coefficients are experimentally determined using an in vitro synthesized RNA added in known concentration to a cDNA synthesis mixture (see Methods for details). The resulting cDNA copy number was measured using standard curve  Tables S3-4).
In our fitting procedure we use the copy number data, see Tables S3, S4 and S5.
We discuss three sources of error in the data. ligand treatment and t=0 of the decay timecourse is observed. We have corrected for the discrepancy between t=180 of ligand treatment and t=0 of the decay timecourse by multiplying the copy numbers of the pre-mRNA decay data by a factor (0.3998) such that the average of the pre-mRNA ligand data at t = 180 min became equal to the average of the pre-mRNA decay data at t = 0 of that experiment. With this corrected pre-mRNA decay data we fit our model and we have assumed further conversion errors to be negligible.

Sample error:
The error resulting from the variation over cells. We assume the error in the copy number to be independent for each measurement and to be normally the standard deviation over more time points is questionable, since for some data courses (mature mRNA ligand and pre-mRNA degradation) the hypothesis of no correlation between time points is clearly rejected (Pearson correlation between time point and standard deviation for the mature mRNA ligand is 0.72 and for pre-mRNA degradation -0.78). Therefore, we use all data points in a Least Square Estimation fitting procedure, which is a standard procedure for the fitting over-determined systems. Assuming a scaled constant standard deviation for all time point error distributions, this results in a MLE. For the scaling we used a factor of 10 for the mature mRNA data as there is approximately one order of magnitude difference between the mean values of pre-mRNA and mature mRNA.

Mathematical models
In the ligand activation experiment the system is assumed to be initially in steady state. Let f be the ratio between the steady state value of pre-mature mRNA species (mature mRNA is less likely to reach steady state values as it has much slower degradation rate) in the steady states after and before the ligand addition. f is experimentally determined from the ratio of the average of 2-3 replicates of 5 late time points (t = 240, 255, 270, 285 and 300 min) and the average of all the data points at t = 0 (Table S3). For pre-mRNA the average at t ≥ 4 h is found to be 2.88 copies per cell, resulting in f of 2.69.

Model n = 0
The simplest model (see Fig. S4A) is given by the following equations: For the ligand activation experiment the input variable u ini = 1, for the degradation experiment, where initiation was stopped abruptly by addition of an inhibitor of initiation, u ini = 0.

Solutions
For this simple system a closed solution still gives insight, the pre-mature mRNA is a simple exponential function and the mature mRNA a combination of two exponentials.

Initial conditions
In the ligand activation experiment the system is assumed to be initially in steady state, but it is the steady-state condition without ligand addition. Since we assume ligand addition to influence only k ini the steady state before ligand addition should correspond to the solution of (1a) with k ini replaced by k ini /f. For the degradation experiment the initial states are unknown and have to be fitted to the data. So the initial conditions are given by the following equations: Where x 0 pro is an unknown parameter and f is an experimentally determined value of the RNA fold induction (f = 2.69).

Model n=2
The model that includes two step promoter activation (see Fig. S4B) is described by the following ODE system:

Solutions
For the n=2 simple system a closed solution exists but it does not provide useful insights.

Initial conditions
In the ligand activation experiment the initial state of the system is the steady state before ligand addition. The addition of the ligand is assumed to only have an effect TF binding step.
The basal level of the activating transcription factor is lower before ligand addition (hence To reach an f-times higher mRNA 'production' rate after ligand addition given two step initiation model, the following relationship between promoter constants should be satisfied: As f is an experientially determined factor (f = 2.69), this allows us to find the value of the k tf 0 : The initial conditions for system (6) are then given by the steady-state values of system (6) with the promoter equations (9a-c) replaced by: so that the initial conditions represent system before ligand addition. One of the promoter differential equations (6b) is replaced by mass conservation relation otherwise the system is singular.
We have used the following initial conditions for system (6): where u ini is defined as above, and the value is an unknown parameter to be fitted to the data.

Multi-step models n = 1, 3, 5, 10, 20
The multi-step model (see Fig. S4C) differentiates in the promoter cycle processes between activation, deactivation and reversion, and in the maturation between initiation/elongation, splicing, and maturation. It is given by the following equations:

Solution
Also this system of equations has a closed solution. Because we do not find it insightful, we omit this and use numerical solutions only.

Initial conditions
In the ligand activation experiment the initial state of the system is the steady state before ligand addition. The addition of the ligand is assumed to only have an effect on the first n1 0 steps of the activation process (see Fig. 1). The basal level of the activating transcription factor is lower in these first steps (hence k act 0 < k act ); if the ligand is added it reaches approximately the same level as the transcription factor for the second part of the activation.
To reach an f-times higher 'production' from the promoter cycle, the cycle time should be f times as fast, giving the following relationship between the activation rate constants before (k act 0 ) and after (k act ) the ligand addition.
The initial conditions for system (11) are then given by the steady-state values of system (11) with the activation equations (11a-b) replaced by Note that, if n1 = n1 0 , all corresponding k act in (11b) should also be replaced by k act 0 .
The system is singular because of the promoter cycle. Therefore one of the 'internal' equations, e.g. the first one above, is replaced by the mass conservation relation Let us denote these steady-state values by . Note that system (11) with its equation for activation replaced by (14) has to be solved for each new parameter vector in the search procedure.
We have used the following initial conditions for system (11): With u ini defined as above, and the value is an unknown parameter to be fitted to the data.

Fit procedure
We used the two time courses for the ligand activation and the time course for the pre-mature mRNA degradation to obtain the unknown parameters in our models. The mature mRNA degradation data (Fig. 2D) were employed for validation.

Model parameters
The parameter vector p to be fitted with the three time courses (pre-mature and mature mRNA from the ligand activation and pre-mature mRNA from the degradation experiment) is for the simple model (n = 0) given by for the two-step promoter activation model n=2 by and for the multi-step models (n = 1, 3, 5, 10, 20) The initial guesses for the parameter search are based on the rough manual fit of the experimental data. They are given by The initial guesses are then used to define the boundaries of the search domain used in the fitting procedure, which uses multiple initial parameter vectors. These search boundaries are [ ] for any kinetic constant k, and [ ] for any initial concentrations x 0 .

Model observables
We denote the observables by (o pR (t, p, u ini ), o mR (t, p, u ini )) for the pre-mature mRNA and the mature mRNA respectively. For the simpler models (n = 0 and n = 2) the model observables are and for the multi-step models (11a-g)

Object function
The distance measures per data set are where N denotes the number of time-points, n the number of data-points per time point, and ed the experimental data; the upper-indices stand for l-ligand addition experiment, ddegradation experiment, pR-pre-mature mRNA, mR-mature mRNA. The object function we use for the fit procedure is The first part (24a) If , and , the parameter vector with the highest object function is replaced by By repeating this, the worst fitting parameter vectors are removed continuously and replaced by ones with a better fit. Eventually, the points will form a cloud that gets denser and denser.
The algorithm stops when ( ( )) ( ( )) with s c the stop criterion. So the worst fit in the remaining collection has an at most % larger V value than that of the best fit. We used the following values: s c = 1.01, n Q = 300. The frequence with which a parameter value was found in new parameter vectors during the fitting is shown in Fig. S7.
The resulting optimal parameters are shown in Table S6.

Residual analysis
The first step in our a posteriori analysis is based on the residual values only. With this we can test whether we should reject the hypothesis that our mathematical model with the obtained parameter vector ̂ is an acceptable description of the data.
There are two types of residual test:

Size-based.
Here we test the probability that the obtained V( ) lies in the χ 2 distribution corresponding to our number of data-points and number of unknown parameters: χ 2 -test T χ 2 = V( )∈χ 2 (df). If the test value T > δ(α), the hypothesis is rejected. The δ is the value such that cumulative distribution function CDF (χ 2 (df)) = 1−α, with α the confidence level and df the degrees of freedom. The df is defined as df = N-m, where N is the number of data points fitted and m the number of parameters under assumption that the parameters are independent.

Correlation-based.
Here the probability is tested that the residuals are uncorrelated: runstest √ ∈ ( ), where R u is a function of sign changes in e( ), which is the vector of differences per timepoint between the model results and the average of the data points. Again if the test value T > δ(α), the hypothesis is rejected.
For all models 95% χ 2 -tests are passed, all pass the 95% runs-test with exception of n=0 model (both described above; the interval for 5% confidence interval is (-1.96,1.96)).

Model discrimination
Because models n = 0 and n = 2 have different number of parameters compared to multistep models with n = 1,3,5,10 and 20 we calculated Akaike Information Criterion (AIC) in order to objectively access which models describe the data better. The AIC is defined as: is the model complexity, defined as the number of independent parameters. For the n = 0 model , while for the n = 2 model and for multi-step models (rate constants describing mRNA production rate are co-dependent for all n>0 models, so the number of independent parameters is the total number of parameters minus 1). According to the AIC criterion all the multi-step models are statistically more likely than the simpler n = 0 and n = 2 models (Table S7).
Since all multistep models have similar properties according to the statistical tests we can try to further discriminate according to the value of the object function -smaller value means better overall fit. Additionally, smaller run-test values are also preferable. Keeping this in mind the better models seem to be n = 1, n = 3 and n = 5, although the differences between all multi-step models are not dramatic.

Sensitivity analysis (n = 5 model)
For local analysis the pairwise covariance coefficient of the parameters (Table S8)  Further, parameter dependent and independent 95% confidence intervals were calculated (Table 2). For global independent parameter sensitivity analysis, parameters were sampled one parameter at a time from a uniform distribution with limits four times lower and higher than the best-fit value. This resulted in N = 3,000 new parameter vectors with a single parameter perturbed. Then for each new parameter vector fold change of the object function compared the best fit was calculated. The results are plotted in Fig. S9.
Additionally, the co-dependent parameter analysis was done by sampling the three out of four co-dependent promoter activity rate constants (k act , k ini and k rev ) from respective uniform distributions. The ranges were as follows: 0.033 -6 min -1 for k ini , 0.025 -5 min -1 for k act and k rev .
The fourth constant (k dea ) was adjusted to keep the mRNA production rate at a fixed value determined by the following relationship: The rate of mRNA production found in the fitted model was 0.238. With exception of cases when k act deviates from the found by fitting by high percentage values the resulting fits were very close to the original fit (90% of fits have deviation less than 10% of the original).

Validation of the n = 5 model
To validate the n = 5 model the model output was compared to the mature mRNA decay data (Table S5). Since the mRNA decay was measured after 180 min of ligand treatment and addition of the inhibitor results in inhibition of initiation at the TSS the following initial conditions were used: Global analysis of the independent parameter sensitivity of the validation fit was carried out in the same manner as the model fit. The results indicate that the validation has the same sensitivity as the model fit with the exception of being insensitive to k spl value (Fig. S10).

Refitting the ligand induction time-course after splicing inhibition
The additional data after splicing inhibition was treated in the same way as the other datasets (Table S9).
The object function was defined as sum of the distance measures: Each single parameter was allowed to change within 4-fold of its original value and the model was fitted anew to the perturbed dataset (Table S9).
In the second round (run 2) in Table S19 the changed k spl (which produced the best fit in run1) was inserted into the model and the model was fitted varying each of the remaining rate constants by 4-fold. The list of the factors by which constants were adjusted and of the residuals is presented in Table S10.
The sensitivity of the refit to the individual constants perturbations performed in the same way as in the a posteriori analysis of the original model is presented in Fig. S14.
Finally, the dependence of the prediction quality on the specific parameter values was tested. All but the adjusted parameters were sampled independently according to the same procedure as for the parameter sensitivity analysis of the original fit. Then for each new parameter vector the fold changes of the object function compared to the optimal parameter vector were calculated for both (the original and perturbed) models. The relationship between fold change in the object function of the original model and the perturbed is plotted Fig. S13 (blue). Same type of sensitivity analysis was done by sampling co-dependent promoter activity rate parameters (k act , k dea, k ini and k rev ) according to same procedure as for the original fit (Fig.   S13, violet).  All models start with initial conditions corresponding to zero transcription activity (gene regulatory region in inactive state) and same steady state for mature mRNA.       Independent parameter changes were sampled for all parameters (except k spl and k deg ) at random using the same procedure as for the parameter sensitivity analysis described above.
Co-dependent changes in the promoter parameters (k act , k ini , k dea and k rev ) were sampled as well according to the described procedure. The respective fold deviation from the best fits of the original and perturbed models were calculated and plotted against each other. The results obtained with independent parameter sampling are shown in blue; the ones obtained with dependent parameter sampling are in purple.    Table S3: pre-mRNA species copy numbers of the PPAR ligand induction experiments. Calculated copy numbers of the experiments reported in Fig. 2A (pre-mRNA, observable 1).
Time t is in minutes, n refers to the number of replica data points at the same time point and N is the number of time points examined.    Table S6: Best fitting parameters. These parameters are used for the models n = 0, n=2, n = 1, n = 3, n = 5, n = 10, n = 20; k tf and k pol are indicated for model n = 2.   Table S8: Pairwise covariance coefficients of the parameters in the n = 5 model.