Constructing maps between distinct cell fates and parametric conditions by systematic perturbations

Abstract Motivation Cell fate transitions are common in many developmental processes. Therefore, identifying the mechanisms behind them is crucial. Traditionally, due to complexity of networks and existence of plenty of kinetic parameters, dynamical analysis of biomolecular networks can only be performed by simultaneously perturbing a small number of parameters. Although many efforts have focused on how cell states change under specific perturbations, conversely, how to infer parametric conditions underlying distinct cell fates by systematic perturbations is less clear and needs to be further investigated. Results In this article, we present a general computational method by integrating systematic perturbations, unsupervised clustering, principal component analysis, and fitting analysis. The method can be used to to construct maps between distinct cell fates and parametric conditions by systematic perturbations. In particular, there are no needs of accurate parameter measurements and occurrence of bifurcations to establish the maps. To validate feasibility and inference performance of the method, we use toggle switch, inner cell mass, and epithelial mesenchymal transition as model systems to show how the maps are constructed and how system parameters encode essential information on cell fates. The maps tell us how systematic perturbations drive cell fate decisions and transitions, and allow us to purposefully predict, manipulate, and even control cell states. The approach is especially helpful in understanding crucial roles of certain parameter combinations during fate transitions. We hope that the approach can provide us valuable information on parametric or perturbation conditions so some specific targets, e.g. directional differentiation, can be realized. Availability and implementation No public data are used. The data we used are generated by randomly chosen values of model parameters in certain ranges, and the corresponding parameters are already attached in supplementary materials.


TS
The TS model with ten parameters takes the form Unless otherwise noted, all parameters obey uniform distributions [1].
In the ODEs, g a represents the production rate of gene A and belongs to [1,100], and k a represents the degradation rate of gene A and the interval is [0.1, 1].H − (B) denotes inhibition rate of A by B, and is given by where τ b is the threshold of B and follows the half-function rule; n b is the integer Hill coefficient selected from [1,6].λ b is the fold change with 1/λ b ∈ [1,100].

EMT network
EMT can be described by the following set of ODEs with nine state variables and forty-four parameters [2].

Parameters
In order to better demonstrate wide application of the method, we apply it to study cell fate decisions in the inner cell mass composed of two antagonistic transcription factors, i.e., Nanog and Gata6, which control the differentiation of the ICM into Epi and PrE.See [3] for more details on the model.The network composed of Nanog, Gata6, Fgfr2 (FR), Fgfr/Erk (ERK), and the averaged level of Fgf4 secreted by the cell and its four neighbors (Fs), is shown in Fig. S1.The Fgf/RTK signaling pathway, Nanog and Gata6 are crucial factors in balancing cell fate specification between Epi and PrE.Through the influence of their Fgf/RTK environment, ICM cells will present a larger proportion of cells differentiating into PrE by culturing with recombinant Fgf4 [4], while inhibiting the Erk signaling pathway also prevents their differentiation into PrE and maintains in a pluripotent state [5].
To clearly distinguish three categories of steady states, we divide them into Epi-like, ICM-like, and PrE-like types, representing different cell phenotypes, respectively.To generate each phenotype, the extracellular concentration of Fgf4 is chosen as Fp ∈ [0 0.15] unit and other parameters are uniformly distributed in the interval (75%p 0 , 125%p 0 ), where p 0 is the basal value of p, as shown in Table.S4.
To construct the map, 10,000 parameter sets are randomly chosen, at which 5,417 parameter sets produce only Epi-like or PrE-like state by computing the ODE model of five state variables, namely 'G', 'N', 'FR', 'ERK', and 'Fs', representing the levels of Gata6, Nanog, Fgfr2, Erk, and the amount of Fgf4 proteins secreted, respectively.By the average linkage unsupervised hierarchical clustering analysis through the Euclidean distance metric method, the clusters form a tree-type structure based on the hierarchy, as shown in Fig. S2(A).Two qualitatively different clusters are obtained, corresponding to Epi-like and PrE-like state, respectively.The scatterplot of two steady state clusters against the first two principal components, i.e., PC1 and PC2, is shown in Fig. S2B.Note that to improve distinguishability of two distinct phenotypes, we remove the data of coexistence of bistable or tri-stable under the same parameter combinations.
Then, we use PLS regression to predict which state the system will stay at under a new parameter set.Now we fits the regression model using parameter sets and PC1 of the steady state sets as before, which contains 81.19% information revealed by the stable steady state data.The fitted response though PLS regression analysis is shown in Fig. 3. Classification results show good accuracy, and more than 98% correct one-to-one correspondence between a parameter set and its corresponding Epi-like or PrE-like phenotype can be predicted, as shown in Table 2.For any parameter set, when the fitted response value is lower than -0.00074, the phenotype will be judged as Epi-like.While when the value is higher than 0.4738, the phenotype will be judged as PrE-like, as shown in Fig. 3A.When cell fate is pre-determined, we perform regression for each phenotype separately and obtain better fitting results and importance order, as shown in Fig. 3B.For the tri-stable case, we also use elliptical regions for classification, as shown in Fig. S4A.Some data distributed in the same ellipse as Epi and PrE cells are considered as Epi-like (green dots in the ellipse of Epi-like region) or  Finally, we use boxplots to represent the differences in values of random parameters at different phenotypes and mark out the important ones, as shown in Fig. S4 B, which corresponds well to biological facts.For example, existing studies show that Nanog is necessary for the differentiation of Epi cells, whereas Gata6 is required for producing the PrE epithelium in vitro and in vivo [3].Moreover, the activation or the blockade of the Fgf/RTK pathway biases cell fate specification towards either PrE or Epi, respectively.Consistent with these evidences, box plots show the same tendency.More exactly, values of parameters vsn 1 and vsn 2 are obviously larger in Epi cells, which results in higher Nanog level.While the negatively correlated parameter kdn with apparently smaller values can induce the same tendency.Similarly, larger vsf r 2 and smaller kdg induce high expression of Gata6 in PrE cells.Fp, i.e., the extracellular concentration of Fgf4, is important in both cells.It is significantly low expressed in Epi and significantly high in PrE cells.

Main method
The approach is developed with the focus mainly on how to construct maps between distinct cell fates and parametric conditions by systematic perturbations.It starts by randomly perturbing all parameters based on basal values within a defined interval to generate steady state data, which are then standardized and normalized for next operation.

Data preprocessing
For each randomly perturbed parameter, we obtained the steady states.Because of their different units or orders of magnitude, each parameter needs standardization and each steady state should be normalized too before systematic analysis. Standardization: Normalization: Here min(p i ), max(p i ), µ(x), and σ(x) denote the minimum and maximum of p i , mean and standard deviation of x, respectively.

Identify the multi-stability
For each parameter set p = (p 1 , p 2 , • • • , p s ) T ∈ R s , one or more solutions can be obtained by solving the ODEs, i.e. monostability or multistability.We use it as the class label of training sets and then perform supervised kernel support vector machines (KSVM) classification by running holdout cross-validation.The test accuracy is around 98.6% and 87.6% for TS and EMT network, respectively.For each parameter set chosen within the perturbation intervals, we can predict whether it is monostable or multistable under the given parameter set.

Clustering
For the case of monostability, we use clustering to identify distinct robust state categories.Here we applied average linkage unsupervised hierarchical clustering analysis through the Euclidean distance metric method, as shown in

Dimensionality reduction
Steady state data are generally high-dimensional, and therefore hard to interpret intuitively.Dimensionality reduction needs to be performed by PCA for linearly divisible case or Kernel Principal Component Analysis (KPCA) for linearly indivisible case, which essentially seeks the derived coordinate to capture as much variability in the data as possible.

Divide the dataset
To verify the classification performance of the approach, for all clusters, 70% of the steady state sets, i.e., X q ′ ×m , and parameter sets related to X q ′ ×m , i.e., P q ′ ×s , form the training sets.Relatively, the remaining 30% act as the test sets.
Through the learning of training sets by PLS regression, we obtain the coefficients and substitute them into test sets.The theoretical details can be understood in the following explanations.The obtained results are compared with known clusters to validate classification efficiency of the method.

Principle of PLS regression and VIP score
To construct maps between cell fates and their perturbation mechanisms, many regression methods can be considered [6].Different from other linear models, such as least squares method, PLS regression is helpful to deal with the noisy and multicollinearity relationship.Compared with nonlinear models, PLS regression hyperplane is a multivariable linear equation, and its regression coefficients can visually indicate the importance ordering of parameters.Besides, PLS regression has stronger interpretability than machine learning methods, and the latter is more of a "black box" model, or more annoying computational complexity.Additionally, PLS regression deals with the multidimensional explanatory variables by extracting latent variables, which is fully consistent with the research related to biological networks.
Based on the actual characteristics of biological networks, the strength of simple PLS regression calculation, strong interpretability and relatively high prediction accuracy, we choose PLS regression to construct maps between cell fates and certain parametric conditions.
For the given explanatory variables P q ′ ×s and monostable steady state data X q ′ ×m , we first extract the first pair of vectors of them and call them t 1 and u 1 , which satisfy the following equations [7,8] which create a linear relationship of parameters and steady states, respectively, meaning the first axis of the elements of P q ′ ×s and X q ′ ×m .Meanwhile, for parameter sets, in order to predict steady state sets better, their principal components are required to have the largest covariance, i.e., max {cov(t 1 , u 1 )}.After a tedious manipulation, we obtain: where θ 2 i (i = 1, 2) is the maximum eigenvalue of matrices P ′ XX ′ P and X ′ P P ′ X, w 1 and c 1 are their corresponding unit eigenvectors, respectively.At this point, their correlativity has been obtained and the foundations have been established as follows where E, F , and G are the corresponding residual matrices in regression equation.Besides, we gain the regression coefficient vectors q 1 , r 1 , b 1 and get the relationship between w 1 and q When the precision does not meet expectation, the residual matrices are used as new explanatory matrix and new response matrix until the cutoff condition is met and if P q ′ ×s extracts h components we obtain Combining w T i t j = 1 (i = j) and w T i t j = 0 (i ̸ = j), and the matrix form can be expressed as where A = W B T .The variable importance in projection (VIP) for parameter sets is calculated by: i = 1, . . ., q ′ , j = 1, . . ., s, k = 1, . . ., h, where p ij , x i are the elements in parameter sets P q ′ ×s and PC1 of steady states X q ′ ×m , respectively; t k is the seleted PLS component; q ′ and s are the number and dimensionality of chosen parameter sets, h is the number of PLS components; q kj , b k , and a ij defines the linear relationship between p ij , t k , and x i , respectively.Besides, e ij , and f i are the residual errors of parameter sets and steady state sets.VIP scores are often used to identify the importance of each parameter in determining each cell fate.

Figure 1 :
Figure 1: The regulatory network controlling the differentiation of the inner cell mass into Epi and PrE phenotypes.

Figure 2 :
Figure 2: Two qualitatively different clusters, corresponding to Epi-like and PrE-like cells, respectively.(A) Dendrogram obtained from hierarchically clustering the steady state data.(B) The data are projected onto the plane of the first two principal components.

Figure 3 :
Figure 3: The correlation between the first principle component PC1 and the fitted response.(A) Two distinct steady state sets of Epi-like and PrE-like are integrated when fitting.(B) Better fitting when regression is performed for each phenotype separately.

Figure 4 :
Figure 4: Steady states and parameters distribution of different phenotypes.(A) Classification of the multiple stable states in the ICM network by confidence ellipses with the first two PCs.(B) Box plots of all parameters.
Fig 2B, Fig 3B and Fig 5A.Each column represents a state variable, i.e., a molecule, x i (i = 1, • • • , m), and each row represents a stable steady state x

Table 2 :
Classification accuracy of ICM network CIs Epi -like (780 Test sets) PrE-like (807 Test sets) PrE-like (green dots in the ellipse of PrE-like region) state.While others are considered as ICM-like state, as shown by the green dots in the blue ellipse.

Table 3 :
Except for extracellular concentration of Fgf4 , i.e., Fp, parameters are uniformly distributed in (75%p 0 , 125%p 0 ), where p 0 is the basal value of each parameter.