Scalable inference of cell differentiation networks in gene therapy clonal tracking studies of haematopoiesis

Abstract Motivation Investigating cell differentiation under a genetic disorder offers the potential for improving current gene therapy strategies. Clonal tracking provides a basis for mathematical modelling of population stem cell dynamics that sustain the blood cell formation, a process known as haematopoiesis. However, many clonal tracking protocols rely on a subset of cell types for the characterization of the stem cell output, and the data generated are subject to measurement errors and noise. Results We propose a stochastic framework to infer dynamic models of cell differentiation from clonal tracking data. A state-space formulation combines a stochastic quasi-reaction network, describing cell differentiation, with a Gaussian measurement model accounting for data errors and noise. We developed an inference algorithm based on an extended Kalman filter, a nonlinear optimization, and a Rauch-Tung-Striebel smoother. Simulations show that our proposed method outperforms the state-of-the-art and scales to complex structures of cell differentiations in terms of nodes size and network depth. The application of our method to five in vivo gene therapy studies reveals different dynamics of cell differentiation. Our tool can provide statistical support to biologists and clinicians to better understand cell differentiation and haematopoietic reconstitution after a gene therapy treatment. The equations of the state-space model can be modified to infer other dynamics besides cell differentiation. Availability and implementation The stochastic framework is implemented in the R package Karen which is available for download at https://cran.r-project.org/package=Karen. The code that supports the findings of this study is openly available at https://github.com/delcore-luca/CellDifferentiationNetworks.


Introduction
Haematopoiesis is the process responsible for maintaining the number of circulating blood cells that are undergoing continuous turnover.This process has a tree-like structure with haematopoietic stem cells (HSCs) at the root node (Cooper and Adams 2023).Each cell division gives rise to progeny cells that can retain the properties of their parent cell (self-renewal) or differentiate, thereby "moving down" the haematopoietic tree.As the progeny cells move further away from the HSCs, their pluripotent ability is increasingly restricted.Clarifying how HSCs differentiate is essential for understanding how they attain specific functions and offers the potential for therapeutic manipulation (Kawamoto et al. 2010).Several mathematical models have been proposed to describe haematopoiesis in vivo.One of the first stochastic models of haematopoiesis was introduced in the early 1960s suggesting that it is the population as a whole that is regulated rather than individual cells that behave stochastically (Till et al. 1964).
Since then, many other works aimed at describing the dynamics of haematopoiesis as a stochastic process.For example, in 2002 a single cell-based stochastic model was proposed to describe the evolution of HSCs as a result of switching between a nonproliferating and a proliferating environment, with transition probabilities, leading to a microenvironment-dependent competition of cells in a stochastic sense (Roeder and Loeffler 2002).By simulating several growth scenarios, the authors were able to describe stem cell kinetics, individual clone tracking, fluctuating clonal contribution, and to compare the theoretical results with experimental findings (Roeder et al. 2005).More recently, various studies analysed data generated by advanced lineage tracing protocols to calibrate novel mathematical models of HSCs differentiation.For example, in 2001 an inference method of a two-compartment hidden Markov model of haematopoiesis was proposed (Catlin et al. 2001).The model parameters were calibrated on time-series of cellular binary markers from a hybrid cats study.Since data from one of the two compartments were not observed, the parameters were estimated by solving the Kolmogorov forward equations and using a nonlinear least squares inference approach.Later in 2010 a stochastic model of haematopoiesis was proposed to keep track of HSCs differentiation in multiple cell compartments (Dingli and Pacheco 2010): the cells move within a single-branch structure with unknown probability from the smallest bone marrow compartment, housing active HSCs, up to the largest compartment hosting cells that are leaving the bone marrow to enter the circulation.The parameters of this model were calibrated on granulopoiesis data.Subsequently, in 2019 a multidimensional continuous-time Markov model was developed to describe the rates of cell differentiation in a hierarchical fashion (Pellin et al. 2019).The likelihood of this model is derived from a local linear approximation of the biochemical Master equation, and the parameters are estimated via a penalized least squares method from clonal tracking data.In late 2019, another stochastic process of haematopoiesis was formulated as a continuous-time, multi-type branching processes, whose parameters were estimated using momentbased techniques from cellular barcoding data (Xu et al. 2019).More recently in 2020, a Bayesian networks framework has been proposed to describe cell differentiation (Di Serio et al. 2020): the process of haematopoiesis was described through different levels of cell differentiation and the corresponding parameters were estimated using clonal tracking data from gene therapy clinical trials.Some of these methods are able to take into account missing cell types, such as those that are difficult to collect from the bone marrow (Catlin et al. 2001, Xu et al. 2019), but none considers the bias provided by false-negative clonal tracking errors.The state-of-the-art methods assume that missing clone observations correspond to minimal clones and set the corresponding counts to zero.But this hypothesis is too restrictive, because it does not take into account other technical sources of false-negative errors, such as low-informative sample replicates and threshold detection failure (Kim et al. 2019).Besides, it has also been shown that false-negative errors strongly depend on calling pipeline parameters, as well as read coverage (Bobo et al. 2016).
To overcome the limitations of the existent approaches, we propose a novel stochastic framework aimed at investigating mechanistic models of cell differentiation from clonal tracking data while cautiously treating all the undetected values as nonmeasurable states.More precisely, we model cell differentiation using a continuous-discrete state-space formulation including a system of Ito ˆ-type stochastic differential equations (SDE) describing clonal dynamics, coupled with a measurement model that links the sparse and noisy corrupted measurements to the underlying process' states.In Section 2, we provide a formal definition of our modelling approach along with an expectation-maximization (E-M) algorithm, based on extended Kalman filtering (EKF) and Rauch-Tung-Striebel (RTS) smoothing, to infer the unknown parameters.In Section 3, we extensively test the method on several simulation studies including a direct comparison with the prior art and we apply our framework to five in vivo high dimensional clonal tracking datasets, comparing four biologically plausible models of cell differentiation.In Section 4, we discuss our results from both a methodological and biological perspective.

Materials and methods
A concise graphical representation of our proposed framework of Kalman reaction networks (Karen) is shown in Fig. 1.The input consists of a clonal tracking dataset, including clone-specific information on the number of cells generated for each lineage over time, and a set of biochemical reactions coding for cell duplication, cell death, and cell differentiation.Inference is done via an E-M algorithm.The E-step is based on a Kalman filter/smoother, aimed at estimating the state variables given the parameters inferred from the M-step.While in the M-step, a nonlinear optimization method updates the unknown parameters given the states estimated by the E-step.Both steps are iterated until convergence is reached.The inferred cell differentiation network is returned as the main output.More details on the input syntax can be found in the documentation of our R package Karen, which we published along with this article.The following subsections provide details on the state-space formulation of cell differentiation and the E-M algorithm.

A stochastic model for cell differentiation
Here, we assume that cell duplication, cell death and cell differentiation can take place from time t to time t þ Dt in a combinatorial number of ways directly proportional to the cell counts at time t and the corresponding rate parameters.This hypothesis is equivalent to the chemical law of mass action (E ´rdi and To ´th 1989) whereby cell duplication, cell death, and cell differentiation are biochemical reactions that can be properly described by stochastic quasi-reaction networks.Consistently with the definition of a chemical reaction network of Supplementary Section S1, we consider a Markov process (1) of one clone and n cell types evolving in a time interval ðt; t þ DtÞ according to a set of K distinct biochemical reactions whose net-effect vectors fv k g K k¼1 and hazard functions fh k ðx t ; hÞg K k¼1 are defined as where i(k) and j(k) are the cell types possibly involved in the kth reaction, and jðkÞ 2 OðiðkÞÞ, where 2 Del Core et al.
OðiÞ ¼ fjjk ij > 0g (3) is called the offspring set of cell type i.The hazard functions include a linear growth term x iðkÞt a iðkÞ with duplication rate a iðkÞ > 0, a linear term x iðkÞt d iðkÞ for cell death with a death rate d iðkÞ > 0, and a linear term x iðkÞt k iðkÞjðkÞ for cell differentiation from cell lineage i(k) to cell lineage j(k) with rate k iðkÞjðkÞ > 0. The vector parameter appearing in the hazard functions, includes all the dynamic parameters, where k 0 iOðiÞ is the vector of all the differentiation rates from cell lineage i to its offspring set OðiÞ.Finally, we define the net-effect matrix and the hazard vector as

State-space formulation
Since the aim of this work is to calibrate the parameters of the continuous-time stochastic model defined by Equations (1-6) on clonal tracking data that have been collected at discrete time points, we use a continuous-discrete state-space model (Del Core 2023).In this formulation the dynamic component is the system of Ito ˆ-type SDEs where lðx; hÞ and bðx; hÞ are defined by Equations (1-6), combined with the measurement model where G t is a d Â n matrix selecting only the measurable states of x t subject to an additive noise r t , and x t is a shorthand notation for xðtÞ.The covariance matrix R t models the measurement noise as a linear function G t x t of the process states x t via the vector parameter q ¼ ðq 0 ; q 1 Þ 0 , thus allowing to increase noise intensity with the magnitude of cell counts.Our proposed state-space formulation of Equations ( 7) and ( 8) can be interpreted as a hidden Markov model where all the states in x are latent, and some of these are measured as y through the measurement model of Equation ( 8).

Optimal filtering and smoothing
Consider the state-space model defined by Equations ( 7) and ( 8).Let y 1:s be the vector of measurements collected at time for the distributions involved in the dynamic and measurement models of Equations ( 7) and ( 8), the aim of optimal filtering and smoothing is to estimate pðx k jy 1:s ; wÞ; w ¼ ðh 0 ; q 0 Þ 0 ; (10) called predictive (k > s), filtering (k ¼ s) and smoothing (k < s) distributions, as a replacement of the (usually intractable) distribution pðx 0:s jy 1:s Þ.Assuming a prior distribution x 0 $ N n ðx 0 jm 0 ; P 0 Þ for x t at t ¼ 0, the distributions of Equation ( 10) are Gaussian, whose first two moments, and the underlying vector parameter w, can be estimated by our proposed iterative algorithm which is summarized as follows (see Supplementary Section S6 for details): 1) Prediction: Solve the differential moment equations (DMEs) to obtain the first two moments m Ã k and P Ã k of the predictive distribution at time t k (k ¼ 1; . . .; s), where V h x is a reformulation of Vhðx; hÞ as a linear function of x.
2) Update: Compute the first two moments m k and P k of the filtering distribution at time t k (k ¼ 1; . . .; s) via the following correction step where e ðÁÞ is the matrix exponential operator.We use a gradient-based method to solve the optimization problem of Equation ( 13).The gradient r w 'ðwjy 1 ; . . .; y s Þ of the marginal log-likelihood of the measurements is estimated numerically with p þ q additional prediction and update steps, at time t k , k ¼ 1; . . .; s, to compute all the partial derivatives of l k ðwÞ and S k ðwÞ w.r.t.w, where p and q are the dimensions of h and q.Steps 1-4 are iterated until convergence is reached.The inference procedure is summarized in Supplementary Algorithm S2.

Transition probabilities
The transition probability p ij from cell type i to cell type j is defined as the multinomial probability where OðiÞ is the offspring set of cell type i, as defined by Equation (3).

Reaction constraints
To ensure identifiability of parameters in Equation ( 13) that involve only unobserved cell types, we use the following conservation laws where OðbÞ is the offspring set of cell type b, as defined by Equation (3), and the linear constraints where v u and v o are the sets of nonmeasured and measured cell types, BðxÞ represents the branch (Myeloid or Lymphoid) of cell lineage x, and finally AðxÞ is defined as the ancestor of cell type x.The constraint of Equation ( 16) can be viewed as a conservation law ensuring that all the cells differentiating from a cell lineage b to its offspring cells c j , j ¼ 1; . . .; jOðbÞj, are those that were previously produced by an ancestor a of b.The first constraint of Equation ( 17) assumes that an unobserved cell type a u may differentiate into an unobserved cell type b u with a rate that equals the average rate of differentiation of a u into its observed offspring cells b o j ; j ¼ 1; . . .; m.The last two constraints of Equation ( 17) state that an unobserved cell type a u may duplicate (or die) with a rate that equals the average duplication (death) rate of the observed cells a o j ; j ¼ 1; . . .; l, belonging to the branch Bða u Þ and sharing an ancestor with a u .

Haematopoietic models
The candidate models of cell differentiation are shown in Fig. 2. The simplest model (a) is a single-branch developmental tree where the HSCs produce all the blood cells through a single multipotent progenitor (MPP) that first differentiate into common myeloid-lymphoid progenitors (CMLP) giving rise to the mature blood lymphoid cells (NK, B, and T) without any further intermediate progenitor.Whereas the erythroid cells (platelets P and erythrocytes ERY) and myeloid cells (granulocytes G and monocytes M) are produced via the Megakaryocyte-erythroid progenitor (MEP) cells and the impaired adult myeloid progenitor (GMP) cells respectively.According to model (b) the lymphoid cells (T, B, NK) and the myeloid/erythroid cells (G, M, P, ERY) are generated through separate branches of differentiation.In particular, MPP cells first differentiate into either common myeloid progenitor (CMP) cells or common lymphoid progenitor (CLP) cells, before giving rise to MEP and GMP cells.This model is known as classical/dichotomic model of haematopoiesis (Kawamoto and Katsura 2009).In contrast, model (c) proposes the idea that myeloid progenitors represent a prototype of haematopoietic cells capable to produce both myeloid (G, M) cells, erythroid (P, ERY) cells, and lymphoid NK cells, whereas lymphoid T, B, and NK cells represent specialized types produced by the CLP cells.Indeed, CMP cells not only produce MEP and GMP cells, but also NK cells.This model is known as the myeloid-based model (Kawamoto and Katsura 2009).Finally, model (d) assumes that lymphoid T/B cells and lymphoid NK cells develop through different branches.While B and T cells are produced by the common progenitor CLP, the NK cells are generated by a separate progenitor NKP.

Model selection
The candidate models of cell differentiation are scored according to the Akaike information criterion (AIC) (Burnham et al. 2011), i.e.
where ' M is the marginal log-likelihood of the measurements of model M and p M its number of parameters.

Results
We tested our proposed method in several simulation studies based on the validation scenarios from Fig. 3.These networks are sufficiently complex and diverse, in terms of number of nodes and depth, to assess the performance of our method against recovery of the true data generative process and underlying parameters.After validating our method, we analysed data from two preclinical studies and three gene therapy clinical trials to shed light on cell differentiation and haematopoietic reconstitution.The specific results are reported in the next subsections.

Validation and comparison with the prior art
We tested our proposed method Karen and compared it with the state-of-the-art approaches, such as the generalized least squares (GLS) method (Pellin et al. 2019), the maximum likelihood method RestoreNet (Del Core et al. 2023), and the branchCorr method (Xu et al. 2019).The validation and comparisons were made in terms of robustness against (i) the sampling frequency s, (ii) the fraction f of false-negatives, and (iii) the magnitude of the measurement noise parameters q 0 and q 1 .To make the comparison of our method with the other candidates possible, we used the cell differentiation structure of Fig. 3a as the true generative model, whose type of biochemical reactions is the only one that can be handled by Xu et al. (2019).The corresponding biochemical reactions and the system of SDEs were defined following Equations (1-6), using the state-space formulation of Equations ( 7) and (8).Two different comparative synthetic studies have been designed.In the first one all the cell types were measured, thus branchCorr was not included, since it does not allow for observed progenitors.In the second study the synthetic HSCs and progenitors P1-P2 were considered as unobserved states, and therefore GLS and RestoreNet were excluded from this comparison, since both methods do not allow for unobserved states.We used the Euler-Maruyama Algorithm S1 of the Supplementary Information to forwardsimulate 100 independent stochastic trajectories of three clones from the true generative data process.Details on parameters and initial condition x 0 used for the simulations are reported in Supplementary Tables S1 and S2.Then we used our proposed framework Karen and the other methods to infer the unknown parameters.Inference with Karen has been carried out using Supplementary Algorithm S2, with and without the conservation laws of Equation ( 16) that ensure identifiability of the parameters involving only unobserved cell types.Results from Fig. 4 clearly indicate that our proposed method outperformed the prior art.In particular, Fig. 4a provides evidence that our proposed method is the most robust with respect to the false negative errors compared to the other methods, which provided more biased estimates for the parameters for a high percentage f ¼ 90% of missing data.Subsequently, Fig. 4b shows that a low sampling frequency (s ¼ 4) of the simulated trajectories did not affect the estimates provided by our proposed method, whereas those obtained with any of the competitor approaches were biased.Finally, Fig. 4c suggests that after increasing the magnitude of the measurement noise parameters q 0 and q 1 up to 10, our proposed method still provided better estimates compared to the other methods.Further results for different values of f, s, q 0 , and q 1 can be found in Supplementary Section S7.In conclusion, the results from our synthetic studies show that overall our method outperformed the prior art in terms of false negative errors, sampling frequency and measurement noise.Details on the computational complexity are reported in Supplementary Section S10.

Model misspecification
We tested our method against model misspecification with an in-silico study.The cell differentiation networks of Fig. 3b  and c were used as true generative models that we crosscompared for simulating and fitting.The corresponding biochemical reactions and the system of SDEs were defined following Equations (1-6), using the state-space formulation of Equations ( 7) and ( 8).We performed 100 independent simulations of the stochastic trajectories for three clones using the Euler-Maruyama Algorithm S1 of the Supplementary Information.Details on parameters and initial condition x 0 used for the simulations are reported in Supplementary Tables S3 and S4.Then we fitted both candidate models using the inference Supplementary Algorithm S2.Forward simulation and fitting has been carried out by using the conservation laws of Equation ( 16), so as to ensure identifiability of the parameters involving only unobserved cell types.As a result, Fig. 5a indicates that our method performed well in model selection, since the true models had the lowest median AIC, as defined by Equation ( 18), over 100 independent simulations.Moreover, Fig. 5a suggests that the true models yielded better fits in terms of the smoothing moments m s kjs and P s kjs (k ¼ 1; . . .; s) compared to the wrong models.

Scalability to complex networks
We performed a synthetic study to explore the scalability of our proposed method to more complex network structures.We used the cell differentiation network of Fig. 3d as the

Scalable inference of cell differentiation networks
indicate that in SFV-treated tumour-prone mice there was a more pronounced unbalance in cell differentiation from the multipotent progenitors (MPP) towards common myeloid progenitors (CMP) compared to the PGK treatment.Therefore our proposed framework Karen suggests that, in this particular study, the design of viral vector did not significantly affect the structure of cell differentiation in tumourprone mice, but had an impact on the transition probabilities pðMPP !CMPÞ and pðMPP !CLPÞ, representing differentiation from the multipotent progenitors compartment (MPP) in either common myeloid (CMP) or lymphoid (CLP) progenitors.

Rhesus Macaques study
We analysed an in vivo clonal tracking dataset collected from Rhesus Macaques (Wu et al. 2014).HSCs were first barcoded by using lentiviral vectors and then transplanted in three animals.Barcode retrieval was performed monthly via PCR on Granulocytes (G), Monocytes (M), T, B, and NK cells up to 9.5 months.Further details on transductions protocol and culture conditions can be found in Wu et al. (2014).Although the sample DNA amount was maintained constant during the whole experiment, the samples resulted in different magnitudes of reads (see Supplementary Table S11), making the data not directly comparable.Therefore, we rescaled the barcode counts as described in Supplementary Section S12 before analysis.The total numbers of clones that were collected range in 1165-1291, but we focused on the top 1000 most recaptured ones, across lineages and time, so as to further remove bias.We fitted the same four candidate models from the previous section on the clonal tracking data using Supplementary Algorithm S2.For each model the system of SDEs was defined following Equations (1-6), using the statespace formulation of Equations ( 7) and ( 8).In analogy with the previous section, inference has been performed by using the conservation laws of Equations ( 16) and ( 17) that ensure identifiability of the parameters involving only unobserved cell lineages.Each candidate model has been scored according to the Akaike information criterion (AIC) of Equation ( 18) and we report the results on model selection in Table 1.
According to the AIC, model (c) is the one that best fitted the clonal tracking data collected from the rhesus macaque study.The estimated cell differentiation network is reported in Fig. 6.This result suggests that the classical/dichotomic model (b) failed to describe adequately clonal dynamics in rhesus macaques, whereas the myeloid-based developmental model (c) better explained haematopoietic reconstitution.Our proposed framework Karen clearly indicates that myeloid progenitors represent a prototype of haematopoietic cells capable to produce both myeloid (G, M) cells, erythroid (P, ERY) cells and lymphoid NK cells in primates.

Gene therapy clinical trials
We considered clonal tracking data collected from six patients affected by three different genetic disorders.The six patients underwent a haematopoietic stem and progenitor cell (HSPC) gene therapy treatment.Vector integration sites in five cell lineages (G, M, T, B, and NK) were collected longitudinally from the peripheral blood of four patients affected by Wiskott-Aldrich syndrome (WAS) (Hacein-Bey Abina et al. 2015), two patients with b haemoglobinopathy, one with bS/ bS sickle cell disease (Ribeil et al. 2017), and one with b0/bE b thalassaemia (Thompson et al. 2018).Details on procedures, gene therapy protocols, and normalization methods  The following results stem from the analysis of the 1000 most recaptured clones, across lineages and time, in each clinical trial.The four haematopoietic models of Fig. 2 have been scored separately in each clinical trial using Supplementary Algorithm S2.For each model the system of SDEs was defined following Equations (1-6), using the state-space formulation of Equations ( 7) and ( 8).In analogy with previous sections, inference has been carried out by assuming the linear constraints of Equations ( 16) and ( 17) ensuring identifiability of the parameters that involve only the unobserved cell types.For each candidate model we computed the Akaike information criterion (AIC), as defined by Equation ( 18), and we report the results on model selection in Table 1.According to the AIC, model (d) is the one that best fitted clonal tracking data collected from each clinical trial, and the corresponding cell differentiation networks are reported in Fig. 6. Results suggest that a three-branches developmental model best explained haematopoietic reconstitution in these gene therapy clinical trials.While lymphoid (T, B) and myeloid/erythroid cells (G, M, P, ERY) developed in parallel through separate branches from different progenitors, NK cells appear to be sustained by a dedicated progenitors' cell population.

Discussion
We have proposed a novel stochastic framework for calibrating cell differentiation networks from partially observed highdimensional clonal tracking data.Our model is able to deal with experimental clonal tracking data that suffers from measurement noise and low levels of clonal recapture due to either threshold detection failures or false-negative errors.Our proposed framework Karen extends stochastic quasi-reaction networks by introducing an EKF and an RTS smoother.We have developed an E-M algorithm to infer the corresponding parameters.Simulation studies have shown the method's accuracy regarding inference of the true parameters, estimation of the first two smoothing moments of all the process states, and model selection using AIC.Simulation results indicated higher robustness of our proposed method compared to the state-of-the-art ones in terms of (i) low sampling frequency, (ii) limited clonal recapture, and (iii) high levels of measurement noise.Results from simulations suggest that our proposed framework scales to complex structures of cell differentiations in terms of nodes size and network depth.
Although the Gaussian assumption makes the analytical formulations of the likelihoods explicitly available, this approximation may become poor when the data contains outliers or shows non-Gaussian behaviours.This limitation can be overcome by using a distribution-free approach, such as the Kernel Kalman Rule, a recent nonparametric inference technique for high dimensional and possibly non-Gaussian nonlinear state-space models (Gebhardt et al. 2019).Besides, our framework considers reaction rates constant for the whole study period.Extensions that allow for modelling reaction rates as smooth functions of time or clinically relevant variables will be the goal of our future research.Our proposed method allowed to unveil the genotoxic impact on cell differentiation in tumour-prone mice.While the differentiation structure does not seem to be affected by the viral vector design, the transition probabilities from the multipotent progenitors to the intermediate progenitors do, showing a more pronounced unbalance towards the common myeloid progenitors under the SFV treatment compared to PGK.This can be biologically interpreted as a faster immune response to the higher inflammation caused by the toxic SFV treatment, compared to the nontoxic PGK one.Subsequently, the application of Karen to a rhesus macaque clonal tracking study unveiled for the lymphoid NK cells a different developmental pathway from the one detected for lymphoid T and B cells.That is, NK cells are produced by both common myeloid CMP and lymphoid CLP progenitors, whereas T and B cells are sustained only by the common lymphoid progenitors CLP.Results are consistent with those previously reported in Wu et al. (2014), where the authors demonstrated the presence of distinct subpopulations within the NK cells lineage, potentially deriving from alternative maturation processes.Finally, it is worth noting the agreement in the inferred

Figure 1 .
Figure 1.Analysis' flowchart: a clonal tracking dataset and the biochemical reactions (top-right) are the input of our framework Karen (left).It mainly consists of three parts: a filtering step, a maximum likelihood step, and a smoothing step.These steps are iterated until convergence is reached.The inferred cell differentiation network is returned (bottom-right).

Figure 2 .
Figure2.Graphical representation of cell differentiation networks proposed as candidate models in the in vivo studies (a-d).Grey and white nodes represent unobserved and observed cell types.Light-grey nodes are the cell lineages whose data were collected in all studies except for the genotoxicity one.Arrows represent cell duplication (green), cell death (red), and cell differentiation (blue).

Figure 3 .
Figure 3. Graphical representation of the cell differentiation networks used in the in-silico studies (a-d).Grey and white nodes represent unobserved and observed cell types.Arrows represent cell duplication (green), cell death (red), and cell differentiation (blue).

Figure 5 .
Figure 5. (a) For each true data generating process (row) and for each candidate model (column), the simulated process (empty dots), the synthetic data (full dots), the estimated smoothing moments (lines) of a single clone for each cell type (colours), and the median AIC across all simulations used to evaluate model misspecification.(b) Boxplots (y-axis) of the ratio between the estimated and true parameters for each reaction (x-axis) of the cell differentiation network of Fig. 3d used in the scalability study.(c) Estimated smoothing moments (lines), the true Markov states (empty dots), and the synthetic data (full dots) of one clone for each cell type (colours) for a single simulation of the cell differentiation network of Fig. 3d used in the scalability study.

Figure 6 .
Figure 6.Inferred cell differentiation networks having the lowest AIC, as defined by Equation (18), for the genotoxicity study for the comparison of the two viral vectors PGK and SFV, the rhesus macaque (RM) study, and the clinical trials WAS, b0bE and bSbS.Each arrow is weighted and coloured according to the transition probabilities estimated with Equation (15).

Table 1 .
For each in vivo clonal tracking dataset analysed (rows) and the candidate models a-d (columns) the AIC computed according to Equation (18).Scalable inference of cell differentiation networks can be found in Hacein-Bey Abina et al. (2015), Ribeil et al. (2017), and Thompson et al. (2018).Since data were already normalized to compensate for unbalanced sampling in VCN and DNA (Sherman et al. 2017), we did not apply any further transformation.The total clones that were collected are 156 654, 17 273, and 230 408, respectively, for WAS, bS/bS, and b0/bE clinical trials.