Machine learning based lineage tree reconstruction improved with knowledge of higher level relationships between cells and genomic barcodes

Abstract Tracking cells as they divide and progress through differentiation is a fundamental step in understanding many biological processes, such as the development of organisms and progression of diseases. In this study, we investigate a machine learning approach to reconstruct lineage trees in experimental systems based on mutating synthetic genomic barcodes. We refine previously proposed methodology by embedding information of higher level relationships between cells and single-cell barcode values into a feature space. We test performance of the algorithm on shallow trees (up to 100 cells) and deep trees (up to 10 000 cells). Our proposed algorithm can improve tree reconstruction accuracy in comparison to reconstructions based on a maximum parsimony method, but this comes at a higher computational time requirement.


INTRODUCTION
Single-cell lineage tracing, reconstructing the relationship between indi vidual di viding cells in a tissue or organism, has the potential to improve our understanding of many biological processes, including the major transitions in evolution ( 1 ), de v elopment from the founding zygote to a complex organism ( 2 ), stem cell properties in tissue ( 3 ), cancer cell differentiation ( 4 ) or metastasis ( 5 ), and pathways of tumour evolution ( 6 ).
The first full de v elopment lineage tree was traced for the embryonic cells of the nematode Caenorhabditis elegans by sketching the e v ents of cell division and development histories observed directly through light microscopy ( 7 ). Recently, platforms based on cell barcoding were proposed for tracking the lineage of individual cells at a high resolution. An integrase-based synthetic barcode system, int-MEMOIR, uses the serine integrase Bxb1 to perform ir-re v ersib le random modification of DNA recording arrays that can be read out using fluorescence in situ hybridization imaging ( 8 ). Fluorescent reporter assays allow the rapid characterization of these recording units ( 9 ). This experimental system is coupled with a time-lapse movie of the cells as they divide to provide a ground truth lineage tree ( 8 ). Another technique, substitution mutation-aided lineage tracing (SMALT) system, used a 3-kb readout sequence with 16 iSceI binding motifs to map single-cell resolution cell phylogenies during organ de v elopment ( 10 ). Additional strategies to achie v e lineage tracing at single-cell resolution hav e been de v eloped in the past few y ears: integr ation barcodes (designed as short DNA fragments placed in an expressed locus) and polylox barcodes (comprising a DNA cassette with multiple loxP sites in alternating orientations) ( 11 ). These, and future, technical advancements have to be matched by progress in relevant computational methods ( 12 ).
Various statistical methods for reconstructing gene trees and species trees ( 13 , 14 ) have been developed over the last few decades, but these methods have not been widely used on time and individual-cell resolved da tasets. Muta tions induced by cell barcoding methods ar e irr e v ersib le, which is different from the somatic m utations accum ulated during mitotic cell division ( 8 ). Previously, we have developed a machine learning (ML) approach for cell lineage reconstruction, with results submitted to the Allen Institute Cell Lineage Reconstruction DREAM Challenge ( 15 ). This was the best-performing algorithm for the reconstruction of in vitro cell lineages of trees with < 100 cells. It reached higher accuracy than other methods, such as distancebased method DCLEAR ( 16 ), maximum parsimony based method Cassiopeia-integer linear programming ( 17 ) and Cassiopeia-Greedy ( 17 ). Our proposed frame wor k was based on embedding single-cell data into a feature space in order to train an ML algorithm to predict the probability that cells are siblings.
In this study, we expanded this method by training an ML model to predict higher le v el relationships between cells. We refined our methodology to be applied to any number of possible edit states. To evaluate the performance of the algorithm, we used two sets of in silico data: (i) shallow trees (up to 100 cells), simulated using parametrization based on in vitro dataset for mouse embryonic stem cells ( 15 ); and (ii) deep trees (up to 10 000 cells), simulated using parametrization based on in vitro dataset for fly organ de v elopment ( 10 ). Further, we reconstructed trees of in vitro dataset for mouse embryonic stem cells ( 15 ). We measured the accuracy of lineage reconstruction using the normalized Robinson-Foulds (RF) scor e. For tr ees with up to 100 cells we additionally calculating four metrics: normalised Robinson Foulds score (RF), triplet score (TRP), quartet score (QRT), and clustering information score (CLI). We compared our reconstruction with a maximum parsimony method. We found that our proposed algorithm has an advantage over the maximum parsimony method in terms of reconstruction quality for both shallow trees and deep trees.

The in vitro datasets
Mouse embr y onic stem cell colonies. The data were obtained as part of the Allen Institute Cell Lineage Reconstruction Challenge through Synapse ID syn20692755 ( 15 ). In the experiment, the recording array (barcode) consisted of L = 10 recording units ( 8 ). Each recording unit was in one of three states: ground state (r epr esented as '1'), a deletion (r epr esented as '0') or an inversion (r epr esented as '2') of the DNA sequence. Each colony was started from an individual cell and colony growth was observed for 48 h. The experimental data comprise (i) an array of intMEMOIR readouts as a text file (also called matrix) and (ii) the ground truth lineage for the colony. Tree-like data structures were provided as a Newick file ( 18 ). The ground truth cell lineage tr ees wer e obtained fr om video-micr oscopy data ( 8 ).
Fly organ development. A SMALT system used a barcode with 16 iSceI binding motifs present with equal distance throughout the sequence ( 10 ). The recording array consisted of L = 2943 recording units. Each recording unit was in one of two sta tes: ground sta te (r epr esented as '0') or muta ted sta te (r epr esented as '1'). Binary sequence data wer e available for two specimens: one had 5002 cells and other 5420 cells. The datasets correspond to fly organ de v elopment from embryo to late-third instar larvae ( 10 ).

The in silico datasets
Stochastic simulation of cell division and barcode editing. Accumulation of stochastic mutations during cell division was sim ulated similarl y to ( 8 , 19 ). Each cell colony starts from an individual cell carrying an unedited array with length of L units. The model assumes that cells divide synchronously and at a constant rate. The initial cell then undergoes a series of cell divisions (Figure 1 A). After d divisions, the colony consists of N cells, where N = 2 d .
After e v ery cell di vision, each unedited target can mutate with a gi v en probability μ to one of se v eral possib le edited states s ∈ S . The state is chosen according to a probability ␣ s . The process of recording is irre v ersib le, and once a recording unit is edited, it can no longer change (Figure 1 B).
The probability that a recording unit has not been edited after d divisions is equal to P not edited ( d, μ, α) = ( 1 − μ) d . The ground truth lineage could provide necessary information on what is an expected lower number of divisions each cell has undergone in a colon y, f or example by counting the number of ancestors for phylogenetic nodes.
The probability of an edited unit going to state s is denoted by α s , such that | S| s= 1 = 1 , where S is a set of all possible edits and | S | denotes the size of set S . In the case of the mouse embryonic stem cell colonies dataset, | S | = 2 and S = { 0, 2 } , and in the case of the fly organ development dataset, | S | = 1 and S = { 1 } . The probability that a recording unit is in state s at division d , given that it was in the unedited sta te a t division d − 1, is equal to P edited ( d, μ, α) = ( 1 − μ) d−1 μα s . As the state values of lineage tree internal nodes are not available, we do not know which division incurred the editing. Therefore, the probability of observing the recording unit in state s after d gen- where C is the total number of colonies, n c is the number of cells in a colony c , d c , i is the number of divisions for cell i in colony c , o c , i , l is the observed state for unit l in cell i and colony c , and o c,i,l = { not edited , edited } . For parameter estimation, we used the R package AMISEpi, which has an implementation of adapti v e multiple importance sampling for Bayesian analysis ( 20 ). We set the priors to be uniformly distributed: μ ∼ U [0, 1] and α i ∼ U [0, 1].

Simulation of shallow trees (up to 100 cells).
We used par ameters (mutation r ate and probability of mutations) estimated from the in vitro mouse embryonic stem cell dataset ( 15 ). We used data on all colonies for parameter estimation.
Simulation of deep trees (up to 10 000 cells). We used parameters (mutation rate) estimated from the in vitro fly organ de v elopment dataset ( 10 ). We have used both datasets to estimate the mutation rate per division / target. We assumed that there were 13 divisions, which was closest to the number of cells available for dataset (2 13 = 8192).

Lineage tree reconstruction using the ML approach (AMbeRland-TR)
Feature engineering. For illustration purpose, we assume that the barcode units can be three possible states: ground state, a deletion or an inversion (Figure 1   of the e xtended recor d set { not mutated , S} as predictors for the ML model. This should produce (| S | + 1)! predictors.
We have used the R package phangorn ( 21 ) to extract information on relatedness between cells from the ground truth cell lineage trees. For each le v el of the tree, this produced two lists of cell pairs for each colony: cells that share an ancestor at le v el t and cells that do not share an ancestor at le v el t . Two e xamples of embedding single-cell data into a feature space for ML training are shown in Figure  1 D. The upper row corresponds to le v el 1 (sibling cells) and the lo wer ro w corresponds to le v el 2 (cousin cells). State indicates whether relationship is true or false.
ML tr aining, pr ediction and interpretation. We used a gradient boosting machine (GBM) to implement the outlined ML approach. All calculations were performed in R using package gbm ( 22 ). The following options were used to train the GBM model: distribution = 'bernoulli'; n.trees = 1000; interaction.depth = 10; n.minobsinnode = 5; cv.folds = 5; and tr ain.fr action = 0.5.
We use the relati v e importance of features, partial dependence plots and individual conditional expectation plots for model interpreta tion. Rela ti v e importance is based on the number of times a predictor is selected when training the model ( 23 ). Higher values of relati v e importance indicate larger influence on the response. Individual conditional expectation curves visualize the partial relationship between the pr edicted r esponse and a featur e for individual datasets ( 24 ). Partial dependence plots show the relationship averaged over all observations, which makes it easier to extract expected trends ( 24 ).
Clustering. We applied a custom hierarchical clustering method for building a cell lineage tree from predicted probabilities. Clustering begins at the lowest tree le v el, where all clusters contain an individual cell. Each possible cell pair is then ranked according to the predicted probability that they share an ancestor at this le v el. At consecuti v ely increasing le v els, pairwise comparison is performed between each lower le v el cluster, where the calculated probability is the maximum between any elements of the two clusters. Cluster pairs ar e order ed again according to this probability and are assumed to have the same parent node if its value is above the estimated threshold for this level. This process is repeated until one or two clusters are left. We assume only binary trees.

Lineage tree reconstruction using a maximum parsimony tree reconstruction method
Maximum parsimony based methods try to find the minimum number of changes necessary to describe the data for a gi v en tree. We perf ormed a maximum parsimon y reconstruction using the R package phangorn ( 21 ). Initial tree was r equir ed to start the maximum parsimony tr ee sear ch. This was done using the Hamming distance between barcodes [package DescTools ( 25 )] and unweighted pair group method with arithmetic mean clustering ( 26 ). We set the method as 'fitch' and minimum number of iterations in the ratchet as 100.

Scores
We assessed the accuracy of lineage tr ee r econstruction using four metrics: normalized RF score, TRP score, QRT score and CLI score. All scores have values between 0 and 1, with smaller values indicating larger similarity between ground truth lineage tree and reconstructed lineage tree.
Normalized RF score. The RF distance counts the number of splits that are unique to one of the two trees ( 27 ). An RF distance of 0 indica tes tha t all splits in both trees are the same. We used the RF.dist function from the R package phangorn to compute the normalized RF score ( 21 ).
TRP score. The TRP distance counts the number of subtrees of three taxa that are different in the two trees ( 28 ). The TRP distance was calculated using the tqDist algorithm ( 29 ) implemented in the R package Quartet ( 30 ). The TRP score was calculated by dividing the TRP distance by the total number of triplets shared between the two trees, i.e. N 3 ( 29 ).
QRT score. The QRT distance enumerates all subsets of leaves of size 4 and counts how often the topologies induced by the four leaves agree in the two trees ( 31 ). QRT di v ergence was calculated using the tqDist algorithm ( 32 , 33 ) implemented in the R package Quartet ( 30 ). The QRT score was calculated as one minus QRT di v ergence.
CLI score. CLI distance is a generalized RF metric based on the information content of the largest split ( 34 ). To compute the CLI score, we use the function ClusteringInfoDistance with the option normalize = TRUE from the R package TreeDist ( 35 ).

Set-up for lineage tree reconstruction
In silico shallow trees. To determine the effect of varying the number of possible states and the number of recording units on the accuracy of lineage reconstruction, we performed simulations with the number of states varied between 2 and 6 and a recorder carrying either 10 or 20 units. For each configuration, we simulated 1000 lineage trees that were used for ML training, and an additional 100 lineage tr ees that wer e used for the testing of lineage reconstruction methods. For each simulation, we sampled the depth of lineage from the empirical distribution associated with the mouse embryonic stem cell dataset and editing rate from the fitted posterior distribution.
Mouse embr y onic stem cell colonies. We used the same partition of the data as in ( 15 ), i.e. array readout data from 76 colonies along with the corresponding ground truth lineages as the training set and array readout data from 30 cell colonies as a testing set for accuracy evaluation. In silico deep tr ees. We simula ted division and target editing for 10 and 12 divisions, which produced colonies with 1024 and 4096 cells, respecti v ely. For trees with 1024 cells, we trained an ML model using data obtained from 10 individual trees, but a single simulated tree was used to train the ML model for the case with 4096 cells.

Comparison of tree reconstruction accuracy scores
First, we investigated performance of scores under conditions where a lineage tree is reconstructed erroneously. Figure 2 shows an example how scores compare for a tree with eight cells and few possible reconstructions. In case (i), cells A and C (which are cousins) have been assigned incorrectly. This resulted in the normalized RF score equal to 0.4, but had a small effect on the TRP score (0.07). In case (ii), we assign cells A and E incorrectl y, w hich incr eases all scor es, including the TRP score, to have values between 0.53 and 0.6. If two pairs of cells were assigned incorrectly [case (iii)], the normalized RF score is reduced by a factor of 3 and the QRT score by a factor of 2, but the value of the TRP score stayed the same as in case (ii). Next, we assumed that the number of divisions was estimated incorrectly for a pair of cells (

Estimating mutation rates
The mouse embr y onic stem cell dataset. The resulting colonies had from 4 to 39 cells; in total, ther e wer e 1453 individual cells in the dataset. The distribution of the number of divisions in the ground truth lineage trees is shown in Figure 3 A. Posterior distributions for fitted editing rate and the probability that an edited unit has a state '2' are shown in Figure 3 B and C, respecti v ely. We estimated the mean of marginal posterior distributions to be μ = 0.15 and α = 0.48. For simulation of cell division and record editing, we assume that all states have the same probability, i.e. α s = 1 / | S |.
The fly organ development dataset. Figure 3 D shows the posterior distribution of muta tion ra te. We estima ted the mean of marginal posterior distributions to be 0.0005. This is in agreement with ( 10 ), where it was concluded that ∼0.8-1.3 mutations were recorded on the readout sequence per cell generation.

Lineage reconstruction of in silico dataset: shallow trees
Results of simulations are shown in Figure 4 . Our MLbased algorithm outperformed the maximum parsimony based method in all four metrics. Comparing mean performance, we found a 43-50% improvement in the normalized RF score, a 19-32% improvement in the QRT score and a 36-45% improvement in the CLI score.
We found that increasing the number of target units had a stronger effect on lineage reconstruction accuracy than increasing the number of states. For a recording array using 20 targets instead of 10 targets, there was a 62% improvement in performance for the QRT score, a 56% improvement in performance for both the normalized RF score and CLI score, and a 28% improvement in perf ormance f or the TRP score. When the number of editing states was increased from 2 to 5, there was a 17% improvement in the normalized RF score for a recording array using 10 targets. There was a 68% improvement in the normalized RF score for a recording array using 20 targets. There was a 30% improvement in the QRT score for a recording array using 10 targets and a 49% improvement in the QRT score for a recording array using 20 targets.

Lineage tree reconstruction of the mouse embryonic stem cell dataset
A reconstruction was computed from the test dataset consisting of 30 cell colonies using only the intMEMOIR array readout ( 8 ). Figure 5 A shows a pairwise comparison between the two methods. Most of the points are above the diagonal, indicating that our ML-based algorithm outperformed the maximum parsimony based method for all four metrics. Comparing mean performance, we found a 17-23% improvement in the normalized RF score, a 7-37% improvement in the TRP score, a 7-11% improvement in the QRT score and a 12-15% improvement in the CLI score (Figure 5 B).
Partial dependence plots and individual conditional expectation plots can be used to analyse the relationship between features and the response. It can be seen from partial dependence plots (red line) in Figure 6 (upper row) that the probability of cells being siblings decreases when the value of feature 'F3' increases up to 3. This has a biological meaning: the number of pairwise bar codes wher e one cell stayed in the ground state and the other had undergone editing cannot be high if the cells are siblings, and the highest probability is when ther e ar e no such barcodes. The same relationship is true for the feature 'F5': the probability of cells being siblings decreases with increasing numbers of pairwise bar codes wher e one cell had undergone deletion and the other inversion. There is a linear relationship between the probability of cells being siblings and the number of pairwise barcodes where both cells have undergone deletions (feature 'F4'). These dependences for 'F3' and 'F4' become weaker when cells get further apart on the lineage tree (lo wer ro ws). The conditional expectation plots (grey lines) demonstra te tha t a t the indi vidual le v el, ther e ar e mor e complex relationships between features.

Lineage reconstruction of in silico dataset: deep trees
The performance of the AMbeRland-TR algorithm was consistent between reconstruction of trees with 1024 cells and 4096 cells, and in both cases outperformed the reconstruction by the maximum parsimony method (Figure 7 A). On average, there was a 50% improvement in performance for the normalized RF score in comparison to the maximum parsimony method. Improvement in reconstruction quality comes at higher computational time r equir ements. It takes 54 min on average to reconstruct a tree with 4096 tips using the AMbeRland-TR algorithm, which is two times longer than the time r equir ed for the maximum parsimony method (Figure 7 B).

Computational r equir ements
The computational cost of the AMbeRland-TR algorithm has the following components: extracting features from barcode data; extracting relationships between cells from the ground truth trees; training the ML model for each tree le v el; and r econstructing tr ees for testing data. We have combined all procedure into two tasks: (i) training data preparation and ML model training; and (ii) testing data preparation and tree reconstruction. For the purpose of evaluating time r equir ements, we assume that training and testing data have only a single cell colony. For deep trees, as shown in the previous section, it is enough to train an ML model on single tree. For shallow trees, the training dataset should contain enough trees to accommodate variate of barcode combination, but this should not be a burden as training data preparation for shallow trees is computationally fast. Overall, time r equir ed for training and testing tasks was similar for a range of number of cells we investigate ( Figure 8 ). For example, for a tree with 10 000 cells, it w ould tak e ∼24 h to train the models and 24 h for reconstructing a tree. All calculations were performed on a MacBook Pro with 2.4-GHz 8-core processor.  Each bar summarizes the score of 100 lineage tree reconstruction tests. We denote our ML-based approach as 'AMbeRland-TR' and lineage reconstruction using the maximum parsimony approach as 'MaxParsymony'. Accuracy metrics are normalized RF score, TRP score, QRT score and CLI score.

DISCUSSION
Machine learning is becoming an important tool that has considerable potential in biology ( 36 ), genetics and genomics, including the annotation of sequence elements and epigenetic, proteomic and metabolomic data ( 37 ), or solving problems arising in population and evolutionary genetics ( 38 ). So far, it has been underutilized for lineage tree re-construction. In this study, we introduced the frame wor k for constructing features and training an ML algorithm for experimental systems based on mutating synthetic genomic bar codes. We explor ed how the barcoding configuration influenced the performance of the algorithm by using a collection of simulated and biological datasets. There is no uni v ersal method to quantify topological similarities between lineage trees ( 39 ). The most widely used  score is the RF score, but it is very insensitive to discrepancies at higher le v els of lineage tree structures [ Figure 2 (iv)]. Assigning the correct pathway cells undergo during organism or cancer de v elopment is necessary in order to understand tissue-type dif ferentia tion. The QRT score systematically sho wed lo wer values indicating higher similarity between ground truth and reconstruction, e v en when reconstructed lineage trees lacked any structure [ Figure  2 (vi)]. Ther efor e, calcula ting a combina tion of accuracy scores is necessary in assessing the quality of lineage tree reconstruction.
Our in silico lineage reconstruction experiments showed that the ML-based approach is able to take advantage of high complexity of relationships between processes   (Figure 4 ). It has also showed that when engineering the barcode system, it is more desirable to increase the number of recording units on barcode arrays than to increase the number of possible mutation states.
Howe v er, increasing the number of edits from two states (i.e. single possible mutation) to three states (two possible mutations) could improve lineage reconstruction accuracy by at least 50% on average. The two in vitro datasets r epr esent differ ent lineage tracing configurations: number of states (3 versus 2) and number of recording units (10 versus 2943). The ML approach performed better than the Hamming distance approach on both datasets. The difference was not as striking as for the simula ted da ta. Possib le e xplanations include the following: much smaller training dataset, larger heterogeneity between colonies or the presence of noise from experimental readouts.
From the model fitted to the mouse embryonic stem cell dataset, we can get insight into the functional relationship between e xperimentally observ ed barcode values and cell relatedness. For all le v els of lineage tree hierarchy, we found that the features had a relati v e influence in the range from 11% to 24%; i.e. none of the six predictors had zero influence (Figure 6 A). Howe v er, the ranking of features varied with the relationship le v el. When predicting whether cells are siblings (le v el 1), the highest relati v e influence was the number of pairwise barcodes where one cell was in the ground state and the other was inverted. This feature had lower influence for higher relatedness le v els.
Our lineage reconstruction of the intMEMOIR dataset submitted to the Allen Institute Cell Lineage Reconstruction DREAM Challenge was the best-performing algorithm (mean RF score of 0.53 and TRP score of 0.52) ( 15 ). It was the best ranking method when benchmarked against a Bayesian phylogenetic frame wor k ( 40 ). By training the ML model to predict higher le v el relationships between cells, we have been able to further improve performance. We achie v ed a mean RF score of 0.31 and a mean TRP score of 0.41 ( Figure 5 ).
The area of lineage tracing is expanding fast, with new tools being de v eloped at the e xperimental and computational le v els. Jointl y profiling DN A methyla tion, chroma tin accessibility, gene expression and lineage information in single cells was made possible by de v eloping an inducib le lineage tracing mouse model with extremely large lineage barcode di v ersity ( 41 ). A computational pipeline allowing to predict cell lineages over several cell divisions solely from transcriptomic data alone was devised by le v eraging genes displaying conserved expression levels over cell divisions ( 42 ). Another important direction is to understand cell fate transitions during de v elopment. An internal cellular clock could be r ecover ed by integrating singlecell transcriptomics with lineage tracing ( 43 , 44 ). The ML approach has the potential to integrate barcode recordings with additional information, such as population dynamic parameters ( 40 ), single-cell gene expression ( 45 ), pr oteomics ( 46 ), micr osa tellite muta tions ( 47 ) or clonal correlations ( 48 ).
A key limitation of our proposed approach is that it requires the ground truth dataset, i.e. recorded barcodes and tracked lineage histories. Except in the mouse embryonic stem cell experiment ( 15 ), such data are not available. One solution would be to train ML on simulated data. Having a global database with data on different lineage tracing configurations and results could be a starting point for accumulating knowledge as r equir ed for simulations. Under time and budget restrictions, having ground truth data on a single lineage tree makes it possible to train ML models, as our analysis on deep tr ees indicates. Futur e work should explore these possibilities and evaluate how to practically go about the process of training and improving ML models for reconstructing whole organ or e v en whole body lineage trees.