Cell lineage inference from SNP and scRNA-Seq data

Abstract Several recent studies focus on the inference of developmental and response trajectories from single cell RNA-Seq (scRNA-Seq) data. A number of computational methods, often referred to as pseudo-time ordering, have been developed for this task. Recently, CRISPR has also been used to reconstruct lineage trees by inserting random mutations. However, both approaches suffer from drawbacks that limit their use. Here, we develop a method to detect significant, cell type specific, sequence mutations from scRNA-Seq data. We show that only a few mutations are enough for reconstructing good branching models. Integrating these mutations with expression data further improves the accuracy of the reconstructed models. As we show, the majority of mutations we identify are likely RNA editing events indicating that such information can be used to distinguish cell types.


INTRODUCTION
Several recent methods have been developed to infer psuedo-time and branching trajectories from time series single-cell RNA-seq (scRNA-Seq) data (1)(2)(3)(4)(5)(6)(7)(8)(9)(10). These methods rely on the assumption that cells that are in a similar state (developmental time, fate etc.) are also close in expression space. Based on these assumptions psuedotime methods construct models based on minimum spanning trees (MST) (2,9), clustering (4) or other graphical models (1,5,11) to connect cells that share the same state and identify branching events that lead to different cell fates. Such methods have been successfully applied to study several developmental and response processes including lung (7,10), neuron (7,10), myeloid (3,5), heart (12) and liver development (13), various treatment responses (14), aging (15) and more. While scRNA-Seq expression information is useful, it is also very noisy. Further, some recent studies indicate that a small subset of the genes, which are sometimes expressed at very low levels and so do not sig-nificantly impact overall expression similarities, can have a large impact on changing cell states (16). Indeed, in several cases the relationships identified by the pseudo-time ordering methods do not accurately capture known biological trajectories (10).
In addition to methods that rely on expression data, several genetic based lineage tracing methods have been developed over the last two decades, though these have not been combined with scRNA-Seq analysis (17). Recently, a number of methods that combine scRNA-Seq with Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) technology for lineage tracing were developed (18,19). These methods are based on the insertion of random mutations during cell division to a pre-determined RNA. Once RNA is sequenced the set of random mutations can be traced backwards to construct a phylogenetic tree which can then be used to assign cells to branches and fates. Such CRISPR based lineage tracing methods have been recently applied to study zebrafish development (18) and to study the lineages in mouse embryonic cells (20). Results indicate that for short durations (until the RNA mutations saturate) such method can indeed lead to good results when attempting to infer cell branching.
While CRISPR based methods are useful, they are limited in several ways (17,21). First, it is not clear how such method would be applied to higher organisms in vivo, especially when studying diseases and responses in humans. Second, the method requires genetic interventions which may alter wild-type behavior. Finally, the method to date is limited to short durations (given the length of the sequenced region) and so may not be appropriate for all studies.
An alternative to using CRISPR is to rely on de novo mutations. These have several advantages since they do not require any engineering, are not restricted in time and can be used for all species. The major challenge for using such an approach is the fact that such mutations are rare. For example, the mutation rate is ∼1.1e−8 per site per generation (22) in human, which is ∼35 mutations genome-wide per generation and so it is unlikely that many of them would be encoded in the coding regions that are profiled by scRNA-Seq. However, de novo mutations are only one reason why RNAs can differ between cells. Another reason is RNA edit-ing, which is a molecular process through which some cells make discrete changes to specific nucleotide sequences with an RNA molecular after it has been generated by RNA polymerase and was previously reported to involve in the cell differentiation process (23). Gommans et al. reported that RNA editing is a highly regulated process in higher organisms with editing levels specifically changing during development. They further claim that editing is often cell and tissue specific (24). Gagnizde et al. reported that A-to-I patterns reveal specific editing signatures distinguishing major cell types in the human brain, among which neuron and astrocytes constitute the most edited cell types (25). Combined, de-novo mutations and RNA editing can provide additional information that is not captured by the expression profiles themselves to aid in reconstructing the branching trajectories. To enable the use of such sequence information when reconstructing dynamic differentiation models from scRNA-Seq data we developed a new method for Trajectory inference Based on SNP information (TBSP) that identifies such mutations (we refer to them as SNPs though several are likely due to RNA editing as we discuss below). Once significant SNPs have been identified we use them to reconstruct a phylogenetic tree for the cells profiled. We show that the tree agrees quite well with known cell states for these cells even though we did not use the expression levels themselves to construct it. Next, we extend a previous method we developed to reconstruct dynamic models of cell differentiation so that it can utilize both expression and SNP data. As we show, the reconstructed models that utilize the SNP data further improve upon models generated by only using the expression level data indicating that SNPs provide information that is not captured by the expression levels themselves. We also discuss the biological meaning of the SNPs and argue that many of them are likely RNA editing events rather than de-novo mutations.

MATERIALS AND METHODS
TSBP starts by filtering potential SNPs in the scRNA-Seq data. Next cells in the input data are clustered using all significant SNPs. Starting from the initial clusters, we iterate (in an EM-like algorithm) between cell assignments to clusters and SNP selection until convergence or the maximal iterations. The final selected SNPs are used by TBSP to calculate distances between cells (based on the Hamming distance). Neighbor-joining is then used to construct trajectories for the cells based on the calculated distance matrix. Finally, TBSP combines SNP and expression data to provide a more comprehensive view of the developmental or progression trajectories. Figure 1 presents an overview of TBSP.

Detecting SNPs from single-cell RNA-seq data
We mapped all the scRNA-Seq reads to the reference genome using HISAT2 (26) with the default parameters. We next used the GATK variant-calling pipeline (27,28) to call all potential SNPs for each of the cells. The obtained SNPs were filtered by the VariantFiltration function included in the GATK pipeline with the recommended parameters (including QD < 2.0 which is cutoff recommended for obtaining significant variants). We further filter SNPs found in <10% of the cells (rare SNPs) or >80% of the cells (Universal SNPs). Rare SNPs are most likely false positives. On the other hand, Universal SNPs (which we term baseline SNPs) are uninformative and likely represent differences between the cell line or animal used for the experiment and the reference. We also tried other cutoffs such as 20% for Rare SNPs, and found the results to be very similar (Supporting Methods and Supporting Figure S1) In addition to baseline SNPs that appear in a large fraction of the cells, the method can also identify baseline SNPs in a fraction of the cells. This would happen if the gene in which this SNP resides is only expressed in a subset of the cells. Such SNPs are redundant with gene expression data and so do not provide any additional information. To remove these, we only use SNPs identified in regions where we have multiple aligned reads in most cells (>8 reads on average aligned in >80% of the cells). Since we find that most significant SNPs are only identified in a small fraction of cells (much smaller than 80%) such requirement means that identified SNPs represent real differences between the cells.

Identifying informative SNPs for trajectory inference
We first build a cell-SNP matrix (M) for all cells (denoted by C) and all SNPs after the initial filtering (denoted by X), where M(i, j) is a binary value, which tells whether SNP j is detected in cell i. As the initial set of SNPs X could possibly contain many false positive or non-informative SNPs, our objective here is to find the best subset of SNPs P from X, Nucleic Acids Research, 2019, Vol. 47,No. 10 e56 which can best distinguish cells of different sub-types.
We initialize the cell clusters (denoted by N) using Kmeans (29) on rows (cells) of the cell-SNP matrix M(i, j) where i ∈ C, j ∈ X . The number of clusters is determined using Silhouette score (30). To find the most discriminative SNPs among the full set, for each cluster I in the N, we search for the best SNP set P(I).
To find such SNP subset P = I∈N P(I), we use an EMlike algorithm that tries to infer the best SNPs for splitting the data into k groups: for each sub-population (I) (cluster) in the data, we identify the set of SNPs that best separates the cells in I from all other cells (C − I), where C represents all the cells in the data.
where N denotes all the sub-population (clusters) in the data and P(I) denotes the signature SNPs for I ∈ N. Since this is a challenging combinatorial problem we use a greedy algorithm to find a local optimal solution. See Supporting Methods for complete details. The initial P matrix might contain both false positive and non-informative SNPs. To refine the matrix and the clustering, we iterate using the identified SNP set. In each iteration, we re-cluster the cells and use the new clusters to re-identify SNPs. This is repeated until convergence or when the maximal iterations are reached. When the algorithm converges, we are left with a selected set of SNPs (P). Similar to all clustering methods this approach can be sensitive to the initial clustering result and so we include additional SNPs if they improve the Silhouette score. See Supporting methods for details.

Inferring the trajectory using identified SNPs
Using the selected SNPs we utilize a distance-based neighbor-joining algorithm (31) to construct an initial trajectory. Each cell is represented using a binary SNP vector The distance between two clusters (I i , I j ) is calculated as the average distance between every pair of cells in the two clusters . These cluster distances are used as the input to the Neighbor-joining algorithm to build the cell trajectory.

Integrating the identified SNPs with the expression-based trajectory inference
The SNPs we selected provide information that is complementary to the profiled scRNA-Seq expression data. We have thus next integrated SNP information with our previously developed method for reconstructing dynamic regulatory networks from scRNA-Seq data, scdiff (10,32). As the single-cell RNA-seq data is very noisy, it's not accurate to estimate the gene expression in each cluster (state) based on only the direct single-cell RNA-seq measurement. An additional source of information is needed to overcome the noisy nature of the scRNA-Seq data. In scdiff, we use the TF-gene regulatory networks as extra information, which impacts the state transition of the underlying Kalman Filter model. scdiff starts with building the initial tree-structured trajectories by clustering the cells and connecting the cell clusters. Next, scdiff iteratively refine the initial trajectories by integrating the extra TF-gene regulatory networks. It re-estimates the gene expression for each state (cluster) of the trajectory tree using Kalman Filter, which utilizes both the direct scRNA-Seq observation (emission model) and also the TF-gene regulatory information (transition model). With the re-estimated expression for each cluster, all the cells will be re-assigned, and the trajectories will be re-inferred based on the new cell assignments. Such process will be iterated until convergence or maximal iterations. The converged trajectories will be the final predictions together with predicted TFs, which are critical to the state transition in the Kalman Filter.
Here, We integrated the SNP information into scdiff. First, we build up the initial trajectories in the same way as scdiff. Next, we use not only the expression information (and the TF-gene regulatory information) as in scdiff but also the SNP information to refine the initial trajectories. We use the clusters from the initial trajectories to identify informative SNPs (though we do not iterate, just use the greedy heuristic on a fixed set of clusters) list(P) to help reassign cells to states. This new assignment combines the expression profile and SNP for each cell c i as follows: = argmax s P(s)P(c i |s) = argmax s P(s) * ps(c i |s) * pe(c i |s) = argmax s log P(s) + log ps(c i |s) + log pe(c i |s) where P(c i , s) represents the probability of c i in cluster s and pe(c i |s) denotes the conditional probability of c i in cluster s based on expression, which is calculated in the same way as in scdiff (10). See supporting methods for how we compute ps(ci|s), the conditional probability of cell c i in cluster s based on SNP information. The initial trajectories will be refined by the aforementioned cell assignments and the initial clusters will be also updated. Next, we re-identify the informative SNPs using the same method described above on the updated clusters. We iterate the above process until convergence or maximal iterations. The converged trajectories will be the final predictions. Here, we demonstrated the integration of the SNP information using the scdiff method. However, such SNP information can also be integrated into other existing single-cell expression based methods. TBSP provides a Cell-SNP matrix, which describes the signature SNP vector for each of the cells in the dataset and thus can be used to refine the cell assignments/trajectories in the existing expression-based methods.

Post analyses of the predicted SNPs
We used PAVIS (33) to annotate the genomic position (Exon, Intron, 3'UTR and so forth) of the predicted SNPs.
If the predicted SNP is located in the promoter (+5 kb), gene body or downstream (1 kb) of a gene, such gene will be regarded as the SNP target. We use PANTHER (34) GO enrichment tool to analyze the GO terms associated with the targets genes of the predicted SNPs.

Software availability
The TBSP software is freely available on GitHub at https: //github.com/phoenixding/tbsp with a detailed user manual and a test set.

Differentiation trajectories can be inferred based on SNP information
We first tested the reconstruction of temporal and spatial trajectories using only the SNP information. For this we used several different scRNA-Seq datasets including the 'Neuron' scRNA-Seq expression data which studies neuron reprogramming (35), the 'Liver' scRNA-Seq data which studies human liver bud development from pluripotency (13) in 2D culture and 3D liver buds (LB) and a 'Lung' scRNA-Seq expression dataset which profiles distal lung epithelium differentiation (36). The Neuron dataset has four time points and a total of 252 cells. The Liver dataset has 765 cells, which falls into four stages (iPSC → DE → HE, IH → MH, LB and others). The Lung dataset has three time points and a total of 152 cells. For all these datasets the original papers provide some information (based on known markers) about the expected trajectories or organizations, and these can be used to test the accuracy of the SNP based analysis. For the mouse 'Neuron' data (35), the SNP-based trajectories inferred by TBSP are consistent with current knowledge ( Figure 2). SNP data was informative enough to correctly cluster the cells (Supporting Table S1). Next, we looked at the trajectory inferred from these SNPs. As can be seen in Figure 2A (13). Similar to the mouse neuron data, cells are clustered well using only SNP information. As for the trajectory analysis, the original study (13) reported a bifurcation in 2D and 3D trajectories. In the 2D culture, the iPSC cells differentiate to mature hepatocyte-like (MH) cells, which are different from the liver bud (LB) and mesenchymal stem cell (MSC)-LB cells in their 3D differentiation counterparts. This is also the branching determined based on the SNP information.
For the human Liver differentiation data (13), the SNPbased trajectories also agree with marker based reconstruction ( Figure 2B). First, as with the mouse neuron data, cells are clustered well using only SNP information (Supporting Table S1). As for the trajectory analysis, the original study (13) reported a bifurcation in 2D and 3D trajectories. In the 2D culture, the iPSC cells differentiate to mature hepatocyte-like (MH) cells, which are different from the liver bud (LB) and mesenchymal stem cell (MSC)-LB cells in their 3D differentiation counterparts. This is also the branching determined based on the SNP information by TBSP. SNP based models mix IH and MH cells, which is inconsistent with prior knowledge that indicates that IH cells are progenitors of MH cells. However, such a mixture of MH and IH cells also be observed in the Monocle results (Supporting Figure S3), which indicates that the cell PAGE 5 OF 9 Nucleic Acids Research, 2019, Vol. 47, No. 10 e56 states of the IH and MH may be quite similar. Based on both TBSP and Monocle models, HE and a small fraction of early IH cells seem to serve as progenitors for the branching.
For the mouse 'Lung' differentiation data (36), the SNPbased predicted trajectories are also partially supported by the known model (Supplement Supporting Figure S4). The first time point (E14.5) is associated with a number of unique clusters (1, 3 and 4) residing at the beginning of the tree while more mature epithelial cells (mainly Bi-potential Progenitors (BP), Alveolar Type 2 and ciliated cells) are clustered together afterward, and the last to branch are Type 1 cells. On the other hand, the SNP-only model incorrectly assigns the E16.5 time point to a later branching location than its actual position in the process. Still, when the SNP information is combined with expression data, it still improves the accuracy of the reconstructed trajectories as we discuss below and in Supporting Results.
We also compared TBSP with Monocle2 (3). These comparisons demonstrated the advantages of our SNP-based methods as shown in Supporting Figure S3. For example, when analyzing the neuron reprogramming dataset, d2 induced cells are displayed on a separate branch in the Monocle 2.4.0 results while neuron cells are on the top branch. In contrast, the reconstructed model based on SNP data correctly connects them. Please refer to the Supporting Results for the complete details on all 3 datasets.

Unique SNPs are associated with the different clusters
The reconstruction above used 36, 55 and 33 SNPs for the Neuron, Liver and Lung data respectively. As can be seen in Figure 3, several of these are associated with one or only a few of the clusters in each model. For example, in the liver data, SNPs 31-38 are only enriched in Cluster 2 (Human umbilical vein endothelial cell (HUVEC) cells). SNPs 49-52 are only enriched in Cluster 0 (MH,immature hepatoblastlike (IH) cells) and SNPs 15-20 are only enriched in Cluster 4 (iPSC). Clusters of cells that are connected in development (for example, one is right after the other) usually share overlapping SNP patterns. An example of this is liver Cluster 4 (iPSC) and Cluster 5 (definitive endoderm (DE), hepatic endoderm (HE)). See Supporting Results for detailed discussion. We have also plotted the SNP distribution along the known trajectories(Supporting Figure S5). For the neuron data, we see several mutations that are only associated with the initial state. We also see some cell types (for example d2 intermediate and d2 induced or d5 earlyiN and Neuron) that share the same mutations while other types (for example, Fibroblasts) do not. This supports their use in unsupervised model reconstruction. For the liver data, we observe the main difference between MSC, LB cells and IH, MH cells. iPSC cells, also display a subset of unique SNPs. For the lung data, we observe a clear separation between progenitor cells E14 and terminal cells (AT1, AT2, Clara, and ciliated).

Predicted SNPs may represent RNA-editing changes
As noted in Methods, we removed baseline mutations that are only identified because some genes are only expressed in the subset of the cells. All of the mutations we identified overlap genes that are expressed in the majority of cells. Given the relatively low genomic mutation rate and the small number of cell divisions in the data we studied (less than 10), it is unlikely that most of the SNPs we identified represent de novo mutations. We have thus looked for alternative explanations for the significant SNPs we found. One such possibility is RNA editing, which is a molecular process through which some cells make discrete changes to specific nucleotide sequences with an RNA molecular after it has been generated by RNA polymerase and was previously reported to involve in the cell differentiation process (23).
The predicted SNPs are dominated by A/G ( A → G, G → A) and C/T (U) ( C → T, T → C) substitutions (Figure 4 A). Note that these substitutions are quite similar. The direction of the A/G substitution (A to G or G to A) depends on the reference and the A/G and C/T are essentially the same substitution at different strands. In the neuron data, A/G and C/T substitutions account for 34.4% and 51.7% respectively(combined total of 86.1%). For the Liver data, A/G accounts for 39.2% and C/T accounts for 41.2%, (80.4%) and in the lung data, A/G and C/T substitutions account for 81.4% of all predicted SNPs. A/G and C/T substitution dominance were also shown for RNA-editing (37).
In addition to the type of substitution, their locations also match. Predicted SNPs are found mostly in non-coding regions ( Figure 4B) especially in 3'UTR regions. For example, in the neuron data, 44.4% of the SNPs are found in the 3'UTR, 19.4% of the SNPs are found in the intronic regions, 2.8% upstream of the gene (5kb), 2.8% are found in the 5'UTR region and 5.6% are found in Exons, 25% are found in the other region including gene downstream(1kb) and intra-genic regions. See Supporting Results for the distribution in other datasets. This also agrees with the fact that most RNA editing sites are located in the non-coding region (37).
We found that predicted SNPs are located near the Alu elements, which is also observed for RNA-editing sites (38). We also looked at the intersection between SNPs identified by TBSP and previously identified RNA-editing sites from the RADAR database (39). For the human data (for which we have many more known sites compared to the mouse) we find that 3 of the 55 SNPs we identified for the liver data are found in RADAR (P-value: 2.4e−4). Note that current knowledge of RNA editing sites is still limited. Finally, we used an RNA-editing site prediction tool RED-MEL (40) to score each of the predicted liver SNPs. Twenty one out of 55 predicted SNPs are identified as RNA-editing sites by RED-MEL (P-value = 0). See Supporting Results for details.
To further investigate the origin of the SNPs identified, we have analyzed data from a study that jointly performed RNA and DNA seq analysis in single cells (41). We downloaded DNA-seq and RNA-seq for 112 cells from this study and used our method to identify mutations in both. Under the parameters used for other studies we analyzed, we obtained 7 DNA SNPs (all in non-coding regions) and 31 SNPs from RNA-seq. None of the 31 RNA-seq SNPs were found in the DNA-seq results. Therefore, these mutations are only identified at the RNA but not the DNA level supporting their likely assignments as RNA-editing events. Please see Supporting Table S2 for the list of SNPs identified.

GO terms associated with the predicted SNPs
We also looked at the function of genes for which we identified SNPs in each of the datasets (Methods). In the neuron data, we found 26 such genes associated with the 36 predicted SNPs. The most significant GO terms associated with these 26 genes are 'Regulation of protein depolymerization' (P-value = 1.32e−4, FDR = 1) and "Regulation of protein complex disassembly" (P-value = 1.98e−4, FDR = 1), which are consistent with the potential protein degradation related functions of RNA-editing previously reported by study (42). In the Liver data, we found 42 genes associated with the 55 predicted SNPs. These 42 genes are enriched with 'protein targeting (P -value = 4.51e−9, FDR = 7.05e−5)' and 'contranslational protein targeting to membrane (P-value = 2.02e−08, FDR = 1.06e−4)', which is also supported by (42). For the Lung data, we found 24 genes associated with 33 predicted SNPs. The top GO

SNP information further improved expression based pseudo time reconstructing methods
We next used the SNP data to improve the reconstruction of expression based trajectory inference methods. For this, we extended scdiff, a method we have previously developed to reconstruct trajectories from scRNA-Seq data (10). Briefly, scdiff is a probabilistic method that integrates expression data with TF-gene interaction data to learn a branching model and assign cells to states. We used the SNP data to further improve cell assignment and state inference (Methods). Results of the combined model are shown in Figure 5. As can be seen, for the Neuron data, the SNP based model leads to trajectories that are more consistent with prior knowledge (35 To further study if SNP information is complimentary to expression data we combined the two and used the combined dataset as input for another trajectory inference method, Monocle (version 2.4.0) (3). For each cell we concatenated the SNP vector for that cell to the expression values to create a new input (Supporting Methods). We next ran Monocle on the SNP plus expression input to reconstruct trajectories. Results indicate that SNP data improves the trajectories reconstructed by Monocle. The major difference between the expression only and expression plus SNP trajectories is the position of d2 induced cells. In the expression plus SNP model, d2 induced cells are predicted to be progenitors (including for Neuron cells), consistent with their known role. In contrast, when only using expression data as input for Monocle, d2 induced cells are located at a separate branch without any descendants. See Supporting Results and Supporting Figures S8, S9 for the complete details.

TBSP scalability
While all three datasets discussed above studied dynamic processes, they have only profiled hundreds of cells each (though with relatively high overage). This number is quite low compared to more recent studies that usually profile thousands of cells. To test the scalability of TBSP, we have also used it to analyze a larger dataset which profiled close to 4000 Hematopoietic stem/progenitor cells (HSPCs) from mice bone marrow (43). While the data was collected over two days in order to study cell differentiation, all HSPCs were pooled before profiling and so unlike the longitudinal datasets discussed above, no time information is available for cells in this dataset. Still, TBSP was able to identify several significant SNPs for this data and these were used to cluster the cells and derive trajectories. Results are presented in Supporting Figure S10. Even though it does not use the expression data itself, TBSP derived trajectories agree well with the observation in the original paper of near-continuous differentiation process, from Haematopoietic stem cell/multipotent progenitor to Haematopoietic progenitor cell. Both the differentiation direction and the continuous process are captured by the SNP-based model, in which the trajectory starts from Cluster 0 (dominated by Haematopoietic stem cell), ends at Cluster 1 (which is heavily dominated by Haematopoietic progenitor cells) and the percentage of Haematopoietic progenitor cells in the cluster is increasing along the trajectories. See Supporting Results for complete details. We also include information on runtime and computing resources for different number of cells (we simulated up to 30 000 cells with 1.5M reads on average for each cell). Please refer to the Supporting Figure S11 for the runtime and memory requirements.

DISCUSSION
Existing methods for the analysis of time series scRNA-Seq data mostly utilize the expression levels for each cell. Such methods reconstruct developmental or response trajectories based on similarities between their expression (often in reduced dimensional space). While these methods have been successfully applied, it is also clear that in many cases that cannot accurately capture the dynamic process that they are modeling due to noise, dropouts and the impact of low expressed genes.
Here, we presented a complementary approach, TBSP, which utilizes what, until now, was a discarded part of the data: The errors in the sequenced RNAs. While some of these may indeed be just that (errors), others, especially those that pass stringent filtering criteria and that are identified in multiple cells, are likely to be true differences. As we show, by using the identified SNPs we can construct reasonable trajectories assigning cells to different branches even without using the expression level information. We applied TBSP to four different datasets ranging in size from less than 200 to close to 4000 cells. As we show, in all cases the method was able to obtain good clusters and trajectories when only using SNP data. When combined with expression data to resulting models are even better and improve upon expression only models.
Some of the SNPs we identified may represent de-novo mutations inserted during cell division. However, it is unlikely that the majority are indeed such mutations given the small number of expected de-novo mutations in coding regions for the data that we studied as we estimated in the introduction. Instead, we argue that many of these likely represent RNA-editing events. Several lines of evidence support this claim. First, the type of mutation we observed, A/G or C/T substitutions, is consistent with RNA-editing sites are also mostly substitutions (44). Second, the locations of these mutations are enriched near the Alu elements, which has been also reported for RNAediting sites (45). Third, the identified SNPs significantly overlap known RNA-editing sites reported in RADAR database (46). Fourth, many reside in genes that are known to be associated with processes that are regulated by RNAediting such as protein degradation (42).
While SNP only models provide useful information about cell fate and branching, their performance is limited. Since the method depends on the mutation changes it often requires longer time scales than expression changes as we observed for the lung data. However, when SNP information is combined with gene expression data, the resulting models can improve upon the expression only models. For the lung data, the AT2 dominated cluster is the sibling node of AT1 dominated cluster when SNP information is combined with expression whereas the AT2 dominated cluster is the descendant of the AT1 dominated cluster in the expression only model (36).
While we believe that the integration of SNP and expression information would be useful for many studies, we note that it may not be a viable option in some cases. For example, when sampling rates are very short it is unlikely that many SNPs would be identified by our method TBSP even if large changes in expression occurs. Low coverage would also impact the accuracy of the method. In addition, many unique molecular identifier (UMI) technologies sacrifice the full-length coverage to sequence part of the primer used for cDNA generation (47,48), which would reduce the ability to detect SNP. Still, several existing and new datasets are sequencing full-length cDNA, and these can benefit from the method we presented.
Software implementing TBSP is freely available at GitHub (https://github.com/phoenixding/tbsp). As the number and types of biological processes that are studied using scRNA-Seq data increases, methods that can accurately infer developmental and response trajectories become an important part of the analysis and modeling process. We hope that TBSP, which aims at better uti-lization of existing data, would aid researchers seeking to analyze such time series scRNA-Seq data.