SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation

Abstract Motivation The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation. Results In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses. Availability and Implementation The R source code of SCODE is available at https://github.com/hmatsu1226/SCODE Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Conventional bulk RNA-Seq reveals the average gene expression of an ensemble of cells, and therefore does not permit the analysis of detailed states of individual cells. With the advancement of single-cell RNA-Seq (scRNA-Seq), we can now quantify the expression of individual cells and analyze detailed differences among cells (Kolodziejczyk et al., 2015). This enables several analyses such as the identification of cell types (Buettner et al., 2015;Zeisel et al., 2015), especially of rare cells (Grun et al., 2015;Jiang et al., 2016) and the estimation of cellular lineages (Burns, 2015;Treutlein et al., 2014).
In analyses by scRNA-Seq, the reconstruction of cellular differentiation processes attracts attention as a novel approach to revealing differentiation mechanisms (Trapnell, 2015). The differentiation process can be reconstructed using dimension reduction (Ji and Ji, 2016;Trapnell et al., 2014) and stochastic processes (Matsumoto and Kiryu, 2016), for example, and the degree of differentiation (in pseudo-time) of each cell is characterized by the position in the reconstructed process. By investigating the expression pattern in pseudo-time, genes can be clustered into multiple groups with different biological functions (Trapnell et al., 2014). Moreover, the regulatory cascade of cellular state transitions, such as differentiation, can be inferred by comparing the timings of upand down-regulation (Eckersley-Maslin et al., 2016;Li et al., 2016;Matsumoto and Kiryu, 2016).
In addition, scRNA-Seq also enables the calculation of accurate correlations of expression between genes because scRNA-Seq can distinguish the detailed states of individual cells without contamination from multiple cell types. The accurate co-expression pattern of each cell type (progenitor cells and multiple types of differentiated cells) can reveal the key regulatory factors for lineage programming (Pina et al., 2015).
In this way, expression dynamics in pseudo-time and accurate relationships among genes can be inferred from scRNA-Seq data. For the next step in differentiation analyses using scRNA-Seq, it is important to reveal the regulatory interactions among genes that bring about the observed expression dynamics during differentiation, namely, gene regulatory network (GRN) inference from scRNA-Seq data. Pseudo-time can be regarded as time information, and hence, scRNA-Seq performed on cells undergoing differentiation can be regarded as time-course expression data at a high temporal resolution. Although several algorithms have been proposed to reconstruct GRN from time-course data (Lee and Tzou, 2009), most of them are not suitable for scRNA-Seq data, such as that collected over continuous time and with a large number of samples. Moreover, time complexity is a serious problem, and runtime becomes infeasibly long with large numbers of samples and genes for the network inference from time-course data.
Recently, Boolean network-based algorithms have been proposed for inferring GRN from single-cell data (Chen et al., 2015;Lim et al., 2016;Moignard et al., 2015). Although these algorithms have revealed some interesting regulatory relationships, their time complexity increases significantly as the number of genes and cells increases, and they have thus been applied to data with a small number of genes. In addition, the expression data must first be converted into binary data for Boolean network inference, and therefore the relationship between networks and the underlying dynamics becomes obscured (Woodhouse et al., 2016).
As another approach, ordinary differentiation equations (ODEs) have been used to describe regulatory network and expression dynamics. ODEs can describe continuous variables over continuous time and the underlying physical phenomena, and therefore they are suitable for inferring GRN from scRNA-Seq during differentiation.
Although several ODE-based network-inference algorithms have been proposed (Di Bernardo et al., 2004;Gardner et al., 2003), most of them are not suitable for the differentiation case because these algorithms assume a steady-state condition. There are some ODE-based algorithms that infer GRNs such that the observed expression dynamics can be reconstructed from the optimized ODE (Bansal et al., 2006). However, time complexity is still a serious problem for such ODE-based algorithms (Lee and Tzou, 2009). Previous research has described optimizing an ODE by using singlecell data and pseudo-time to infer key GRNs (Ocone et al., 2015). Although it is a suggestive approach, the optimization assumes that the GRNs are given and learns the ODE for a specific GRN. Therefore, a novel and efficient algorithm is necessary to learn GRNs from ODEs designed for scRNA-Seq performed on differentiating cells and for a large number of samples and genes.
Accordingly, we developed an approach to describe regulatory networks and expression dynamics with linear ODEs as well as a novel, highly efficient optimization algorithm, SCODE, for scRNA-Seq performed on differentiating cells by integrating the transformation of linear ODEs and linear regression. In the Methods section, we show that linear ODEs can be transformed from fixed-parameter linear ODEs if they satisfy a relational expression. We also show that the relational expression can be estimated analytically and efficiently by linear regression. In addition, SCODE uses a small number of factors to reconstruct expression dynamics, which results in a marked reduction of time complexity. In the Results sections, we described the application of SCODE for three scRNA-Seq datasets during differentiation. First, we validated that the optimized ODEs can reconstruct observed expression dynamics accurately. Second, we evaluated the inferred network by comparing it to the transcription factor (TF) regulatory network database based on DNaseI footprints and transcription factor binding motifs. SCODE performed best with two of the datasets and was the close second best algorithm for the remaining dataset. Third, we compared the runtimes of the algorithms, and SCODE was significantly faster than previous algorithm that was designed for time-course data. Moreover, SCODE is faster than some algorithms that do not use time parameters. These results illustrate the remarkable efficiency of SCODE. Lastly, we analyzed the network inferred from a dataset and determined that the de novo methyltransferases Dnmt3a and Dnmt3b might be key regulators of differentiation.
In this paper, we propose a novel algorithm for scRNA-Seq performed on differentiating cells to reconstruct expression dynamics and infer regulatory networks with a highly efficient optimization method. We believe that our approach will substantially advance the development of regulatory network inference and promote the development of further single-cell differentiation analyses and bioinformatics methods.

Describing regulatory networks and expression dynamics with linear ODEs
In this research, we focus on TFs and inferring TF regulatory networks. First, we describe TF expression dynamics throughout differentiation with linear ODEs: where x is a vector of length G (G is the number of TFs) that denotes the expression of TFs and A is a square matrix with dimensions equal to G that denotes the regulatory network among TFs. We infer the TF regulatory network by optimizing A such that the ODE can successfully describe the observed expression data. The above model and most of the algorithms which infer GRN from RNA-Seq make no distinction between mRNA and corresponding TF levels. Although several studies have shown poor correlation between mRNA and protein levels (Maier et al., 2009), the correlation increases in the case of differentially expressed mRNAs (Koussounadis et al., 2015), and changes of mRNA and protein levels are highly correlated (Lundberg et al., 2010). Therefore, TF levels can be represented by mRNA levels with respect to differentially expressed TF genes during differentiation. The delay due to translation and translocation also may impact our model. The time scales of translation and translocation are seconds (Morisaki et al., 2016) and several tens of minutes (Aymoz et al., 2016), which are significantly shorter than that of differentiation (days). Therefore, our assumption that identifies mRNA level with corresponding TF levels will be appropriate to describe the regulatory relationships and dynamics during differentiation.
The observed expression data consist of a G Â C matrix (X ðeÞ ), where C is the number of cells. In addition, the time parameter of a cell c is given as t c . Therefore, our objective is to optimize A such that dx ¼ Axdt can properly represent X ðeÞ at a corresponding time point.
Here, A contains G Â G parameters and an efficient parameter optimization algorithm is necessary for large values of G. This is because the time complexity is typically OðG 3 Þ for operation on a G Â G matrix, and it will exceed OðCG 3 Þ to optimize A with a general algorithm. As experimental technologies have advanced, the number of cells that may be subjected to scRNA-Seq has been increasing, and hence C can be quite large. Therefore, we developed a novel algorithm to optimize A efficiently, even if both G and C are large, by integrating the transformation of linear ODEs and linear regression.

Deriving A from a linear ODE transformation
At first, we consider the following linear ODE: where z is a vector of length G and B is a known square matrix. If we know a matrix W that satisfies x ¼ Wz, we can derive the ODE of x by transforming the ODE of z as follows: Therefore, if the parameter B of dz ¼ Bzdt and the relationship x ¼ Wz are given, we can derive A from WBW À1 .

Estimating W using linear regression
To infer A, we have to estimate a matrix W that satisfies x ¼ Wz.
Here, we assume that the problem of W inference can be regarded as a linear regression problem. Initially, from dz ¼ Bzdt, we calculate z at t ¼ t c for each cell and generate a G Â C matrix (Z ðeÞ ) ( Fig. 1(a)).
With this Z ðeÞ , we optimize W to successfully represent the relationship X ðeÞ ' WZ ðeÞ , which results in x ' Wz. The above problem can be regarded as solving the linear regression for each gene, as follows: where is a noise term. Therefore, W can be optimized analytically and efficiently by linear regression for each TF ( Fig. 1(b)), and A can be efficiently calculated from WBW À1 .

Dimension reduction of z
The basic idea of reduction is that the patterns of expression dynamics are limited and expression dynamics can be reconstructed with a small number of patterns. For the next step, we consider a small vector z to represent the original expression dynamics. Hereafter, z is a vector of length D, with D ( G. In this case, W is a G Â D matrix, and hence we used a pseudo-inverse matrix W þ instead of the inverse matrix, and A is derived from A ¼ WBW þ . The matrix W is estimated as before, via linear regression. By using a small vector z, the time complexity of estimation of W becomes much lower. Recently, such dimensionality reduction approach has also been proposed to infer network (Wang et al., 2016). Although it is a sophisticated algorithm, it is designed for discrete time-course data and small samples, and is not suitable for scRNA-Seq data.

Optimizing B
Thus far, we have assumed B is given. To represent the original expression dynamics with small values of D, we optimize B for the The expression dynamics can be reconstructed from the optimized ODE of x next step. We suppose that the appropriate value of B satisfies the condition that the Z ðeÞ generated from dz ¼ Bzdt can predict X ðeÞ with WZ ðeÞ accurately. Therefore, we evaluate the appropriateness of the matrix B with the following residual sum of squares (RSS): In this research, we assume B is a diagonal matrix and the elements B ii satisfy b min B ii b max (we set b min and b max to -10 and 2, respectively). This limitation is acceptable because large and small values of B ii represent a dynamics of sharp change and seem to be an inefficient basis for reconstructing the expression dynamics.
We optimize B by random sampling and iterative optimization so that the RSS decreases ( Fig. 1(c)). The brief pseudocode is given below (see the Supplementary text for the detailed procedure). Fig. 1(d)).

Time complexity
The time complexity of optimizing W and B is OðIðGD 3 þ GCD 2 ÞÞ, where I is the number of iterations of B optimization. The time complexity of calculating A is OðDG 2 Þ. Because we assume that D is small, the total time complexity is about OðIGC þ G 2 Þ. As matrix operations on A, such as multiplication, have a time complexity of OðG 3 Þ, our algorithm is highly efficient to infer regulatory network even though it integrates time-course information into the model.

Other network inference approaches
For comparison, we also developed a simple network inference algorithm based on linear regression that predicts expression of a particular TF from the expression of the remaining TFs as follows: With this method, the optimized W 0 is regarded as a regulatory network. In this research, we optimized W 0 using two criteria. The first criterion is based on normal linear regression, and we estimated W 0 with the lm function in R. The second criterion is based on lasso regression, and we estimated with the msgps package in R, which automatically selects the optimal degrees of freedom (Hirose, 2012). We used the Bayesian information criterion for model selection in msgps.
In addition, we inferred networks with GENIE3 (Huynh-Thu et al., 2010), which also predicts TF expression from the expression of other TFs by using regression trees. The performance of GENIE3 was best in the DREAM5 Network Inference challenge for population data (Marbach et al., 2012).
We also inferred networks with Jump3 (Huynh-Thu and Sanguinetti, 2015), which is the expansion of GENIE3 for reconstructing a network from time-course expression data. Jump3 is based on jump trees and showed high performance for multiple time-course datasets.

Dataset
We analyzed three time-course scRNA-Seq datasets by the following procedures. First, transcripts per million reads (TPM) and fragments per millions of kilobases mapped (FPKM) were transformed as log(TPM þ 1) and log(FPKM þ 1), and we regarded these logtransformed values as the expression value. Next, we calculated the averaged expression of each TF at each time point and calculated the variance of the averaged expression for each TF. For TF data, we used Riken TFdb for mouse (Kanamori et al., 2004), and animalTFDB for human (Zhang et al., 2015). (Riken TFdb contains not only TFs but also their related genes, and we called all genes in the database transcription factors in this study.) Lastly, we regarded the TFs with large variances as variable TFs during differentiation. Hereafter, we used the top 100 variable TFs for network inference. For these 100 TFs, we estimated pseudo-time (t c ) with Monocle (Trapnell et al., 2014). We also excluded 100 randomly selected cells from the training data in order to use them as test data to evaluate adequate sizes of z (D).

Data1: mouse ES cells to primitive endoderm cells
The first time-course scRNA-Seq dataset (at 0, 12, 24, 48 and 72 h) analyzed was derived from primitive endoderm (PrE) cells differentiated from mouse ES cells (by using G6GR ES cells (Shimosato et al., 2007)), containing 456 cells. This dataset was produced with RamDA-Seq, a novel scRNA-Seq protocol developed by our laboratory (in submission).

Data2: mouse embryonic fibroblast cells to myocytes
The second dataset was derived from scRNA-Seq data obtained to examine direct reprogramming from mouse embryonic fibroblast (MEF) cells to myocytes at days 0, 2, 5 and 22 (Treutlein et al., 2016). This dataset contained 405 cells.

Network validation method
To validate the inferred networks, we used the Transcription Factor Regulatory Network database (http://www.regulatorynetworks. org), which was constructed from DNaseI footprints and TFbinding motifs (Neph et al., 2012;Stergachis et al., 2014). We integrated the TF regulatory networks of all cells for human and mouse, and extracted 100 Â 100 TF regulatory networks for each dataset. We regarded these TF regulatory networks as correct networks for each dataset and calculated the AUC values of the inferred networks. The AUC values were calculated by regarding the directed edges that show higher absolute values as representing reliable regulatory relationships. We removed self-loop regulation and TFs that do not have an edge in the correct network from AUC calculation in order to avoid biases.
ii ( uniform random value 2 ½b min ; b max end for 3 Results

Selection of the size of z (D) and reproducibility of A
Our model was overfitted to the training data, and the inferred A was unstable with needlessly large D. Additionally, the model cannot reconstruct expression dynamics with insufficiently small values of D. Therefore, the selection of appropriate values for D is necessary, and we applied SCODE to training data and evaluated the validity of the optimized model on the basis of the RSS of independent test data for various values of D (D ¼ 2, 4, 6 and 8). For each D, we executed SCODE 100 times independently, and the first, second and third quantiles of the RSS values of test data are shown in Figure 2(a). For every dataset, the median of RSS is almost saturated at D ¼ 4.
Because we used random sampling during optimization, we validated the reproducibility of the optimized A. We calculated the correlation coefficient among optimized A for the top 50 replicates (in ascending order of RSS values) of test data for each D. The corresponding first, second and third quantiles of correlation coefficients are shown in Figure 2(b). For D ¼ 4, the medians of the correlation coefficients are 0.71, 0.94 and 0.88 for each dataset. The medians tend to decrease for large D because the matrix A is unstable with needlessly large D. The medians also decrease for small D, possibly because the optimized A is trapped in local optima. In summary, the correlations among replicates are high, and therefore, an optimized matrix A is stable for D ¼ 4.
Because the RSS values for test data are almost saturated and the estimated A are stable with D ¼ 4, we used D ¼ 4 unless otherwise specified. For optimized A of each D, we used the mean of optimized A of the top 50 replicates, hereafter.

Validation of A optimization with simulation data
Next, we investigated whether SCODE can infer genuine A by using simulated data. Because the dynamics of x become unrealistic with randomly determined A, we used previously inferred A (for D ¼ 4) as genuine A and simulated data with the same condition for each dataset (such as the same pseudo-time). We also added uniform random numbers ( 2 ½À0:1; 0:1) to simulated data as a noise term. We optimized A for each simulated dataset 100 times, and Figure 3 shows the first, second and third quantiles of the correlation coefficients between the genuine A and optimized A for each D. The medians are 0.70, 0.71 and 0.91 for D ¼ 4, and 0.61, 0.48 and 0.49 for D ¼ 6. Therefore, SCODE can accurately infer the genuine A with appropriate D, and can roughly infer A with slightly different D values unless we set extremely large or small D.

Reconstruction of expression dynamics
Although RSS values for test data were almost saturated at D ¼ 4, this does not necessarily mean that SCODE can successfully learn the dynamics. Next, we investigated whether the optimized ODE can accurately reconstruct observed expression dynamics to verify the optimization of SCODE (Fig. 1(e)). For each set of dynamics, the initial values (x at t ¼ 0) were set to the mean expression of 0-h or day 0 cells. At first, we compared the reconstructed dynamics with observed data in the principal component analysis (PCA) space (Fig. 4). For every dataset, SCODE was able to reconstruct the dynamics with D ! 4.
Next, we compared the reconstructed dynamics with observed expression dynamics for some TFs (Sox2, Utf1, Epas1 and Foxq1) in Data1 (Fig. 5). The analysis for every TF and dataset is described in the Supplementary text. Although the reconstructed dynamics of SCODE with D ¼ 2 differ from the observed data, the model with D ! 4 successfully reconstructed complicated dynamics, such as   Fig. 4. PCA of scRNA-Seq data for each dataset. Each circle represents a cell, and its color represents experimental time (from light gray to black). The reconstructed expression dynamics are projected onto PCA space and are represented by colored lines (green, yellow, orange and red correspond to D ¼ 2, 4, 6 and 8, respectively) transient patterns. Therefore, we concluded that SCODE can successfully optimize A and learn the ODE of x

Validation of inferred network
We also evaluated the inferred network of each algorithm including the correlation network by comparing them to TF regulatory networks based on DNaseI footprints and TF-binding motifs (see Section 2.4). Because the runtimes of Jump3 are significantly large for large numbers of cells, we used 25 cells at even intervals in the pseudo-time order as the data for Jump3. The AUC values of each method for each dataset are shown in Table 1.
For Data1 and Data2, the AUC values of SCODE are significantly larger than those of the other algorithms. This is because our model considers the dynamics of expression and fully uses time information. Although Jump3 is also designed for time-course expression data, the AUC values are not high. This is because Jump3 is not designed for scRNA-Seq conducted during differentiation, but is designed for multiple time-course data. This suggests the necessity of a novel computational algorithm designed for scRNA-Seq data.
The performance of SCODE is second, but almost equal to the best performance for Data3. Given that the reconstructed path in PCA space is a little out of alignment for Data3 (Fig. 4), our model based on linear ODEs might be slightly insufficient to describe the expression dynamics of Data3.
In summary, our algorithm can infer TF regulatory networks with high performance in comparison to other network inference algorithms, especially for Data1 and Data2. This result implies the importance of time parameters in network inference and the necessity of a novel network inference algorithm designed for scRNA-Seq data obtained during differentiation.
We also compared the result of SCODE of Data3 with TRRUST, which is a human transcriptional regulatory network database based on small-scale experimental studies (Han et al., 2015). As a result, SCODE successfully identified the relationship included in TRRUST that GATA3, which is the known marker for the differentiation of Data3, positively regulate ZEB1 (see Supplementary text for the detailed explanation).

Runtimes
We investigated the runtime of each method and the runtimes for Data1 are shown in Table 2. The runtime of Jump3 is calculated using the data from 25 cells as stated above. The runtime of SCODE is 11 seconds and is significantly smaller than that of Jump3. Moreover, the runtime of SCODE is smaller than those of msgps and GENIE3, which do not consider time dynamics. These results show that SCODE can infer regulatory networks efficiently, even though it considers a time parameter in its model.

Network analysis
Lastly, we investigated the structure of the inferred regulatory network of Data1. At first, we defined the threshold a as the value of the 1000th largest absolute value in A, and we counted the number of positive edges (A ij ! a) and negative edges (A ij Àa) for TF j. Figure 6(a) shows the total counts for each TF in decreasing order. About 39% of edges are included in the top 10 TFs, and this result implies the existence of key regulators for differentiation. Interestingly, most TFs mainly have either positive or negative edges, and this result suggests that TFs might mainly work as either activators or inhibitors in differentiation. This tendency was shared with Data3, but was not seen in Data2 (see Supplementary text). This result might reflect a difference in the systems; Data1 and Data3 represent differentiation from ES cells, while Data2 represents direct reprogramming from MEF cells.
We also visualized the top 10 TF results Figure 6(b). Interestingly, Dnmt3a and Dnmt3b, which are the de novo DNA    To maintain the pluripotency of ES cells, Dnmt3a and Dnmt3b seem dispensable, and these genes must be unimportant for ES cells (Tsumura et al., 2006). However, several studies have suggested the importance of Dnmt3a and Dnmt3b in differentiation. For example, these TFs restrict the lineage-specific function of TFs during differentiation via DNA methylation (Oda et al., 2013). In addition, Dnmt3a is essential for hematopoietic stem cell differentiation and it seems to enhance differentiation by epigenetic silencing of multipotency genes (Challen et al., 2012). Thus, Dnmt3a and Dnmt3b are known to affect differentiation based on de novo DNA methylation. In this study, these genes were inferred to regulate several TFs positively. Because DNA methylation essentially silences expression, these TFs might be regulated positively indirectly via the inactivation of negative regulators of these TFs. Although the direct targets of Dnmt3a and Dnmt3b are obscure, our result suggests that they are the key regulators of this differentiation.

Discussion
The advancement of scRNA-Seq and the analysis of differentiation reconstruction and pseudo-time have elucidated differentiation mechanisms. The inference of regulatory networks associated with differentiation is necessary to further our understanding of differentiation and development. In the inference of regulatory networks, it is important to fully use pseudo-time information and expression dynamics. However, there are no efficient algorithms for inferring the regulatory networks of many TFs from continuous time expression data. Thus, we developed SCODE, an efficient algorithm based on linear ODEs. SCODE is based on the transformation of linear ODEs and linear regression, and the time complexity is significantly small.
We applied SCODE to three scRNA-Seq datasets during differentiation and showed that SCODE can successfully optimize ODEs so that these ODEs can reconstruct observed expression dynamics. In the validation of the inferred network, the AUC values of SCODE were higher than those of other methods in almost of all cases. The runtime of SCODE is significantly smaller than that of Jump3, which also infers networks from time-course data. Additionally, SCODE is faster than GENIE3, which does not use time information. These performance results show the efficiency of SCODE.
Single-cell sequencing technologies are developing rapidly, and the number of scRNA-Seq datasets produced from differentiating cells will therefore increase. Our novel and efficient method for inferring regulatory networks demonstrated high performance and will therefore enhance the analysis of regulatory networks. Moreover, our model can reconstruct expression dynamics accurately. This means that we can simulate expression dynamics (such as those associated with the knockout of a TF) by using an optimized model, and such simulation-based analyses will be useful for many types of research, such as detection of drivers of differentiation. Thus, SCODE is useful not only for regulatory network inference, but also for various analyses using simulation, and therefore, our research is a promising computational tool for further single-cell sequence analyses.