D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

Abstract Background Dimensionality reduction and visualization play vital roles in single-cell RNA sequencing (scRNA-seq) data analysis. While they have been extensively studied, state-of-the-art dimensionality reduction algorithms are often unable to preserve the global structures underlying data. Elastic embedding (EE), a nonlinear dimensionality reduction method, has shown promise in revealing low-dimensional intrinsic local and global data structure. However, the current implementation of the EE algorithm lacks scalability to large-scale scRNA-seq data. Results We present a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding (D-EE). D-EE reveals the low-dimensional intrinsic structures of data with accuracy equal to that of elastic embedding, and it is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory efficiency and high-performance computing simultaneously. In addition, an extended version of D-EE, termed distributed optimization implementation of time-series elastic embedding (D-TSEE), enables the user to visualize large-scale time-series scRNA-seq data by incorporating experimentally temporal information. Results with large-scale scRNA-seq data indicate that D-TSEE can uncover oscillatory gene expression patterns by using experimentally temporal information. Conclusions D-EE is a distributed dimensionality reduction and visualization tool. Its distributed storage and distributed computation technique allow us to efficiently analyze large-scale single-cell data at the cost of constant time speedup. The source code for D-EE algorithm based on C and MPI tailored to a high-performance computing cluster is available at https://github.com/ShaokunAn/D-EE.

Reviewer #2: The paper by An et al. presents a distributed implementation of the elastic embedding (EE) algorithm that allows to run EE on large datasets. The authors argue that EE (and its time-series version TSEE, previously developed by the same authors) is a good choice of an embedding algorithm for single-cell data, and so their implementation allows to actually use it in practice. Previously existing implementations did not scale beyond a sample size of a few thousand cells.
This is a solid work on implementing a parallelized/distributed version of EE. The EE/TSEE algorithms can indeed be an attractive choice of an embedding algorithm for single-cell data, so this paper makes a contribution that is relevant for the field. Importantly, this paper does *not* suggest any new algorithm, but simply presents an efficient implementation of an existing algorithm. If this is in scope of the "Technical note" section of GigaScience, then I can recommend acceptance.

Reply:
Thanks for your thoughtful comments. We followed the instructions of GigaScience and submitted our manuscript to "Technical note" as our target, since "Technical Notes should present an open-source software tool or an experimental or computational method, test or procedure for the analysis or handling of large-scale data."

Comment:
The main issue I have is that it seems the authors used a supercomputer to run their analysis, and it is not clear to me if this implementation would allow to analyze the same dataset (n=250k) on a normal desktop computer. For competitor algorithms like t-SNE and UMAP this dataset size is not a problem even on a laptop. If D-EE can only deal with this dataset on a supercomputer, then it's clearly much much less efficient than competitors. The authors should clarify this point, see also below.
MAJOR COMMENTS * Even though the main contribution of the paper is efficient EE implementation, the authors do not provide almost any benchmarking results. What I would like to see reported in the text: a)What exact hardware was used for the experiments? On page 4 the authors say that they used up to 4000 CPU processes. What kind of computer allows to run 4000 processes? Was it a computing cluster? Please give exact details of the used hardware. Caption of Figure 3 says it was a supercomputer, but more details should be given in the text.

Reply:
Thanks for your thoughtful comments. In the revised manuscript, we provided detailed information of the computing cluster on which we tested and implemented our software in main text as follows: "… The numerical tests are carried out on the LSSC-IV supercomputer. The 400 computing nodes of LSSC-IV are comprised of two 18-core Intel Xeon Gold CPUs with 192 GB local memory, and are interconnected via a proprietary high performance network. " (see the first paragraph of "D-EE achieves high strong scaling efficiency" on Page 4) Comment: b) What was the actual runtime on this supercomputer? At a minimum, report the physical time for iPSCs dataset in the text (e.g. 1 hour, or 20 hours, or 30 minutes). Even better would be a figure (perhaps another panel in Figure 3) that plots the time as a function of sample size. What is the largest sample size that can be feasibly analyzed on the supercomputer?

Reply:
Thank you for your suggestion.
In the revised manuscript: Firstly, we tested our software extensively at various numbers of processes on iPSCs dataset and provided the detailed runtime information: "To test performance of parallel efficiency of D-EE on the large-scale dataset, we apply D-EE to the iPSCs dataset (∼ 250k cells) using 500, 1,000, 2,000, and 4,000 processes, respectively, and for each setting of number of processes we run at least twice. The averaged computation times of D-EE for iPSCs dataset are 5.83 hours, 3.19 hours, 2.36 hours, and 2.02 hours, when the number of processes are 500, 1,000, 2,000, and 4,000, respectively..." (see subsection "D-EE achieves high strong scaling efficiency" on Page 4) Secondly, we explored the relationship of runtime versus the sample sizes, displayed in Figure 3C, and described the results as "We further evaluate the performance of D-EE on computational times in our supercomputer. Three test cases with sample sizes being 10k, 50k, and 100k are run on 500, 1,000, and 2,000 processors, respectively. It is shown that when using the same number of processes, it is naturally that the computational time increases as total sample size increases (as shown in Figure 3C)." (see the first paragraph on Page 5) Thirdly, we estimated the largest sample size that can be feasibly analyzed on the LSSC-IV supercomputer on which we implemented our software. Specifically, for a dataset with 10k samples implemented by using 8 processes, it consumed about 10 Gb memory. According to the memory of our supercomputer and the square relationship between the memory cost and the sample size N, we estimated the largest sample size that can be analyzed on our supercomputer is about 850k. We provided this information in the Instruction of D-EE at GitHub https://github.com/ShaokunAn/D-EE.
Comment: c) Most importantly --most users will not have access to a supercomputer (!!). The text should clearly state how large a dataset can be feasibly analyzed on a typical desktop computer (8 CPU threads, 16 Gb RAM) and perhaps on a typical powerful lab computing station (e.g. 40 CPU threads, 256 Gb RAM). Maybe include the runtimes using such hardware into the figure that I suggested above.

Reply:
Thanks for pointing out this problem. In the revised manuscript, we made it clear that a large computing cluster is recommended for the implementation of D-EE when analyzing large-scale datasets, and provided information about the largest sample size that can be feasibly analyzed on the two typical computing devices as follows: "In practice, for analysis of large-scale single-cell dataset, a workstation with multiple CPUs and large memory is recommended. Meanwhile, we also test D-EE on datasets at different sample sizes, and find that D-EE efficiently implements on a typical personal computer (e.g., 8 CPU threads, 16 Gb RAM) with up to a sample size of 12k, while on a conventional workstation (e.g., 40 CPU threads, 256 Gb RAM) with up to a sample size of 48k, respectively." (see the second paragraph on Page 5) For large sample size up to 250k as shown in our cases, D-EE has to be implemented on a super computer due to the large memory consumption. As mentioned in the revised manuscript, the improvement of D-EE is discussed as a future work: "In the future study, to resolve the limitation of D-EE on huge-scale data computation, we can accelerate D-EE by adopting either the fast Fourier transform as used in FIt-SNE, or adopting the state-of-the-art neural network framework used by net-SNE [28]. On the other hand, since huge-scale single-cell dataset can be highly redundant, we can also select subset of informative samples using the advanced geometric sketching tool [29] prior to application of D-EE." (see the last paragraph of Conclusion section on Page 5) Comment: d) In case this implementation only works well on a supercomputer, and is not very helpful on a personal desktop computer, this should be stated very prominently in the abstract/intro/discussion/etc. Currently the abstract says that D-EE allows "to efficiently" analyze large RNAseq data. But if I must have a supercomputer for such an analysis, then it's not really very "efficient". The abstract should be clear about this. Note that t-SNE and UMAP can easily analyze n=250k on a desktop computer or a laptop. The authors have to discuss this in the Discussion.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Reply: Thanks for your valuable suggestion. In the revised manuscript, we made it crystal clear in abstract that "…Its distributed storage and distributed computation technique allow us to efficiently analyze large-scale single-cell data at the cost of constant time speedup. " We also discussed the limitation of D-EE in revised manuscript as "We demonstrate that D-EE and D-TSEE work efficiently on large-scale datasets at a super computer. However, the proposed distributed algorithm D-EE still has disadvantage due to the huge computational cost and storage with a relative large number of cells. Therefore, D-EE is limited to handle and analyze huge-scale datasets with the number of cells up to the order of millions [27]. In comparison, the state-of-the-art accelerated implementations of t-SNE (e.g., FIt-SNE) and UMAP are of the close-to-linear computational complexities, showing great efficiency on huge data analysis." (see the second paragraph of Conclusion on Page 5).
Finally, in revised manuscript, we discussed potential improvement of D-EE for computation of huge datasets: "In the future study, to resolve the limitation of D-EE on huge-scale data computation, we can accelerate D-EE by adopting either the fast Fourier transform as used in FIt-SNE, or adopting the state-of-the-art neural network framework used by net-SNE [28]. On the other hand, since huge-scale single-cell dataset can be highly redundant, we can also select subset of informative samples using the advanced geometric sketching tool [29] prior to application of D-EE." (see the last paragraph of Conclusion section on Page 5) Comment: * The authors need to give exact details of how they ran t-SNE and UMAP for Figure 4 (page 4). Was it default parameters of some implementation? Name the implementation, give the version, say that it was default parameters. If you changed some parameters from the default ones, specify all of them.
In particular, for t-SNE I am worried that a random initialization was used for Figure 4. Random initialization is a bad idea, as stated e.g. in the cited Ref [8]. UMAP uses nonrandom init, so this is not a fair comparison. I suggest to use PCA initialization as explained in Ref [8] and also as is default in the latest versions of FIt-SNE and openTSNE.
Apart from that, t-SNE is sensitive to the learning rate (see https://www.nature.com/articles/s41467-019-13056-x), so it should be set appropriately, e.g. to n/12, which is also the default in FIt-SNE and openTSNE.

Reply:
Thank you for your valuable suggestion. In revised manuscript, we followed the instruction of the reference you provided, and tuned the parameters of both t-SNE and UMAP, and found that the performance of t-SNE can be largely improved. We added detailed implementation of t-SNE and UMAP as well as their version as follows: "The t-SNE is conducted by the FIt-SNE method [23] with a PCA initialization, and we choose the learning rate as 1/12 of the sample size, according to [3] for better preservation of the global structures. UMAP is conducted by adjusting the number of nearest neighbors (NNs) to balance the preservation of local and global structures. We find that UMAP is not sensitive when choosing the number of NNs from 30 (defaults) to 500 (square root of the number of samples), and we thus set the number of NNs to be 100 in our study. Both t-SNE and UMAP are implemented by the Seurat software (version 3.2.1). In our implementations, both D-EE and D-TSEE are used with their default parameters." (see the first paragraph of subsection "D-EE and D-TSEE recover intrinsic low-dimensional structures of large-scale scRNA-seq data" on Page 5) Comment: * page 1 --You might want to consider citing https://www.nature.com/articles/s41467-019-13056-x after "widely used in the single-cell community".

Reply:
Thanks for pointing out this valuable reference. In our revised manuscript, we cited the reference in 3 places: (1) Introduction: "…For example, the celebrated t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm [2] is widely used in the single-cell community [3]." (see Page 1) (2) Methods: "…It is worth to note that, a newly proposed combinational perplexity has been applied to t-SNE [3], which greatly enhances the performance of t-SNE in preservation of global structures. The combinational perplexity can be also adopted by D-EE in future update." (see Page 2) (3) Results: "…The t-SNE is conducted by the FIt-SNE method [23] with a PCA initialization, and we choose the learning rate as 1/12 of the sample size, according to [3] for better preservation of the global structures." (see Page 5) Comment: * page 2 --"EE extends t-SNE", not sure this is a good way to put it. EE is not really an "extension" of t-SNE, is it? It's a different but very related algorithm. Consider revising.

Reply:
Thanks for your suggestion. In the revised manuscript, we modified the sentence as "…To achieve this goal, EE penalizes the placement of latent points in close proximity away from dissimilar data points in high-dimensional space…" (see Introduction on Page 2) Comment: •page 2, a bit later --"the current implementations ... is" -&gt; are Reply: Thanks for pointing this issue. We have corrected these typos in revised manuscript.
Comment: •page 2, after the first formula: as far as I know, the original EE paper used one fixed value of sigma (kernel width) and not adaptive sigma_n as you do here. I suggest to mention this and explain/cite where the sigma_n comes from. Looks like it's from Ref [11] (last sentence in that paragraph). If so, write that you follow [11] in using adaptive sigma_n based on perplexity (?), and you use perplexity=X by default.

Reply:
Thanks for your valuable suggestion. In the revised manuscript, we provided a detailed description of parameters σ_n as well as the perplexity parameter (defaults as 20) as follows: "…The σ_m in w_mn^+is a sample-specific scaling parameter. It is estimated adaptively by solving a sample-specific root-finding problem, such that the samplespecific distribution over its neighbors having a desired perplexity (see [12] for details). We set a default value of perplexity as 20 in this study." (see Methods on Page 2) We also pointed out that, "…It is worth to note that, a newly proposed combinational perplexity has been applied to t-SNE [3], which greatly enhances the performance of t-SNE in preservation of global structures. The combinational perplexity can be also adopted by D-EE in future update." (see Methods on Page 2) Comment: •page 4, last sentence of "Data description": this is not enough detail. Logarithmic means log(x+1)? How exactly did you find variable features? How many? Please provide full description and cite some literature where you took the recommendations from. If you used a particular software to do it, say so and cite.

Reply:
In the revised manuscript, we described the detailed information on data preprocess as follows "Firstly, we normalize the gene expression in each sample as follows: we divide each gene read count by the total read counts for each cell, and then multiply a scale factor of 〖10〗^4 and plus one, followed by taking a logarithmic transformation. Secondly, we select the top 2,000 variable genes using the default "vst" method of Seurat package, i.e., variance-stabilizing transformation [21]." (see the second paragraph of Results on Page 4) The package used to perform pre-process is specified as "The obtained single-cell read count data is pre-processed by Seurat package (version 3.2.1) [19,20]." (see the second paragraph of Results on Page 4) Comment: •page 4: "lambda used is set to the default values" --what values are default? How can the reader know that? Please specify exact values of all parameters.

Reply:
Thanks for pointing out this issue. In the revised manuscript, we added the default value of parameter lambda as: "In single-cell data analysis, with λ = 10, EE can achieve robust performance with high accuracy [5]. Therefore, we set the default value of λ as 10 for D-EE." (see Methods on Page 2) Comment: page 4, description of Figure 2: Frobenius norm only makes sense if you used the same initialization. Please specify in the text that it was the same and also specify what the initialization was (random?).

Reply:
Thanks for your suggestion. In the revised manuscript, we provided the definition of Frobenius norm and illustrated same initialization was used when perform consistency experiments as: "… We use the same initialization generated from Gaussian distribution as that in the original EE Matlab code… " (see Results on Page 4) "…To measure the consistency quantitatively, we calculate the relative error which is defined by relative error =(∥A-B∥_F)/∥A∥_F, where ∥⋅∥_F is the Frobenius norm of matrix, A and B represent the output of EE and D-EE, respectively. Frobenius norm of a matrix A∈R^(m×n)is defined as ∥A∥_F=√(∑_(i=1)^m ∑_(j=1)^n |a_ij |^2 )." (see Results on Page 4) Comment: •page 5, "oscillatory patterns" --I am not sure I see the oscillatory patterns in the figure. I suggest the authors highlight it in the figure somehow.

Reply:
Thank you for pointing out this. In revised manuscript, we added labels and marks on Figure 5 to better illustrate the oscillating pattern. In contrast, Elastic Embedding (EE), a nonlinear dimensionality reduction method, attempts to preserve both local and global structures underlying the data [? ]. To achieve this goal, EE penalizes the placement of latent points in close proximity away from dissimilar data points in high-dimensional space, thus resolving the di culty of global structure preservation (see [? ], or Methods for details). EE has attracted increasing interest among statistical researchers [? ]. It has also shown remarkable performance on visualizing the intrinsic structures of scRNA-seq data [? ? ? ]. However, the current implementations of the EE algorithm are not scalable to sample size N (e.g., number of cells). Thus, it cannot be used for large-scale scRNA-seq datasets. For example, the storage of the attractive and the repulsive weight matrixes of the EE algorithm is O(N 2 ). Therefore, we present a distributed optimization implementation of EE, termed D-EE. D-EE not only reveals the lowdimensional intrinsic structures of data with the same accuracy as EE, but also is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory e ciency and high-performance computing simultaneously ( Figure 1). In addition, a distributed optimization implementation of the time series EE (TSEE) algorithm [? ], termed D-TSEE, is also provided for visualizing large-scale time series scRNA-seq data. In this study, we demonstrate the power of D-EE and D-TSEE on both simulated and real data. Both D-EE and D-TSEE (1) achieve the same accuracy as EE and TSEE, respectively; (2) gain high strong scaling performance on large-scale dataset.

Elastic Embedding algorithm
EE was proposed by [? ]. It optimizes an energy function containing the attractive and repulsive terms.
Given N samples Y = {y 1 , y 2 , . . . , y N }, where y i ∈ R D represents its high-dimensional coordinates, the goal of EE is to map the data from high-dimensional space onto a lowdimensional representation X = {x 1 , x 2 , . . . , x N } with x i ∈ R d and d D by minimizing an energy function where w + nm = exp (-1 2 y n -y m 2 /σ 2 n ) and wnm = y n -y m 2 . The rst term acts as an attractive force to preserve local distances, while the second term acts as a repulsive force to preserve global structures or to separate latent points. The parameter λ ∈ R + trades o the two terms, and a larger value implies preservation of global structures is more important. In single-cell data analysis, with λ = 10, EE can achieve robust performance with high accuracy [? ]. Therefore, we set the default value of λ as 10 for D-EE.
The σ m in w + mn is a sample-speci c scaling parameter. It is estimated adaptively by solving a sample-speci c root-nding problem, such that the sample-speci c distribution over its neighbors has a desired perplexity (see [? ] for details). We set a default value of perplexity as 20 in this study. It is worth to note that, a newly proposed combinational perplexity has been applied to t-SNE [? ], which greatly enhances the performance of t-SNE in preservation of global structures. The combinational perplexity can be also adopted by D-EE in future update.
An extension of EE, TSEE [? ], was recently proposed to handle the dimensionality reduction problems of time series scRNA-seq data. It works by minimizing where t nm represents the dissimilarity of time of pairwise points, and β trades o the weights between dissimilarities of time stages and expression space.

Numerical optimization of EE
Since the optimization solution of TSEE is basically the same as that of EE, we only give the numerical solution of EE. First, we denote W P = {w + nm } and W N = {wnm }. Owing to the existence of parameters {σ n }, W P is not a symmetrical matrix but we make it to be symmetric by taking W P := W P + W P T . Next, the diagonal elements of W P and W N are set to zero. Finally, each element is normalized by dividing the sum of all elements in the matrix.
To solve the optimization problem, the classic Quasi-Newton methods update X k+1 according to X k+1 = X k + α k P k in the k-th iteration, where α k is the step length determined by a line search procedure, and P k is the search direction obtained by solving a Jacobian system B k P k = -G k . In this equation, B k is positive-de nite to guarantee the decrease of objective function. G k = L k X k is the gradient of the objective function in the . These procedures are repeated until a certain termination criterion is satis ed. During the iteration, B k generally needs to be updated in each iteration as well.
When optimizing the EE-like optimization problems, a technique termed Partial-Hessian optimization strategies has been proposed to employ partial information of Hessian L P [? ], which is the Laplacian of W P and is invariant in each iteration. This invariance makes it possible to utilize some precondition approaches, e.g., LU decomposition, to improve calculation efciency. The e ectiveness of the determined direction, called Spectral Direction, has been validated experimentally in previous work [? ].

D-EE algorithm
We provide a distributed optimization implementation of EE, termed D-EE. The overview of the newly proposed D-EE algorithm is given in Figure 1. During whole optimization implementation, multiple processes are employed for computation and storage of data. In Figure 1, two processes, P 0 and P 1 , are taken as an example. To achieve high performance in computing and memory e ciency simultaneously, our proposed distributed algorithm divides data (W P , W N , and G k ) by rows for the multiple processes assigned. To avoid frequent communication, the whole original high-dimensional data Y is read and stored in each process, and the low-dimensional embedding X is established in each process as well since the storage consumed by Y and X is much less when compared to other N × N matrixes used during computation. It is worth to note that, since most of the computation of each row in one matrix generally merely depends on the same row of other matrixes (see the approximated computational complexity of D-EE in the following section for details), the partition procedure we design in the D-EE algorithm is an almost optimal partition in parallel computing as a result of the optimal leverage of computation and communication. On the one hand, the total computational cost of the D-EE algorithm is almost the same as that of the centralized algorithm of EE. On the other hand, most procedures in the D-EE algorithm are communication-free, as shown in Figure 1 by black arrow. Even though some procedures still exist with communication, as shown in Figure 1 by blue arrow, the communication volume is in a much lower order than the cost of computation.

Computation of matrixes D, W P , W N
As mentioned before, the matrixes W P , W N depend on the highdimensional data Y and {σ n } N n=1 . Since each σ n is obtained by solving a root-nding problem from the n-th row of the distance matrix D, each matrix is equally, or almost equally, partitioned into multiple nonoverlapping parts by rows and stored in multiple processes, as shown in Figure 1A. Let us denote D = [D 1 , · · · , D P ], where sub-matrix D i with size of M i × N is stored in the i-th process, and P is the number of processes we used. The [· · · ] represents a column vector. Similar notations are used for the other N×N matrixes. It is clear that each row of matrixes D, W P , W N depends on all original high-dimensional data Y. Therefore, we load a copy of Y into each process to avoid frequent communication.
In the centralized implementation of EE, the parameters σ n , n = 1, . . . , N, are calculated by iteratively solving a sequence of root-nding problems. The iteration method for the rootnding problems is improved by reordering the computation of {σ n } N n=1 according to the distances of all samples (Y), which is also the complete distance matrix D [? ]. Then the reordered root-nding problems are sequentially solved by taking the solution of the previous one as the initial value of the next. Since the parameters are distributed in di erent processes, it is clear that the sequential root-nding approach cannot be parallelized without modi cations. In the D-EE algorithm, we calculate {σ n } N n=1 in the following parallel way. First, we decompose {σ n } N n=1 into P subsets as with i = 0, · · · , P -1. The elements in the i-th subset Σ i are computed and stored in the i-th process. Similar to the centralized algorithm of EE, we then reorder Σ i according to the distance matrix D i and iteratively solve the corresponding root-nding problems within the i-th process. According to the distributions of the initial data Y and the matrixes established before, we conclude that the D-EE algorithm calculates {σ n } N n=1 in parallel, which is communication-free. The e ciency of the root-nding approach is also guaranteed by the local order. Since we only change the order and initial guesses of the root-nding problems, the solutions of the root-nding problems, as obtained from D-EE, are almost the same as those from EE. With the whole original highdimensional data Y and the subset Σ i , we can compute the following submatrixes W P i , W N i in the i-th process. Thus, we give a parallel and communication-free approach to compute matrixes D, W P , W N .

Normalization of W P and W N
After computing matrixes W P , W N , each process sets the diagonal elements belonging to it as 0 in parallel. Then, we set W P := W P + W P T such that W P becomes a symmetric matrix.
Let us denote W P T :=Ŵ P = [Ŵ 1 P , · · · ,Ŵ p P ], where submatrix W i P has the size of M i × N. In the i-th process, we rst obtain the elements of the submatrixŴ i P from the other P-1 processes by communication and then compute W P i := W P i +Ŵ i P . Here point-to-point communication happens, and the communication volume for each process is O(N 2 /P).
To normalize the matrixes W P , W N , each element should be divided by the sum of all elements in the matrix. The sum of all elements in matrix W P is parallel computed as follows. First, each process calculates the sum of all elements in the submatrix W P i independently. We denote the sum of all elements in the submatrix W P and W P i as S and S i , respectively. Then we compute the sum of all elements in matrix W P by S = P i=1 S i through an MPI_Allgather action. Here all-to-all communication happens, and the communication volume for each process is O(P). Then, we normalize matrix W P in each process by taking W P i = W P i /S in parallel without communication. The normalization of matrix W N is done in a similar way.

Computation of low-dimensional embedding X
After normalizing W P , its Laplacian L P , which is needed for the subsequent determination of descent direction, is computed in parallel as follows. In the i-th process, we calculate the ele- l + mn and w + mn are the elements of matrixes L P and W P , respectively. Since the two matrixes are partitioned by row in the same way, the computation of L P is also communication-free.
The low-dimensional embedding X is obtained by solving the optimization problem with the Partial-Hessian optimization strategy. During the Quasi-Newton procedures, a dense linear system L P P k = -G k must be solved in parallel. In the D-EE algorithm, we perform LU decomposition on L P . Considering that L P is positive semi-de nite, but not positive de nite, a small value µ is added to the diagonal of L P in practice. During the following sections, we still use L P to denote the adjusted matrix. LU decomposition on L P = LU is done with PETSc, which provides uniform and e cient access to all linear system solvers in the package, including parallel and sequential, direct and iterative [? ? ? ]. Here, L and U are the corresponding lower and upper triangle matrixes, respectively. With the decomposition of LU, the dense linear system L P P k = -G k is replaced by two sublinear systems LP k = -G k and U P k =P k , which can be solved by the backward substitution method.
As shown in Figure 1D, the partitions of L P , L, and U are the same as W P . Let us denote P k = [P k 1 , · · · , P k P ], where and a similar partition is performed on G k . Based on the partitions of L P , L, U , P k , and G k , the computational complexities per process of LU decomposition and backward substitution are O(N 3 /P) and O(N 2 /P), with corresponding communication volumes of O(N 2 /P) and O(N/P), respectively. According to the analysis, LU decomposition is only done in the rst iteration of the Quasi-Newton method, and matrixes L and U are stored and reused during the whole Quasi-Newton procedure. The gradient G k in the right-hand side of the linear system L P P k = -G k is calculated by G k = L k X k , where the N × N matrix L k depends on matrixes W P , W N , and Ker. Here the elements of matrix Ker are de ned as ker mn = exp (-x m -x n 2 ), and the elements of matrix L k are de ned as l (k) mn = w + mnλwmn ker (k) mn . As shown in Figure 1E, the partitions of matrixes L k and Ker are the same as those of W P . In D-EE, we store all elements of X k in each process, which is the same as the original highdimensional data Y. Thus, we can parallel compute matrixes L k and Ker in the same way with the matrix W N , which means the procedure is also communication-free.
After solving the linear system, we obtain the search direction P k . Then, we update X k+1 according to X k+1 = X k + α k P k , where α k is determined by a line search approach. As men-tioned before, the low-dimensional embedding X k is stored sequentially, but P k is distributed stored. Thus, we rst compute the elements of submatrix P k+1 i in the i-th process and then gather all elements of P k in each process by the all-gather function in MPI. Here all-to-all communication happens, and the order of communication volume for each process is O(Nd).
In line search steps, we need to calculate the energy function E(X, λ) several times, which is computed in parallel according to the following formula The summation included in the parentheses is calculated in each process simultaneously and then gathered by the MPI allgather function. Here all-to-all communication happens, and the communication volume for each process is O(P).

Data Description
We test the accuracy and scalability of D-EE on three datasets. The rst simulated dataset [? ], named PHATE data for convenience, consists of 1,440 samples and 60 features. It is a complex tree structure which simulates a cellular developmental process, namely, progressions, branch or split in progressions and end state of progression, composed of ten branches in total. We rst perform principal component analysis (PCA) on the original data, reserving a 1,440 samples ]. Firstly, we normalize the gene expression in each sample as follows: we divide each gene read count by the total read counts for each cell, and then multiply a scale factor of 10 4 and plus one, followed by taking a logarithmic transformation. Secondly, we select the top 2,000 variable genes using the default "vst" method of Seurat package, i.e., variance-stabilizing transformation [? ]. Finally, we conduct PCA on the processed data, and select the top 50 largest principal components, resulting a 4,423 samples × 50 features matrix as input of EE and D-EE.
The third data is a large-scale time series scRNA-seq dataset containing~250k cells [? ]. The data characterizes reprogramming of broblasts to induced pluripotent stem cells (iPSCs), which was collected at half-day intervals across 18 days, resulting in 39 time points. Since the nal time point of the iPSCs status was not annotated temporally, we therefore set the nal point as 20-th day as the input to D-TSEE. We pre-process this data with Seurat package as well. Same as the pre-process of HSPCs data, we rst lter cells and genes to include cells where at least 200 features are detected and to include genes detected in at least 50 cells, obtaining 259,081 cells and 19,427 genes. After that, we perform logarithmic transformation, select variable features and perform PCA as described in HSPCs dataset, obtaining a 259,081 samples × 50 features matrix as input of dimensionality reduction methods.

D-EE achieves high strong scaling e ciency
We evaluate D-EE using both simulated and real scRNA-seq datasets. The numerical tests are carried out on the LSSC-IV supercomputer. The 400 computing nodes of LSSC-IV are comprised of two 18-core Intel Xeon Gold CPUs with 192 GB local memory, and are interconnected via a proprietary high performance network. First, we employ PHATE data and HSPCs data to test the consistency between D-EE and EE results. We employ 36 processes in both D-EE algorithms. The low dimensions are set to 2 for the convenience of visualization for both datasets, and the parameter λ used is set to the default value 10. We use the same initialization generated from Gaussian distribution as that in the original EE Matlab code. D-EE achieves consistent 2-dimensional embedding as that of EE (Figure 2). To measure the consistency quantitatively, we calculate the relative error which is de ned by where · F is the Frobenius norm of matrix, A and B represent the output of EE and D-EE, respectively. Frobenius norm of a matrix A ∈ R m×n is de ned as Their relative errors of D-EE in HSPCs dataset and PHATE dataset are 2.42 × 10 -6 and 1.60 × 10 -6 , respectively, thus further validating the consistency of results by D-EE and EE.
To test performance of parallel e ciency of D-EE on the large-scale dataset, we apply D-EE to the iPSCs dataset (∼ 250k cells) using 500, 1,000, 2,000, and 4,000 processes, respectively, and for each setting of number of processes we run at least twice. The averaged computation times of D-EE for iPSCs dataset are 5.83 hours, 3.19 hours, 2.36 hours, and 2.02 hours, when the numbers of processes are 500, 1,000, 2,000, and 4,000, respectively. Next, we evaluate the parallel performance of D-EE based on two widely used indexes, the strong scaling speedup ratio and the parallel e ciency. The strong speedup ratio is de ned as where T s and T p are the time of computation by using single process and p processes, respectively. The parallel e ciency is de ned as The ideal strong speedup should be p, and the corresponding parallel e ciency should be 1 when p processes are used. However, it is impossible to run the data with 250k samples on a single process owing to limited memory and low e ciency of EE algorithm. Thus, we take the time of computation of 500 processes as T s with s = 500, and the parallel e ciency is then reformulated as When adopting the speedup ratio and parallel e ciency as the indexes for scaling, a strong scaling performance at remarkable speedup is observed when increasing CPUs from 500 to 4,000 processes ( Figure 3A, B).

First et al. | 5
We further evaluate the performance of D-EE on computational times in our supercomputer. Three test cases with sample sizes being 10k, 50k, and 100k are run on 500, 1,000, and 2,000 processors, respectively. It is shown that when using the same number of processes, it is naturally that the computational time increases as total sample size increases (as shown in Figure 3C).
In practice, for analysis of large-scale single-cell dataset, a workstation with multiple CPUs and large memory is recommended. Meanwhile, we also test D-EE on datasets at di erent sample sizes, and nd that D-EE e ciently implements on a typical personal computer (e.g., 8 CPU threads, 16 Gb RAM) with up to a sample size of 12k, while on a conventional workstation (e.g., 40 CPU threads, 256 Gb RAM) with up to a sample size of 48k, respectively.

D-EE and D-TSEE recover intrinsic low-dimensional structures of large-scale scRNA-seq data
We illustrate the application of D-EE and D-TSEE on the large-scale iPSCs dataset. We visualize the iPSCs data on the 2-dimensional space using t-SNE, UMAP, D-EE and D-TSEE, respectively ( Figure 4). The same pre-processed single-cell data is used for all four methods. The t-SNE is conducted by the FIt-SNE method [? ] with a PCA initialization, and we choose the learning rate as 1/12 of the sample size, according to [? ] for better preservation of the global structures. UMAP is conducted by adjusting the number of nearest neighbors (NNs) to balance the preservation of local and global structures. We nd that UMAP is not sensitive when choosing the number of NNs from 30 (defaults) to 500 (square root of the number of samples), and we thus set the number of NNs to be 100 in our study. Both t-SNE and UMAP are implemented by the Seurat software (version 3.2.1). In our implementations, both D-EE and D-TSEE are used with their default parameters.
We color the cells based on time stages on their lowdimensional embeddings obtained by the four dimensionality reduction results (Figure 4). We nd that t-SNE preserves the time lineage structure of data in an "S" shape with small gaps (

Conclusion
In this work, we develop a novel tool, D-EE, for visualizing large-scale scRNA-seq data. D-EE implements the distributed storage and distributed computing techniques to a powerful nonlinear dimensionality reduction method, Elastic Embedding. The optimal distributed computational strategies implemented by D-EE allow it to achieve not only the strong scalability on large-scale dataset, but also the exact optimization solution as original EE by fully utilizing the whole data. Numerical experiments validate the correctness and parallel ef-ciency of D-EE. Considering the emergence of time series scRNA-seq data, our D-TSEE tool allows us e ciently to perform dimensionality reduction on large-scale single-cell data by employing experimentally temporal information. Besides, when incorporating temporal information if it is available, D-TSEE can reveal dynamic gene expression patterns, providing insights for subsequent analysis of molecular mechanisms and dynamic transition progression.
We demonstrate that D-EE and D-TSEE work e ciently on large-scale datasets at a super computer. However, the proposed distributed algorithm D-EE still has disadvantage due to the huge computational cost and storage with a relative large number of cells. Therefore, D-EE is limited to handle and analyze huge-scale datasets with the number of cells up to the order of millions [? ]. In comparison, the state-of-the-art accelerated implementations of t-SNE (e.g., FIt-SNE) and UMAP are of the close-to-linear computational complexities, showing great e ciency on huge data analysis.
In the future study, to resolve the limitation of D-EE on huge-scale data computation, we can accelerate D-EE by adopting either the fast Fourier transform as used in FIt-SNE, or adopting the state-of-the-art neural network framework used by net-SNE [? ]. On the other hand, since huge-scale singlecell dataset can be highly redundant, we can also select subset of informative samples using the advanced geometric sketching tool [? ] prior to application of D-EE.

Availability of source code and requirements
Lists the following: • Project name: D-EE • Project home page: https://github.com/ShaokunAn/D-EE • Operating system(s): Linux • Programming language: C, R • Other requirements: Multi-core processor, implementation of MPI library (i.e., OpenMPI or IntelMPI) installed on each node of the cluster, a reasonably fast interconnecting infrastructure, PETSc 3.11.4 or higher • License: GNU General Public License • biotools: d-ee

Availability of supporting data and materials
The PHATE data supporting the results of this article is available in the Github repository [? ]. The iPSCs data is available in NCBI repository with number GSE 122662 [? ]. The HPSCs data is available in the NCBI repository with accession number GSE72857 and the dataset used in our study is downloaded from their Github https://github.com/ManuSetty/wishbone.

test convergence
No Yes data stored in both processes Figure 1. Overview of D-EE algorithm. The D-EE algorithm can be decomposed into ve parts. Part A: the computation of matrixes D and W P . To obtain W P , the parameters {σn} are determined by solving a series of root-nding problems. Part B: the symmetry and normalization of the matrix W P . Part C: the computation and normalization of W N . Part D: the computation of the Laplacian matrix L P together with LU decomposition. Part E: the computation of X k+1 by solving an optimization problem with the classic Quasi-Newton methods iteratively. In the k-th iteration, L k is computed based on W P , W N and X k , which are used to obtain the gradient G k = L k X k . The descent direction P k is then determined by solving a linear system L P P k = -G k . Finally, X k+1 is updated according to X k+1 = X k +α k P k , where the step size α k is calculated by a line search method.

Consent for publication
Not applicable.

Competing Interests
The authors declare that they have no competing interests.

Background
The advent of single-cell sequencing provides highdimensional pro les of cellular states at single-cell resolutions (e.g., single-cell RNA sequencing (scRNA-seq) of transcriptomes), o ering the opportunity to unveil intrinsic biological processes and mechanisms. Dimensionality reduction and visualization methods have been extensively studied, as they play vital roles in revealing the intrinsic structures underlying scRNA-seq high-dimensional data [1]. Nonetheless, it is still challenging for these state-of-the-art methods of dimensionality reduction and visualization to preserve both local and global structures of data in low-dimensional space. For example, the celebrated t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm [2] is widely used in the singlecell community [3]. It emphasizes the preservation of local structures, but it often distorts global structures [4,5,6]. As a solution, the Uniform Manifold Approximation and Projection (UMAP) algorithm [7] was developed, with the aim to preserve global structures, drawing increasing attention in single-cell data analysis community [8]. However, a recent study showed that UMAP does not improve upon t-SNE in this regard when Compiled on: September 29, 2020. Draft manuscript prepared by the author.
using the same initialization [9], making the validity of UMAP debatable.
In contrast, Elastic Embedding (EE), a nonlinear dimensionality reduction method, attempts to preserve both local and global structures underlying the data [4]. To achieve this goal, EE penalizes the placement of latent points in close proximity away from dissimilar data points in high-dimensional space, thus resolving the di culty of global structure preservation (see [4], or Methods for details). EE has attracted increasing interest among statistical researchers [10]. It has also shown remarkable performance on visualizing the intrinsic structures of scRNA-seq data [1,5,11]. However, the current implementations of the EE algorithm are not scalable to sample size N (e.g., number of cells). Thus, it cannot be used for large-scale scRNA-seq datasets. For example, the storage of the attractive and the repulsive weight matrixes of the EE algorithm is O(N 2 ). Therefore, we present a distributed optimization implementation of EE, termed D-EE. D-EE not only reveals the lowdimensional intrinsic structures of data with the same accuracy as EE, but also is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory e ciency and high-performance computing simultaneously (Figure 1). In addition, a distributed optimization implementation of the time series EE (TSEE) algorithm [11], termed D-TSEE, is also provided for visualizing large-scale time series scRNA-seq data. In this study, we demonstrate the power of D-EE and D-TSEE on both simulated and real data. Both D-EE and D-TSEE (1) achieve the same accuracy as EE and TSEE, respectively; (2) gain high strong scaling performance on large-scale dataset.

Elastic Embedding algorithm
EE was proposed by [4]. It optimizes an energy function containing the attractive and repulsive terms.
Given N samples Y = {y 1 , y 2 , . . . , y N }, where y i ∈ R D represents its high-dimensional coordinates, the goal of EE is to map the data from high-dimensional space onto a lowdimensional representation X = {x 1 , x 2 , . . . , x N } with x i ∈ R d and d D by minimizing an energy function where w + nm = exp (-1 2 y n -y m 2 /σ 2 n ) and wnm = y n -y m 2 . The rst term acts as an attractive force to preserve local distances, while the second term acts as a repulsive force to preserve global structures or to separate latent points. The parameter λ ∈ R + trades o the two terms, and a larger value implies preservation of global structures is more important. In single-cell data analysis, with λ = 10, EE can achieve robust performance with high accuracy [5]. Therefore, we set the default value of λ as 10 for D-EE.
The σ m in w + mn is a sample-speci c scaling parameter. It is estimated adaptively by solving a sample-speci c root-nding problem, such that the sample-speci c distribution over its neighbors has a desired perplexity (see [12] for details). We set a default value of perplexity as 20 in this study. It is worth to note that, a newly proposed combinational perplexity has been applied to t-SNE [3], which greatly enhances the performance of t-SNE in preservation of global structures. The combinational perplexity can be also adopted by D-EE in future update.
An extension of EE, TSEE [11], was recently proposed to handle the dimensionality reduction problems of time series scRNA-seq data. It works by minimizing where t nm represents the dissimilarity of time of pairwise points, and β trades o the weights between dissimilarities of time stages and expression space.

Numerical optimization of EE
Since the optimization solution of TSEE is basically the same as that of EE, we only give the numerical solution of EE. First, we denote W P = {w + nm } and W N = {wnm }. Owing to the existence of parameters {σ n }, W P is not a symmetrical matrix but we make it to be symmetric by taking W P := W P + W P T . Next, the diagonal elements of W P and W N are set to zero. Finally, each element is normalized by dividing the sum of all elements in the matrix.
To solve the optimization problem, the classic Quasi-Newton methods update X k+1 according to X k+1 = X k + α k P k in the k-th iteration, where α k is the step length determined by a line search procedure, and P k is the search direction obtained by solving a Jacobian system B k P k = -G k . In this equation, B k is positive-de nite to guarantee the decrease of objective function. G k = L k X k is the gradient of the objective function in the . These procedures are repeated until a certain termination criterion is satis ed. During the iteration, B k generally needs to be updated in each iteration as well.
When optimizing the EE-like optimization problems, a technique termed Partial-Hessian optimization strategies has been proposed to employ partial information of Hessian L P [13], which is the Laplacian of W P and is invariant in each iteration. This invariance makes it possible to utilize some precondition approaches, e.g., LU decomposition, to improve calculation efciency. The e ectiveness of the determined direction, called Spectral Direction, has been validated experimentally in previous work [13].

D-EE algorithm
We provide a distributed optimization implementation of EE, termed D-EE. The overview of the newly proposed D-EE algorithm is given in Figure 1. During whole optimization implementation, multiple processes are employed for computation and storage of data. In Figure 1, two processes, P 0 and P 1 , are taken as an example. To achieve high performance in computing and memory e ciency simultaneously, our proposed distributed algorithm divides data (W P , W N , and G k ) by rows for the multiple processes assigned. To avoid frequent communication, the whole original high-dimensional data Y is read and stored in each process, and the low-dimensional embedding X is established in each process as well since the storage consumed by Y and X is much less when compared to other N × N matrixes used during computation. It is worth to note that, since most of the computation of each row in one matrix generally merely depends on the same row of other matrixes (see the approximated computational complexity of D-EE in the following section for details), the partition procedure we design in the D-EE algorithm is an almost optimal partition in parallel com-puting as a result of the optimal leverage of computation and communication. On the one hand, the total computational cost of the D-EE algorithm is almost the same as that of the centralized algorithm of EE. On the other hand, most procedures in the D-EE algorithm are communication-free, as shown in Figure 1 by black arrow. Even though some procedures still exist with communication, as shown in Figure 1 by blue arrow, the communication volume is in a much lower order than the cost of computation.

Computation of matrixes D, W P , W N
As mentioned before, the matrixes W P , W N depend on the highdimensional data Y and {σ n } N n=1 . Since each σ n is obtained by solving a root-nding problem from the n-th row of the distance matrix D, each matrix is equally, or almost equally, partitioned into multiple nonoverlapping parts by rows and stored in multiple processes, as shown in Figure 1A. Let us denote D = [D 1 , · · · , D P ], where sub-matrix D i with size of M i × N is stored in the i-th process, and P is the number of processes we used. The [· · · ] represents a column vector. Similar notations are used for the other N×N matrixes. It is clear that each row of matrixes D, W P , W N depends on all original high-dimensional data Y. Therefore, we load a copy of Y into each process to avoid frequent communication.
In the centralized implementation of EE, the parameters σ n , n = 1, . . . , N, are calculated by iteratively solving a sequence of root-nding problems. The iteration method for the rootnding problems is improved by reordering the computation of {σ n } N n=1 according to the distances of all samples (Y), which is also the complete distance matrix D [12]. Then the reordered root-nding problems are sequentially solved by taking the solution of the previous one as the initial value of the next. Since the parameters are distributed in di erent processes, it is clear that the sequential root-nding approach cannot be parallelized without modi cations. In the D-EE algorithm, we calculate {σ n } N n=1 in the following parallel way. First, we decompose {σ n } N n=1 into P subsets as Σ i = {σ n } M i+1 n=M i +1 with i = 0, · · · , P -1. The elements in the i-th subset Σ i are computed and stored in the i-th process. Similar to the centralized algorithm of EE, we then reorder Σ i according to the distance matrix D i and iteratively solve the corresponding root-nding problems within the i-th process. According to the distributions of the initial data Y and the matrixes established before, we conclude that the D-EE algorithm calculates {σ n } N n=1 in parallel, which is communication-free. The e ciency of the root-nding approach is also guaranteed by the local order. Since we only change the order and initial guesses of the root-nding problems, the solutions of the root-nding problems, as obtained from D-EE, are almost the same as those from EE. With the whole original highdimensional data Y and the subset Σ i , we can compute the following submatrixes W P i , W N i in the i-th process. Thus, we give a parallel and communication-free approach to compute matrixes D, W P , W N .

Normalization of W P and W N
After computing matrixes W P , W N , each process sets the diagonal elements belonging to it as 0 in parallel. Then, we set W P := W P + W P T such that W P becomes a symmetric matrix.
Let us denote W P T :=Ŵ P = [Ŵ 1 P , · · · ,Ŵ p P ], where submatrix W i P has the size of M i × N. In the i-th process, we rst obtain the elements of the submatrixŴ i P from the other P-1 processes by communication and then compute W P i := W P i +Ŵ i P . Here point-to-point communication happens, and the communication volume for each process is O(N 2 /P). To normalize the matrixes W P , W N , each element should be divided by the sum of all elements in the matrix. The sum of all elements in matrix W P is parallel computed as follows. First, each process calculates the sum of all elements in the submatrix W P i independently. We denote the sum of all elements in the submatrix W P and W P i as S and S i , respectively. Then we compute the sum of all elements in matrix W P by S = P i=1 S i through an MPI_Allgather action. Here all-to-all communication happens, and the communication volume for each process is O(P). Then, we normalize matrix W P in each process by taking W P i = W P i /S in parallel without communication. The normalization of matrix W N is done in a similar way.

Computation of low-dimensional embedding X
After normalizing W P , its Laplacian L P , which is needed for the subsequent determination of descent direction, is computed in parallel as follows. In the i-th process, we calculate the elements of submatrix L P i by using l + mn = N k=1 w + mk -w + mn , where l + mn and w + mn are the elements of matrixes L P and W P , respectively. Since the two matrixes are partitioned by row in the same way, the computation of L P is also communication-free.
The low-dimensional embedding X is obtained by solving the optimization problem with the Partial-Hessian optimization strategy. During the Quasi-Newton procedures, a dense linear system L P P k = -G k must be solved in parallel. In the D-EE algorithm, we perform LU decomposition on L P . Considering that L P is positive semi-de nite, but not positive denite, a small value µ is added to the diagonal of L P in practice. During the following sections, we still use L P to denote the adjusted matrix. LU decomposition on L P = LU is done with PETSc, which provides uniform and e cient access to all linear system solvers in the package, including parallel and sequential, direct and iterative [14,15,16]. Here, L and U are the corresponding lower and upper triangle matrixes, respectively. With the decomposition of LU, the dense linear system L P P k = -G k is replaced by two sublinear systems LP k = -G k and U P k =P k , which can be solved by the backward substitution method.
As shown in Figure 1D, the partitions of L P , L, and U are the same as W P . Let us denote P k = [P k 1 , · · · , P k P ], where submatrix P k i with size of M i × d is stored in the i-th process, and a similar partition is performed on G k . Based on the partitions of L P , L, U , P k , and G k , the computational complexities per process of LU decomposition and backward substitution are O(N 3 /P) and O(N 2 /P), with corresponding communication volumes of O(N 2 /P) and O(N/P), respectively. According to the analysis, LU decomposition is only done in the rst iteration of the Quasi-Newton method, and matrixes L and U are stored and reused during the whole Quasi-Newton procedure. The gradient G k in the right-hand side of the linear system L P P k = -G k is calculated by G k = L k X k , where the N × N matrix L k depends on matrixes W P , W N , and Ker. Here the elements of matrix Ker are de ned as ker mn = exp (-x m -x n 2 ), and the elements of matrix L k are de ned as l (k) mn = w + mnλwmn ker (k) mn . As shown in Figure 1E, the partitions of matrixes L k and Ker are the same as those of W P . In D-EE, we store all elements of X k in each process, which is the same as the original highdimensional data Y. Thus, we can parallel compute matrixes L k and Ker in the same way with the matrix W N , which means the procedure is also communication-free.
After solving the linear system, we obtain the search direction P k . Then, we update X k+1 according to X k+1 = X k + α k P k , where α k is determined by a line search approach. As men-tioned before, the low-dimensional embedding X k is stored sequentially, but P k is distributed stored. Thus, we rst compute the elements of submatrix P k+1 i in the i-th process and then gather all elements of P k in each process by the all-gather function in MPI. Here all-to-all communication happens, and the order of communication volume for each process is O(Nd).
In line search steps, we need to calculate the energy function E(X, λ) several times, which is computed in parallel according to the following formula The summation included in the parentheses is calculated in each process simultaneously and then gathered by the MPI allgather function. Here all-to-all communication happens, and the communication volume for each process is O(P).

Data Description
We test the accuracy and scalability of D-EE on three datasets. The rst simulated dataset [17], named PHATE data for convenience, consists of 1,440 samples and 60 features. It is a complex tree structure which simulates a cellular developmental process, namely, progressions, branch or split in progressions and end state of progression, composed of ten branches in total. We rst perform principal component analysis (PCA) on the original data, reserving a 1,440 samples × 7 features matrix.
The second dataset characterizes process of mouse hematopoietic stem and progenitor cells (HSPCs) bifurcating to myeloid and erythroid precursors [18], consisting of 4,423 samples. The obtained single-cell read count data is preprocessed by Seurat package (version 3.2.1) [19,20]. Firstly, we normalize the gene expression in each sample as follows: we divide each gene read count by the total read counts for each cell, and then multiply a scale factor of 10 4 and plus one, followed by taking a logarithmic transformation. Secondly, we select the top 2,000 variable genes using the default "vst" method of Seurat package, i.e., variance-stabilizing transformation [21]. Finally, we conduct PCA on the processed data, and select the top 50 largest principal components, resulting a 4,423 samples × 50 features matrix as input of EE and D-EE.
The third data is a large-scale time series scRNA-seq dataset containing~250k cells [22]. The data characterizes reprogramming of broblasts to induced pluripotent stem cells (iPSCs), which was collected at half-day intervals across 18 days, resulting in 39 time points. Since the nal time point of the iPSCs status was not annotated temporally, we therefore set the nal point as 20-th day as the input to D-TSEE. We pre-process this data with Seurat package as well. Same as the pre-process of HSPCs data, we rst lter cells and genes to include cells where at least 200 features are detected and to include genes detected in at least 50 cells, obtaining 259,081 cells and 19,427 genes. After that, we perform logarithmic transformation, select variable features and perform PCA as described in HSPCs dataset, obtaining a 259,081 samples × 50 features matrix as input of dimensionality reduction methods.

D-EE achieves high strong scaling e ciency
We evaluate D-EE using both simulated and real scRNA-seq datasets. The numerical tests are carried out on the LSSC-IV supercomputer. The 400 computing nodes of LSSC-IV are comprised of two 18-core Intel Xeon Gold CPUs with 192 GB local memory, and are interconnected via a proprietary high performance network. First, we employ PHATE data and HSPCs data to test the consistency between D-EE and EE results. We employ 36 processes in both D-EE algorithms. The low dimensions are set to 2 for the convenience of visualization for both datasets, and the parameter λ used is set to the default value 10. We use the same initialization generated from Gaussian distribution as that in the original EE Matlab code. D-EE achieves consistent 2-dimensional embedding as that of EE (Figure 2). To measure the consistency quantitatively, we calculate the relative error which is de ned by where · F is the Frobenius norm of matrix, A and B represent the output of EE and D-EE, respectively. Frobenius norm of a matrix A ∈ R m×n is de ned as Their relative errors of D-EE in HSPCs dataset and PHATE dataset are 2.42 × 10 -6 and 1.60 × 10 -6 , respectively, thus further validating the consistency of results by D-EE and EE.
To test performance of parallel e ciency of D-EE on the large-scale dataset, we apply D-EE to the iPSCs dataset (∼ 250k cells) using 500, 1,000, 2,000, and 4,000 processes, respectively, and for each setting of number of processes we run at least twice. The averaged computation times of D-EE for iPSCs dataset are 5.83 hours, 3.19 hours, 2.36 hours, and 2.02 hours, when the numbers of processes are 500, 1,000, 2,000, and 4,000, respectively. Next, we evaluate the parallel performance of D-EE based on two widely used indexes, the strong scaling speedup ratio and the parallel e ciency. The strong speedup ratio is de ned as where T s and T p are the time of computation by using single process and p processes, respectively. The parallel e ciency is de ned as The ideal strong speedup should be p, and the corresponding parallel e ciency should be 1 when p processes are used. However, it is impossible to run the data with 250k samples on a single process owing to limited memory and low e ciency of EE algorithm. Thus, we take the time of computation of 500 processes as T s with s = 500, and the parallel e ciency is then reformulated as When adopting the speedup ratio and parallel e ciency as the indexes for scaling, a strong scaling performance at remarkable speedup is observed when increasing CPUs from 500 to 4,000 processes ( Figure 3A, B).

First et al. | 5
We further evaluate the performance of D-EE on computational times in our supercomputer. Three test cases with sample sizes being 10k, 50k, and 100k are run on 500, 1,000, and 2,000 processors, respectively. It is shown that when using the same number of processes, it is naturally that the computational time increases as total sample size increases (as shown in Figure 3C).
In practice, for analysis of large-scale single-cell dataset, a workstation with multiple CPUs and large memory is recommended. Meanwhile, we also test D-EE on datasets at di erent sample sizes, and nd that D-EE e ciently implements on a typical personal computer (e.g., 8 CPU threads, 16 Gb RAM) with up to a sample size of 12k, while on a conventional workstation (e.g., 40 CPU threads, 256 Gb RAM) with up to a sample size of 48k, respectively.

D-EE and D-TSEE recover intrinsic low-dimensional structures of large-scale scRNA-seq data
We illustrate the application of D-EE and D-TSEE on the large-scale iPSCs dataset. We visualize the iPSCs data on the 2-dimensional space using t-SNE, UMAP, D-EE and D-TSEE, respectively ( Figure 4). The same pre-processed single-cell data is used for all four methods. The t-SNE is conducted by the FIt-SNE method [23] with a PCA initialization, and we choose the learning rate as 1/12 of the sample size, according to [3] for better preservation of the global structures. UMAP is conducted by adjusting the number of nearest neighbors (NNs) to balance the preservation of local and global structures. We nd that UMAP is not sensitive when choosing the number of NNs from 30 (defaults) to 500 (square root of the number of samples), and we thus set the number of NNs to be 100 in our study. Both t-SNE and UMAP are implemented by the Seurat software (version 3.2.1). In our implementations, both D-EE and D-TSEE are used with their default parameters.
We color the cells based on time stages on their lowdimensional embeddings obtained by the four dimensionality reduction results (Figure 4). We nd that t-SNE preserves the time lineage structure of data in an "S" shape with small gaps (Figure 4, upper-left panel); UMAP also shows a time lineage structure, but with a large gap occurring between time stages 5.5 and 6 ( Figure 4, upper-right panel). In contrast, both D-EE and D-TSEE preserve continuous time lineage structures in low-dimensional space (Figure 4, lower panels).
We further explore the gene expression patterns of Sox2, Sox4 and Nanog on both D-EE and D-TSEE embeddings (Figure 5). These genes are key regulators during stem cell di erentiation and reprogramming process [24,25,26]. Previous study has shown evidence that these genes may oscillate during cell development progression [11,26]. These genes display oscillatory gene expression patterns in the early stage of iPSCs on the D-TSEE view ( Figure 5), providing useful information and clues for downstream analysis.

Conclusion
In this work, we develop a novel tool, D-EE, for visualizing large-scale scRNA-seq data. D-EE implements the distributed storage and distributed computing techniques to a powerful nonlinear dimensionality reduction method, Elastic Embedding. The optimal distributed computational strategies implemented by D-EE allow it to achieve not only the strong scalability on large-scale dataset, but also the exact optimization solution as original EE by fully utilizing the whole data. Numerical experiments validate the correctness and parallel ef-ciency of D-EE. Considering the emergence of time series scRNA-seq data, our D-TSEE tool allows us e ciently to perform dimensionality reduction on large-scale single-cell data by employing experimentally temporal information. Besides, when incorporating temporal information if it is available, D-TSEE can reveal dynamic gene expression patterns, providing insights for subsequent analysis of molecular mechanisms and dynamic transition progression.
We demonstrate that D-EE and D-TSEE work e ciently on large-scale datasets at a super computer. However, the proposed distributed algorithm D-EE still has disadvantage due to the huge computational cost and storage with a relative large number of cells. Therefore, D-EE is limited to handle and analyze huge-scale datasets with the number of cells up to the order of millions [27]. In comparison, the state-of-the-art accelerated implementations of t-SNE (e.g., FIt-SNE) and UMAP are of the close-to-linear computational complexities, showing great e ciency on huge data analysis.
In the future study, to resolve the limitation of D-EE on huge-scale data computation, we can accelerate D-EE by adopting either the fast Fourier transform as used in FIt-SNE, or adopting the state-of-the-art neural network framework used by net-SNE [28]. On the other hand, since huge-scale singlecell dataset can be highly redundant, we can also select subset of informative samples using the advanced geometric sketching tool [29] prior to application of D-EE.

Availability of source code and requirements
Lists the following: • Project name: D-EE • Project home page: https://github.com/ShaokunAn/D-EE • Operating system(s): Linux • Programming language: C, R • Other requirements: Multi-core processor, implementation of MPI library (i.e., OpenMPI or IntelMPI) installed on each node of the cluster, a reasonably fast interconnecting infrastructure, PETSc 3.11.4 or higher • License: GNU General Public License • biotools: d-ee

Availability of supporting data and materials
The PHATE data supporting the results of this article is available in the Github repository [17]. The iPSCs data is available in NCBI repository with number GSE 122662 [22]. The HPSCs data is available in the NCBI repository with accession number GSE72857 and the dataset used in our study is downloaded from their Github https://github.com/ManuSetty/wishbone. optimization problem with the classic Quasi-Newton methods iteratively. In the k-th iteration, L k is computed based on W P , W N and X k , which are used to obtain the gradient G k = L k X k . The descent direction P k is then determined by solving a linear system L P P k = -G k . Finally, X k+1 is updated according to X k+1 = X k +α k P k , where the step size α k is calculated by a line search method.   Dear Editor,

Declarations
We are pleased to submit our revised manuscript "D-EE: software for visualizing low-dimensional intrinsic structures of large-scale single-cell RNAsequencing data" for consideration for GigaScience.
Thank you for your valuable suggestion, we registered our proposed software with both bio.tools and SciCrunch.org databases, and provided the biotools ID identifiers in our revised manuscript.
We are also very grateful to the referees' insightful comments and suggestions. They are very helpful for us to improve the research and to revise the manuscript. The manuscript has been fully revised by following the reviewers' comments and suggestions, making our results more solid and convincing. We summarize the major changes as follows: 1) As suggested by Reviewer #1 and #2, in the revised manuscript, we tuned the parameters of t-SNE and UMAP and provided detailed implementation information of t-SNE and UMAP. We found that their performances can be further improved, especially for t-SNE (see the first paragraph of subsection "D-EE and D-TSEE recover intrinsic lowdimensional structures of large-scale scRNA-seq data" on Page 5). We also toned down our claims and removed the conclusion that "EE outperforms t-SNE and UMAP".