GPseudoRank: MCMC for sampling from posterior distributions of pseudo-orderings using Gaussian processes

A number of previous approaches to pseudotime estimation have provided point estimates of the ordering of cells for scRNA-seq data, while more recently, Gaussian process latent variable models and MCMC methods have been applied to understand the uncertainty associated with the pseudotemporal ordering. We present a new type of Gaussian process latent variable model for pseudotemporal ordering, which samples a distribution on the probability space of the orderings, that is on the group of permutations, rather than on the hugely high-dimensional vector space of possible pseudotimes, as done by previous models. We determine the best proposal distribution for our Metropolis-Hastings sampler for different types of data in an extensive simulation study, and show on a microarray data set that it is both able to capture complicated posterior distributions with modes close to pseudotime estimates found by state-of-the-art methods for point estimation of pseudotime orderings, and identify a global maximum of the distribution close to the true order. Finally, in an application to scRNA-seq data we demonstrate the particular potential of our method to identify phases of lower and higher pseudotime uncertainty during a biological process. Software in the form of Matlab code, together with sample input data sets, is available on request from the first author.


Introduction
In order to obtain information about the changes of gene activity during a process such as a response to an infection, or the transition between cellular states, the study of time trajectories of mRNA expression levels helps us infer which genes regulate the process at various stages. For instance, we might be interested whether up-or down-regulation of specific genes in mRNA expression occurs early or late in the process.
Single cell RNA sequencing (scRNA-seq) provides mRNA expression levels of genes for individual cells. For a description of the technical aspects of scRNAseq, see [18]. scRNA-seq has shown that gene expression across cells is heterogeneous in many situations and contexts, part of which is attributable to technical noise and part of which is genuine cell hetereogeneity. See, for instance, [6,40].
As the cells are destroyed as a result of the measurement process during scRNAseq, the method only provides a single measurement per cell [37]. Therefore, it is not possible to obtain time series data following the development of one single cell. Typically, we obtain measurements for large numbers of cells at the same time point, or at a few different capture times. However, individual cells progress through changes at different time scales [39], and it is possible to obtain a form of time series data even from cross-sectional data by statistical means, an approach referred to as pseudotemporal ordering. Previous approaches to pseudotemporal ordering are described in Section 2.  There are two types of pseudotime algorithms, the first estimating the ordering of the cells in terms of a permutation of the set {1, . . . T }, where T is the number of cells, the second providing estimates of the developmental stage of each individual cell in terms of a continuous pseudotime parameter. The proposed method, GPseudoRank is the first fully Bayesian pseudotime method to sample from a posterior distribution of permutations rather than continuous pseudotimes, capturing uncertainty and even bi-and multi-modality of the orderings in terms of permutations.

Background
There are a number of previous approaches to pseudotemporal ordering. Most of them are based on considering cells as elements in the space R ng , where n g is the number of genes. The gene expression measurements for each cell are represented as a vector in R ng . For review papers on existing pseudotemporal methods see [3,9].
Wanderlust [4] first computes a k nearest neighbours graph for the highdimensional data, which connects cells with short distances in R ng , that is cells with similar gene expression profiles. Then it repeatedly applies a randomised shortest path algorithm that orders the cells, in order to obtain an average pseudotime for each cell.
Wishbone [32], based on Wanderlust, uses diffusion maps [11] to reduce the dimensionality of the data, reduce noise and avoid the problem of short circuits, that is cells being near each other in the ordering which are far from each other in terms of their developmental stages. Unlike Wanderlust, Wishbone is able to identify a branching point as well. It uses randomly sampled cells to improve an initial trajectory based on shorted paths distances. Diffusion maps have also been used for pseudotemporal ordering in other publications [2,16,17].
SLICER [41] also uses a k nearest neighbours graph. This algorithm first applies LLE (local linear embedding) [31] to reduce the dimensionality of the space. Several methods for pseudotime estimation are based on minimal spanning trees (MST). Waterfall [34] first uses PCA to reduce the dimensionality of the gene-space from n g to 2. It then performs k-means clustering in the space of dimension 2 to group and computing an MST over the cluster centroids to find a pseudotime path. The pseudotimes of the individual cells are positioned based on their distance from the path found by computing the MST over the centroids.
CellTree [14] uses latent Dirichlet allocation (LDA) [5] in a Bayesian frame-work to construct point estimates of distance matrices between cells. LDA identifies underlying 'topics'. Different mixtures of topics explain the gene expression levels of different cells. CellTree uses Gibbs sampling to obtain histograms of the topics of the different cells, and compares these histograms to obtain the matrix of pairwise differences required for the computation of the MST, rather than propagating the uncertainty obtained through the Gibbs sampling throughout the entire algorithm. Therefore, the output of the algorithm is a point estimate.
There are several other methods using MST and clustering methods to obtain a point estimate of a pseudotemporal order. TSCAN [22] is based on the construction of an MST between centroids of clusters, and, like Waterfall, uses PCA for dimensionality reduction before performing the clustering and the ordering. Unlike Waterfall, it does not use k-means clustering but finite mixtures of Normal distributions. Mpath [10] uses MST and hierarchical clustering to find pseudotemporal orderings, constructing a network of neighbouring 'landmark' clusters, clusters of cells of a certain minimum size and purity. Another well-known method using MST and clustering is Monocle 2 [28], which applies a recent machine learning algorithm called graph structure learning [26].
All the algorithms discussed so far provide point estimates for the orders, and do not sample from a posterior distribution to quantify the uncertainty of the ordering. There are two existing methods for pseudotime estimation which use MCMC to sample from a posterior distribution [8,30]. They use Gaussian processes (GPs, see Section 3.1) to model the data. However, the algorithms presented in [8,30] sample from posterior distributions of pseudotime vectors R n , rather than sampling the ordering as a permutation.
To our knowledge there is no previous algorithm sampling directly from a posterior distribution of pseudo-orders. In addition, our approach is the first to provide a sampling strategy specifically tailored to the problem of pseudotemporal ordering, which is shown to perform well with noisy data and little information on capture times. Unlike for other discrete sampling problems with applications in computational biology, like networks or phylogenetic trees, there have not been any specific strategies developed to sample from complicated posterior distributions on a sample space of orderings to our knowledge. Moreover, our approach jointly models individual gene trajectories shares information across them, while approaches using MST or other dimensionality reduction methods are not able to borrow strength across different gene trajectories.
In summary, there are a number of existing approaches to estimating pseudotimes, but very few of them provide an estimate of the uncertainty of the the pseudotimes. The approaches that do so [8,30] provide posterior distributions of the pseudotimes of the individual cells, and not directly of the ordering of the cells. Moreover, our model is one of very few [30] for pseudotime ordering that also models not only the uncertainty of the ordering, but also the uncertainty of the actual trajectory of the process given noisy observations. Further, our approach provides the most detailed and stringent convergence analysis.

Single-cell trajectories as stochastic processes
Our approach does not only order the data, but it also models the variation of the individual mRNA expression levels conditional on each ordering. In fact, our model has two components of uncertainty: first, the uncertainty of the pseudotime ordering, and second the uncertainty from the stochasticity of the process. We assume that there is an underlying stochastic process which all of the cells follow. Figure 2 illustrates the mean and standard deviation of this process. Each sample from this process is a trajectory. Figure 2 illustrates three possible trajectories from the same process. The actual measurements are then noisy observations of this trajectory. The lines represent sample trajectories of the GP, and the points in the corresponding colours are the corresponding noisy measurements. The stochastic process we use to model the gene expression trajectories are Gaussian processes (GPs) [29]. GPs have been widely used to model time series data for biological systems in areas other than pseudotemporal ordering, in particular for the modelling of microarray time series data. For instance, there are applications to differential expression analysis for time series microarray data [36,23], for the modelling of gene expression profilebs in an ODE based model to identify possible targets of transcription factors [21], for the inference of gene regulatory networks [1], or to model the parametric components for time-course data in mixture models [12,19,20,24].
The proposed approach uses Gaussian processes (GPs) to model the dynamics of each individual RNA expression profile, conditional on the ordering. GPs are stochastic processes defining families of functions in terms of a mean function and a covariance function. More precisely, f ∼ GP (µ, Σ), if for any vector of input points τ = τ 1 · · · τ T we have f (τ 1 ) · · · f (τ T ) = N T (µ(τ ), Σ(τ , τ )). We use radially symmetric functions for Σ. That is, the covariance between f (t) and f (t) for any t andt depends only on a distance function d(t,t) of the input points.
We use the squared exponential covariance function, that is were d refers to the Euclidean distance. We refer to σ 2 w as the scale parameter, and to l as the length scale.
where, unless Σ is known a priori, it will depend on the ordering o. In practice, to assume a zero-mean GP we subtract the overall mean across all genes and cells from our given matrix of gene expression levels (see Sections 3.5.5 and 3.5.6).

Accounting for irregular pseudotimes: pseudotime versus rank time
If we may assume pseudotime points to be approximately uniformly distributed, then the input points for the GP modelling may be chosen equidistant. We refer to this approach as rank time. However, if the rate of development during a biological process changes substantially, then the GP might not be stationary with respect to rank time, and we need to apply the concept of pseudotime as a latent variable measuring biological development [8,30,42]. By estimating or sampling pseudotime vectors as vectors in R n , existing GP methods for pseudotime estimation [8,30,42] provide a latent pseudotime parameter, which measures biological development. We have extended our model to estimate both pseudotime and rank time simultaneously, for the case of pseudotime differing significantly from rank time. Thus, our approach provides both pseudotimes and rank times automatically, and if, for instance we want to compare genetic and epigenetic data from different cells undergoing the same biological process, we do not need to introduce additional steps to convert the different pseudotimes to a common timescale as in [42].
Our proposed method samples orderings directly, and we obtain a rank time t j ∈ (0, 1) for each cell as follows: if c j is the position of cell j in the ordering, then the corresponding rank time is t j = cj − 1 2 T , where T is the number of cells. Thus rank time is approximately uniformly distributed and therefore an approximation to linear time, rather than pseudotime. The advantage of rank time as opposed to pseudotime is the fact that it allows us to compare directly genetic versus epigenetic measurements on different cells of the same population, process, and cell type. Pseudotime measures the biological development, that is pseudotime intervals are shorter than rank time intervals during periods of slower development, whereas pseudotime intervals are longer than the corresponding rank time ones during periods of faster development. Rank time is similar to the concept of master time developed in [42], where the authors map pseudotime to corresponding uniform quantiles and then use linear interpolation or Gaussian process regression to obtain approximations to genetic and epigenetic measurements of different cells corresponding to the same uniformly distributed master time.
Our model provides pseudotime values corresponding to given orderings as follows: if o = o 1 · · · o T is a given ordering, and the n × T dimensional vector of ordered RNA expression levels, then we set τ = τ 1 · · ·τ T , withτ This constructs pseudotimes as sums of Euclidean distances between neighbouring cells. This approximates geodesic distances on the underlying manifold [38].  Figure 3 illustrates the difference between rank time and pseudotime. We use simulated data corresponding to a situation where there is slower development during the first part of the process. We see from the GP interpolation that for the rank time approximation the curve is flatter at the beginning and then becomes steeper as the second phase of faster development sets in. This effect is not there if we use pseudotime instead of rank time, as pseudotime intervals stretch and shrink to compensate for different developmental speeds.

Extending the model: GP parameters
For real datasets the parameters of the GPs underlying the gene expression trajectories are not known. As integrating them out would computationally be all but unfeasible, we sample them. A distribution over both orders and GP parameters is, however, almost guaranteed to be multi-modal, in particular with local modes where less likely orders are compensated by a very noisy or short length scale GP. In order to avoid this, our method uses informative priors on the GP parameters reflecting our knowledge that likely orders are associated with a higher signal-to-noise ratio than less likely orders. In addition, to avoid getting trapped in local modes with both a high scale parameter σ 2 w and a high noise parameter σ 2 , we sample only σ 2 w and set σ 2 = V − σ 2 w , where V is the sample variance taken across the entire n g × T matrix of gene expression levels of T cells for n g genes.
We use log-normal prior distributions for the scale parameter σ 2 w and the length scale l, and a uniform prior on the orders. Formally, our model is as follows: where κ is a squared exponential covariance function for a GP (see equation 1).
Let V be the overall variance of the n g × T matrix of gene expression data. We use the following prior distributions for the GP parameters: where v = 0.1 for the microarray data set considered in [43] (see Section 3.5.5) and v = 0.01 for the scRNAseq data set [33] (see Section 3.5.6). The stronger prior for the scRNA-seq data set was chosen to account for the high noise levels and the fact that the prior with high signal and low noise may otherwise lead to getting trapped in local modes with very short length scales and orders that would be unlikely given longer length scales.

Fully Bayesian approaches and MCMC
We use a Bayesian approach for inference, as we believe that a full understanding of the uncertainty of the ordering of the cells is desirable, as different orderings will lead to different gene trajectories, which in turn will lead to different inferences about differential expression and gene regulatory networks. We use a Bayesian approach and an inference method which samples from the posterior distribution of the model. MCMC is probably the most widely used method for this.

Simulation studies
The following simulation studies illustrate the efficacy of the individual moves and of combinations of different moves for different types of data. We simulate 50 genes with 90 cells. For each of the simulation studies we generate 16 data sets. On each of these data sets we run MCMC chains using all the possible combinations of our 4 proposed moves. For the chains with combinations of more than one moves, we apply each of them with equal probability.
We perform 3 different simulations to illustrate the strengths of the individual moves, as different moves are particularly powerful in different situations. The first simulation is relatively informative data, with detailed information on capture times and a substantial, though not very high, noise level. The second simulation has fewer capture times, and the third is more noisy data.
To focus on the convergence of the orders and on their proposal distribution, we use the known true parameters in the model, and do not sample them. We use regular spaced pseudotimes, as we simulated them from a uniform U (0, 1) distribution.
We run all the chains without adaptation of the proposal distribution, which, provided we have found reasonably good fixed parameters for the proposal moves, has not shown to accelerate convergence in general.

Simulation 1
We generate 16 data sets as follows: For the GP parameters, we draw 16 samples each from the following distributions: Using the parameters sampled above, we simulate 50 orderings with 90 input points each from the 16 zero-mean GP with the squared exponential covariance matrix specified by the 16 different sets of GP parameters. The input points are drawn from a uniform distribution U (0, 1). For this simulation, we assume three capture times, with the first 30 cells drawn from the first capture time, the second 30 cells from the second capture times, and the remaining 30 cells from the last capture time.

Fewer capture times
To gain better understanding of how the moves perform in different situations relevant to the analysis of single-cell data, we next look at how the different combinations of moves perform when there are fewer capture times. The setup is identical to simulation 1, however instead of three capture times with 30 cells each, we only have two capture times, with the first 30 cells belonging to the first, and the remaining 60 belonging to the second capture time.

Higher noise
Finally, as single-cell data tends to be quite noisy, we also explore the performance of the individual moves in the presence of higher noise levels. We use the same number of capture times, cells and genes as in simulation 1. We sample L and σ w for the generation of simulated data from the same distribution as with simulation 1, but this time we sample σ from the same distribution as σ w , which means that on average we have a signal-to-noise ratio of 1.

Microarray data
To validate the proposed algorithm, we apply it to a data set with known true order, the whole leaf Arabidopsis thaliana microarray data set [43].
[43] studied the response of Arabidopsis thaliana to infection by the fungal pathogen Botrytis cinerea, generating microarray time series data over 48 h, with measurements at intervals of 2h. We compare our distribution of posterior orders to the true order and to estimates produced by two established pseudotime methods, TSCAN [22] and SLICER [41]. We subtract from the data its sample mean across all cells and genes. For the analysis with TSCAN and SLICER we use the standard settings implemented for the respective algorithm, and we use the 150 genes mentioned in the paper by Windram et al. [43]. With SLICER, we tried several different values for the number of edges of the nearest neighbours graph in the low dimensional space, and found that the values leading to the order closest to the true one are 4 and 5. Like Reid and Wernisch [30], we group the known 24 capture times into 4 groups of 6, assuming that we do not have any further information, in order to test the algorithm.

scRNA-seq data
Shalek et al. [33] examined the response of primary mouse bone-marrow-derived dendritic cells in three different conditions using single-cell RNA-seq. We apply our model to the data on the lipopolysaccharide stimulated (LPS) condition.
Shalek et al. [33] identified four modules of genes. For the ordering, we use a total of 74 genes from the four modules with the highest temporal variance relative to their noise levels [30]. The number of cells is 307, with 49 unstimulated cells, 75 captured after 1h, 65 after 2h, 60 after 4h, and 58 after 6h. We subtract from the data its sample mean across all cells and genes.
By default, we set n 0 = T 4 , and we use this default setting for the simulation studies, but we often use a different value to optimise the acceptance rate.
Adaptive version: During a short period at the beginning, we adjust n 0 to achieve an optimal acceptance rate between 0.2 and 0.3, but requiring For many data sets the acceptance rate for this move will therefore be above 0.3.
2. Move 2-swaps of cells with short L 1 -distances: we compute a distribution over the pairs of cells proportional to a squared exponential of the L 1distance between the cells, that is ), where d refers to the L 1 distances to cells as elements in R ng , where n g is the number of genes. We then sample one pair of cells from p ij , to swap them. If P k = 1, 2, 3, 4, 5, 6, 7, 8 is the current ordering, and cells 2 and 5 are close in terms of their L 1 -distance, then the proposed ordering could be P k+1 = 1, 5, 3, 4, 2, 6, 7, 8.
Adaptive version: During a short period of the beginning, we temper or anneal the distribution f from which we sample the pairs of cells by sampling from f a instead of f , in case of very high or low acceptance rates.
3. Move 3 -reversals: again we compute a distribution over the pairs of cells proportional to a squared exponential of the L 1 -distance between the cells, that is ). Now the proposed move is to reverse the ordering between these two sampled cells, where the reversal includes the pair of cells sampled. If P k = 1, 2, 3, 4, 5, 6, 7, 8 is the current ordering, and cells 2 and 5 are close in terms of their L 1 -distance, then the proposed ordering could be P k+1 = 1, 5, 4, 3, 2, 6, 7, 8.
Adaptive version: During a short period of the beginning, we temper or anneal the distribution from which we sample the pairs of cells in case of very high or low acceptance rates.
By default, we set n 3 = T 20 , and n 3a = T 12 . Adaptive version: At the beginning, we adapt n 3 and n 3a in case of acceptance rates above 0.3 or below 0.1, but we require T 50 ≤ n 3 ≤ T 10 and 4 ≤ n 3a ≤ T 10 . 5. Move 5-reversing the entire ordering: with a low probability p 5 we reverse the entire ordering, which is accepted with probability 1. This move allows us to exploit the symmetry of our posterior distribution for convergence assessment. If P k = 1, 2, 3, 4, 5, 6, 7, 8 is the current ordering, then the proposed ordering would be P k+1 = 8, 7, 6, 5, 4, 3, 2, 1.
For the simulation studies we apply all possible combinations of moves 1-4.
For the microarray data we apply one move 3, as it was shown to be the best sampling strategy for multi-modal distributions. For the scRNA-seq data set, we use all the moves. For the microarray data set we use γ 2 = 1000 and an additional tempering factor a = 0.1. For the scRNA-seq data set we set γ 1 = γ 2 = 4000, without any tempering factor for moves 2 or 3.

Order symmetry and MCMC parameters
As our posterior distribution is a symmetric function of the order, each order and its reverse will be sampled with equal probability from the posterior distribution. However, we are only interested in those orders that correlate positively with the capture times of the cells. Therefore, we reverse the orders negatively correlated with the capture times.
For the simulated and the microarray data sets we run chains for 100,000 iterations and apply a thinning factor of 10. For the scRNA-seq data we use the same thinning factor but 500,00 iterations. For each of the data sets, we run 12 chains from random starting orders and with random GP parameters sampled from the prior distribution. For the starting orders we restrict the randomness to permuting the data randomly within capture times, but not across capture times.

Assessment of MCMC convergence
MCMC convergence needs to be checked rigorously for the following two reasons. First, we need to check that our proposal distribution effectively reaches all the regions of high density of the posterior distribution, and does not get trapped around one single mode. Second, we need to make sure that we have run our chain long enough for it to have converged to its invariant distribution. The Gelman-RubinR-statistic is an established method to assess convergence. It was first proposed in [15], and later corrected to account for sampling variability [7]. TheR-statistic requires several MCMC chains with over-dispersed initial states, and estimates the potential scale reduction factor, the factor by which the pooled variance across all the chains is larger than the within-sample variance. For convergent chains,R approaches 1 as the number of samples tends to infinity. According to [7], we may assume that convergence has been reached ifR < 1.2 for all model parameters. We use the more stringent recommendation to run MCMC-chains until a value of 1.1 has been reached for theR-statistic [35].
Apart from determining the length of the burn-in and checking whether a set of MCMC chains have converged, we may also use theR-statistic to assess the speed of convergence for different proposal distributions. More previously, for different values of > 0, we check whether theR reaches below 1 + for a given number of samples. A different way of assessing the speed of convergence is to measure the number of samples required for the chain to reach below 1 + .

Comparison of move efficiency
To compare the efficacy of the individual moves described in Section 3.6 in the simulation setting described in Section 3.5.1, we construct for each of the 16 simulated data sets 5 different starting orderings (one for each simulated data set) by permuting the cells within, but not across, the three capture times. For each combination of moves, and each of the 16 simulated data sets, we run 5 chains starting from the 5 starting orderings for 100,000 iterations, with a thinning factor of 10. We compute for each combination of moves and for each of the 16 sets of 5 chains the Gelman-RubinR-statistic as in [7] on the log-likelihood as follows: For each batch for which we compute theR-statistic on only the second half, discarding the first 50% as burn-in. We first compute the statistic after 50 thinned samples, which would correspond to 500 samples without the thinning. Then we recompute the statistic after each additional 20 thinned samples. That is, we obtain values for the statistic at the following thinned sample numbers: 51, 71, 91, 101, ... . To compute the Gelman-Rubin statistic as in [7], we use the R-package coda citecoda.
For each ordering in each chain we compute the corresponding positions of the cells, that is the inverse permutation of the given ordering. Then we compute for each sample the L 1 -distance of the corresponding cell positions to the reference positions 1 : 100. For each combination of moves and each of the 16 sets of 5 chains we compute theR-statistic on these L 1 -distances. Again we compute thê R-statistic with the same intervals as above and discarding the first 50% of each batch as burn-in. Thereby, for each combination of moves and for each of the 16 simulations, we obtain two curves ofR-statistics, one for the log-likelihood and one for the L 1 -distances.
We assess the moves according to the following criteria: Whether they have converged before 10,000 thinned samples according to theR-statistic, at different levels. That is, we check whether for all the 16 simulations for a given combination of moves the chains have converged at the 1.1, 1.07, 1.05, and 1.02 levels. For the moves for which there is convergence at a given level we compute the average number of thinned samples to convergence, where the average is based on computations of theR-statistic at intervals of 20. We also compute the maximum number of thinned samples to convergence, again based on computations of theR-statistic at intervals of 20.

Simulation study 1
For the following combinations of moves the chains on all simulated data sets have anR < 1.02 before the last sample, both for the log-likelihood and the L 1 -distances: 4; 1,4; 2,4; 1,3,4; 1,2,3,4; Simulation study 1 therefore indicates that these combinations of moves are the most suitable ones for sampling from data sets with medium, but not very high, noise levels, and with relatively good information on capture times. The worst performing combinations of moves with these sets of simulated data is clearly move 3, for which some of the 16 data sets haveR-statistics which remain above the 1.07 threshold. Among the best performing moves, it is less clear-cut which one is the combination of moves leading the fastest convergence, as the sample size of our quite extensive simulation study is still limited. However, assigning ranks to a number of criteria, we see that move 4 is ranked first in seven of these criteria. For a more detailed analysis, see in the supplementary materials. Figure 4 compares the best and the worst performing move in the set-up of simulation 1. We plotted the Gelman-Rubin statistics for the log-likelihoods for all the 16 simulated data sets.

Simulation 2: fewer capture times
The performance of the different combinations of different moves is very different to what we saw in simulation 1. In fact, as illustrated by Figure 5, when comparing the performance of single moves, we now see that move 3 performs better than any other single move. In fact, move 3 is the only move for which allR-statistics go below 1.1 within the first 10,000 thinned samples. With move 1 and move 4, theR-statistic does not go below 1.1 for any single simulated data set. Fortunately, we can improve convergence in this case by combining at least three moves, as long as they do not include both move 1 and move 4. In fact, the only combination of moves for which not for a single one of the simulated data sets theR-statistics gets below 1.1 is the combination of moves 1 and 4, which performed well in simulation 1, when we had more capture times. Fortunately, very reasonable levels of convergence are still attained with combinations of moves more suited to the challenge of fewer capture times. In fact, for the following combinations of moves, all theR-statistics are below 1.05 by the last sample: 1,3; 2,3; 1,2,3; 2,3,4; 1,2,3,4 The combination of moves that is ranked first according to the largest number of criteria is the combination 1,   (table 1) listing the performance of the combinations of moves on our criteria outlined in this Section and provide plots not only of theR-statistics for the log-likelihood, but also for the L 1 -differences from the reference permutation.

Simulation 3: higher noise
Our moves still perform well in this situation, with the exception of move 3, which does not even reach the 1.1 level for the Gelman-Rubin statistic for all the simulated data sets. The moves which are ranked the highest in the largest number of our categories are move 4, and the combination 1,4 (see table 3 in Section A.1 of the supplementary materials for details). The following other combinations of moves also provide convergence for all the simulated data sets at the 1.05 -level: 2; 1,2; 2,4; 3,4 and all combinations of at least three moves. Again, we refer the reader to the supplementary materials for detailed tables and Figures.

Adaptive moves
For this simulation study we used suitable parameters for the definition of the moves, and did not adapt them during a fixed number of iterations at the beginning. We repeated all the simulations using adaptations of the moves at the beginning as described in Section 3.6. However, our results indicate that convergence tends to be better without the adjustment, provided that parameters for the moves with good acceptance rates can be found, for instance by running the sampler for a short number of iterations with different settings. A detailed comparison without the non-adaptive and the adaptive version can be found in table 4 in Section A.1 of the supplementary materials.

Validation on microarray data
We first apply GPseudoRank with irregular pseudotimes. The irregular pseudotimes reflect the different speed of biological development during different phases of the process, even if we know that the real time points are equidistant with 2h each between them. We again use the Gelman-Rubin statistic to assess convergence (see Figure 7). We plotted the Gelman-Rubin statistic for both the log-likelihood and the L 1distances from the true order. From Figure 7 we see that it would be easily sufficient to run the chain for 36,000 samples, that is 3,600 thinned samples, as by that number both of theR statistics have reached below 1.1. Discarding half of the 3,600 thinned samples as burn-in as recommended in [35], we discard a burn-in of 1,800 samples.
We also plotted the L 1 -distances from the real order for GPseudoRank, as well as for two algorithms providing point estimates. As illustrated by Figure 8 the distribution of GPseudoRank is bi-modal, while the estimates provided by SLICER [41] and TSCAN [22] are both close to one of the modes of the distribution of GPseudoRank. This illustrates how important is it to sample from the distribution of the orderings rather than just obtaining a point estimate.
In practice, when we do not know the true order and two ordering algorithms provide very different point estimates, then we do not know which one is the closer to the true underlying biological order. Therefore, it is preferable to quantify the uncertainty, rather than only providing point values. The run time of GPseudo rank with irregular pseudotimes for 100,000 iterations for the Windram data is about 7min 20s on an Intel Xeon X5 2.0GHz CPU.

Pseudotemporal uncertainty varies during a response to an infection
We again check convergence carefully using theR-statistics on both the loglikelihood and the L 1 -distances from a reference order ( figure 11). This time the true order is unknown. We could take any fixed reference order, so we use 1 : 307. The run time of GPseudo rank for 500,000 iterations for the Shalek data is about 3 hours on an Intel Xeon X5 2.0GHz CPU. It should be noted that because of the focus of this paper on showing the good convergence properties of our method we run the chain for long to show that high levels of convergence can be achieve. However, according to the recommendation to run chains until anR < 1.2 has been reached ( [35]), 10,000 thinned samples, that is a total of 100,000 samples would easily be sufficient even with a stricter threshold of 1.1. We discard a burn-in of 5,000 thinned samples at the beginning of each chain. Figure 12: L 1 -distances from the real order: Shalek data Figure 12 underlines again that it is important to provide a posterior distribution over the orders, rather than a point estimate, as again the two methods for point estimation give rather different results, and not knowing the uncertainty will lead to over-confidence in the results. Of course, now with the true order not known, it is not clear which of the two point estimates is closer to the truth.
In addition to providing the uncertainty of the orders, our approach also shows how this uncertainty varies during the course of the process.

Possible alternatives to MCMC
There are, of course, alternatives to MCMC for sampling from posterior distributions, in particular importance sampling and extensions to it such as annealed importance sampling [27], or annealed SMC sampling [13]. However, the geometry of our posterior distribution makes it all but impossible to implement importance sampling. While annealed SMC sampling could be an interesting alternative for future work, we have implemented a proposal distribution for MCMC that is particularly suitable to our problem and has good convergence properties even if the data are very noise or little is known about their capture times.

Two-step approaches to branching processes
One drawback of our model is that, like the other pseudotime models using GPs, it is not able to model branching processes in a one-step process. How-ever, it should be noted that while models using GPs do not directly infer the branching structure of a branching process, they have been used for variational inference in a two-step process that first infers the pseudotimes and then identifies the branches [25]. Along the same lines, our model could be used to provide a full posterior distribution of the pseudotimes, which could be used as a basis to infer branches depending on the pseudotimes, although this would be computationally more expensive than the variational approach. Alternatively, we could first use a well-established existing method that provides a point estimate of the branching structure, such as TSCAN or SLICER, and then use the method proposed in this paper to provide full inference of the posterior distribution of the pseudotimes of each branch, an approach that is likely to be more efficient computationally.

Conclusions
We presented a new type of Gaussian process latent variable model for pseudotemporal ordering, which samples a distribution on the probability space of the orderings, that is on the group of permutations, rather than on the hugely highdimensional vector space of possible pseudotimes, as done by previous models. We have implemented a Metropolis-Hastings sampler on our sample space, which is still very large and difficult to sample from. We found it necessary to develop novel moves in order to explore the posterior effectively. Our proposal distribution allows the sampler to make long distance moves in this space with a good acceptance rate.
An extensive simulation study examined how different combinations of moves perform in different situations relevant to the analysis of single-cell data. Combinations of all the four moves proposed perform well in all the situations considered. We therefore generally recommend to use a combination of all four moves. Combinations of three moves also perform well, except for those containing both moves 1 and 4, which do not perform well with fewer capture times.
If we have several capture times with not very many cells each, then move 4, on its own or combined with move 1, will probably provide the fastest convergence. However, even if there are several capture times, there tends to be an overlap between them with real data, as opposed to the situation we simulated. In fact, towards the end of a biological process, the last two or so capture times might overlap considerably (see Section 4.3). This means that de facto we have one capture time less.
Important insights are also gained from the validation of our algorithm on microarray data. As illustrated by Figure 8, point estimates of cell orderings, even if computed using state-of-the-art pseudotime estimation methods, only provide estimates near one mode of the distribution of the orderings. Even if for this data set we know the real order, orders near the second mode are still likely orders, and with a data set without a known true order it would be important to sample the full distribution, in particular for a data set where the two modes are so far apart. This illustrates the particular importance of methods accounting for uncertainty and possible multi-modality in pseudotime ordering.
Finally, an application to an scRNA-seq data set illustrates the good convergence properties of our algorithm, and a comparison to point estimation methods of orderings illustrates once more the need for a method of estimating orderings that fully captures the uncertainty in a Bayesian way. One particular interesting aspect of the uncertainty of the ordering captured by our algorithm is the fact that the uncertainty is lowest during the middle of the process, where the heterogeneity of cells with regard to their progress through the response to the infection is highest. Finally, table 4 compares the non-adaptive simulations with adaptive ones, where we adapt the parameters for the proposal moves at every 50th iteration during the first 5,000 samples.