A descriptive marker gene approach to single-cell pseudotime inference

Campbell, Kieran R; Yau, Christopher

doi:10.1093/bioinformatics/bty498

Abstract

Motivation

Pseudotime estimation from single-cell gene expression data allows the recovery of temporal information from otherwise static profiles of individual cells. Conventional pseudotime inference methods emphasize an unsupervised transcriptome-wide approach and use retrospective analysis to evaluate the behaviour of individual genes. However, the resulting trajectories can only be understood in terms of abstract geometric structures and not in terms of interpretable models of gene behaviour.

Results

Here we introduce an orthogonal Bayesian approach termed ‘Ouija’ that learns pseudotimes from a small set of marker genes that might ordinarily be used to retrospectively confirm the accuracy of unsupervised pseudotime algorithms. Crucially, we model these genes in terms of switch-like or transient behaviour along the trajectory, allowing us to understand why the pseudotimes have been inferred and learn informative parameters about the behaviour of each gene. Since each gene is associated with a switch or peak time the genes are effectively ordered along with the cells, allowing each part of the trajectory to be understood in terms of the behaviour of certain genes. We demonstrate that this small panel of marker genes can recover pseudotimes that are consistent with those obtained using the entire transcriptome. Furthermore, we show that our method can detect differences in the regulation timings between two genes and identify ‘metastable’ states—discrete cell types along the continuous trajectories—that recapitulate known cell types.

Availability and implementation

An open source implementation is available as an R package at http://www.github.com/kieranrcampbell/ouija and as a Python/TensorFlow package at http://www.github.com/kieranrcampbell/ouijaflow.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The advent of high-throughput single-cell technologies has revolutionized single-cell biology by allowing dense molecular profiling for studies involving 100–10 000 s of cells (Kalisky and Quake, 2011; Macaulay and Voet, 2014; Shapiro et al., 2013; Wills and Mead, 2015). The increased availability of single-cell data has driven the development of novel analytical methods specifically tailored to single cell properties (Stegle et al., 2015; Trapnell, 2015). The difficulties in conducting genuine time-series experiments at the single-cell level has led to the development of computational techniques known as pseudotime ordering algorithms that extract temporal information from snapshot molecular profiles of individual cells. These algorithms exploit studies in which the captured cells behave asynchronously and therefore each is at a different stage of some underlying temporal biological process such as cell differentiation. In sufficient numbers, it is possible to infer an ordering of the cellular profiles that correlates with actual temporal dynamics and these approaches have promoted insights into a number of time-evolving biological systems (Bendall et al., 2014; Haghverdi et al., 2016; Hanchate et al., 2015; Qiu et al., 2011; Reid and Wernisch, 2016; Setty et al., 2016; Shin et al., 2015; Trapnell et al., 2014).

A predominant feature of current pseudotime algorithms is that they emphasize an ‘unsupervised’ approach. The high-dimensional molecular profiles for each cell are projected on to a reduced dimensional space by using a (non)linear transformation of the molecular features. In this reduced dimensional space, it is hoped that any temporal variation is sufficiently strong to cause the cells to align against a trajectory along which pseudotime can be measured. This approach is therefore subject to a number of analysis choices including gene selection, dimensionality reduction technique, and cell ordering algorithm, all of which could lead to considerable variation in the pseudotime estimates obtained. In order to verify that any specific set of pseudotime estimates are biologically plausible, it is typical for investigators to retrospectively examine specific marker genes or proteins to confirm that the predicted (pseudo)temporal behaviour matches a priori beliefs. An iterative ‘semi-supervised’ process maybe therefore be required to concentrate pseudotime algorithms on behaviours that are both consistent with the measured data and compliant with a limited amount of known gene behaviour.

2 Approach

In this paper we present an orthogonal approach implemented within a Bayesian latent variable statistical framework called ‘Ouija’ that learns pseudotimes from small panels of putative or known marker genes (Fig. 1A). Our model focuses on switch-like and transient expression behaviour along pseudotime trajectories, explicitly modelling when a gene turns on or off along a trajectory or at which point its expression peaks. Crucially, this allows the pseudotime inference procedure to be understood in terms of descriptive gene regulation events along the trajectory (Fig. 1B). As each gene is associated with a particular switch or peak time, it allows us to order the genes along the trajectory as well as the cells and discover which parts of the trajectory are governed by the behaviour of which genes. For example, if the pseudotimes for a set of differentiating cells run from 0 (stem cell like) to 1 (differentiated) and only two genes have switch times less than 0.25 then a researcher would conclude that the beginning of differentiation is regulated by those two genes. We further formulate a Bayesian hypothesis test as to whether a given gene is regulated before another along the pseudotemporal trajectory (Fig. 1C) for all pairwise combinations of genes. Furthermore, by using such a probabilistic model we can identify discrete cell types or ‘metastable states’ along continuous developmental trajectories (Fig. 1D) that correspond to known cell types.

Fig. 1.

Open in new tab Download slide

Learning single-cell pseudotimes with parametric models. (A) Ouija infers pseudotimes using Bayesian nonlinear factor analysis by decomposing the input gene expression matrix through a parametric mapping function (sigmoidal or transient). The latent variables become the pseudotimes of the cells while the factor loading matrix is informative of different types of gene behaviour. A heteroskedastic dispersed noise model with dropout is used to accurately model scRNA-seq data. (B) Each gene’s expression over pseudotime is modelled either as a sigmoidal shape (capturing both linear and switch-like behaviour) or through a Gaussian shape (capturing transient expression patterns). These models include several interpretable parameters including the pseudotime at which the gene is switched on and the pseudotime at which a gene peaks. (C) The posterior distributions over the switch and peak times can be inferred leading to a Bayesian statistical test of whether the regulation of a given gene occurs before another in the pseudotemporal trajectory. (D) Ouija can identify discrete cell types that exist along continuous trajectories by clustering the matrix formed by considering the empirical probability one cell is before another in pseudotime

3 Materials and methods

3.1 Overview

The aim of pseudotime ordering is to associate a G-dimensional expression measurement to a latent unobserved pseudotime. Mathematically we can express this as:

\underset{Expression}{\underset{︸}{y_{n}}} = \underset{Mapping}{\underset{︸}{f}} (\underset{Pseudotime}{\underset{︸}{t_{n}}}) + \underset{Noise}{\underset{︸}{ϵ_{n}}}

(1)

where the function f maps the one-dimensional pseudotime t_n for cell n to the G-dimensional observation space. The challenge lies in the fact that both the mapping function f and the pseudotimes are unknown. Our objective here is to use parametric forms for the mapping function f that will enable relatively fast computations whilst characterising certain gene expression temporal behaviours. The specification of a statistical pseudotime algorithm therefore comes down to the choice of the mean function f and the noise model on ϵ (see Supplementary Text Section S5 for an in-depth discussion).

3.2 Input data normalization

We index N cells by

n \in 1, \dots, N

and G genes by

g \in 1, \dots, G

⁠. Let

y_{n g} = {[Y]}_{n g}

denote the log-transformed non-negative observed cell-by-gene expression matrix. In order to make the strength parameters comparable between genes we normalize the gene expression so the approximate half-peak expression is 1 through the transformation

y_{n g} \to y_{n g}^{'} = y_{n g} / s_{g}

where s_g is a gene-specific size factor defined by

s_{g} = \frac{1}{| Y_{g}^{*} |} \sum_{y_{n g}^{*} \in Y_{g}^{*}} y_{n g}^{*}

(2)

and

Y_{g}^{*} = {y_{n g} : y_{n g} > 0}

⁠.

3.3 Noise model

Our statistical model can be specified as a Bayesian hierarchical model where the likelihood is given by a bimodal distribution formed from a mixture of zero-component (dropout) and an non-zero expressing cell population. If μ(t_n, Θ_g) is the mean for cell n and gene g (evaluated at pseudotime t_n with gene-specific parameters Θ_g) then

\begin{matrix} β_{0}, β_{1} \sim Normal (0, 0.1) \\ π_{n g} \sim Bernoulli ({logit}^{- 1} (β_{0} + β_{1} μ (t_{n}, Θ_{g}))), \\ p (y_{n g} | π_{n g}, μ_{n g}, σ_{n g}) = π_{n g} δ (y_{n g}) \\ + (1 - π_{n g}) T_{ν} (y_{n g} | μ (t_{n}, Θ_{g}), σ_{n g}^{2}), \end{matrix}

(3)

where π_ng is the probability of observing a dropout (zero-count) in cell n gene g and T_ν is the density function of the Student-t distribution with ν degrees of freedom.

The relationship between dropout rate and expression level is expressed as a logistic regression model (Kharchenko et al., 2014). Furthermore, we impose a mean-variance relationship of the form $σ_{n g}^{2} = (1 + ϕ) μ (t_{n}, Θ_{g}) + ϵ$ where ϕ is the dispersion parameter with prior $ϕ \sim Gamma (α_{ϕ}, β_{ϕ})$ ⁠, which is motivated by empirical observations of marker gene behaviour (Supplementary Text S4.1).

3.4 Mean functions

We then need to specify the form of the mean functions μ(t_n, Θ_g), for which we consider both sigmoidal and transient gene behaviour. For genes we expect to be a priori switch-like we model

μ (t_{n}, Θ_{g}) = \frac{2 η_{g}}{1 + exp (- k_{g} (t_{n} - t_{g}^{(0)}))},

(4)

where k_g and

t_{g}^{(0)}

denote the activation strength and activation time parameters for each gene and η_g the average peak expression with priors

η_{g} \sim Gamma (δ / 2, 1 / 2)

⁠,

k_{g} \sim Normal (μ_{g}^{(k)}, 1 / τ_{g}^{(k)})

⁠,

t_{g}^{(0)} \sim {TruncNorm}_{[0, 1)} (μ_{g}^{(t)}, 1 / τ_{g}^{(t)})

⁠. If available, user-supplied prior information can be encoded by specifying priors on the parameters

μ_{g}^{(k)}, τ_{g}^{(k)}, μ_{g}^{(t)}, τ_{g}^{(t)}

⁠. Otherwise, inference can be performed using uninformative hyperpriors on these parameters. Specifying

μ_{g}^{(k)}

encodes a prior belief in the strength and direction of the activation of gene g along the trajectory with

τ_{g}^{(k)}

(inversely-) representing the strength of this belief. Similarly, specifying

μ_{g}^{(t)}

encodes a prior belief of where in the trajectory gene g exhibits behaviour (either turning on or off) with

τ_{g}^{(t)}

encoding the strength of this belief.

For the transient case we have

μ (t_{n}, Θ_{g}) = 2 η_{g} exp (- λ b_{g} {(t_{n} - p_{g})}^{2}),

(5)

where we take λ = 10 to be a constant and with prior structure

η_{g} \sim Gamma (δ / 2, 1 / 2)

⁠,

p_{g} \sim {TruncNorm}_{[0, 1)} (μ_{g}^{(p)}, 1 / τ_{g}^{(p)})

⁠,

b_{g} \sim {TruncNorm}_{[0, \infty)} (μ_{g}^{(p)}, 1 / τ_{g}^{(p)})

⁠, where informative priors may be placed on p and b as before.

3.5 Inference

Under this framework learning the single-cell trajectory becomes Bayesian inference of $p (t, Θ | Y)$ —the joint posterior distribution of the pseudotimes and gene behaviour parameters given the expression data. We performed posterior inference using Markov Chain Monte Carlo (MCMC) stochastic simulation algorithms, specifically the No U-Turn Hamiltonian Monte Carlo approach (Homan and Gelman, 2014) implemented in the STAN probabilistic programming language (Carpenter et al., 2017). The parameter ϵ = 0.01 is used to avoid numerical issues in MCMC computation. For larger marker gene panels, such as in the cell cycle analysis section, we used stochastic gradient variational Bayes implemented in Edward (Tran et al., 2016) to perform approximate Bayesian inference.

4 Results and discussion

4.1 Pseudotime inference from small marker gene panels

The transcriptomes of both single cells and bulk samples exhibit remarkable correlations across genes and transcripts. Such concerted regulation of expression is thought to be due to pathway-dependent transcription (Braun et al., 2008; Tegge et al., 2012) and is necessary for the field of network inference from gene expression data (Langfelder and Horvath, 2008). An example of such transcriptome wide correlations can be seen in Figure 2A for the Trapnell et al. (2014) dataset, where hierarchical clustering reveals a block-diagonal structure, implying an intrinsic low-dimensionality of the data that can be efficiently compressed using techniques such as principal components analysis (Supplementary Fig. S1).

Fig. 2.

Open in new tab Download slide

Transcriptome-wide pseudotimes can be inferred from small marker gene panels. (A) A gene-by-gene correlation matrix for the Trapnell et al. (2014) dataset reveals similarities in the transcriptional response of hundreds of genes. The redundancy of expression implies the information content of the transcriptome may be compressed through techniques such as principal components analysis (PCA) or by picking informative marker genes. (B) Comparison of pseudotimes fitted using Ouija on a small panel of marker genes to transcriptome-wide fits (using the 500 most variable genes) across five datasets using the algorithms Monocle 2, DPT and TSCAN. The marker gene fits show high correlation to the transcriptome-wide fits with the exception of the Shin et al. (2015) dataset. (C) Gene expression profiles for two marker genes ID1 and MYOG from the Trapnell et al. (2014) dataset. The solid red line denotes the maximum a posteriori (MAP) Ouija fit while the grey lines show draws from the posterior mean function. (D) Gene expression profiles for the same genes for the algorithms DPT, Monocle 2 and TSCAN show similar expression fits, demonstrating equivalent pseudotemporal trajectories have been inferred. The solid red line denotes a LOESS fit (Color version of this figure is available at Bioinformatics online.)

This redundancy of expression is often exploited by statistical models of single-cell RNA-seq data. Examples include Heimberg et al. (2016) where the intrinsic low-dimensionality is used to reconstruct transcriptome-wide gene expression from ultra-shallow read depths; Cleary et al. (2017) apply compressed sensing techniques to reconstruct high-dimensional gene expression profiles from low-dimensional random projection; and McCurdy et al. (2017) who propose a column subset selection procedure where a small number of genes are chosen to represent the full transcriptome. The compressibility of transcriptome data is likewise exploited by many single-cell pseudotime inference algorithms via initial dimensionality reduction steps. For example, Monocle (Trapnell et al., 2014) reduces the expression data down to 2 dimensions using independent component analysis, while both TSCAN (Ji and Ji, 2016) and Waterfall (Shin et al., 2015) apply PCA to reduce the data down to 2 dimensions. The implication behind such approaches that there is sufficient information in just two dimensions of the data via a linear projection to learn ‘transcriptome-wide’ pseudotime and that the majority of expression is redundant given the low-dimensional projection.

In Ouija, we exploit the high gene-gene correlations by modelling a small number of marker genes that are representative of the whole transcriptome. Such an approach is advantageous as by modelling the data directly rather than a reduced-dimension representation we can understand the pseudotimes for each cell in terms of the behaviour of genes through time rather than abstract notions of manifolds embedded in high-dimensional space. This takes the form of a nonlinear factor analysis model, departing from previous models that have relied upon linear factor analysis (Campbell and Yau, 2017a; Pierson and Yau, 2015) by introducing sigmoidal nonlinearities and transient expression functions, both of which have been successfully applied previously in post-processing of single-cell trajectories (Campbell and Yau, 2016, 2017b; Sander et al., 2017).

We then turn to the question of how to choose the small number of marker genes in order to fit the pseudotimes. In single-cell pseudotime studies, the cells under examination undergo a known biological process such as differentiation or cell cycle. Importantly, key marker genes associated with these processes are usually known a priori by investigators. These marker genes act as positive controls whose behaviour is used post-hoc to confirm the validity of the transcriptome-wide pseudotime fit. Examples include the markers of myoblast differentiation MYH3, MEF2C and MYOG in Trapnell et al. (2014); the markers of neurogenesis Gfap and Sox2 in Shin et al. (2015); and in Li et al. (2016) the authors tabulate the marker genes they expect to be involved in the process along with their expected behaviour along the differentiation trajectory. Given both the widespread a priori knowledge of such markers and their requirement to validate transcriptome-wide pseudotime fits, we therefore propose to derive pseudotimes directly from such markers.

We first sought to test whether our model applied to small panels of marker genes could accurately recapitulate the transcriptome-wide pseudotimes inferred by popular pseudotime methods. We applied Monocle 2, DPT and TSCAN to five publicly available single-cell RNA-seq datasets (Chu et al., 2016; Dulken et al., 2017; Shin et al., 2015; Trapnell et al., 2014; Zhou et al., 2016) using the 500 most variable genes as input [the default in packages such as Scater (McCarthy et al., 2017) for PCA representations]. For each dataset, we then inferred pseudotimes using Ouija based only on a small number of marker genes reported in each paper (ranging from 5 to 12), and compared the Pearson correlation between the Ouija pseudotimes and the pseudotimes reported for each dataset (Fig. 2B). There was good agreement between the marker-based pseudotimes inferred using Ouija and the transcriptome-wide pseudotimes inferred using existing algorithms, with the correlation exceeding 0.75 in the majority of comparisons.

Noting that the correlation will not be 1 unless the algorithms are identical, we sought to compare Ouija’s correlation to transcriptome-wide pseudotime to the agreement of the transcriptome-wide pseudotimes with each other. We found large variability in the agreement between existing algorithms using transcriptome-wide pseudotimes, with correlations as high as 0.93 but as low as 0.61 (Supplementary Fig. S2). We found the marker-based Ouija pseudotimes have higher correlations to one of the transcriptome-wide algorithms than they have amongst each other in all but one of the datasets studied. On average, the correlation between Ouija’s marker based pseudotime with the transcriptome-wide pseudotimes was around 0.1 lower than the correlation amongst the transcriptome-wide pseudotimes, though given Ouija uses around 1–2% the number of input genes we believe this is a positive result that represents transcriptome-wide pseudotimes may be inferred using interpretable, parametric models on a small number of marker genes chosen a priori.

This equivalence of transcriptome-wide and marker-based pseudotimes is further confirmed by examining the qualitative fit of the marker genes across the different algorithms. For example, Figure 2C shows the posterior fit of the marker-based pseudotime for two marker genes from (Trapnell et al., 2014), correctly inferring the switch-like downregulation of ID1 and the upregulation of MYOG. Near identical behaviour is found using transcriptome-wide pseudotimes derived from DPT, Monocle and TSCAN (Fig. 2D). We note the low correlations of the marker-based Ouija pseudotimes with the transcriptome-wide fits for the Shin et al. dataset. Upon close inspection of the marker genes (Supplementary Fig. S3) we found that the expression of four of the marker genes (Aldoc, Apoe, Eomes, Sox11) were highly correlated (the switch times are similar) whilst Gfap and Stmn1 showed little variation over pseudotime. This meant that there was effectively only a single marker gene for this dataset—too few for reliable marker gene-based pseudotime inference.

4.2 Gene regulation timing from marker gene-based pseudotime

Having demonstrated Ouija can accurately recapitulate transcriptome-wide pseudotimes using just small marker gene panels, we next sought to show how it allows for marker-driven interpretable inference of such trajectories. We applied Ouija to a single-cell time-series dataset of human embryonic stem cells differentiating into definitive endoderm cells (Chu et al., 2016). The authors examined the expression of key marker genes over time and found nine to exhibit approximately switch-like behaviour (POU5F1, NANOG, SOX2, EOMES, CER1, GATA4, DKK4, MYCT1 and PRDM1) with a further two exhibiting transient expression (CDX1 and MSX2). We applied Ouija using noninformative priors over the behaviour parameters with no information about the capture times of the cells included.

The resulting pseudotime fit demonstrates we can understand single-cell pseudotime in terms of the behaviour of particular genes. Figure 3A shows a heatmap of the nine switch-like genes (top) and two transient genes (bottom), ordered by the posterior switch time of each gene. It can be seen that the early trajectory is characterized by the expression of NANOG, SOX2 and POUF51, which then leads to a cascade of switch-like activation of the remaining genes as the cells differentiate.

Fig. 3.

Open in new tab Download slide

Parametric models lead to pseudotimes centred around gene regulation timing. (A) An expression heatmap for the nine switch-like genes and two transient genes in the Chu et al. dataset, with genes ordered by the posterior mean of the switch time. (B–C) Posterior distributions over the switch times and peak times for the 11 genes, coloured by their up or down regulation along pseudotime. The horizontal error bars show the 95% highest probability density credible intervals. (D) A Bayesian hypothesis test can quantify whether the posterior difference between two regulation timings (either switch or peak time) is significantly different from 0, allowing us to determine whether a given gene is regulated before or after another along pseudotime

While transcriptome-wide pseudotime algorithms could provide similar heatmaps if the marker genes were known in advance, the key departure of Ouija is that we can quantitatively associate each gene with a region of pseudotime at which its regulation (switch time or peak time) occurs. This is illustrated in Figure 3B–C showing the posterior values for the regulation timing along with the associated uncertainty. In essence, Ouija allows us to order genes along trajectories as well as being able to order the cells, which provides insight into gene regulation relationships.

To approach such questions of gene regulation timings in a quantitative and rigorous manner we constructed a Bayesian hypothesis test to find out whether one gene is regulated before another given the noise in the data. If $t_{Gene A}^{(0)}$ and $t_{Gene B}^{(0)}$ are the regulation timings of genes A and B respectively, we calculate the posterior distribution $p (t_{Gene A}^{(0)} - t_{Gene B}^{(0)} | Y)$ ⁠, and if both the lower and upper bounds of the 95% posterior credible interval fall outside 0 we say the two genes are regulated at significantly different times. We applied this to the pseudotime fit in the Chu et al. dataset, the results of which can be seen in Figure 3D for a subset of genes. The model suggests that EOMES is downregulated before DKK4 and MYCT1 is downregulated after PRDM1. Furthermore, it suggests the switch-like downregulation of DKK4 occurs after the transient peak-time of CDX1. However, it suggests the difference in regulation timings of DKK4 and MYCT1 are not significantly different from zero, which could imply co-regulation.

4.3 Ouija is robust to gene behaviour misspecification

A potential disadvantage of our model is the requirement to pre-specify genes as having switch-like or transient behaviour over pseudotime, which may result in biased or erroneous pseudotimes. We noticed such an effect in the Li et al. (2016) dataset, where the authors pre-specified how they expected several marker genes to behave over pseudotime. Upon fitting the pseudotimes using Ouija, we noted that the genes Mef2c and Pik3r2 exhibited the correct upregulation over pseudotime (Supplementary Fig. S4A), but that Scd1 that was supposed to exhibit transient, peaking expression was effectively constant along the trajectory (Supplementary Fig. S4B).

We first asked whether this was a particular failing of Ouija or a result common to all pseudotime algorithms so fitted transcriptome-wide pseudotimes using TSCAN, Monocle 2 and DPT. We found remarkably low correlations between the different pseudotime algorithms (Supplementary Fig. S4C), with the highest correlations reported between Ouija using markers only and Monocle 2 using the full transcriptome. Furthermore, none of the pseudotime fits displays consistent nor expected behaviour for the set of marker genes (Supplementary Fig. S5).

We supplemented this with extensive simulations to discover wether Ouija is in general robust to gene behaviour misspecification. We simulated datasets where either 75 or 50% of the genes were switch-like (Supplementary Fig. S4D) for 8, 12, 16 and 24 genes with 100 replications for each situation, and re-inferred the pseudotimes using Ouija assuming all genes were switch-like. The results in Supplementary Figure S4D show with four switch-like and four transient genes Ouija still achieves a median correlation greater than 0.9 with the true pseudotimes, demonstrating Ouija is highly robust to misspecification of prior knowledge of gene behaviour.

It is further possible to identify errors in the prior belief of gene behaviour without having to explicitly fit a pseudotemporal trajectory. If a dataset contains a number of switch-like and transient genes, the switch-like genes will have high absolute correlation with themselves but low absolute correlation with the transient genes, which will in turn have high absolute correlation with themselves. This effect is exemplified in the Chu et al. dataset that contains nine switch-like and two transient genes. A hierarchical clustering of the absolute correlations across the genes reveals the transient genes clustering separately from the switch-like genes (Supplementary Fig. S6). Therefore, an investigator could corroborate their prior expectations through similar investigations.

4.4 Identifying discrete cell types along continuous developmental trajectories

We further investigated the single cell expression data from a study tracking the differentiation of embryonic precursor cells into haematopoietic stem cells (HSCs) (Zhou et al., 2016). The cells begin as haemogenic endothelial cells (ECs) before successively transforming into pre-HSC and finally HSC cells. The authors identified six marker genes that would be down-regulated along the differentiation trajectory, with early down-regulation of Nrp2 and Nr2f2 as the cells transform from ECs into pre-HSCs, and late down-regulation of Nrp1, Hey1, Efnb2 and Ephb4 as the cells emerge from pre-HSCs to become HSCs. The study investigated a number of distinct cell types at different stages of differentiation: EC cells, T1 cells (CDK45^– pre-HSCs), T2 cells (CDK45⁺ pre-HSCs) and HSC cells at the E12 and E14 developmental stages.

We therefore sought to identify the existence of these discrete cell types along the continuous developmental trajectory. As Ouija uses a probabilistic model and inference we were able to obtain a posterior ordering ‘consistency’ matrix (Fig. 4A) where an entry in row i column j denotes the empirical probability that cell i is ordered before cell j. Performing PCA on this matrix gives a rank-one representation of cell-cell continuity, which is then clustered using a Gaussian mixture model to find discrete cell states along the continuous trajectory (where the number of states is chosen such that the Bayesian information criterion is maximized).

Fig. 4.

Open in new tab Download slide

Pseudotime ordering and cell type identification of haematopoeietic stem cell differentiation. (A) Consistency matrix of pseudotime ordering. Entry in the ith row and jth column is the proportion of times cell i was ordered before cell j in the MCMC posterior traces. Gaussian mixture modelling on the first principal component of the matrix identified three clusters that are evident in the heatmap. (B) Confusion matrix for cell types identified in original study (columns) and Ouija inferred (rows). Ouija inferred cluster 1 largely corresponds to EC cells, cluster 2 corresponds to pre-HSC cells while cluster 3 corresponds to HSC cells. (C) HSC gene expression as a function of pseudotime ordering for six marker genes. Background colour denotes the maximum likelihood estimate for the Ouija inferred cell type in that region of pseudotime

Applying this methodology to the Zhou et al. dataset uncovered three metastable groups of cells corresponding to endothelial, pre-HSCs and HSCs respectively (Fig. 4B). Misclassifications within cell types (T1/T2 and E12/E14 cells) could be explained by examining a principal components analysis of the global expression profiles (Supplementary Fig. S7) which suggests that these cell types are not completely distinct in terms of expression. When examining the inferred pseudotime progression of each marker gene (Fig. 4C), these three metastable states corresponded to the activation of all genes at the beginning of pseudotime time, the complete inactivation of all the marker genes at the end of the pseudotime and a intervening transitory period as each marker gene turns off. Each metastable state clearly associates with a particular cell type with Nrp2 and Nr2f2 exhibiting early down-regulation and Nrp1, Hey1, Efnb2 and Ephb4 all exhibiting late down-regulation. Using this HSC formation system as a proof-of-principle it is evident that, if a small number of switch-like marker genes are known, it is possible to recover signatures of temporal progression using Ouija and that these trajectories are compatible with real biology.

To show the widespread applicability of this method we applied it to two further publically available datasets. Dulken et al. (2017) examined the trajectory of quiescent neural stem cells (qNSCs) as they differentiate into activated neural stem cells (aNSCs) and neural progenitor cells (NPCs). Applying Ouija’s clustering-along-pseudotime revealed seven distinct clusters (Supplementary Fig. S8; Supplementary Table S1) with clusters 1–2 corresponding to early and late qNSCs, cluster 3 defining the qNSC to aNSC transition, clusters 4–6 corresponding to early to late aNSCs and cluster 7 defining the aNSC to NPC transition. We similarly applied this method to the Chu et al. dataset of time-series scRNA-seq that identified 8 distinct clusters along pseudotime (Supplementary Fig. S9; Supplementary Table S2). Clusters 1–4 track the cells as the progress through the 4 stages from 0 to 36 h, while clusters 5–8 track the 3 stages from 36 to 96 h but with much more heterogeneity within each cluster, which is expected due to the longer time-scales considered.

4.5 Scalable pseudotime inference using TensorFlow

Finally, we wanted to consider a study composed of a large panel of putative marker genes to determine if Ouija could automatically identify genes satisfying its behavioural constraints. We identified a single-cell RNA-seq study (Kowalczyk et al., 2015) that examined variation between individual hematopoietic stem and progenitor cells from two mouse strains (C57BL/6 and DBA/2) as they age. Principal component analysis for each cell type and age showed a striking association of the top principal components with cell cycle-related genes (Fig. 5A), indicating that transcriptional heterogeneity was dominated by cell cycle status. They scored each cell for its likely cell cycle phase using signatures based on functional annotations (Reference Genome Group of the Gene Ontology Consortium, 2009) and profiles from synchronized HeLa cells (Whitfield et al., 2002) for the G1/S, S, G2 and G2/M phases.

Fig. 5.

Open in new tab Download slide

Cell cycle phase prediction. Principal component representation of hematopoietic stem cells coloured according to (A) the original cell cycle progression score (Kowalczyk et al., 2015) and (B) Ouija—cell cycle classes indicated are based on original study classifications. (C) Distribution of Ouija inferred pseudotime versus the original cell cycle classifications. (D) Estimated activation strengths for the 374 cell cycle gene panels. (E) Gene expression profile for 88 switch-like genes with cells ordered by pseudotime and (F) genes ordered by activation time

We investigated if Ouija could be used to identify cell cycle phase, treating the inferential problem as a continuous pseudotime process and assuming all genes as candidate switch genes. We applied Ouija to 1008 C57Bl/6 HSCs using 374 GO cell cycle genes that satisfied gene selection criteria used in the original study. This large number of genes and cells makes inference using Hamiltonian Monte Carlo (HMC) slow so we implemented a second version of Ouija (termed Ouijaflow) using the probabilistic programming language Edward (Tran et al., 2016) based on TensorFlow (Abadi et al., 2016). This performs fast approximate Bayesian inference using stochastic gradient variational inference (Supplementary Fig. S10).

The estimated pseudotime progression given by Ouija recapitulates the trajectory observed in principal component space (Fig. 5A). The estimated pseudotime distribution correlates well with the cell cycle phase categorization given in the original study (Fig. 5C). Furthermore, we identified 88 genes with large activation strengths indicating strong switching-on behaviour (Fig. 5D). Ordering the genes by activation time demonstrates a cascade of expression activation across these 88 genes over cell cycle progression with the quiescent (G₀) indicated by complete inactivation of all 88 genes (Fig. 5E, F). The explicit parametric model assumed by Ouija makes this gene selection and ordering process simple and quantitative compared to a non-parametric approach that would require some retrospective analysis or visual inspection.

5 Conclusion

We have developed a novel approach for pseudotime estimation based on modelling switch-like and transient expression behaviour for a small panel of marker genes chosen a priori. Our strategy provides an orthogonal and complementary approach to unsupervised whole-transcriptome methods that do not explicitly model any gene-specific behaviours and do not readily permit the inclusion of prior knowledge.

We demonstrate that the selection of a few marker genes allows comparable pseudotime estimates to whole transcriptome methods on real single cell datasets. Furthermore, using a parametric gene behaviour model and full Bayesian inference we are able to recover posterior uncertainty information about key parameters, such as the gene activation time, allowing us to explicitly determine a potential ordering of gene (de)activation and peaking events over pseudotime. The posterior ordering uncertainty can also be used to identify homogeneous metastable phases of transcriptional activity that might correspond to transient, but discrete, cell states. In summary, Ouija provides a novel contribution to the increasing plethora of pseudotime estimation methods available for single cell gene expression data.

Funding

KRC was supported by a doctoral studentship from the UK Medical Research Council and a postdoctoral fellowship from the Canadian Statistical Sciences Institute. CY was supported by the UK Medical Research Council [MR/P02646X/1 and MR/L001411/1].

Conflict of Interest: none declared.

References

Abadi

M.

et al. (

2016

) TensorFlow: large-scale machine learning on heterogeneous distributed systems.

arXiv preprint

.

Bendall

S.C.

et al. (

2014

)

Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development

.

Cell

,

157

,

714

–

725

.

Braun

R.

et al. (

2008

)

Identifying differential correlation in gene/pathway combinations

.

BMC Bioinformatics

,

9

,

488.

Campbell

K.R.

,

Yau

C.

(

2016

)

Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference

.

PLoS Comput. Biol

.,

12

,

e1005212

.

Campbell

K.R.

,

Yau

C.

(

2017a

)

Probabilistic modeling of bifurcations in single-cell gene expression data using a bayesian mixture of factor analyzers

.

Wellcome Open Res

.,

2

,

19.

Campbell

K.R.

,

Yau

C.

(

2017b

)

switchde: inference of switch-like differential expression along single-cell trajectories

.

Bioinformatics

,

33

,

1241

–

1242

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Carpenter

B.

et al. (

2017

)

Stan: a probabilistic programming language

.

J. Stat. Softw

,

76

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Chu

L.-F.

et al. (

2016

)

Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm

.

Genome Biol

.,

17

,

173.

Cleary

B.

et al. (

2017

) Composite measurements and molecular compressed sensing for highly efficient transcriptomics. bioRxiv, 091926.

Dulken

B.W.

et al. (

2017

)

Single-cell transcriptomic analysis defines heterogeneity and transcriptional dynamics in the adult neural stem cell lineage

.

Cell Rep

.,

18

,

777

–

790

.

Haghverdi

L.

et al. (

2016

)

Diffusion pseudotime robustly reconstructs lineage branching

.

Nat. Methods

,

13

,

845

.

Hanchate

N.K.

et al. (

2015

)

Single-cell transcriptomics reveals receptor transformations during olfactory neurogenesis

.

Science

,

350

,

1251

–

1255

.

Heimberg

G.

et al. (

2016

)

Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing

.

Cell Syst

.,

2

,

239

–

250

.

Homan

M.D.

,

Gelman

A.

(

2014

)

The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo

.

J. Mach. Learn. Res

.,

15

,

1593

–

1623

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Ji

Z.

,

Ji

H.

(

2016

)

TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis

.

Nucleic Acids Res

.,

44

,

e117

.

Kalisky

T.

,

Quake

S.R.

(

2011

)

Single-cell genomics

.

Nat. Methods

,

8

,

311

–

314

.

Kharchenko

P.V.

et al. (

2014

)

Bayesian approach to single-cell differential expression analysis

.

Nat. Methods

,

11

,

740

–

742

.

Kowalczyk

M.S.

et al. (

2015

)

Single-cell rna-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells

.

Genome Res

.,

25

,

1860

–

1872

.

Langfelder

P.

,

Horvath

S.

(

2008

)

Wgcna: an R package for weighted correlation network analysis

.

BMC Bioinformatics

,

9

,

559.

Li

J.

et al. (

2016

)

Systematic reconstruction of molecular cascades regulating gp development using single-cell rna-seq

.

Cell Rep

.,

15

,

1467

–

1480

.

Macaulay

I.C.

,

Voet

T.

(

2014

)

Single cell genomics: advances and future perspectives

.

PLoS Genet

.,

10

,

e1004126.

McCarthy

D.J.

et al. (

2017

)

Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R

.

Bioinformatics

,

33

,

1179

--

1186

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

McCurdy

S.

et al. (

2017

) Column subset selection for single-cell rna-seq clustering. bioRxiv, 159079.

Pierson

E.

,

Yau

C.

(

2015

)

Zifa: dimensionality reduction for zero-inflated single-cell gene expression analysis

.

Genome Biol

.,

16

,

241.

Qiu

P.

et al. (

2011

)

Extracting a cellular hierarchy from high-dimensional cytometry data with spade

.

Nat. Biotechnol

.,

29

,

886

–

891

.

Reference Genome Group of the Gene Ontology Consortium

. (

2009

)

The gene ontology’s reference genome project: a unified framework for functional annotation across species

.

PLoS Comput. Biol

.,

5

,

e1000431

.

Crossref

PubMed

WorldCat

Reid

J.E.

,

Wernisch

L.

(

2016

) Pseudotime estimation: deconfounding single cell time series.

Bioinformatics

,

32

,

2973

–

2980

.

Sander

J.

et al. (

2017

)

Impulsede: detection of differentially expressed genes in time series data using impulse models

.

Bioinformatics

,

33

,

757

–

759

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Setty

M.

et al. (

2016

)

Wishbone identifies bifurcating developmental trajectories from single-cell data

.

Nat. Biotechnol

.,

34

,

637

–

645

.

Shapiro

E.

et al. (

2013

)

Single-cell sequencing-based technologies will revolutionize whole-organism science

.

Nat. Rev. Genet

.,

14

,

618

–

630

.

Shin

J.

et al. (

2015

)

Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis

.

Cell Stem Cell

,

17

,

360

–

372

.

Stegle

O.

et al. (

2015

)

Computational and analytical challenges in single-cell transcriptomics

.

Nat. Rev. Genet

.,

16

,

133

–

145

.

Tegge

A.N.

et al. (

2012

)

Pathway correlation profile of gene-gene co-expression for identifying pathway perturbation

.

PLoS One

,

7

,

e52127.

Tran

D.

et al. (

2016

) Edward: a library for probabilistic modeling, inference, and criticism.

arXiv preprint arXiv

:1610.09787.

Trapnell

C.

(

2015

)

Defining cell types and states with single-cell genomics

.

Genome Res

.,

25

,

1491

–

1498

.

Trapnell

C.

et al. (

2014

)

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells

.

Nat. Biotechnol

.,

32

,

381

–

386

.

Whitfield

M.L.

et al. (

2002

)

Identification of genes periodically expressed in the human cell cycle and their expression in tumors

.

Mol. Biol. Cell

,

13

,

1977

–

2000

.

Wills

Q.F.

,

Mead

A.J.

(

2015

)

Application of single cell genomics in cancer: promise and challenges

.

Hum. Mol. Genet

.,

24

,

R74

–

R84

.

Zhou

F.

et al. (

2016

)

Tracing haematopoietic stem cell formation at single-cell resolution

.

Nature

,

533

,

487

–

492

.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
June 2018	113
July 2018	56
August 2018	79
September 2018	46
October 2018	46
November 2018	113
December 2018	380
January 2019	365
February 2019	155
March 2019	151
April 2019	199
May 2019	135
June 2019	110
July 2019	109
August 2019	102
September 2019	136
October 2019	138
November 2019	71
December 2019	64
January 2020	101
February 2020	99
March 2020	65
April 2020	52
May 2020	62
June 2020	64
July 2020	56
August 2020	59
September 2020	74
October 2020	74
November 2020	48
December 2020	44
January 2021	48
February 2021	45
March 2021	57
April 2021	48
May 2021	72
June 2021	46
July 2021	55
August 2021	60
September 2021	67
October 2021	69
November 2021	49
December 2021	51
January 2022	52
February 2022	74
March 2022	108
April 2022	61
May 2022	39
June 2022	45
July 2022	77
August 2022	55
September 2022	75
October 2022	53
November 2022	58
December 2022	39
January 2023	54
February 2023	55
March 2023	72
April 2023	46
May 2023	64
June 2023	45
July 2023	25
August 2023	29
September 2023	31
October 2023	48
November 2023	44
December 2023	54
January 2024	73
February 2024	38
March 2024	73
April 2024	27

Article Contents

A descriptive marker gene approach to single-cell pseudotime inference

Abstract

1 Introduction

2 Approach

3 Materials and methods

3.1 Overview

3.2 Input data normalization

3.3 Noise model

3.4 Mean functions

3.5 Inference

4 Results and discussion

4.1 Pseudotime inference from small marker gene panels

4.2 Gene regulation timing from marker gene-based pseudotime

4.3 Ouija is robust to gene behaviour misspecification

4.4 Identifying discrete cell types along continuous developmental trajectories

4.5 Scalable pseudotime inference using TensorFlow

5 Conclusion

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

A descriptive marker gene approach to single-cell pseudotime inference

Abstract

1 Introduction

2 Approach

3 Materials and methods

3.1 Overview

3.2 Input data normalization

3.3 Noise model

3.4 Mean functions

3.5 Inference

4 Results and discussion

4.1 Pseudotime inference from small marker gene panels

4.2 Gene regulation timing from marker gene-based pseudotime

4.3 Ouija is robust to gene behaviour misspecification

4.4 Identifying discrete cell types along continuous developmental trajectories

4.5 Scalable pseudotime inference using TensorFlow

5 Conclusion

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only