PseudoGA: cell pseudotime reconstruction based on genetic algorithm

Abstract Dynamic regulation of gene expression is often governed by progression through transient cell states. Bulk RNA-seq analysis can only detect average change in expression levels and is unable to identify this dynamics. Single cell RNA-seq presents an unprecedented opportunity that helps in placing the cells on a hypothetical time trajectory that reflects gradual transition of their transcriptomes. This continuum trajectory or ‘pseudotime’, may reveal the developmental pathway and provide us with information on dynamic transcriptomic changes and other biological processes. Existing approaches to build pseudotime heavily depend on reducing huge dimension to extremely low dimensional subspaces and may lead to loss of information. We propose PseudoGA, a genetic algorithm based approach to order cells assuming that gene expressions vary according to a smooth curve along the pseudotime trajectory. We observe superior accuracy of our method in simulated as well as benchmarking real datasets. Generality of the assumption behind PseudoGA and no dependence on dimensionality reduction technique make it a robust choice for pseudotime estimation from single cell transcriptome data. PseudoGA is also time efficient when applied to a large single cell RNA-seq data and adaptable to parallel computing. R code for PseudoGA is freely available at https://github.com/indranillab/pseudoga.

Step 2: After Step 1, the set of expression values for any particular gene has increasing linear trend along pseduotime trajectory if β j is taken positive. If we want j-th gene to have decreasing mean expression level with pseudotime, we modify the set of expression values for j-th gene as Y .j = reverse(Y .j ) where reverse(x 1 , x 2 , ..., x n ) = (x n , x (n−1) , ..., x 1 ). To include genes with both increasing and decreasing trend, we apply Y .j = reverse(Y .j ) for some genes and keep Y .j = Y .j for the rest.
Step 3: To construct a gene with approximate quadratic trend with respect to pseudotime, we modify Y .j from Step 2 in the following manner: [1, 3, 5, ..., (2n − 1), (2n), (2n − 2), ..., 2] To construct a gene with approximate cubic trend or sinusoidal trend with respect to pseudotime, we modify Y .j in the following manner: To generate genes that are not associated with pseudotime, we take Y .j as a random permutation of Y .j .
To include variety of trends in our final gene expression matrix, we consider four types of genes: genes with quadratic trend, genes with cubic trend, genes with linear trend from Step 2 that we keep unchanged in Step 3 and genes with no association with pseudotime. In scenarios I and II, we assume relatively high values of c j (c j ∼ N (1, 1 4 )) implying lower occurrence of zeros. In scenario III, relatively lower values of c j (c j ∼ N (0, 1 4 )) are considered to incorporate higher incidence of dropouts.
The expression values for the j-th gene corresponding to the ordering (i 1 , i 2 , ..., i n ), s .j is obtained by equating (s i 1 j , s i 2 j , ..., s inj ) = Y .j

Simulation with Splatter
Using the Bioconductor package Splatter (1), single cell RNA-seq data were generated using three different methods: PROSSTT (2), Splat (1) and PhenoPath (3). Under PROSSTT simulation method, the datasets of the specified size were generated with 100 steps keeping all other parameters as default.
Similarly, datasets with the specified number of features and cells were generated by Splat simulation method with 100 steps keeping all other parameters as default. Let i-th gene has the highest variance of expression values. The expression values of the i-th gene X i is changed into σ(X i ) where σ(.) represents a random permutation of (1, 2, . . . , n), n being the number of cells.
Similarly, simulations using PhenoPath with the specified size were performed with all default parameters. PhenoPath generates expression values comparable to log normalized expressions. To convert them into actual expression values, the following transformation is applied on expression vectors X i for the i-th gene: Y i = 10 X i −min(X i )−3 . Y i values for all 100 genes are used for estimation by different methods. The distribution of Y i can be thought of as left truncated unimodal log-normal, which is known to be a good approximation of normalized single cell expression data (4).

Operators in genetic algorithm
We explain different operators like mutation, recombination and selection that are used in our algorithm. We explain these using two chromosomal representations of eight objects as given in Figure S1. Mutation operator, when applied on one permutation, arranges the objects in the segment between two positions in reverse order ( Figure S2). We can also think of similar kinds of other mutation operators.
We apply recombination operator on two parental chromosomal configurations. This would produce two new offspring chromosomes that are different from parental chromosomes ( Figure S3).