SCTC: inference of developmental potential from single-cell transcriptional complexity

Abstract Inferring the developmental potential of single cells from scRNA-Seq data and reconstructing the pseudo-temporal path of cell development are fundamental but challenging tasks in single-cell analysis. Although single-cell transcriptional diversity (SCTD) measured by the number of expressed genes per cell has been widely used as a hallmark of developmental potential, it may lead to incorrect estimation of differentiation states in some cases where gene expression does not decrease monotonously during the development process. In this study, we propose a novel metric called single-cell transcriptional complexity (SCTC), which draws on insights from the economic complexity theory and takes into account the sophisticated structure information of scRNA-Seq count matrix. We show that SCTC characterizes developmental potential more accurately than SCTD, especially in the early stages of development where cells typically have lower diversity but higher complexity than those in the later stages. Based on the SCTC, we provide an unsupervised method for accurate, robust, and transferable inference of single-cell pseudotime. Our findings suggest that the complexity emerging from the interplay between cells and genes determines the developmental potential, providing new insights into the understanding of biological development from the perspective of complexity theory.


Negative Correlation between Odd-Order and Even-Order Cell Complexities
The plots illustrate the differences between the odd-order and even-order complexities (kc,0-kc,1, kc,1-kc,2, kc,2-kc,3, and kc,3-kc,4) for (A) HND data, (B) ZEB data, (C) HSG data, and (D) MSG data.The negative correlation between odd-order and even-order complexities is evident across four datasets.

Supplementary Figure S5
Threshold Order for Complexity Convergence Recursion will cause the cell complexity to collapse to the same value when the complexity order N exceeds a certain threshold Nth.The threshold order Nth at which this collapsing occurs is determined for each dataset: (A) Nth = 16 for HND data, (B) Nth = 28 for ZEB data, (C) Nth = 52 for HSG data, and (D) Nth = 60 for MSG data, respectively.

Supplementary Figure S6
Comparison of Pseudotime Inference between kc, 14 and CCI Pseudotime inferred using 14-order cell complexity (kc,14, left column) versus Cell Complexity Index (CCI, right column) on (A) HND, (B) ZEB, (C) HSG, and (D) MSG scRNA-seq datasets.The pseudotime orderings obtained by the two methods show high concordance across all four datasets, indicating kc,14 provides equivalent pseudotime inference to CCI.

Supplementary Figure S7
Correlation Between Nth-Order Cell Complexity and Cell Complexity Index (CCI) We present the Spearman Correlation Coefficients (SCCs) between various Nth-order complexities and CCI across four datasets: (A) HND, (B) ZEB, (C) HSG, and (D) MSG.The figure illustrates the consistency between CCI and the Nth-order complexity within a certain range of N values.When utilizing the Nth-order complexity, it's critical to identify an appropriate N value.During the iterative computation of the Nth-order complexity, the calculation can be stopped when the SCC value between the Nth-order complexity and CCI is sufficiently high (approaching 1), with the current N value being used as the criterion.Alternatively, a broader range of N values can be explored for Nth-order complexity, and the N value exhibiting the highest correlation with CCI can be chosen as the criterion.Specifically, the N values corresponding to the maximum SCC for the four datasets are as follows: 14 for HND, 24 for ZEB, 34 for MSG, and 30 for HSG.

Silhouette Coefficients of Gene Complexity across Complexity Orders
Silhouette coefficients measuring the consistency of gene complexity with developmental time at different complexity orders N, calculated based on scRNA-seq datasets: (A) HND, (B) HSG, and (C) MSG.Overall, the silhouette coefficient increases with higher complexity order N, indicating that higher-order complexities provide better distinction of developmental stages.
We excluded the 30th dataset (AT2/AT1 lineage (C1)) because we found that the time labels did not match the actual developmental stages.Additionally, we removed two datasets that overlapped with the Quasildr benchmark datasets (30).
 Datasets 40-51 were obtained from the Quasildr benchmark datasets (30), which primarily derive from the work of Saelens et al. ( 12), with the addition of a few extra datasets.The Quasildr benchmark datasets contain three types of standards: gold, silver, and other.We selected 12 gold-standard datasets with a linear trajectory type and two other-standard datasets (dataset 50 and dataset 51) from model organismal single-cell developmental studies.
 While dataset 51 (briggs_2018) originates from the Quasildr benchmark datasets, its preprocessed format does not meet CytoTRACE's input requirements.Therefore, we alternatively obtained the raw data from Gene Expression Omnibus (GEO) using accession number GSE113074 (33).
 Dataset 56 (Mouse Spermatogenesis), similar to HSG and MSG datasets, is one of the mammalian spermatogenesis scRNA-seq datasets published by Shami et al (23,32).
 For duplicate datasets from different sources, we retained only one with the highest average Spearman Correlation Coefficient (SCC) calculated using both the SCTD and SCTC methods.
 The datasets sourced from the CytoTRACE benchmark datasets have undergone normalization and log1p transformation, so we did not perform any additional preprocessing on them.However, the datasets from Quasildr and NCG were only normalized, so we proceeded to apply log1p transformation to those datasets.All other data underwent both normalization and log1p preprocessing.