CellGO: a novel deep learning-based framework and webserver for cell-type-specific gene function interpretation

Abstract Interpreting the function of genes and gene sets identified from omics experiments remains a challenge, as current pathway analysis tools often fail to consider the critical biological context, such as tissue or cell-type specificity. To address this limitation, we introduced CellGO. CellGO tackles this challenge by leveraging the visible neural network (VNN) and single-cell gene expressions to mimic cell-type-specific signaling propagation along the Gene Ontology tree within a cell. This design enables a novel scoring system to calculate the cell-type-specific gene-pathway paired active scores, based on which, CellGO is able to identify cell-type-specific active pathways associated with single genes. In addition, by aggregating the activities of single genes, CellGO extends its capability to identify cell-type-specific active pathways for a given gene set. To enhance biological interpretation, CellGO offers additional features, including the identification of significantly active cell types and driver genes and community analysis of pathways. To validate its performance, CellGO was assessed using a gene set comprising mixed cell-type markers, confirming its ability to discern active pathways across distinct cell types. Subsequent benchmarking analyses demonstrated CellGO’s superiority in effectively identifying cell types and their corresponding cell-type-specific pathways affected by gene knockouts, using either single genes or sets of genes differentially expressed between knockout and control samples. Moreover, CellGO demonstrated its ability to infer cell-type-specific pathogenesis for disease risk genes. Accessible as a Python package, CellGO also provides a user-friendly web interface, making it a versatile and accessible tool for researchers in the field.


Model details of CellGO
CellGO consists of two phases: (i) The modeling phase includes steps 1-3, developed for scoring cell type-specific activities of gene-term pairs to enable cell type-specific pathway analysis for a single gene, and (ii) the analysis phase includes steps 4-7, developed for comprehensive cell type-specific pathway analysis that works for a gene set.Detailed descriptions of each step are as follows.

Step 1: Processing of ontologies and extraction of subtrees
We used the python package goatools [3] v1.0.15 to import the GO graph into the workspace to make it operable and assigned genes to GO terms based on species-specific GO annotations.Then we pruned the species-specifically annotated GO graph according to the following steps: (i) A GO term with less than six annotated genes was deleted, and all genes within it were passed to its first-order parental terms.This procedure was performed from bottom to up to ensure a term is deleted only when it has no child terms.(ii) We deleted the Cellular Component (CC) branch due to its small number of terms.Only the Biological Process (BP) and Molecular Function (MF) branches were retained.After this step, the human-and mouse-specifically annotated GO graph retained 9,870 and 9,979 terms, respectively.(iii) For each non-leaf term (a term with at least one child term), we removed annotated genes within it which simultaneously be annotated in its child terms.(iv) For each non-leaf term, we removed all genes within it if more than 100 genes were annotated.This step avoids the excessive influence of a single term on the global graph.After the above processing of ontologies, we next extracted subtrees from the pruned GO graph.Concretely, for each leaf term with no more than 100 genes, we exported it and all of its parental terms, the parent-child relations between these terms, and the annotated genes of each term as a single file with a specified format.A total of 4,168 and 4,230 subtrees were extracted from the human-and mouse-specifically pruned GO graph, respectively.This part has been coded into a function to facilitate processing other ontology files.

Step 2: Training the VNN of each subtree
The VNN was initially inspired by Dcell [4], whose neural network architecture matched the hierarchical structure of the GO graph.The differences of the VNN between CellGO and Dcell are discussed in Supplementary Text S3.Given a subtree including interrelations between terms and annotated genes within each term, CellGO constructs and trains a VNN whose architecture mirrors the inter-pathway parent-child relations and intra-pathway gene annotations.We denote our input training dataset as D = {(X 1 , y 1 ), (X 1 , y 1 ),…, (X N , y N )}, where N is the number of samples (or cells).For each sample i, X i ∈R G denotes the gene expression, represented as a vector of continuous values on G genes, and y i ∈R C denotes the cell identity, represented as a one-hot vector of binary values on C classes (or cell types).y ic = 1 if sample i is annotated to class c, otherwise y ic = 0.If an annotated gene does not exist in the single-cell expression matrix, we pad the expression of this gene with zero.The multidimensional state of each term (or pathway) p, denoted by the output vector O i (p) , is defined by a nonlinear function of the contribution of its annotated genes g (p) (if present) and the states of all of its first-order children child (p) , concatenated in the input vector I i (p) : Here, Z i (p) is the contribution vector of annotated genes of p with length In equation ( 3), the parameters of Linear give the ability to capture the complex biological signals of p. Tanh and Batch Norm reduce the impact of extreme values and scale differences of weights to accelerate the training speed.We performed the training process by minimizing the objective function: Here, Loss is the cross-entropy loss function, and r is the top-level term of the subtree.The top-level term can only be one of 'biological_process' (GO:0008150) and 'molecular_function' (GO:0003674).Linear in equation ( 5) denotes the linear function transforming the multi-dimensional vector O i (p) into a vector of continuous values with length C. Every term is optimized to predict the cell identities by itself.The parameter α (=0.3) balances these two contributions.λ (=0.001) is an l2 norm regularization factor.We optimized the objective function using ADAM, and the batch size was selected as one-third of the sample number of the class with the minimum number.We trained each subtree-matched VNN separately using 15 epochs, given its strong learning ability and fast convergence, which also reduced overfitting.We downsampled classes with over 10,000 samples and oversampled classes with less than 500 samples at each epoch (Supplementary Table S1).

Step 3: Scoring cell type-specific activities of gene-term pairs
After achieving a well-trained VNN corresponding to a subtree v, CellGO scores the cell type-specific activities of gene-term pairs within it.Let gp∈GP (v) denote all gene-term paired annotation relations within v, and the raw score of a gene-term pair gp in class c is determined by S gp,c (v) = − change of sample number predicted as c by r when setting the expression of g in p to 0 (6) Intuitively, the top-level term r is the parent of all terms in the subtree except itself, which may better respond to the influence of the perturbation of g in p on the whole subtree.When the predicted number of c is decreased then this score is positive and vice versa.We noticed that the impact of the perturbation was transmitted upward along the network architecture to r, which simulated the transmission process of biological signals within the hierarchical functional structure of a cell.We next eliminated the class-specific background bias by This step allows these scores to reflect relative activity differences between gene-term pairs, and removes the internal bais of the VNN which may cause every score is extreme in a certain class.We next corrected these scores according to the classification accuracy of r: number of all samples (8) size factor (v) = max 0, accuracy r (v) − accuracy random 1 − accuracy random (9) S gp,c (v) ← S gp,c (v) * size factor (v) (10) Here, size factor (v) measures the predictive ability of r. size factor (v) = 1 if the classification accuracy of r equals 1, and size factor (v) = 0 if the accuracy of r is less than the accuracy of classifying all samples to the class with the maximum number.We performed this step because if genes annotated in v do not have any cell type-specific signatures, the VNN at this time is not predictable, hence these scores should be rejected.To reduce the time-consuming of the modeling phase, the classification accuracy of r is replaced by the classification accuracy for the training dataset rather than the accuracy based on five-fold cross-validation.Let V denote the collection of every subtree.Since a gp may appear in multiple subtrees, we next computed gene-term scores without consideration of v: v∈V (11) Let GP denote the collection of every gene-term pair.We further balanced the power of positive and negative scores by Lastly, we normalized these scores to the interval − 1, 1 by max abs S gp,c gp∈GP (13) So far, we have completed the estimation of cell type-specific gene-term scores.To facilitate storage and transfer, we exported them as a CSV file.We further defined the P-value of a cell type-specific gene-term score by P-value gp,c = g ' p ' ∈GP S gp,c < S g ' p ' ,c number of all gene-term pairs (14) We ran the above pipeline on 71 datasets (Supplementary Table S1) separately to obtain 71 CSV files for downstream analysis.Dataset 1, which includes 25,639 cells and six cell types, spent about 35 hours using one NVIDIA GeForce RTX 2080 Ti GPU.

Step 4: Assigning cell type-specific P-values to pathways given a gene list
Given a set of genes denoted by g∈G (u) .Let GP (p) denote the collection of all gene-term pairs which involve term p or any one of its child terms, and let G (p) denote all annotated genes of term p and all of its child terms.CellGO next computed cell type-specific pathway active scores (ctPASs) through summing gene-pathway paired active scores (GPPASs): Here, A gp,c denotes the cell type-specific GPPAS of gene g in pathway p, if p or its children contain g.A p,c denotes the ctPAS of p by summarizing GPPASs of genes in both G (u) and G (p) .The most active gene in a cell type is defined as the gene with the largest cell type-specific GPPAS.We noticed that a higher GPPAS indicating a more positive and larger contribution to pathway activation.We next generated approximate null distribution of ctPASs by randomly sampling B (default 100,000) gene sets G K (b) , ∀b∈ 1, …, B with K genes in each: Here, E g,c denotes the cell type-specific gene active score of gene g without consideration of p, a universal approximation of A gp,c in equation ( 15).E c,K denotes the cell type-specific active scores without consideration of p. Distribution c,K denotes the null distribution of ctPASs with consideration of gene number K. Let K p (u) denote the number of genes contained by both G (u) and G (p) , we assigned the P-value to A p,c (u) by In order to make A p,c (u) comparable between different pathways and cell types, we calculated the z-scored ctPAS as the output value of the ctPAS by Step

5: Analysis of cell-type level activation
We used the python package scipy [5] v1.5.4 to perform Kolmogorov-Smirnov (K-S) test to evaluate the cell type-level significance on a given gene list: , alternative = 'less' (22) Step 6: Cell type-specific pathway network analysis We screened cell type-specific active pathways by P-value p,c (u) ≤ ActiveP (default 0.01) and took them as seed nodes of random walk with restart (RWR) to explore high-affinity nodes in the pruned GO graph.Since BP and MF are two non-interconnected domains, we implemented RWR on the two domains separately.Let p∈P BP (c) denote active pathways of cell type c in BP, we then went on 25 travels starting from each of them, and each travel included ten steps.The first step of each travel was at the starting node.Each of the other nine steps had a 10% probability of returning to the starting and a 90% probability of randomly entering one neighbor (one parent or one child) of the current node.We then selected high-affinity nodes according to the times each node be traveled.The number of high-affinity nodes was defined as [1.1 * number of seed nodes].It should be noted that high-affinity nodes allow the inclusion of seed nodes.
Let p∈hP BP (c) denote high-affinity nodes (or pathways) of cell type c in BP, which composed the network of cell type-specific enriched pathways.We then divided them to determine pathway communities.Specifically, we used the python package graph-tool (DOI: 10.6084/m9.figshare.1164194.v14)v2.43 to partition these nodes according to the maximization of Newman's modularity [6].The graph-tool adopted an agglomerative heuristic to fit the stochastic block model to achieve the community partition.We generated 100 Monte Carlo (MC) partitions and aligned them with a common community labelling.
Here, equation ( 23) denotes the significance of community t of cell type c in BP.We defined communities with significance greater than − log 10 (ActiveP) (default 2) as active (or significant) communities.We further filtered out communities with a low number of pathways by the average number of pathways contained by significant communities.Prioritized top communities were selected based on significance rankings.In this study, we manually summarize the topic of a community to maximally cover the semantics of pathways within the community.It should be noted that all results of this part reported in this paper were based on BP because MF contained a low number of pathways.For the differentially expressed gene (DEG) list generated from the Conrow-Graham et al.Adnp KO dataset [7], we set ActiveP to 0.001 because there were more than 500 original active pathways in microglia.

Step 7: ORA-based pathway analysis in CellGO
CellGO included the function of conventional pathway enrichment analysis similar to gProfileR [8][9][10].Given a set of genes denoted by g∈G (u) , and let G (p) denote the collection of all annotated genes of a term p and all of its child terms.We next used the python package scipy [5] v1.5.4 to perform Fisher's exact test to evaluate the pathway-level significance on a given gene list: These results were output simultaneously with cell type-specific P-values.

Differences of the VNN between CellGO and Dcell
Compared with Dcell [4], the VNN in the modeling phase of CellGO has made the following changes and upgrades: (1) Dcell is designed to predict phenotype (such as the relative growth rate of yeast) from genotype (a binary vector indicating gene deletions).However, CellGO trains the VNN by inputting a single-cell expression matrix to predict identities of cells, then CellGO scores the cell type-specific activities of gene-term pairs by assessing the impact of gene expression perturbation within a pathway on cell identity prediction.(2) CellGO trains each subtree-based VNN individually and collects cell type-specific scores of all gene-term within it.(3) Compared to Dcell, CellGO uses more neurons to capture multi-functions of GO terms.(4) We optimized the network architecture of Dcell to reduce the VRAM consumption significantly.In CellGO, each GO term only inputs the expression of its annotated genes rather than the expression of all genes.(5) We implemented CellGO using the python package torch (https://github.com/pytorch/pytorch),while Dcell is implemented using the Torch7 library (https://github.com/torch/torch7).
A total of 71 datasets were curated and processed separately to make up the proposed database according to the following steps: (i) When the obtained value was read count or UMI count, it will be normalized by the NormalizeData function from the R package Seurat [40] v4.3.0 with default parameters to allow correction for the total number of reads per cell and log-transformation.Other values, such as CPM, RPKM, and TPM, were log2-transformed with pseudo-count one unless the expression data was already log-transformed.(ii) Quality control (QC) of cells was performed as described in the original study unless the obtained dataset was already QC-ed.(iii) Cells with uninformative cell type labels (e.g., 'unclassified') defined as outliers in the original study were excluded.We also removed unimportant and low-quantity cell types in the brain, such as T cells, endothelial cells, pericytes, and vascular leptomeningeal cells.We only retained cells of the control group.(iv) We manually aligned cell type labels annotated in the original study to the unified terminology to ensure consistency between datasets.61 out of the 71 datasets were re-annotated at major-type-level, including excitatory neurons, inhibitory neurons, radial glias, intermediate progenitor cells, etc.Other ten datasets, which included only one major cell type, were re-annotated at subtype-level, such as excitatory neuron subtypes, including L4 IT, L5/6 IT, L6b, etc. (v) Re-annotated cell types with the number of cells less than 100 were removed.For each dataset, we exported the re-processed expression matrix and the cell identity list as two CSV files and took them as input to the modeling phase of CellGO.Details about each dataset, such as references, original download URLs, species, anatomical areas, and included cell types, are reported in Supplementary Table S1.

One PBMC dataset
The human PBMC dataset [41] was collected from UCSC Cell Browser (https://cells.ucsc.edu/),and both the log-transformed UMI count data that has been QC-ed and pre-assigned cell labels were downloaded.We only retained B cells, CD4+ T cells, CD8+ T cells, Gamma-Delta T cells, monocytes, and natural killer cells.Details about the dataset are reported in Supplementary Table S1.

Identification of predefined cell type-specific pathways
We collected marker genes of five cell types from the McKenzie et al. study [42].These five cell types included neurons, oligodendrocytes, oligodendrocyte precursor cells, astrocytes, and microglia (Supplementary Table S2).The top 50 marker genes were selected for each cell type based on the reported rankings.We then took the marker genes of each cell type as input to perform conventional pathway enrichment analysis using CellGO.Predefined cell type-specific pathways were selected for each cell type according to the top 50 pathways ranked by P-values (Supplementary Table S3).

Collection of experimental DEG lists
We first considered five experimental datasets, including the bulk RNA sequencing data for both the experimental and control groups (Supplementary Table S6).We next screened the top 100 differentially expressed genes based on P-value rankings from the supplementary data of corresponding references.Differentially expressed genes were achieved by differential gene expression analysis between the experimental and control groups.For the Conrow-Graham et al.Adnp KO dataset [7], the supplementary data only reported up-regulated genes in the KO group.

Comparison with other tools
We compared our method with two existing methods: gProfileR [8][9][10] and scMappR [43].We applied these two methods to DEG lists from Smith et al. [44], Runge et al. [45], and Wang et al. [46].gProfileR takes a set of genes as input and returns P-values of pathways.We used the gprofiler function from the R package gProfileR v0.7.0 with default parameters to analyze the collected DEG lists and obtained FDR-corrected P-values.scMappR requires not only a DEG list but also the P-values and log-transformed fold changes of DEGs, the bulk expression matrix, and the single-cell gene signature matrix as input, then scMappR returns cell type-specific P-values of pathways.The single-cell gene signature matrix was generated based on dataset 47 using the seurat_to_generes and generes_to_heatmap functions from the R package scMappR v1.0.9 with default parameters.The adopted type of the single-cell gene signature matrix was 'OR' (odds ratios).The P-values and log-transformed fold changes matching the collected DEG lists were obtained from the supplementary data of corresponding references (Supplementary Table S6).The bulk expression matrices were collected from GEO (Data Availability).We next took them to the scMappR_and_pathway_analysis function with default parameters and obtained cell type-specific FDR-corrected P-values.It should be noted that we only retained pathways with the domain BP or MF.

Benchmarking cell type-specific functional inference using putative gold standard pathways
To more powerfully demonstrate the ability of CellGO to accurately infer cell type-specific pathways compared to scMappR, we constructed the putative gold standard cell type-specific pathways based on reported experimental validations.Specifically, for the DEG lists from Smith et al. [44], we first extracted the keywords from existing validation results.These keywords are 'ATP', 'mitochondrial', 'amino', 'amine', and 'metabolic'.Next, the gold standard astrocyte-specific pathway is defined as a pathway that semantically contains any one of these keywords.Lastly, we counted the number of active gold standard pathways per cell type in CellGO, as well as the number of significant gold standard pathways per cell type in scMappR.Notably, CellGO indicated the largest number of active gold standard pathways in astrocytes (Supplementary Figure S5A), aligning with previous experimental reports.However, scMappR showed that ExN, OPC, astrocytes, and microglia had the same number of significant gold standard pathways.
In the case of the DEG lists from Runge et al. [45], the keywords for gold standard ExN-specific pathways are 'migration', 'synapse', 'membrane potential', and 'ion'.CellGO indicated the largest number of active gold standard pathways in ExN (Supplementary Figure S5B), consistent with previous findings.However, scMappR showed that astrocytes owned the largest number of significant gold standard pathways.
For the DEG lists from Wang et al. [46], the keywords for gold standard microglia-specific pathways are 'immune', 'cell cycle', and 'catabolic'.CellGO indicated the largest number of active gold standard pathways in microglia (Supplementary Figure S5C).While scMappR also revealed microglia had the largest number of significant gold standard pathways, other cell types also enriched a set number of pathways.Overall, these results strongly demonstrate that CellGO more accurately predicts cell type-specific pathways than scMappR.

Application of CellGO on more gene knockout datasets
We used CellGO to analyze the differentially expressed genes generated by the other two experimental datasets [7,47].For the top 100 differentially expressed genes between Adnp KO and WT mice [7], CellGO found that microglia had the highest activation according to the number of active pathways (Supplementary Figure S6B).Noteworthily, top communities of microglia represented topics related to metabolic and immune system processes, stimulus responses, signal transduction, and reproductive processes (Supplementary Table S8).More specifically, organic substance metabolic process, immune response-inhibiting signal transduction, positive regulation of leukocyte differentiation, and reproductive process were found in these communities (Supplementary Figure S6B), with Hexb (a gene encoding the beta subunit of the lysosomal enzyme beta-hexosaminidase that breaks down fatty compounds and complex sugars), Lyn (a gene encoding the tyrosine protein kinase that involves in immune response-regulating signaling pathways), and Inpp5d (a gene encoding the protein that functions as a negative regulator of myeloid cell proliferation and survival) identified as the MAGs in these pathways.These findings are consistent with previous studies [7] showing that Adnp KO leads to pro-inflammatory and pro-phagocytic phenotype and a prominent increased number of microglia.
For the top 100 differentially expressed genes between neuron-specific Wwox knockdown and WT mice [47], CellGO discovered that oligodendrocytes showed the highest activation (Supplementary Figure S6C).Notably, top communities of oligodendrocytes represented topics related to cell development, metabolic processes, homeostatic processes, signaling, and ion transport (Supplementary Table S8).More concretely, glial cell development, cell differentiation, transmission of nerve impulse, and positive regulation of calcium ion transmembrane transport were found in these communities (Supplementary Figure S6C), with Plp1 (a gene encoding a transmembrane proteolipid protein that is the predominant component of myelin and plays a role in oligodendrocyte development and axonal survival) and Nfasc (a gene encoding the protein functions in neurite outgrowth, neurite fasciculation, and organization of the axon initial segment) predicted as the MAGs in these pathways.These findings align with a previous study [47] indicating that Adnp knockdown in neurons reduced oligodendrocyte maturation and induced hypomyelination and impaired axonal conductivity.These results again demonstrated that CellGO could be used effectively to infer functionally relevant cell types and cell type-specific pathways.

Application of CellGO on risk genes of systemic lupus erythematosus
In order to prove that CellGO can be used to analyze the risk genes of multi-locus genetic diseases in other systems except the nervous system, we used the human PBMC dataset [41] to run the modeling phase of CellGO, and leveraged 120 GWAS risk genes of systemic lupus erythematosus (SLE) [48] to try to elucidate the blood cell types relevant to SLE, as well as their cell type-specific pathogenic mechanisms.
Our analysis identified B cells as the PBMC cell type most associated with SLE, along with T cells showing a high degree of enrichment (Supplementary Figure S8A), consistent with the known knowledge that SLE is mainly characterized by the excessive production of pathogenic autoantibodies by the B cells [49,50], and T cells play a significant role inducing B cells to generate autoantibodies [49].The top communities of B cells were linked to topics related to environmental stimulus responses, protein kinase signaling, tolerance induction to self antigen, and macromolecule biosynthetic processes (Supplementary Figure S8B and Supplementary Table S10).These topics correspond to the abnormal secretion of autoantibodies and responses to antigens by B cells, and the known effect of protein kinase in enhancing B cell activation and secretion of immunoglobulin M , during SLE [49].Moreover, BANK1, the gene encoded the B cell-specific scaffold protein, was identified as the MAG of pathways in multiple communities (Supplementary Figure S8B).This finding aligns with its known role in promoting B cell activation and inflammation [51], representing the prominent features of SLE.In conclusion, our results once again demonstrated the ability of CellGO to understand complex multi-gene diseases.
11 Embedding a set of pathways to the semantic space Given a set of pathways denoted by p∈P (u) , where P represents the collection of every pathway p in the pruned GO graph.The probability of a pathway is defined as (25) Here, D p denotes the number of p and all of its children.r p denotes the top-level pathway of p.The top-level pathway can only be one of 'biological_process' (GO:0008150) and 'molecular_function' (GO:0003674).Let ρ min (p 1 , p 2 ) denote the minimum probability of the common ancestor pathways of p 1 and p 2 , we next used the Lin method [52] to determine the similarity between the two pathways: We computed the similarity for all pathway-pathway pairs in P (u) , which composed the similarity matrix ℝ U×U , where U is the number of pathways in P (u) .We next used the umap function from the R package umap [53] v0.2.9.0 with default parameters to reduce the length of the second dimension of the similarity matrix to two.Since BP and MF are two non-interconnected domains, we only mapped the BP pathways to the two-dimensional semantic space.
12 Hyperparameter selection of CellGO In the modeling phase of CellGO, there are four hyperparameters that involve in the neural network training.These hyperparameters include the learning rate, the l2 norm regularization factor, the training epoch number, and the balance coefficient α in equation (5).We referred to the hyperparameters adopted by Dcell [4] and directly set the l2 norm regularization factor to 0.001 and the balance coefficient α to 0.3.
For the other two hyperparameters, the learning rate and the training epoch number, we performed hyperparameter grid search based on three evaluation metrics: the classification accuracy, the reproducibility, and the running time.The larger the first two and the smaller the last one, the better.Specifically, the classification accuracy is defined as the median cell type classification accuracy of 4,168 subtree-based VNNs; the reproducibility is defined as the average value of pairwise Spearman's rank correlations of GTSs between replicates of the same cell type, and these correlations were calculated on gene-term pairs with GTSs ranked in the top 5% in either group, the replicate number is two; the running time is defined as the CPU time spent in the modeling phase.We trained CellGO on dataset 17 using nine different hyperparameter combinations (learning rate = 0.0001, 0.001, and 0.005; epoch number = 10, 15, and 20).We found that the classification accuracy increases with the learning rate, however, the reproducibility decreases with the learning rate (Supplementary Figure S10).To balance these two evaluation metrics, we chose the learning rate at 0.001.In addition, we discovered that the classification accuracy and the reproducibility slightly increase with the epoch number, whereas the running time moderately increases with the epoch number (Supplementary Figure S10).To balance these three evaluation metrics, we set the epoch number to 15.
In the analysis phase of CellGO, there is no hyperparameter when acquiring cell type-specific active pathways given a single gene or a gene set, when identifying the cell type most activated by the input gene set, and when selecting the most active genes within pathways.Nevertheless, there is one key hyperparameter used in pathway network analysis when constructing the network of cell type-specific active pathways.ActiveP is used to determine which pathways are active, and pathways with P-values no more than it will be considered as active pathways and will be sent to pathway network analysis.Ideally, a smaller ActiveP will increase the reliability of the active pathways, but will reduce the number of active pathways.Here, we give our empirical strategy to choose a suitable ActiveP: ActiveP equals to 0.01 when the number of active pathways is no more than 500, otherwise ActiveP equals 0.001.We used this strategy throughout this study and found it performed well.Finally, there are some unimportant hyperparameters during the pathway network analysis, such as the number of travels starting from a seed node and the number of steps per travel in the RWR algorithm.Despite the larger these parameters will lead to more confident results, they are sufficient to complete the required analysis.
13 CellGO's imperfect performance using standardized scRNA-seq data To investigate the analysis results of models trained on standardized scRNA-seq data, we inputted the standardized single-cell expression matrix of dataset 47 to the modeling phase, and other parameters were unchanged.This standardization procedure is performed by the ScaleData function from the R package Seurat [40] v4.3.0, using the normalized single-cell expression matrix (Supplementary Text S4).
Next, we applied CellGO to examine Atp1a2 and Neurod2 using single-gene analysis based on the model trained on standardized dataset 47.We found that astrocytes was the most affected cell type by Atp1a2 perturbation (Supplementary Figure S11A).In addition, pathways with the highest GTSs in astrocytes were related to ion homeostasis and stimulus responses (Supplementary Figure S11B and Supplementary Table S12).These above findings are consistent with the results based on normalized dataset 47 (Figure 3A-B).However, our single-gene analysis revealed Neurod2 primarily affected InN, inconsistent with the results using normalized data (Figure 3A) and the reported more functions of Neurod2 on ExN than InN [45].
When extending the analysis to the top 100 differentially expressed genes between astrocyte-specific Atp1a2 KO and WT mice, CellGO revealed the largest number of active pathways in astrocytes.However, some key experimentally verified pathways found by the normalized dataset 47, such as amine biosynthetic process and primary amino compound metabolic process (Figure 3C), were not active in astrocytes (Supplementary Figure S11C and Supplementary Table S12).For the analysis of the top 100 differentially expressed genes between Neurod2 KO and WT mice, CellGO identified ExN as the cell type with the highest activation.Nevertheless, one experimentally verified pathway found by the normalized dataset 47 (Figure 3D), behavior, was not active in excitatory or inhibitory neurons (Supplementary Figure S11D and Supplementary Table S12).Overall, these results suggested that analysis outcomes generated from the standardized scRNA-seq count data are worse than those from the normalized scRNA-seq count data.
10. Concatenate returns a single vector that is the concatenation of all input vectors.Let L O