IKAP—Identifying K mAjor cell Population groups in single-cell RNA-sequencing analysis

Abstract Background In single-cell RNA-sequencing analysis, clustering cells into groups and differentiating cell groups by differentially expressed (DE) genes are 2 separate steps for investigating cell identity. However, the ability to differentiate between cell groups could be affected by clustering. This interdependency often creates a bottleneck in the analysis pipeline, requiring researchers to repeat these 2 steps multiple times by setting different clustering parameters to identify a set of cell groups that are more differentiated and biologically relevant. Findings To accelerate this process, we have developed IKAP—an algorithm to identify major cell groups and improve differentiating cell groups by systematically tuning parameters for clustering. We demonstrate that, with default parameters, IKAP successfully identifies major cell types such as T cells, B cells, natural killer cells, and monocytes in 2 peripheral blood mononuclear cell datasets and recovers major cell types in a previously published mouse cortex dataset. These major cell groups identified by IKAP present more distinguishing DE genes compared with cell groups generated by different combinations of clustering parameters. We further show that cell subtypes can be identified by recursively applying IKAP within identified major cell types, thereby delineating cell identities in a multi-layered ontology. Conclusions By tuning the clustering parameters to identify major cell groups, IKAP greatly improves the automation of single-cell RNA-sequencing analysis to produce distinguishing DE genes and refine cell ontology using single-cell RNA-sequencing data.


Background
In single-cell RNA-seq analysis, clustering cells into groups and differentiating cell groups by marker genes are two separate steps for investigating cell identity. However, results in clustering greatly affect the ability to differentiate between cell groups. This interdependency often creates a bottleneck in the analysis pipeline in which researchers repeat these two steps many times by trying different clustering parameters to identify a set of cell groups that are more distinguishing and biologically relevant.

Findings
To accelerate this process, we develop IKAP -an algorithm identifying major cell groups that improves differentiating by systematically tuning parameters for clustering. We demonstrate that, without specifying any parameters, IKAP successfully identifies major cell types such as T cells, B cells, NK cells, and monocytes in two peripheral blood mononuclear cell (PBMC) datasets and recovers major cell types in a previously published mouse cortex dataset. In addition, those major cell groups identified by IKAP present more distinguishing marker genes compared with cell groups generated by different combinations of clustering parameters. Finally, we show that cell subtypes can be further identified by re-running IKAP within identified major cell types so that a cell ontology prototype can be automatically constructed using single-cell RNA-seq data.

Conclusions
IKAP can speed up single-cell RNA-seq analysis and cell type recognition by identifying major cell groups with distinguishing marker genes and can facilitate cell ontology curation using single-cell RNA-seq data. In single-cell RNA-seq analysis, clustering cells into groups and differentiating cell groups by marker genes are two separate steps for investigating cell identity. However, results in clustering greatly affect the ability to differentiate between cell groups. This interdependency often creates a bottleneck in the analysis pipeline in which researchers repeat these two steps many times by trying different clustering parameters to identify a set of cell groups that are more distinguishing and biologically relevant.

Findings
To accelerate this process, we develop IKAP -an algorithm identifying major cell groups that improves differentiating by systematically tuning parameters for clustering. We demonstrate that, without specifying any parameters, IKAP successfully identifies major cell types such as T cells, B cells, NK cells, and monocytes in two peripheral blood mononuclear cell (PBMC) datasets and recovers major cell types in a previously published mouse cortex dataset. In addition, those major cell groups identified by IKAP present more distinguishing marker genes compared with cell groups generated by different combinations of clustering parameters. Finally, we show that cell subtypes can be further identified by re-running IKAP within identified major cell types so that a cell ontology prototype can be automatically constructed using single-cell RNA-seq data.

Conclusions
IKAP can speed up single-cell RNA-seq analysis and cell type recognition by identifying major cell groups with distinguishing marker genes and can facilitate cell ontology curation using single-cell RNA-seq data.

Findings
Single-cell RNA-sequencing (scRNA) enables inquiry of cell identity based on single cell transcriptomics. To facilitate cell type characterization and recognition, computational methods have been developed for (i) clustering cells with similar transcriptomic profiles into groups and (ii) identifying a set of marker genes to differentiate those cell groups [1]. These two tasks have been treated independently although groups identified by clustering greatly determine the marker genes associated with each group. Because computing marker genes is often more resource intensive compared to clustering, we attempted to improve and accelerate biological interpretation by developing an algorithm to effectively identify the k major groups that produce distinguishing marker genes.
Despite the existence of well-performing scRNA clustering methods, identifying k groups remains challenging due to parameter specification [2]. Most (if not all) clustering methods require a parameter suggesting k and a list of features such as genes or principal components (PCs) for computing cell-to-cell similarity. The proper k is generally unknown a priori. Choosing a small k may mix more than one cell type in a group whereas choosing a large k would result in many subgroups of unclear biological significance. Both can complicate cell type recognition by producing uninformative marker genes. In addition, the choice of feature list can affect grouping quality which, in turn, affects its distinguishing power. Therefore, k and feature list selection often become a bottleneck in the scRNA analysis pipeline.
To address this issue, we propose an unbiased approach -called IKAP (Identifying K mAjor cell Population groups) -which identifies the k well-separated major groups likely to produce distinguishing marker genes in a scRNA dataset by systematically exploring the parameter space ( Figure 1 and Online Methods). IKAP is implemented on top of Seurat [3] -one of the most widely used scRNA analysis packages -in which clustering requires two user-specified parameters: resolution r that determines k (the higher r, the larger k) and the number of top principal components (nPC) as the feature list. Briefly, for a given nPC, IKAP initializes a set of kmax groups by setting a high r. To simulate the coarse-to-fine grouping process, two nearest groups are merged iteratively, generating kmax sets of groups with k = 1 to kmax. For each set, gap statistic is computed to measure the gap between the grouping with observed data and that with random data [4]. The gap often monotonically increases (at variable amount) as k increases from 1 to kmax indicating that splitting out each group somewhat contributes to the grouping moving away from randomness (Supplementary Figure 1). We reason that those k's that contribute more (i.e. yield large gap increase) might correspond to the set of k well-separated major groups. IKAP repeats this procedure for a range of nPCs. Finally, candidate sets grouped by clustering cells with different k's and nPCs are picked from those with larger gap increase. Among all candidate sets, the one with the lowest classification error is marked as the best using decision trees built from marker genes. IKAP can be run by default without specifying any parameter as we did for all experiments in this study and can potentially be tailored for scRNA clustering methods other than Seurat.
We tested IKAP on a peripheral blood mononuclear cell (PBMC) dataset of ~8K cells (denoted as PBMC_8K) from a healthy donor [5]. The best set (with k=7 and nPC=9; thus, abbreviated as PC9K7) and two alternative sets (PC16K8 and PC18K9 respectively) were reported. The major groups reported in PC9K7 were effectively aligned with known major cell lineages such as B cells,  Figure 8A). IKAP identified one candidate set with 8 major groups (PC13K8) ( Figure 2D). Six groups were consistent with the annotated cell types as evidenced by expressing marker genes annotated specific to those cell types ( Figure 2E) and high proportion of overlapping cells ( Figure 2F). For the 2 remaining groups, one was a subtype of microglia cells (group 3 in Figure 2D) characterized by Hexb (AUROC=1.0), Cst3 (0.98), and P2ry12 (0.98) and the other was the union of 3 annotated cell types, interneurons, pyramidal SS and pyramidal CA1 (group 7 in Figure 2D), characterized by Atp1a3 (0.93), Ndrg4 (0.94), and Stmn3 (0.93) ( Figure 2E). Compared with the 20 trial sets (generated in the same way described above), marker genes in PC13K8 yielded higher expression fold change and lower classification error (Supplementary Figure 8B). Interestingly, the number of marker genes with high AUROC remained high as more subtypes were identified (which was not true for PBMC datasets), suggesting transcriptomic differentiation in the mouse cortex cells was very finegrained. Overall, IKAP successfully recovered major cell types (rather than many subtypes) that were consistent with previous annotations and produced distinguishing marker genes.
Finally, a potential application of IKAP is to help redefine multi-layered cell ontology based on single-cell transcriptomics [7]. IKAP can define one layer by splitting a cell population into k groups. By recursively applying IKAP within each group in the upper layer, subgroups can be identified in deeper layers. To demonstrate this approach, we expanded two more layers for all 3 datasets by applying IKAP on their biggest major groups and the biggest resulting subgroups.
For the mouse cortex dataset, IKAP successfully recovered interneurons, pyramidal SS, and pyramidal CA1 by subdividing their union group (group 7 in Figure    PBMC_4K and PBMC_8K datasets. PBMC_4K and PBMC_8K were downloaded from the 10x Genomics website [5]. They were filtered and normalized using the R package Seurat [3]. We removed cells with less than 200 genes expressed or the unique molecular identifier (UMI) count of mitochondrial genes > 5% of the total UMI count. For each dataset, we regressed out the percentage of mitochondrial gene UMI count and the total UMI count from the normalized expression matrix and scaled the matrix using ScaleData function in Seurat. Finally, we got the expression matrix with 16,746 genes and 4,077 cells for PBMC_4K and the matrix with 18,408 genes and 8090 cells for PBMC_8K.
Mouse cortex dataset. The dataset was obtained from [6]. We normalized and scaled the expression matrix as we did for PBMC_4K and PBMC_8K but we did not filter out any cells in order to be consistent with the published work. In total, the expression matrix comprised 19,972 genes and 3,005 cells.

Competing interests
The author(s) declare that they have no competing interests.

Funding
This work was supported by the Intramural Program of the National Heart, Lung, and Blood Institute, National Institutes of Health. Grant number: 1ZICHL006228-02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.     Shown on the left is the ontology with two layers (also shown in Figure 3D). The heatmap shows expression of top marker genes (ranked by expression fold change) of each subgroup. Rows are genes and columns are cells.

Supplementary Figure 10. An example of 2-layer T cell ontology proposed by IKAP for PBMC_8K.
Shown on the left is the ontology with two layers (also shown in Figure 3E)  Figure 1 for how to compute gap increase), the following steps were taken.
Step 0: filter entries by gap increases > mean + standard deviation.
Step 1: take the max gap increase across rows for each column (k) and record the corresponding (nPC, k).
Step 2: sort recorded (nPC, k)'s based on their corresponding gap increases.
Step 3: add the first (nPC, k), which is PC9K7, into the candidate list.
Step 4: remove the second (nPC, k), which is PC9K6, because its nPC (=9) is not larger than nPC of the candidate (=9) in the candidate list and neither is its k not larger than k of the candidate (=7) in the candidate list.
Step 5: add the third (nPC, k) into the candidate list because its nPC (=16) is larger than nPC of the candidate (=9) in the candidate list and so is its k.
Step 6: add the fourth (nPC, k) into the candidate list because its nPC (=20) is larger than all nPCs of the candidates (=9 and 16) in the candidate list and so is its k. Finally, PC9K7, PC16K8, and PC18K9 were selected as candidate sets for PBMC_8K.