Multi-resolution characterization of molecular taxonomies in bulk and single-cell transcriptomics data

Abstract As high-throughput genomics assays become more efficient and cost effective, their utilization has become standard in large-scale biomedical projects. These studies are often explorative, in that relationships between samples are not explicitly defined a priori, but rather emerge from data-driven discovery and annotation of molecular subtypes, thereby informing hypotheses and independent evaluation. Here, we present K2Taxonomer, a novel unsupervised recursive partitioning algorithm and associated R package that utilize ensemble learning to identify robust subgroups in a ‘taxonomy-like’ structure. K2Taxonomer was devised to accommodate different data paradigms, and is suitable for the analysis of both bulk and single-cell transcriptomics, and other ‘-omics’, data. For each of these data types, we demonstrate the power of K2Taxonomer to discover known relationships in both simulated and human tissue data. We conclude with a practical application on breast cancer tumor infiltrating lymphocyte (TIL) single-cell profiles, in which we identified co-expression of translational machinery genes as a dominant transcriptional program shared by T cells subtypes, associated with better prognosis in breast cancer tissue bulk expression data.


INTRODUCTION
As high-throughput transcriptomic assays become more efficient and cost-effective, they are being routinely integrated into large-scale biomedical projects (1)(2)(3)(4). Bulk gene expression profiling by RNA sequencing (RNAseq) has been widely adopted in multiple high-throughput genomics studies, the paramount example being The Cancer Genome Atlas (TCGA) data commons, which currently include 10,558 bulk RNA sequencing (RNAseq) profiles across 33 cancer types (https://portal.gdc.cancer.gov/). Furthermore, since its first published application in 2009 (5), the size of singlecell RNA sequencing (scRNAseq) studies has exploded, such that it is now commonplace for studies to generate tens of thousands of profiles (6). As the scale of these studies and the associated datasets increases, so does their utility as a resource from which biological information can be extracted through the application of machine learning approaches. Common deliverables of these types of analysis include the discovery and characterization of molecular subtypes, which are prevalent in both bulk and single-cell gene expression studies. For example, TCGA bulk expression data have been utilized to characterize subtypes of numerous cancers (7), including but not limited to: breast (8), colorectal (9), liver (10) and bladder cancer (11,12). Similarly, the characterization of molecular subtypes is a stan-dard component of the scRNAseq data analysis workflow, insofar as estimation and annotation of subpopulations of cells is one of the primary goals of the assay (13).
The general framework for subtype characterization can be summarized in two steps: (i) estimation of data-driven groups of observations via application of an unsupervised learning procedure, followed by (ii) annotation of each group based on the identification of distinct patterns of gene expression relative to other groups. While most approaches focus on discovering a 'flat' set of non-overlapping groups or subtypes, in this manuscript we present an alternative approach, devised to emphasize 'taxonomy-like' hierarchical relationships between observations to discover nested subgroups.
Whereas a wide range of unsupervised learning algorithms is available for the analysis of bulk gene expression data, the considerable sparsity of scRNAseq data has motivated the development of novel methods specifically tailored to the analysis of this type of sparse, high-dimensional data. Popular software packages, such as Seurat (14) and Scran (15), generate 'flat' clusters, in which a finite set of mutually exclusive cell types is estimated. In so doing, they fail to capture the 'taxonomy-like', hierarchical structure that may exist among subgroups of observations at multiple levels of resolution, driven by transcriptional signatures based on different factors, including but not limited to: shared lineage, cell state, pathway activity, or morphological origin. Complementary methods exist to model such relationships, such as Neighborhood Joining (16,17) and more recent single-cell trajectory inference approaches (18), which estimate 'pseudo-temporal' states of individual cells indicative of developmental progression. Given the stringent interpretation of such models, their suitability depends on the assumption that the measured similarity between neighbors of cell profiles arises from a distinct continuous progression of molecular activity. However, the relative similarity between cell profiles may be confounded by numerous factors, including: cell cycle, spatial patterning, cell stress, and batch effects (19). To overcome these shortcomings, one recent method, partition-based graph abstraction (PAGA) (20), was devised to model complex topologies by estimating a graph of 'high-confidence' connections between labeled cell types based on their shared nearest neighbors. This method has the advantage of being able to first identify disconnected subgraphs from which to model separate trajectories. Even so, a ubiquitous attribute of trajectory inference approaches, including PAGA, is that all distances between cell profiles are computed based on a single set of features, generated by selection filtering and/or dimensionality reduction (21), thus precluding the discovery of nested structures defined by distinct transcriptional programs shared by relatively few cells.
Hierarchical clustering (HC) algorithms at face value address the need for a multi-resolution representation of the relationship among observations, and while originally adopted for the analysis of bulk gene expression data (22), numerous packages have also been developed for scR-NAseq analysis, such as pcaReduce (23), ascend (24) and BackSPIN (25). However, since the number of possible subgroupings increases with the number of observations, robustly identifying such relationships can be challenging. As a result, tree-cutting methods are often applied, ultimately yielding a flat set of non-overlapping clusters. Furthermore, as with trajectory inference, the bottom-up nature of HC's sample aggregation procedure forces the use of the same set of genes/features to drive the agglomeration at all levels of the hierarchy. Finally, HC methods do not support the clustering of groups of samples, or group of cell profiles representing cell types in scRNAseq, nor the identification of an interpretable taxonomic hierarchy over these groups or cell types.
Here, we introduce K2Taxonomer, a novel taxonomy discovery approach and associated R package for the estimation and in-silico characterization of hierarchical subgroup structures in both bulk and single-cell data. An important feature of the approach is that it can analyze both individual samples as well as sample groups such as, but not limited to, those corresponding to scRNAseq cell types. The package employs a recursive partitioning algorithm, which utilizes repeated perturbations of the data at each partition to estimate ensemble-based K = 2 subgroups. For scR-NAseq analysis, K2Taxonomer utilizes the constrained kmeans algorithm (26) to estimate partitions of the data at the cell type level, while preserving the influence of each individual cell profile. A defining feature of the method is that each recursive split of the input data is based on a distinct set of features selected to be most discriminatory within the subset of samples member of the current hierarchy branch. This makes the approach quite distinct from both standard clustering algorithms and trajectory inference algorithms, and particularly apt to discover nested taxonomies. K2Taxonomer thus fills a methodological gap and provides a rigorous way to further resolve biological insights from clustered scRNA-seq data. In addition, the package includes functionalities to comprehensively characterize and statistically test each subgroup based on their estimated stability, gene expression profiles, and a priori phenotypic annotation of individual profiles. Importantly, all results are aggregated into an automatically generated interactive portal to assist in parsing the results.
In this manuscript, we assess the performance of K2Taxonomer for partitioning both bulk gene expression and scRNAseq data, using both simulated and publicly available data sets, and we compare it to agglomerative clustering procedures. For bulk gene expression data, performance is assessed in terms of unsupervised sorting of breast cancer subtypes and established genotypic markers, using breast cancer patient tumor tissue data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (27) and the TCGA compendia. For scR-NAseq data, performance is assessed in terms of recapitulation of established relationships between 28 annotated cell types of the airway of healthy subjects (28). We conclude with a case study where we perform a K2Taxonomerbased analysis of breast cancer tumor infiltrating lymphocytes (TILs) profiled by scRNAseq (29). Our analysis significantly expands upon previously published results and identifies a phenotypically diverse subgroup of CD4 and CD8 cells, characterized by constitutive up-regulation of a subset of translation machinery genes. We further show that high expression of these genes in breast cancer tissue bulk expression is associated with better survival, supporting re-PAGE 3 OF 17 Nucleic Acids Research, 2021, Vol. 49, No. 17 e98 cent findings on the role of the translation machinery assembly in T cell activation (30,31), and demonstrate that this coordinated expression of the translation machinery is pervasive among T cell subpopulations to such an extent that the expression levels of these genes in bulk measurements of tumor tissue is predictive of the degree of immune infiltration. The complete suite of analysis results is accessible through an automatically generated and publicly accessible portal (https://montilab.bu.edu/k2BRCAtcell/).
While we focus on the analysis of transcriptomics data, we emphasize that our approach is applicable to other 'omics' data, such as those generated by high-throughput proteomics and metabolomics assays. Moreover, applications of K2Taxonomer are not limited to cancer-centric studies. For example, we applied an earlier prototype of K2Taxonomer to the analysis of a toxicogenomic study on the effects of environmental exposures on adipocyte activity, and the tool proved to be instrumental to the identification of chemical subgroups and the pathways contributing to either their deleterious or beneficial effects on energy homeostasis (32).

K2Taxonomer algorithm overview
K2Taxonomer implements a recursive partitioning algorithm that takes as input either a set of individual observations or a set of sample groups and returns a top-down hierarchical taxonomy of those samples or groups ( Figure  1). Here we summarize the K2Taxonomer algorithm. A detailed description of the method is provided in the following three subsections. To achieve robust model estimation, each partition is defined based on the aggregation of repeated partition estimations from distinct perturbations of the original set. Each of these partition estimates is created in three steps. First, a perturbation-specific data set is generated by bootstrapping features, i.e. sampling features from the original data set with replacement. Next, this perturbation-specific data set undergoes variability-based feature selection filtering. Finally, a K = 2 clustering algorithm is run, producing a perturbation-specific partition estimate. These three steps are repeated, generating a set of perturbation-specific partition estimates, which are aggregated into a cosine similarity matrix. The aggregate partition is then estimated based on a K = 2 tree cut following hierarchical clustering of the transformed cosine similarity matrix into a distance matrix, calculated as 1 -cosine similarity, with a user-specified agglomeration method.
The K2Taxonomer package supports both observationlevel and group-level analysis modalities, depending on whether the analysis end-points are single samples or groups of samples, respectively. In the group-level modality, the algorithm is applied to data sets where observations have a priori-assigned group labels, and the objective is to identify intermediate relationships between these groups. This functionality was specifically incorporated to enable partitioning and annotations of cell types estimated by scR-NAseq clustering algorithms, but it is applicable to any data set with group-level labels.
For further customization of analyses, the K2Taxonomer R package permits the use of user-specified functions for performing perturbation-specific partition estimates.

K2Taxonomer feature filtering
A distinguishing property of K2Taxonomer when compared to other methods, such as traditional agglomerative hierarchical cluster or trajectory inference (21) is the manner in which feature selection is implemented. Even in large studies of high-throughput data sets, the number of features is typically much larger than the number of observations. This generally requires filtering the data set prior to modeling in order to reduce variance and computational expense of model fit. One way to do this is through feature selection, in which features suspected to contain more information about the relationship between observations are chosen for down-stream analysis. For unsupervised learning, relative information estimation is commonly calculated via variability-based metrics. Assuming the amount of noise is consistent across features, these metrics will capture the relative magnitude of the signal of individual features. Two common choices are standard deviation (SD) and median absolute deviation (MAD), of which the former is more statistically efficient with a small sample size and the latter is more robust to outliers (33). Implementation of these feature selection techniques prior to modeling may be problematic when learning hierarchical models. The magnitude of variability-based metrics is influenced by the frequency of observations for which the signal-to-noise ratio is higher, such that the subset of features is more likely to capture broader relationships between larger sets of observations and less likely to capture relationships between smaller sets of observations. This can obscure important relationships within smaller sets of observations, as when evaluating a sub-group of samples in a hierarchical procedure. In addition, an appropriate choice of the number features to use for modeling is difficult to determine a priori and may be obscured by many factors, including: the number of subgroups, number of observations belonging to each subgroup, and the number of features distinguishing individual subgroups.
To overcome these challenges, K2Taxonomer produces a model fit for each partition independently, such that feature selection is only performed within the subgroup of observations being evaluated at a given step. In particular, at each recursive step, the objective of partition estimation is to split the data based only on the dominant relationship between two subgroups. Since the selected features need only capture one relationship, a much smaller subset of features will be sufficient to discover this partition. By default, K2Taxonomer uses the square root of the total number of features, which is used in a related albeit supervised learning method, random forests (34). In doing so, the percentage of filtered features is dependent on the total number of features. For example, if the data set consists of 1,000 or 10,000 features, K2Taxonomer will estimate partitions using 3.2% or 1.0% of the total number of features, respectively. The appropriateness of using the square root of the total number of features against fixed percentages is later assessed with simulation-based testing.  The K2Taxonomer package includes options to perform both SD and MAD based feature selection. In the case of group-level analysis, K2Taxonomer can perform F-statistic based feature selection based on the ratio of between-group to within-group variability implemented by the limma R package (35).

K2Taxonomer data partitioning
To estimate each partition, K2Taxonomer, performs feature-level bootstrap aggregation, similar to that of consensus clustering (36). More specifically, each data partition represents the aggregation of a set of partitions estimated from perturbations of the original data set in which features have been sampled with replacement. Feature selection and K = 2 clustering are independently performed within each perturbation-specific data set. The final partition estimate is calculated by aggregating the set of perturbation-specific partitions into a cosine similarity matrix (defined below), which further undergoes hierarchical clustering, followed by a K = 2 tree cut.
K2Taxonomer package implements separate clustering methods tailored to analysis of either observation-level and group-level data input. For observation-level data, the perturbation-specific partitions are estimated via hierarchical clustering of the Euclidean distance matrix, followed by a K = 2 tree cut. By default, Ward's agglomerative method is performed at this step because it has been shown to generally perform well compared to other hierarchical methods (22). For group-level data, perturbation-specific partitions are estimated via constrained K-means clustering (26). This algorithm performs semi-supervised clustering, in which group-level information is included as a pairwise 'must-link' constraint, preserving relationships between observations from the same group.
To assess the robustness of the partitioning of the aggregated results, hereby referred to as partition stability, as well as to facilitate interpretability, a cosine similarity matrix is computed, with each pairwise cosine similarity measurement functionally equivalent to the Pearson correlation of standardized variables.
Let an 'item' denote a single observation or group, depending on whether observation-or group-level analysis is being performed, respectively. The cosine similarity of two items is a measure proportional to the number of times across perturbation iterations that the two items are assigned to the same group in the perturbation-specific dichotomous partitions. It takes its maximum/minimum value when the two items are always/never assigned to the same group.
If we represent with '−1' and '1' the assignments of an item to one or the other group in a dichotomous partition, we can then represent, and compare, the complete set of assignments of any two items across p perturbation-specific partitions as the vectors where X i and X j represent the ith and jth item, respectively.
We can then define the cosine similarity of X i and X j as where X i · X j represents the dot product and ||X i ||||X j || represents the product of the Euclidean norms of X i and X j . Next, we prove the equivalence of the cosine similarity and Pearson correlation of two assignment vectors. The cosine similarity can be rewritten as follows: In the above derivation, since X i and X j only take values in {−1,1}, their dot product, X · Y, is equal to the difference between the number of iterations, Z, the two items are assigned to the same group, and the number of iterations, p − Z, the two items are assigned to different groups. Furthermore, the product of the Euclidean norms of X i and X j , Similarly, taking advantage of the relationship between Pearson correlation of standardized variables, r (), and Euclidean distance, d(), we have where we used the fact that the squared Euclidean distance of X i and X j , Furthermore, for X i − X j , the difference between mismatched adjacent elements is 2 and the difference between matched adjacent elements is 0. Therefore, (X i − X j ) 2 = 4Z. The function, 1 − 2Z p , is the Hamann similarity index (37). Consistent with Pearson correlation, the range of possible values for the Hamann similarity is between 1 and −1; these extremes occur if Z is equal to p and 0, respectively, indicating that X i and X j are either identical or fully dissimilar. Furthermore, if elements of X i and X j share 50% of their matching assignments, then Z = p 2 and the Hamann similarity is equal to 0, indicating a lack of a relationship between their perturbation-specific partition estimates. This is not true for the related phi similarity (38), another correlation metric for dichotomous variables, the calculation of which includes adjustment for the marginal distribution of X i and X j . In this case, the marginal distribution of X i and X j is irrelevant because the elements of X i and X j are only meaningful in relation to their matching assignments.

K2Taxonomer partition stability
To assess the robustness of partition estimates, indicating the consistency of the perturbation-specific partition results, we developed a partition stability metric, which is calculated using the eigen-decomposition of the matrix of pairwise cosine similarities, Q, of dimension, N, the number of items. The eigen-decomposition of Q satisfies where U is the matrix of eigenvectors corresponding with , the vector of rank-ordered eigenvalues Each eigenvalue is proportional to the 'variance explained' by each eigenvector, such that the cumulative sum of variance explained by the first k eigenvectors, v k , is given by In this context, the variance explained by eigenvectors captures the consistency with which pairs of items received the same or different assignments across perturbationspecific partitions. Therefore, we can summarize this consistency by evaluating the difference between the variance explained by the eigenvectors of the estimated cosine matrix and the variance explained by these eigenvectors if there was no consistency across perturbation-specific partition assignments. We denote this deviation as the partition stability, PS, calculated as the maximum difference between v k and k N , the null value corresponding to all items being linearly independent, The possible values for the partition stability range between 1 − 1 N and 0, with the former representing the case in which λ 1 = N, where every perturbation-specific partition is identical, i.e., all values of the cosine matrix are either −1 or 1. Conversely, a partition stability of 0 represents the case when the perturbation-specific partition assignments are random, i.e., all values of the cosine matrix are close to 0. The maximum value for a given partition is dependent on the number of items in the partition, approaching 1 when N is large, and equal to 0.5 when N = 2. Using the K2Taxonomer package, partition stability can be used to set stopping criteria for creating new partitions, thereby serving as a way to control the number of terminal subgroups without prior knowledge.
Finally, partition stability is used as a heuristic for calculating branch heights in dendrogram creation of the K2Taxonomer output. For a series of m partitions resulting in a given partition, z m , the branch height, h m , is where c is a constant added to ensure that the minimum height for a node is equal to 1.

K2Taxonomer R package functionalities
In addition to running the recursive partitioning algorithm, the K2Taxonomer R package provides functionalities for comprehensive annotation of the estimated subgroups, via subgroup-level statistical analyses, including: differential analysis, gene set enrichment analysis, and phenotypic variable testing. Differential analysis of gene expression is carried out using the limma R package, which is well-suited to the analysis of normally distributed data such as microarray gene expression, as well as log-transformed and normalized RNAseq data (35). Gene set enrichment analysis is carried out on a set of user-provided gene sets and implemented in two ways: over-representation analysis based on a hypergeometric test, and differential analysis of single-sample gene set projections scores based on the GSVA R package (39). Finally, phenotypic variable testing is carried out on user-provided variables labeling individual observations or groups, supporting both continuous and categorical variables. Testing of association between continuous variable and taxonomy subgroups can be performed based on the parametric Student's t-test or the nonparametric Wilcoxon rank-sum test, while categorical testing is carried out using Fisher's exact test. All subgroup-level statistical analyses are corrected for multiple hypothesis testing based on the FDR procedure (40). The full set of results are compiled into an interactive-web portal for exploration and visualization. Differential analysis comparisons are carried out at the partition-level, i.e., comparing only the two subgroups at a particular node. However, the web portal includes functionality for performing post-hoc differential analysis of any combination of user-selected subgroups.

Simulation-based performance assessment
The performance of K2Taxonomer was assessed in comparison to Ward's agglomerative method for recapitulating induced hierarchical structure in simulated data. See supplementary methods for a comprehensive description of the strategy implemented for data generation and performance assessment for observation-and group-level analyses.

Performance assessment using breast cancer primary tumor bulk gene expression data
K2Taxonomer was evaluated in its observation-level modality for its ability to recover the Pam50 subtypes, as well ER-, PR-, and HER2-status, and the aggregate three-gene genotype of ER-, PR-, and HER2-status, in the METABRIC and TCGA breast cancer datasets, independently (41,42). Distribution of these variables, as well as additional clinical variables for METABRIC and TCGA data sets are summarized in Supplementary Tables S1 and S2, respectively. See supplementary methods for a detailed description of these two data sets, as well as methods for acquisition and preprocessing. K2Taxonomer was also compared to two agglomerative clustering algorithms, Ward's and average. These specific methods were chosen because they have been previously shown to outperform other common agglomerative methods (22,43). Given the sensitivity of hierarchical clustering to the level of feature filtering, analyses included individual runs on four filtered data subsets of the total number of features: 100%, 25%, 10% and 5%, while K2Taxonomer was only run on the full set (i.e. 100%) of features. This should be kept in mind when comparing performances, since the best-performing pre-filtering level is not known a priori, and it is in general dataset dependent. For every pre-filtering level, the median absolute deviation (MAD) score was used for feature selection, and Euclidean distance was used to estimate observation-level distance. Performance was assessed as the entropy of each of the phenotypes (e.g., PAM50 labels), induced by the inferred sample sub-grouping, with lower entropy indicating 'purer' subgroups, hence better performance (44). The different methods were evaluated and compared by the relative decrease in entropy as the number of mutually exclusive clusters, K, increased from 2 to 8 based on tree cuts of the dendrograms produced by each model.

Healthy airway tissue scRNAseq gene expression analysis
K2Taxonomer was applied in its group-level modality to partition scRNAseq gene expression profiles of 28 estimated cell types from a publicly available data set of airway tissue of healthy patients (28), was evaluated against the known relationships among the included cell types, and was compared to the partitioning obtained by two agglomerative methods, Ward's and average, as well as PAGA, a graphestimation and trajectory inference algorithm (20), and ascend, a fully unsupervised scRNAseq hierarchical clustering algorithm (24). See supplementary methods for a detailed description of this data set, as well as methods for acquisition and preprocessing. For agglomerative methods, cell type-level data processing and feature selection were performed consistent with the results of group-level analysis of simulated data (see supplementary methods). As with agglomerative methods, PAGA was run on F-statistic-based pre-filtered feature sets at different percentages of the full 18,417 genes: 100%, 25%, 10% and 5%. Due to limitations of computational resources, ascend analyses excluded those based on 100% of genes. Moreover, PAGA and ascend were each applied using three sets of the top principal components: 10, 15 and 20, such that a total of 12 and 9 unique PAGA and ascend models were estimated, respectively.
Analyses with ascend were carried out using the ascend (v0.99.69) R package. The ascend-based hierarchical models were estimated from the component-based dimensionality reduced data sets using the 'runCORE()' function. Given that, unlike K2Taxonomer and PAGA, ascend can only be carried out as a fully unsupervised analysis without grouplevel input, concordance of these models with those of other methods was evaluated based on purity-based assignment of cell type labels to individual subgroups of each ascend model. Cell types were assigned to individual partitions based on two criteria: first, if at least 75% of the profiles of a given cell type were assigned to the subgroup and, second, if the total number of cell profiles across individual cell types that met the first criteria made up at least 75% of the total cell profiles in the subgroup.

Breast cancer immune cell scRNAseq gene expression analysis
Publicly available scRNAseq gene expression of raw counts from immunocytes of two TNBC patients was obtained from GEO, accession number GSE110938 (29). The data was processed in accordance with the original manuscript (29), recapitulating the reported 5,759 individual cells, 4,844 and 915 from either sample, with 15 623 genes passing QC criteria, selection of 1675 highly variable genes, and 10 latent variables estimated by ZINB-WaVE (v1.8.0) (45). To enable exploration of the data at finer resolution, clustering of the latent variables with Seurat (v1.3.4) was modified by setting the 'resolution' argument of 'FindClusters()' to 1.1, rather than the default, 0.8 (46). This resulted in 13 estimated cell clusters. Of the 10 cell clusters reported in the original manuscript, two cell clusters, 'CD4+ FOXP3+' and 'CD4+ IL7R+', were further split into three and two individual clusters, respectively (Supplementary Table S3).
K2Taxonomer was applied in its group-level modality to partition these 13 estimated cell subtypes based on the normalized count matrix estimated by ZINB-WaVE. According to the developers, ZINB-WaVE, normalized count estimates are not recommended for differential analysis (45), hence differential analysis was performed based on drop-out imputed and batch-corrected normalized counts estimated using the bayNorm (v1.4.14) R package (47). Pathway-level analysis was carried out using Reactome gene sets downloaded from mSigDB (v7.0) (48). Signatures of up-regulated genes were derived from each subgroup based on their FDR corrected P-value (FDR < 1e−10) and minimum subgroup-specific expression, (mean[log 2 counts] > 0.5), then restricted to a maximum of 50 genes.
To investigate the concordance between K2Taxonomer estimated model and that estimated by a fully unsupervised scRNAseq hierarchical clustering method, we ran ascend on the same ZINB-WaVE reduced dimension data set that was used as input for Seurat. Subgroup-level cell cluster assignments in the resulting ascend estimated model was carried out using the same criteria employed for the healthy airway tissue scRNAseq analysis.
To validate the clinical relevance of signatures of TILs derived by K2Taxonomer, we performed survival analysis based on gene signature projection scores, as well as on selected genes in the METABRIC breast cancer primary tumor gene expression data set. Gene set projection was carried out using GSVA (39). Multivariate Survival analysis was performed using Cox proportional hazards tests. All models included age and Pam50 subtype as covariates. To account for possible confounding effects of inflammation and proliferation, we generated separate patient-level activity scores for each, using a gene set projections of published signatures of deleterious breast cancer inflammation markers (49) and breast cancer proliferation (50) (Supplementary Table S4).

RESULTS
We report here our extensive evaluation of K2Taxonomer on both simulated and real data. We evaluated the method's performance both on data where the analysis end-points were single samples (observation-level), and groups of samples (group-level). The latter corresponds to scenarios where the goal is to define a taxonomy over sample groups, such as cell types in single-cell experiments, or chemical perturbations profiled in multiple replicates (51).

K2Taxonomer discovers hierarchical taxonomies on simulated data
We first evaluated K2Taxonomer's capability to recapitulate hierarchical relationships induced in simulated data, as measured by the Baker's gamma coefficient estimate of similarity between the structure of two dendrograms, where the structure of each dendrogram is quantified by a matrix based on the number of partitions separating each pair of leaves (52). As a term of reference, we compared K2Taxonomer's performance to Ward's agglomerative method.
For observation-level analysis, K2Taxonomer demonstrated robust performance for moderate levels of background noise, e.g., standard deviations equal to 0.5 and 1.0, regardless of the proportion of features with signal or the number of terminal clusters (Figure 2a, Supplementary Table S5). For higher levels of background noise, e.g., standard deviations equal to 2.0 and 3.0, the performance of K2Taxonomer was more dependent on either parameter, performing better with more features with signal and fewer terminal clusters. K2Taxonomer significantly outperformed Ward's method in 221 out of the 400 combinations of parameters tested (FDR < 0.05), while Ward's performed better for 17 combinations (Figure 2A, Supplementary Table S5). Furthermore, the differences between Baker's gamma coefficients for the 221 results significantly in favor of K2Taxonomer were generally larger than the 17 results significantly in favor of Ward's method, with median differences of 0.14 and 0.03, respectively. In general, K2Taxonomer significantly outperformed Ward's method when the background noise, the number of terminal groups and percent features with signal increased.
Remarkably, for group-level analysis K2Taxonomer outperformed Ward's method for all 400 combinations of variables tested (Figure 2b, Supplementary Table S6).  Table S7), the square root outperformed larger percentages when the number of features was large, and outperformed smaller percentages when the number of features was small.

K2Taxonomer accurately sorts breast cancer subtypes without pre-filtering of features
We evaluated K2Taxonomer's ability to sort Pam50 subtypes, ER-status, PR-status, and HER2-status from bulk gene expression data from METABRIC and the TCGA BRCA bulk gene expression data, separately. A fourth variable, defined by the Cartesian product of ER-status, PRstatus and HER2-status was also assessed. Performance was assessed in terms of the decrease in entropy as the number of cluster estimates, K, increased from 2 to 8 ( Figure 3A and B). We also compared K2Taxonomer's performance to two agglomerative clustering methods, Ward's and average. Since standard hierarchical clustering is sensitive to the level of feature filtering, the comparison was repeated for multiple pre-filtering levels.
In general, K2Taxonomer accurately segregated the known subtypes and phenotypes, performing as well or better than either method ( Figure 3B, Supplementary Tables S8 and S9). When applied to the METABRIC data, K2Taxonomer analysis yielded the lowest entropy score compared to all other methods for K = 3 and higher with few exceptions. Other methods produced similar entropy measurements at selected higher levels of K. For example, Ward's method resulted in similar entropy scores for Pam50 subtypes and HER2-status at K = 4 and K = 5, respectively, but for different pre-filtering levels, 5% and 100%, respectively. When applied to the TCGA BRCA data, the difference in performance was less pronounced. K2Taxonomer resulted in the lowest entropy score for Pam50 scores, genotype, ER-and PR-status for K = 4. Ward's method at 5% pre-filtering level produced the smallest entropy score for HER2-status for K = 4.
It should be emphasized that the pre-filtering level to be used with hierarchical clustering is not known a priori, and it would thus preclude us in practice from selecting the level yielding the best results shown in the above comparison. Figure 3. Breast cancer subtyping performance assessment of bulk gene expression data. Comparison of sorting of breast cancer Pam50 subtypes and genotypes (ER-, PR-and ER-status) for two bulk gene expression data sets, METABRIC and TCGA. An aggregate, three gene genotype status was also included by combining the individual genotypes. Performance was assessed based on reduction of entropy as the number cluster estimate increased based on tree cutting. K2Taxonomer was only run on the full set of features, while either agglomerative method, average and Ward's, were run on three additional subsets of the data. (A) Illustration of the results generated by K2Taxonomer and Ward's method for the METABRIC dataset. These results reflect Ward's method run on 5% of the total number of features, which demonstrated the best performance among agglomerative methods. (B) Entropy measurements for each method as K increased across the METABRIC and TCGA data sets.

K2Taxonomer accurately identifies and organizes subgroups of shared progenitors and epithelial cells from healthy airway scRNAseq cell clusters
To assess the capability of K2Taxonomer to recapitulate biologically relevant subgroupings of cell types estimated from scRNAseq data, we ran group-level analysis using 29 cell types estimations assigned to 77,969 cells of airway tissue from 35 samples across 10 healthy subjects and multiple locations ( Figure 4A, B) (28). In addition to agglomerative methods, we evaluated K2Taxonomer's performance against that of partition-based graph abstraction (PAGA) (20), which has been shown to outperform similar methods, especially for analyzing large-scale scRNAseq data sets (18).
K2Taxonomer was remarkably accurate in capturing the higher-order organization of the 28 cell types. The first partition separated all epithelial cell subtypes from nonepithelial cell types ( Figure 4B). Further partitioning of the 17 epithelial cell subtypes yielded five subgroups, characterized by shared morphology, labeled as 'Ciliated', 'Basal', 'Submucosal', 'Brush-like' and 'AE' (Alveolar Epithelium). The 'Ciliated' subgroup was comprised of differentiated multiciliated cells and their precursor, deuterosomal cells (28). The 'Basal' subgroup was comprised of epithelial cell progenitors, basal and cycling basal cells, as well as epithelial cell intermediary, suprabasal cells (53). The 'Brushlike' subgroup was comprised of three rare airway epithelial cell types: brush, ionocyte, and pulmonary neuroen-docrine cells (PNECs), as well as their likely shared 'brushlike' precursor cells (54), as suggested in the original publication of these data (28). Further partitioning of the 11 nonepithelial cell types yielded two main subgroups characterized by shared progenitor cells: immune (55) and mesenchymal stem cells (56), with the immune cell subgroup also including endothelial cells. Further partitions of the immune cell subgroup included progeny of monoblasts: monocytes, dendritic cells, and macrophages (57), followed by endothelial cells separated from all non-monoblast progeny immune cells subtypes in the adjacent subgroup.
In contrast, agglomerative hierarchical clustering of these cell clusters, even if evaluated at multiple F-statistic-based pre-filtering levels, yielded significantly different results poorly reflective of the known taxonomic cell type organization ( Figure 4C, Supplementary Figure S2). While the mesenchymal stem cell subgroup, comprised of fibroblasts, smooth muscle, and pericytes, was identified by Ward's method, and while there were other instances of concordant subgroups, none of these consisted of more than two cell types.
PAGA-based trajectory analysis (20) of the cell types performed better than agglomerative clustering, and demonstrated both improvements and drawbacks compared to K2Taxonomer. (Figure 4D, Supplementary Figure S3). Figure 4D shows what we selected to be the 'best' PAGA result for runs on a grid of F-statistic-based pre-filtering levels (25% of genes) and principal component-based dimensionality reduction size (15 principal components). In this case, PAGA recapitulated four subgroups identified by K2Taxonomer as disconnected subgraphs: 'Immune', 'Epithelial', 'Submucosal', and 'Mesenchymal'. Unlike K2Taxonomer, this model accurately segregated endothelial cells as its own disconnected vertex. However, all PAGA models failed to include basal cells in the epithelial cell subgraph ( Figure 4D, Supplementary Figure S3). Furthermore, PAGA segregated the histologically separated submucosal cells lines: SMG goblet and serous, from the other epithelial cell lines. The most apparent difference between the K2Taxonomer and PAGA results is the high connectedness of the PAGA subgraphs, especially considering that these are all 'near perfect-confidence' connections. As a result, subgroup relationships within these subgraphs are difficult to distinguish, and approaches to estimate tree-like graphs, such as minimum spanning tree algorithms, yield multiple equivalent solutions. Finally, the PAGA results varied based on the feature pre-filtering level and the number of principal components, most notably in the estimations of connections between immune cells and epithelial cells (Supplementary Figure S3).
While models generated by ascend-based hierarchical clustering were also able to recapitulate some of the subgroups, we did not observe any improvements compared to K2Taxonomer or PAGA (Supplementary Figure S4). Whereas the first K2Taxonomer partition segregated all epithelial cell types, each of the ascend models separated multiciliated cells from all other cell types at the first split. Moreover, in six out of nine of these models, the multiciliated cells did not subgroup with deuterosomal cells. Like PAGA, the structure and subgroups identified by individual ascend models varied widely based on feature pre-filtering level and number of principal components, such that a 'best' model wasn't discernable. For example, the ascend model run with 25% of genes and 10 principal components was the only model capable to identify a purely 'Immune' cell subgroup, but did not include the 'Basal' or 'Monoblast' subgroups identified by some of the other models. In terms of interpretability the ascend models presented additional challenges. As expected, individual profiles with the same cell type label were scattered throughout the ascend models, such that a purity-based heuristic approach was used to assign cell type labels at individual partitions. As a result, in most cases individual cell type labels 'dropped-out' of models before segregating in cell type-specific subgroups. Moreover, this assignment procedure required considerable ad hock manipulation of the objects generated by the ascend R package.

PAGE 11 OF 17
Nucleic Acids Research, 2021, Vol. 49, No. 17 e98 K2Taxonomer identifies subgroups of TILs characterized by differential regulation of TNF signaling, translation and mitotic activity from BRCA tumor scRNAseq cell clusters We performed K2Taxonomer analysis on scRNAseq data of 13 TIL cell clusters reflecting further subdivision of the 10 cell types reported in the original study (29). The higher resolution was achieved by reproducing the reported methods (29) with the exception of selecting a higher resolution parameter when performing clustering with Seurat (14) (Supplementary Table S3). The full set of results is also available through the interactive portal at https://montilab.bu. edu/k2BRCAtcell.
We compared the subgroup discovery of this model to that of ascend (24) (Supplementary Figure S5). The models differed substantially. Notably, unlike K2Taxonomer, ascend-based models did not segregate monocytes from T cell subtypes in the first partition and segregated mitotic Treg cells, CD4+ FOXP3+ (3), from the two other Treg cell subtypes in the second partition. Moreover, the ascendbased model segregated individual profiles of CD4+ Trm cells, such that this cell type 'dropped-out' out of the model early in partitioning.

Confounding effects of inflammation and proliferation on the association between tumor infiltrating cell activity and patient survival
To assess the clinical relevance of K2Taxonomer annotation of single-cell immune cell subgroups, we performed survival analysis, via Cox proportional hazards testing, modeling the relationship between K2Taxonomer subgroup gene signature scores and patient survival in the METABRIC breast cancer bulk gene expression data set.
For these models, we examined two possible sources of confounding factors. First, inflammation has a welldescribed paradoxical role in breast cancer progression (58), such that the content of different subpopulations of lymphocytes has been associated with both better and worse prognosis (59). Given the physiological similarities between different lymphocyte subtypes (60), we hypothesized that expression patterns associated with tumor-promoting inflammation could mask those associated with tumorsuppressing TILs subsets. Second, we hypothesized that the signatures of the two mitotic T cell subgroups were similar enough to signatures of proliferative activity in nonimmune tumor cells to result in a spurious association between T cell mitosis and worse prognosis. To assess and correct for these confounding effects, multivariate survival models were run without and with the inclusion of inflammation and proliferation scores as individual covariates. These patient-level scores were estimated by projecting published signatures of 'bad' inflammation (49) and proliferation (50), each of which had been previously reported to be associated with poor prognosis in breast cancer.
The results of each of these analyses are summarized in Figure 5A. The full set of survival results for unadjusted and adjusted models, including the genes belonging to each subgroup signature are reported in Supplementary Table S12. Controlling for inflammation and proliferation scores increased the overall significance of the association between subgroup-driven signatures of TILs and improved survival (hazard ratio < 1, FDR < 0.05). Furthermore, signatures of two cell subgroups, 'CD8+ mit. Trm' and 'Treg mit.', characterized by increased cell cycle activity ( Figure 5B), were associated with worse patient survival in models ignoring inflammation and proliferation scores, but were subsequently statistically insignificant in models including these covariates, likely reflecting the effect of confounding by proliferation activity ( Figure 5A). This is further illustrated in Figure 5E, which shows the 95% confidence intervals of hazard ratios of 'marginal' inflammation and proliferation models (left-most), as well as the confidence intervals of hazard ratios of select subgroups of cell subtypes, unadjusted and adjusted for inflammation and proliferation. Controlling for inflammation and proliferation allowed us to disentangle the contribution to survival of different components. For example, in the 'CD8+ mit. Trm' subgroup, we observed that the 'CD8+ mit. Trm' signature score was highly associated with worse patient survival in the unadjusted model, but the association became insignificant in the full model adjusted for proliferation and inflammation. On the other hand, there were instances where the hazard ratio achieved or improved significance (i.e., patient survival was significantly better) only after controlling for inflammation and proliferation in the full adjusted model, as observed in the 'Trm All' subgroup and, to a lesser extent, in the 'Trans-lation+' subgroup ( Figure 5E).

High expression of TNFRSF4, a marker for Treg cell activity is associated with worse survival when adjusting for CCL5 expression
TNFRSF4 and CCL5 were found to be the top two markers constitutively up-and down-regulated, respectively, within Treg subgroups, with TNFRSF4 the top marker further dis- criminating between the two non-mitotic Treg subgroups, Treg TNFRSF4+ and Treg RGS1+ ( Figure 5A, C, D). Furthermore, their expression was highly correlated in the METABRIC data set (rho = 0.66, P-value = 3.2e−245) ( Figure 5F), supporting a pattern of co-expression within TIL microenvironments. To assess whether TNFRSF4 and CCL5 expression levels could serve as markers for immunosuppressive activity of Treg cells, we performed survival analysis of each gene modeled separately and in a combined model ( Figure 5G). When modeled separately, the expression of TNFRSF4 is not associated with patient survival (Pvalue = 0.32), while CCL5 is associated with better patient prognosis (P-value = 1.94E−4). However, in the combined model both genes are associated with patient survival, with TNFRSF4 associated with worse patient survival (P-value = 0.015).
Taken together these results indicate that, in bulk gene expression data, markers of Treg cell activity are highly correlated with markers of overall tumor immune infiltration, confounding associations between expression of these markers and patient survival.

Up-regulation of specific translation genes characterizes a subgroup of TILs and is associated with better survival prognosis, independent of inflammation activity
The 'Translation+' subgroup was a notable instance where the subgroup-specific signature projection was associated with better patient survival, regardless of adjustment for inflammation and proliferation ( Figure 5A, E). To assess the extent to which up-regulation of translation-specific genes in this subgroup associated with better patient prognosis, we ran separate survival analysis for each of the 112 genes from the Reactome Eukaryotic Translation Initiation gene set, which were shared between the single-cell BRCA gene set and METABRIC data set. Of the 112 genes, 61 were upregulated in the 'Translation+' subgroup (FDR < 1E−5), including 26 genes within the top 50 marker 'Translation+' subgroup signatures ( Figure 5H, I). The full set of survival results for these 112 genes are shown in Supplementary Table S13. The test statistics derived from single gene Cox proportional hazards models were negatively correlated with the corresponding genes' test statistics of their up-regulation in the 'Translation+' subgroup (rho = −0.23, P-value = 0.014) ( Figure 5I). Furthermore, seven of the 112 genes were associated with better patient survival (FDR < 0.1). All of these genes were significantly up-regulated in the 'Translation+' subgroup. Of these seven genes, RPL36A had the minimum 'Translation+' subgroup-associated test statistic (FDR = 1.34e−25) and two (RPS28 and RPS27) were members of the top 50 markers, comprising the 'Trans-lation+' subgroup signature. RPS27 was the top translation gene associated with the 'Translation+' subgroup (FDR = 9.6e−246).
In summary, we employed K2Taxonomer to characterize the up-regulation of translational machinery as the dominant transcriptional program shared by a diverse subgroup of TILs. These findings informed additional analyses, which demonstrated that association between expression of translational machinery genes and better patient survival tracked with their over-expression in a subgroup of TILs. Taken together these findings suggest that the up-regulation of specific translational machinery genes is widespread across TILs, serving as a predictor of the level of immune infiltration in breast cancer tissue from bulk gene expression data.

DISCUSSION
In this manuscript, we presented extensive assessment and practical applications of K2Taxonomer, a novel unsupervised recursive partitioning algorithm for taxonomy discovery in both bulk and single-cell high-throughput transcriptomic profiles. An important distinctive feature of the algorithm is that each partition is estimated based on a feature set selected to be most discriminatory within that partition, thus permitting the use of large sets of features to be used as input, without pre-filtering or dimensionality reduction approaches. Additionally, to minimize generalization error, each partition is based on an ensemble (61) of partition estimates from repeated perturbations of the data. The adoption of an ensemble approach also makes it possible to compute a stability measure for each partition, which can be used to assess the robustness of each partition, as well as a stopping criterion for limiting the number of subgroup estimates. Finally, to facilitate comprehensive exploration of the results, as well as to share results for independent interrogation, the K2Taxonomer R package includes functionalities to automatically generate interactive web-portals. One of these portals was utilized extensively to annotate subgroups as part of our analysis of breast cancer TILs in scRNAseq data and is publicly available (https://montilab.bu.edu/k2BRCAtcell/).
Our extensive evaluation and benchmarking of the methods on simulated data showed K2Taxonomer's high accuracy, its superior performance when compared to representative other methods, and its capability to (re)discover known nested taxonomies. As we have shown in its multiple applications, K2Taxonomer may be applied in a fully unsupervised mode to partition individual-level data, or it can take group-level labels as input to estimate inter-group relationships among the known groups. In the group-level settings, K2Taxonomer is not directly comparable to fully unsupervised 'flat' or hierarchical clustering methods. In our simulations, the comparison to Ward agglomerative clustering was attained by averaging the profiles within each simulated 'group' and by then clustering these 'average profiles.' Side-by-side comparison with any of the single cell algorithms (23-25) on simulated data was thus deemed not necessary, since their main difference from standard agglomerative clustering is in their handling of the highly sparse single cell profiles, a need that is eliminated when dealing with the afore mentioned 'average profiles. ' In the group-level analysis, partition estimates are based on the constrained K-means algorithm (26), which estimates clusters at the level of known group labels. This approach is perfectly suited to the downstream analysis of scRNAseq data, following the estimation of mutually exclusive cell types using scRNAseq clustering methods such as Seurat (14) or Scran (15). It thus provides a unique tool for the analysis of ever-growing single cell data repositories such as the Human Cell Atlas (62) Program (HuBMAP) (63) among others. By preserving the single observation information within each group, and by thus being able to tailor the feature set to each of the groups, we expect our approach to outperform methods in which group-level information is summarized into single statistical measures. This conclusion is supported by our simulation analysis, where K2Taxonomer was shown to significantly outperform Ward's agglomerative method based on grouplevel test statistics. Even when adopted for observation-level analysis, where inference was performed on the full set of individual observations, K2Taxonomer was still shown to significantly outperform standard agglomerative methods, on both simulated and real data, although not to as large an extent.
In our analysis of healthy airway cell types' annotation (28), we employed K2Taxonomer to (re)discover subgroups of cell types characterized by shared lineage. Remarkably, our analysis accurately recapitulated the known taxonomic structure relating the different cell types to an extent not matched by the other methods evaluated. This example illustrates a prototypical use of the tool: in those cases where a data set and its associated cell type estimations are publicly available, K2Taxonomer facilitates their immediate repurposing for additional insight and discovery. This is a defining advantage over fully unsupervised scRNAseq hierarchical clustering methods, such as ascend (24), which cannot make use of prior labeling of the cell profiles for subgroup discovery. Subgroup annotation from ascend analysis of these data was complicated by the fact that cell profiles belonging to the same cell type label were split across multiple partitions, and often to an extent that it made it challenging to assign a representative cell type label to individual subgroups. Moreover, even for annotated subgroups, none of the ascend models were able to recapitulate taxonomic relationships between cell types to the same level as the K2Taxonomer model. This was also true when comparing the K2Taxonomer and ascend models generated from the breast cancer TILs scRNAseq data set.
It is important to emphasize that in many data sets continuous lineage trajectories are non-existent or obscured by phenotype-driven inter-group transcriptional relationships. While K2Taxonomer cannot identify precursor relationships between cell types, the strategy of pairing recursive partitioning with local feature selection allows the discernment of relative relationships between groups rather than 'all-or-nothing' connections as in graph-based trajectory models. The advantages of this strategy are exemplified by the analysis of the healthy airway, in which the data included samples of multiple individuals and airway locations. PAGA analysis of these data produced highly-connected graphs with no decipherable trajectories or subgroups beyond that of disconnected subgraphs. On the other hand, K2Taxonomer recapitulated clear lineagedriven hierarchies of airway cell subgroups. While the subgrouping of endothelial cells as a subset of immune cells does not reflect the expected hierarchy, K2Taxonomer generated this model without parameterization. The 'best' PAGA model had the unfair advantage of having been chosen from a set of distinct models generated from combinations of feature selection and dimensionality reduction pa-rameters. In practice, the a priori choice of the optimal values of these parameters is challenging.
Our extensive analysis of single-cell data from breast cancer TILs showcased the incorporation of K2Taxonomer in an advanced in-silico study that yielded significant novel insights. In contrast to bulk gene expression, which captures average expression across all cells, identifying dominant transcriptional programs driving phenotypic similarities between subgroups of cell populations offers additional insights to deconvolute the cellular microenvironment of these samples beyond their individual transcriptional signatures. Molecular convergence of cells of disparate lineages is exemplified by subpopulations of CD8+ and CD4+ T cells, each of which exists in various functional states as naïve, effector, and memory subpopulations (64). Importantly, our K2Taxonomer-based analysis showed that concordant subpopulations of CD8+ and CD4+ T cells share transcriptional signatures that may outweigh those arising from their shared lineage. For example, both CD8+ Trm and CD4+ Trm cells have been reported to express surface molecules, CD69 and CD11a (65). Concordantly, CD8+ Trm and CD4+ Trm cells were segregated into a common subgroup by K2T, demonstrating the relative dominance of their shared transcriptional activity. Projection of the expression signature of Trm cell subgroups was associated with better survival in the METABRIC data set. Past studies focusing on CD8+ Trm cell markers have reported similar findings (29,66).
Unlike Trm cells, the presence of immune-suppressing Treg cells in the microenvironment has been associated with poor prognosis in breast cancer (67)(68)(69). After identifying TNFRSF4 as heterogeneously expressed across the Treg cell subgroup, we showed that TNFRSF4 expression was associated with worse patient survival in the METABRIC data set when adjusted for CCL5 expression, which was downregulated among all Treg cells. This supports previous findings that TNFRSF4, also known as OX40, is a marker of high Treg cell immunosuppressive activity (70,71). The high level of co-expression of TNFRSF4 and CCL5 in the METABRIC data set suggests that either gene is associated with immune infiltration in breast cancer tumors. Additionally, this provides a resolution as to why projections of the signature of the Treg cell subgroup were associated with better patient survival, while the signature of the Treg cell subset characterized by high TNFRSF4 expression, was not. These results are consistent with previous studies establishing the ratio between Treg and CD8+ T cell abundance as a prognostic marker of breast cancer that reflects immune inhibitory function of Treg cells (72). Moreover, these results strongly suggest a prognostic value for markers that capture the degree of immune inhibitory activity of Treg cell populations.
Finally, K2Taxonomer identified a diverse subgroup of breast cancer TILs characterized by consistent upregulation of translational genes. Increased ribosomal biogenesis has been previously implicated in increased tumorigenesis (73)(74)(75)(76), but has only recently been implicated in T cell activation (30) and expansion (31). Unlike the majority of other subgroups, the signature of the T cell subgroup overexpressing translational machinery genes was associ-ated with better patient survival in METABRIC patients regardless of adjustments for inflammation (49) and proliferation (50) signatures. Furthermore, the association of the expression of specific translational genes with better patient survival was significantly correlated to their overexpression in this T cell subgroup. These results suggest that overexpression of these T cell-specific translational genes is not masked by tumor-specific gene expression and is therefore indicative of CD4+ and CD8+ T cell tumor infiltration.
In summary, K2Taxonomer demonstrated a remarkable ability to discover biologically relevant taxonomies when applied to the analysis of both bulk gene expression and scRNAseq data and to outperform standard agglomerative methods. In multiple practical applications, we showcased the versatility of K2Taxonomer to analyze scRNAseq data toward the characterization of genes and pathways distinguishing specific subgroups, thereby generating hypotheses that were then in-silico validated in independent bulk gene expression data. As noted, while we here focused on the analysis of transcriptomics data, the proposed approach is equally applicable to other bulk and single-cell 'omics' data, such as those generated by high-throughput proteomics and metabolomics assays.
The data that support these findings are publicly available and were accessed from several repositories. The bulk microarray gene expression and subject data from the METABRIC breast cancer study is available through the cBioPortal, https: //www.cbioportal.org/study/summary?id=brca metabric. The bulk RNAseq gene expression and subject data from the TCGA BRCA project is available through the Genomic Data Commons, https://gdc.cancer.gov/access-data/. The single-cell RNAseq gene expression and subject data of healthy airway tissue is available through the UCSC Cell Browser page, https://www.genomique.eu/cellbrowser/ HCA/?ds=HCA airway epithelium. Finally, the single-cell RNAseq gene expression and subject data of breast cancer immune cells is available in the Gene Expression Omnibus, GSE110938.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.