PhenoComb: a discovery tool to assess complex phenotypes in high-dimensional single-cell datasets

Abstract Motivation High-dimensional cytometry assays can simultaneously measure dozens of markers, enabling the investigation of complex phenotypes. However, as manual gating relies on previous biological knowledge, few marker combinations are often assessed. This results in complex phenotypes with the potential for biological relevance being overlooked. Here, we present PhenoComb, an R package that allows agnostic exploration of phenotypes by assessing all combinations of markers. PhenoComb uses signal intensity thresholds to assign markers to discrete states (e.g. negative, low, high) and then counts the number of cells per sample from all possible marker combinations in a memory-safe manner. Time and disk space are the only constraints on the number of markers evaluated. PhenoComb also provides several approaches to perform statistical comparisons, evaluate the relevance of phenotypes and assess the independence of identified phenotypes. PhenoComb allows users to guide analysis by adjusting several function arguments, such as identifying parent populations of interest, filtering of low-frequency populations and defining a maximum complexity of phenotypes to evaluate. We have designed PhenoComb to be compatible with a local computer or server-based use. Results In testing of PhenoComb’s performance on synthetic datasets, computation on 16 markers was completed in the scale of minutes and up to 26 markers in hours. We applied PhenoComb to two publicly available datasets: an HIV flow cytometry dataset (12 markers and 421 samples) and the COVIDome CyTOF dataset (40 markers and 99 samples). In the HIV dataset, PhenoComb identified immune phenotypes associated with HIV seroconversion, including those highlighted in the original publication. In the COVID dataset, we identified several immune phenotypes with altered frequencies in infected individuals relative to healthy individuals. Collectively, PhenoComb represents a powerful discovery tool for agnostically assessing high-dimensional single-cell data. Availability and implementation The PhenoComb R package can be downloaded from https://github.com/SciOmicsLab/PhenoComb. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Flow and mass cytometry assays are able to measure dozens of markers in hundreds of thousands of cells per sample at single-cell resolution (Chattopadhyay et al., 2008), enabling the characterization of complex phenotypes and powering new insights. However, traditional, hypothesis-driven, manual gating analysis of cytometry data relies on prior knowledge of the researcher, enabling only incremental advances. In contrast, agnostic approaches to the analysis of cytometry data are powerful discovery tools that more fully utilize the high-dimensional data and are not limited to a priori hypotheses. However, many agnostic approaches, such as dimension reduction [e.g. Uniform Manifold Approximation and Projection (UMAP) (Becht et al., 2019)] and clustering tools [e.g. Phenograph (Levine et al., 2015)], fail to capture discrete phenotypes, limiting interpretability, the ability to do statistical analysis and downstream isolation of cells (e.g. flow sorting) for future experiments.
An alternative agnostic approach is to exhaustively identify discrete phenotypes in the flow data by assigning discrete states for each of the measured markers and counting the cells that have the combination of marker states of a given phenotype. For example, if two markers (A and B) are measured and discretized into positive (þ) and negative (À) states, an exhaustive combination of markers and states would result in the following phenotypes: AþBþ, AþBÀ, AÀBþ and AÀBÀ. Additionally, the partial combinations of markers could also be considered: Aþ, AÀ, Bþ and BÀ. By visiting all possible marker/state combinations, the cell counts could be then used to assess which phenotypes are relevant given a comparison of interest.
Similar approaches were shown to be effective, unveiling novel phenotypes in the context of cancer therapy (Woods et al., 2020) and infectious diseases (Aghaeepour et al., 2012). However, exhaustive exploration in the phenotypic landscape can rapidly get computationally expensive to perform as the number of markers increases. This is because the number of possible phenotypes increases exponentially, such as ðn þ 1Þ m À 1, with the number of markers (m) and states (n). Previous approaches could only tackle a limited number of markers, had long run times, and are no longer supported/available. Thus, new and more powerful bioinformatic tools are needed to perform this kind of analysis.
Here, we present PhenoComb, an efficient tool for the exploration of complex phenotypic landscapes. It is available as an R package and implements methods to (i) convert continuous scale data for marker expression into discrete states, (ii) count cells for all possible phenotypes present in the input dataset, (iii) perform a variety of statistical comparisons to evaluate the relevance of those phenotypes based on cell frequency of the samples and (iv) identify relevant phenotypes that are independent from each other. PhenoComb takes advantage of parallel computing and is implemented in a memory-safe manner, overcoming limitations on the number of markers from previous approaches. It can process datasets with virtually any number of markers, only being limited by computing time and disk space for the outputs.

Package implementation and features
PhenoComb is available as an R package with some functions implemented in Cþþ for efficiency, but only available through R's interface. There are two coding workflows available that implement the analysis illustrated in Figure 1: a local workflow, designed for ease of use, holding all objects in memory, being easily accessible for the user; and a server workflow, designed to handle large datasets which cannot be held in memory. In both implementations, PhenoComb takes advantage of parallel computing to increase performance where steps of the analysis are mutually independent. Tutorials on how to install and use PhenoComb are available at the GitHub repository page: https://github.com/SciOmicsLab/PhenoComb.
The package also provides several ways to filter data and tune the analysis. For example, cells can be filtered out if intensity values are over a certain threshold, called Out of Bound, to reduce noise. When counting cells for all phenotypes, the user can filter phenotypes by the minimum number of cells on a fraction of samples, by parent phenotype, and also set the maximum phenotype length desired. For the statistical tests, the output can be filtered by parent population and P-value. PhenoComb also provides some convenience tools to deal with big outputs, such as finding specific phenotypes in files.

Combinatorial phenotypes computation
Given a dataset with m markers' signal intensity measured, each continuous measurement can be discretized into a finite number n of states through nÀ1 provided thresholds obtained by manual gating. These states can be called, for example, negative (À) and positive (þ) if one considers only two states, or negative (À), low (þ) and high (þþ) if three states are considered. The number of discrete states can be any number !2, and there can be a different number of states for each marker.
Considering that all m markers can be discretized into n states, there are n m possible combinations of marker states. Each different combination will henceforth be referred to as phenotypes. Additionally, we can consider phenotypes that are neutral for one or more markers, increasing the total number of possible phenotypes to ðn þ 1Þ m À 1.
To illustrate the combinations between markers and states, Figure 1a shows all possible combinations between three markers (A, B and C) each one having two possible states (À and þ). Each row indicates the number of markers for the phenotype, henceforth called length, as a consequence of considering phenotypes neutral to certain markers.
PhenoComb provides a method to count the number of cells in the dataset that have a given phenotype for all possible phenotypes across all samples. This is done efficiently by first counting cells for all full-length phenotypes present in the data, and then generating all other phenotypes with neutral states by summing up the cells already counted in the full-length ones. Since the computation of each neutral-state combination is independent from each other, this step can take advantage of parallel computation to increase performance.

Evaluation of generated phenotypes
The statistical evaluation of a phenotype based on its cell counts across samples can be done in several ways. PhenoComb provides three types of statistical comparisons: two group comparison using the Mann-Whitney U-test, correlation with a given variable using Kendall's rank correlation, and a time-to-event (survival) analysis test using Cox Proportional Hazards Model. Since the total number of cells can vary across samples, all statistical tests are performed on cell frequencies, being the cell count of the phenotype divided by the total number of cells the sample contains. If phenotypes are filtered for a given parent phenotype, the frequencies will be calculated based on the parent phenotype's cell count instead of the total number of cells. The phenotypes can then be filtered by respective P-values as represented in bold font in Figure 1a.
While an exhaustive evaluation of marker combinations can create a large number of phenotypes, these phenotypes are not independent of one another. For example, in a two-marker dataset, the phenotype Aþ is related to the phenotypes AÀ, AþBÀ and AþBþ. Given the interrelatedness of phenotypes, when performing statistical comparisons, the type I error rate (i.e. false positives) is not a product of the number of comparisons. Therefore, traditional methods for correcting for multiple comparisons are not appropriate. However, an inflated type I error rate (i.e. false positives) will still occur. PhenoComb does not adjust for multiple comparisons and should be used with that caveat.

Identification of independent phenotypes
As mentioned in the previous section, phenotype cell counts can be dependent on each other. In large datasets, it can be hard to distinguish the phenotypes that are the actual drivers of statistical/biological relevance. PhenoComb provides a network-based algorithm aimed at finding the driver phenotypes.
Given a number P of phenotypes, a phenotypic similarity (ps) score matrix S PÂP is calculated based on the number of markers each pair of phenotypes share, such as: where ' is the length of a given phenotype. The similarity matrix S can be understood as a symmetric adjacency matrix, thus representing an undirected network where each node is a phenotype, and the links between them are weighted by the ps score given in Equation (1). An example of such network is depicted in Figure 1b. Next, n sub-networks are generated by pruning the original network from its edges by applying a set increasing thresholds t 1 ; t 2 ; . . . ; t n with The pruning is performed by applying the following function to the matrix S: For each sub-network, we apply the Louvain clustering algorithm (Blondel et al., 2008), which automatically finds the optimal number of clusters and labels the nodes accordingly, as illustrated in Figure 1c. A representative phenotype is then selected from each cluster by ranking them according to the absolute value from the statistic provided by the chosen comparison. For group comparison, the ranking value used is the jeffect sizej Ã jlog 2 ðfold changeÞj. We use this formula to avoid artifacts such as only one sample with an unexpected high frequency driving the fold change. For the correlation test, the correlation metric is used. The Cox Proportional Hazards coefficient is used if this method was chosen. All values used for ranking are normalized for the range. Finally, the representative phenotypes are ranked by a confidence score based on the frequency that the given phenotype shows up as representative in the n sub-networks, as depicted in Figure 1d.

Synthetic datasets
To better evaluate PhenoComb's performance, we generated a set of synthetic datasets simulating several combinations of number of markers, maximum phenotype length, number of cells and number of samples. Since they were only used to test the performance of the combinatorial phenotypes computation, all full-length phenotypes for all samples have a similar number of cells. The code to generate such datasets is available on the aforementioned GitHub repository. All computations were performed on a 28-core, 56-thread, Intel Xeon W-3275M 2.50 GHz with 1 TB of memory. However, the memory requirements for PhenoComb are approximately four times the size of the fcs input file.

The HIV dataset
This dataset, previously described in Weintrob et al. (2008), was obtained from FlowRespository.org, flow repository ID FR-FCM-ZZZK. The dataset consisted of 466 peripheral blood mononuclear cells (PBMC) samples obtained from human immunodeficiency virus (HIV)-infected military personnel that were evaluated by flow cytometry for 12 markers (CD3, CD4, CD8, CD45RO, CD27, CD28, CD57, CCR5, CCR7, CD127, Ki67 and CD14) and a viability dye. Samples contained a mean of 393 315 events. Individual sample fcs files were gated to exclude debris (foward scatter area versus side scatter area). Samples with <5000 cells were not considered, leaving a total of 421 samples to be analyzed. Dead cells and CD14þ cells were removed through a dump channel gate. Samples were then concatenated and each marker gated as in Supplementary Figure S1A. These analyses were performed on FlowJo 10.7 software. The fluorescence intensity coordinates of gates (i.e. threshold differentiating positive and negative populations for each marker) were recorded in a csv file for reference in PhenoComb function arguments.

The COVIDome dataset
This dataset, previously described in Sullivan et al. (2021), was also obtained from FlowRespository.org, flow repository ID FR-FCM-Z367. The dataset consisted of 99 PBMC samples obtained from The network has its edges filtered by several ps thresholds and a clustering algorithm is applied for each one. For each cluster identified, the phenotype with the highest log 2 fold-change is selected as the representative one. Clusters are represented by different colors and representative phenotypes are highlighted with a green border. (d) A confidence score for each phenotype is then computed as the frequency of its selection as a representative throughout the networks COVID-infected individuals (n ¼ 69) and COVID-negative individuals (n ¼ 30), which were evaluated for 40 markers by CyTOF mass cytometry. Samples contained a mean of 390 059 events. The publicly available fcs files were previously gated to remove anomalies and dead cells. Sample fcs files were concatenated, each marker gated as in Supplementary Figure S1B, and the intensity coordinates of gates were recorded in a csv file.

flowType comparison
flowType (Aghaeepour et al., 2012) version 2.25.0 was installed using R version 4.0.5 with BiocManager version 3.11. The aforementioned synthetic dataset was assessed by flowType for marker combinations of length 4 through 11, each being repeated three times and the mean runtime assessed.

Performance benchmark
To evaluate PhenoComb's performance, we simulated a set of 48 synthetic datasets considering different numbers of markers, cells, samples, maximum phenotype length and number of threads used to compute. These datasets represent the worst-case scenario for each configuration of parameters since all possible combinations of fulllength phenotypes will be present in them. That is unlikely to happen in real datasets because of biological restrictions and limited number of sampled cells. To illustrate the magnitude of the analysis, Figure 2a shows the number of expected computed phenotypes in the synthetic datasets for a given number of markers. It shows the exponential nature of the combinatorial problem. Also, the number of phenotypes in HIV and COVIDome datasets (discussed later) is shown in purple, confirming that the number of phenotypes present in real datasets is lower than in the theoretical ones, despite some markers in the real datasets having more than two states. For each of the experiments shown in Figure 2b-f, we varied one parameter and fixed the remaining parameters as follow: 15 markers, 40 000 cells per sample, 10 samples, 50 threads and the maximum phenotype length as the number of markers unless otherwise stated. All datasets were assessed with two states for each marker. All analysis was performed using the server workflow since the local workflow only allows the processing of small datasets, which are usually computed in a matter of a few minutes. The experiments shown in Figure 2 account for the combinatorial phenotype counting step, which is the most impacted by the number of markers and states.
The parameter that impacts PhenoComb's performance the most is the number of markers present in the dataset. This was expected since the number of possible phenotypes grows exponentially with the number of markers. Figure 2b shows that the computational complexity growth is super-exponential.
PhenoComb allows a decrease in the maximum length of phenotypes to mitigate this problem when very large phenotypes are not of interest. This feature can also decrease exponentially the computation time as shown in Figure 2c. Nevertheless, computation times for real datasets are almost always lower than the synthetic ones because of the lower number of present phenotypes, even though the number of samples and the cell counts per sample were higher in the real datasets used versus the synthetic datasets. The only case where the real dataset is higher than the synthetic is for the HIV data because of its high number of samples (n ¼ 421).
The total number of cells in the dataset had little impact on the cell counting process once they are all transformed to single numbers when calculating the full-length phenotypes, a relatively computationally inexpensive process (Fig. 2d). The number of samples present in the dataset increase linearly the time to compute the phenotypes as shown in Figure 2e. Lastly, PhenoComb can take advantage of parallel computing, reducing almost exponentially the computing time as the number of threads used is increased (Fig. 2f).
To assess the relative performance of PhenoComb, flowType was used to assess the synthetic datasets. As before, 10 samples with 40 000 cells each were assessed using 50 threads. Marker combinations of length 4 through 15 were assessed. The mean runtimes from three independent runs for each marker length are shown by the orange points in Figure 2b. Also, it's worth noticing that the times measured for flowType do not account writing results to a file, which are included for PhenoComb.

Validation of phenotypes identified in previous approaches
To test PhenoComb's ability to identify biologically meaningful phenotypes, we used a previously published, publicly available flow cytometry dataset evaluating T-cell immunophenotypes associated with HIV seroconversion (Aghaeepour et al., 2012). PhenoComb was utilized to discretize markers and assess all marker combinations. Marker combinations for CD3þ population were then assessed for associations with survival time following seroconversion. A total of 9754 phenotypes with an unadjusted P-value of <5:0E À 7 were identified. We specifically focused on phenotypes identified in the original publication (Table 1). Of the three phenotypes highlighted in the original publication as being significantly correlated with time to seroconversion, PhenoComb identified two as also statistically significant.
We also benchmarked PhenoComb's performance in assessing this dataset. For the 12 markers assessed, a total of 220 170 phenotypes were identified using a minimum cell count threshold of 100 in 50% of samples out of a theoretical maximum of 531 440 phenotypes (Fig. 2a). Computation time was 2.35 min (Fig. 2b).
Phenotypes with a confidence score of 1 are reported in Table 2 and graphs of the data are shown in Figure 3. Both the CD45þCD3ÀCD19ÀCD56ÀCD14À and the CD3þCD19þ population had elevated frequencies of cells expressing the proliferation marker Ki67 and high levels of the activation marker/ectoenzyme CD38 (Quarona et al., 2013) in COVID-positive samples (Fig. 3ac). In T cells, decreased CD4 cells expressing CD45RA and negative for expression of CCR7 and CD25 were observed in COVIDpositive samples relative to COVID-negative (Fig. 3d). In NK cells, elevated frequencies of cells expressing the proliferation marker Ki67 were seen in COVID-positive samples (Fig. 3e), and elevated frequencies of cells negative for Ki67 were seen in COVID-negative samples (Fig. 3f). Monocytes also had elevated frequencies of Ki67expressing cells in COVID-positive samples (Fig. 3g). Increases in frequencies of the monocyte parent lineage expressing the dendritic cell marker CD141 (Fig. 3h) and, separately, CD11b (Fig. 3i) were also observed in COVID-positive samples.
For each of the lineages assessed, we also determined the total number of phenotypes present and the total runtime needed to compute. These data are plotted in Figure 2a and b. The following number of phenotypes were identified for the corresponding lineage (minimum cell count 100 in 50% of samples): CD45þCD56þ, 485 832 phenotypes; CD45þCD14þ, 1 537 542 phenotypes; CD45þCD3ÀCD19ÀCD56ÀCD14À, 28 219 848 phenotypes;  PhenoComb: Assess complex phenotypes

Discussion
Herein, we demonstrated that PhenoComb is an efficient and effective discovery tool for agnostically interrogating high-dimensional cytometry data. The package pipeline is carried out over four steps: (i) discretizing marker expression based on user-defined thresholds, (ii) assessing the frequencies of all maker combinations present in the dataset for each sample, (iii) performing a statistical comparison of the phenotype frequencies and (iv) network/clustering analysis to identify representative independent phenotypes from those identified as significantly different. While we presented a pipeline analysis of data using PhenoComb, dataframes/csv files are generated at intermediate points that can be used for further interrogation inside and outside of PhenoComb. For example, the frequency of all identified phenotypes can be accessed to further interrogate specific phenotypes of interest, even if they are not identified as different in comparisons or an independent phenotype in clustering. As PhenoComb utilizes thresholds to discretize data, inputs from separate experiments need to be assessed independently. This is consequence of the different thresholds for markers across experiments, except in cases of standardization (e.g. using rainbow beads to standardize voltages). As a result, PhenoComb's utility as a direct pipeline for use on datasets with batch differences is limited. However, using the product of intermediate steps, workarounds are possible (e.g. combining csv files generated from separately discretized experiments and then assessing frequencies).
To the authors' knowledge, no tools similar to PhenoComb in function are publicly available. The package flowType is the closet in function but is no longer actively supported. The comparison of PhenoComb and flowType on synthetic data highlights flowType's limitation of assessing only relatively short maximum marker lengths. While flowType was used to assess marker combinations only up to a maximum of length 11, the theoretical time necessary Fig. 3. Identified phenotypes. Phenotypes differing between COVID-positive and COVID-negative samples identified as independent with a confidence score of 1 are shown for the following parent lineages: (a, b) CD45þCD14ÀCD19ÀCD3ÀCD56À, (c) CD45þCD19þ, (d) CD45þCD3þ, (e, f) CD45þCD56þ, (g-i) CD45þCD14þ. Each panel shows the corresponding difference of means, the 95% confidence intervals and associated P-values for a 15 marker long phenotype is estimated at 42 days. At all marker combination lengths assessed, PhenoComb had shorter runtimes of several orders of magnitude. It is important to note that flowType, to the best of our knowledge, is limited to single-core use. As flowType operates in memory, it is also limited by available RAM. Conversely, PhenoComb takes advantage of multi-core processors and is not limited by available RAM. The presented analyses utilized two-state (i.e. 6) or three-state (i.e. negative, low, high) discretization of markers in datasets, but PhenoComb is theoretically compatible with a large number of discretely defined states. Defining the thresholds and the number of states is determined by the user, as it is in manual analysis. The number of states should be guided by visual analysis and background knowledge for markers. For example, CD4 is expressed by monocytes (and other innate cell lineages) in addition to T cells, with monocytes having a 'low' expression state and T cells having a 'high' expression state. Hence, thresholds for CD4 should define three states if the staining index is adequate to discriminate. Caution should be taken not to define more states than are supported by the biology of the marker and the quality of staining. While measurements for markers in cytometry are continuous, much of the variation is technical in nature rather than biological.
While we generated synthetic datasets to assess the computational time necessary, real datasets take significantly less time to calculate than the theoretical. As mentioned before and shown in our analysis of the HIV and COVIDome datasets, this is due to many of the theoretical marker combinations being absent in real data. This is the consequence of a limited number of cells being interrogated and/or biological limitations to co-expression/exclusion of markers. With a large enough cell count per sample, an interesting question that PhenoComb is capable of answering is how many of the theoretical marker combinations are present. In other words, PhenoComb could be applied to identify if and what phenotypes are absent. Likewise, PhenoComb could be used to interrogate highly correlated markers, which then could be used to inform more efficient future analyses.
Using PhenoComb to assess the HIV and COVIDome datasets, we demonstrated it was able to process large, real-world datasets in a relatively short amount of time. For example, in the COVIDome dataset analysis of the CD45þCD14þ lineage, PhenoComb was able to assess over 1.5 million phenotypes derived from 16 markers across 99 samples in under 19 min. However, given the superexponential increase in computation time, increases in the maximum length of phenotypes can result in analyses requiring extended amounts of time. Illustrating this, the CD45þCD3þ lineage, also from the COVIDome dataset, took 50 h to assess 26 markers. While for specialized and/or occasional use this may be an acceptable amount of time, often it is impractical. This can be mitigated through tuning functional arguments in PhenoComb, including filtering out low-frequency populations or limiting the maximum length of phenotypes to less than the total number of markers interrogated.
In interrogating the HIV dataset, PhenoComb identified two of the three phenotypes associated with time to seroconversion highlighted in the original publication. However, there was variation between the P-values, Cox Proportional Hazard coefficients and cell frequencies between PhenoComb and the original publication. This likely reflects the differences in statistical tests utilized and differences caused by the subjective nature of gating markers. This likely also explains why PhenoComb did not identify one of the three original publication phenotypes as significant.
In the COVIDome dataset, PhenoComb identified Ki67þ phenotypes in the NK cell, monocyte and undefined other immune cell (i.e. CD45þCD3ÀCD19ÀCD56ÀCD19À) lineages elevated in peripheral blood samples from COVID-positive individuals. Increases in B cells and undefined other immune cells expressing high levels of CD38 were also observed. These phenotypes highlight the proliferation and activation of immune cells as elevated in COVID infection. These phenotypes are associated with an active immune response, as would be expected in an infected individual. In T cells, a CD4þþCD45RAþCCR7ÀCD25À phenotype was decreased in COVID-positive individuals. A CD45RAþCCR7À T-cell phenotype is canonically associated with an effector or 'effector memory re-expressing RA (TEMRA)' phenotype (Ahlers and Belyakov, 2010;Restifo and Gattinoni, 2013). A decrease in effector T cells in COVID-infected individuals is counter-intuitive; however, the lack of CD25 expression suggests that these cells are not activated (Caruso et al., 1997) and hence not truly effector cells.
It is important to note that for all statistical tests, P-values are not corrected for the number of comparisons. This is because each phenotype assessed is not an independent observation of the data. Instead, a large number of phenotypes are dependent on each other, making it difficult to estimate multiple comparison adjusted P-values. Inflation of type I errors is partially mitigated by the network/clustering approach that uses metrics other than P-values to rank their importance. Regardless, identified phenotypes should be treated as 'discoveries', requiring further validation.
PhenoComb is a powerful tool to add to the cytometrist's analysis toolset. By taking full advantage of the single cell, high-dimensional data generated in cytometry assays, PhenoComb empowers exploratory data analysis and is able to highlight phenotypes for further characterization. Its implementation is also flexible, making it adaptable to work with various single-cell data types.