scCross: efficient search for rare subpopulations across multiple single-cell samples

Abstract Motivation Identifying rare cell types is an important task to capture the heterogeneity of single-cell data, such as scRNA-seq. The widespread availability of such data enables to aggregate multiple samples, corresponding for example to different donors, into the same study. Yet, such aggregated data is often subject to batch effects between samples. Clustering it therefore generally requires the use of data integration methods, which can lead to overcorrection, making the identification of rare cells difficult. We present scCross, a biclustering method identifying rare subpopulations of cells present across multiple single-cell samples. It jointly identifies a group of cells with specific marker genes by relying on a global sum criterion, computed over entire subpopulation of cells, rather than pairwise comparisons between individual cells. This proves robust with respect to the high variability of scRNA-seq data, in particular batch effects. Results We show through several case studies that scCross is able to identify rare subpopulations across multiple samples without performing prior data integration. Namely, it identifies a cilium subpopulation with potential new ciliary genes from lung cancer cells, which is not detected by typical alternatives. It also highlights rare subpopulations in human pancreas samples sequenced with different protocols, despite visible shifts in expression levels between batches. We further show that scCross outperforms typical alternatives at identifying a target rare cell type in a controlled experiment with artificially created batch effects. This shows the ability of scCross to efficiently identify rare cell subpopulations characterized by specific genes despite the presence of batch effects. Availability and implementation The R and Scala implementation of scCross is freely available on GitHub, at https://github.com/agerniers/scCross/. A snapshot of the code and the data underlying this article are available on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.10471063.


B Solver description
The main principle of the scCross solver is to use a beam search to efficiently find a first approximate solution, which serves as initialization to a local search algorithm.The beam search is modified to take advantage of the sample identity of each cell, so as to ensure diversity in the search and avoid getting stuck in a region of the search space that is relative to only one sample.Indeed, an optimal solution to the scCross optimization problem is expected to contain many cells from different samples.However, the beam search starts from clusters containing only a few cells.Therefore, the δ i factor gives a large penalty to all solutions, canceling out its effect during this early phase.The beam search might thus be drawn to a search space region relative to only one sample and get stuck there.
The first adaptation consists in adding an extra initialization step, which aims to find a small solution present in as many samples as possible by combining pairs of cells from different samples.It starts by selecting the 100 best pairs of each sample, which form new surrogate variables for a search where only pairs from different samples are combined (i.e.solutions of this initialization step contain at most 2 cells from the same sample).The best solutions from this initialization step form the starting point of the actual beam search, which explores the search space level by level by adding one cell to solutions from the previous level.
While the initialization step guarantees that the beam search starts with a cluster of cells coming from at least two different samples, the beam search could in some cases only add cells coming from one sample.To avoid filling the beam only with solutions related to one sample, the solver maintains one beam for each sample during the first levels of the beam search.This ensures a larger diversity of solutions if the search is temporarily dominated by sample-specific ones.In the long run, cross-sample solutions will be favored thanks to the effect of δ i , and all beams will tend to contain the same solutions (thus going back to a regular beam search).
Finally, the solution of the beam search serves as initialization for a simulated annealing algorithm, which will stochastically explore its neighborhood in order to improve the quality of the solution and produce the final scCross result.
Figure A1 shows the execution time of scCross evolves linearly with respect to the number of cells.

Figures A2 to A4
show the influence of the values of the κ and µ parameters on the number of cells and genes included in the scCross solution for the cases studies reported in sections 3.3 to 3.5 of the main manuscript.
The general trend that can be observed in these figures is that the number of genes in inversely correlated to  .

05
.10 .10 .20 .20 .01-.20 .01-.20  The results are reported here for µ = 20%, as the default value of 10% yields a solution with only two genes (for the first bicluster).This indicates that the constraint on the maximum number of negative values allowed in the bicluster is probably too strict, motivating the fact to increase the value of µ.
the value of κ.Indeed, whenever κ is increased, the number of genes in the solution decreases.This makes sense given the definition of the scCross optimization problem, as it inflicts a larger penalty on out-of-cluster expression and therefore only keeps the most deferentially expressed genes.
One should note that tuning this κ parameter didn't change the biological interpretation of the biclusters in any of our experiments.One might therefore choose the value of κ based on the number of genes that is deemed appropriate for visualization purposes and/or biological interpretation (e.g. using a Gene Onthology analysis).For instance, we here chose to increase the value of κ whenever the number of genes returned with the default κ value is significantly higher than 60, so as to be able to display the entire bicluster in the figures below (A8 to A15).A Gene Onthology analysis was easily conducted with the obtained number of genes (ranging from 17 to 77 genes).
The µ parameter controls the maximum proportion of negative values that is allowed within the bicluster.By default, it is set to 10%.Increasing this parameter (for instance to 20%) naturally leads to more cells and genes being included in the solution.Indeed, the number of genes increases as genes that contained more negative values (e.g. have higher dropout rates) can now be included.We therefore suggest to increase the value of µ whenever the number of genes returned with µ = 10% is very low (the tabula muris case study gives an example as only two genes are included in the solution).Increasing µ also leads to an increase in the number of cells, as cells that express a significant part of the genes identified with µ = 10% may now be included.We therefore suggest to increase µ whenever a visual inspection of the results shows that some out-of-cluster cells express a significant part (for instance 2/3) of the marker genes.

D Gene prefiltering
The focus of the scCross method is to identify rare subpopulations of cells shared across different samples.However, the scCross optimization problem (equations ( 1)-( 4) in the main manuscript) could in principle identify a large cell population whenever many genes are expressed in nearly all cells.In the limit, all cells could be part of the selected bicluster (J = C) and the second term of equation ( 1) would vanish.scCross therefore filters out initially any gene expressed in more than x% of the cells.This initial filtering is motivated by the search of gene markers of specific subpopulations rather than generic markers of high expression throughout the cell population.
Such a threshold makes sense as the the distribution of genes, in terms of the number of cells they are expressed in, generally follows a power law (figure A5).One typically sets a threshold corresponding to the beginning of the long tail of the distribution.By default, a threshold of x = 25% of the cells is used, which yields good results in all of our experiments as the vast majority of genes clear this bar.Yet, other thresholds could be considered.
Figure A5a shows the genes kept after a 25%, a 33%, and a 50% threshold on the GARP+ Tregs data (see section 3.1 of the main manuscript).Table A1 reports the results of scCross after applying the different thresholds.The same subpopulation of 31 cells is identified in all cases.Only the number of genes vary as some genes that didn't pass the initial filtering are included when increasing the threshold.Yet, these genes have little influence in the solution: the result after a 50% threshold contains 60% more genes than the one obtained after a 25% filtering, but the objective value is increased by only 7%.This makes sense as these genes will have more out-of-cluster expression, which will be penalized by the second term of equation ( 1) and lead to a low contribution to the objective value.Changing the filtering threshold therefore has little influence on the identified cell subpopulation.
Rather than using a fixed threshold, one could take advantage of the multiple run strategy to eliminate large biclusters.Indeed, a first run of scCross could be used to first identify any large bicluster present in the data.One would then filter out the corresponding genes, which are representative of this large subpopulation of cells, to perform a second run of scCross to search for a rare subpopulation of cells.In the limit, one could relax the constraint on the proportion of negative values allowed in the bicluster by setting µ = 100% in this first run to identify the largest possible bicluster.This would be analog to the max-sum submatrix approach (Branders et al., 2019), whose purpose is precisely to identify large biclusters of cells and genes.On the GARP+ Tregs data, this leads to a bicluster containing 4047 genes and all cells.Removing these genes roughly corresponds to a 33% threshold, both in terms of the genes kept (figure A5b) and the subsequent scCross result (fourth line of table A1).Given the computational overhead of performing multiple runs, we favor the pragmatic approach of defining a fixed threshold.
As an alternative to simply filtering out genes, one could use other types of data normalization to search for other kinds of patterns.For instance, considering the opposite of the original expression values would lead to select genes that are underexpressed in the selected cells.Other data normalizations can also lead to other interesting patterns (e.g. the genes departing the most from the median expression values for each cell).We evaluate the first approach on the GARP+ Tregs data by inverting the sign of the expression values of all genes expressed in more than 50% of the cells.Yet, this results in the exact same solution as the one obtained when

E Gene Onthology functions
Tables A2 to A7 report the results of the Gene Ontology analysis for the subpopulations identified by scCross in the different case studies.This analysis has been performed using the clusterProfiler R package.For each one, the 20 most significantly enriched GO terms are reported.A functional interpretation of these terms is included in the main manuscript.

F Cilia Carta genes
Tables A8 and A8 indicate which of the genes characterizing the cilium subpopulation identified in the NSCLC DTCs data have been identified in the Cilia Carta compendium.

G Additional heatmaps
This section contains several heatmaps which augment the ones present in the main manuscript.
Section 3.1: breast cancer GARP+ Treg data 3b in the main manuscript, which only presents genes associated to the 8th bicluster found by MicroCellClust.Figure A6 shows 10 genes for each cluster identified by MicroCellClust.
• • Figure A11 corresponds to the second cluster displayed in Figure 5 of the main manuscript.
• Figure A12 corresponds to the third cluster displayed in Figure 5 of the main manuscript.
Section 3.5: tabula muris • Figure A13 corresponds to the first cluster displayed in Figure 6 of the main manuscript.
• Figure A14 shows the second result of scCross when using the same parameters (κ = 0.044 and µ = 0.2) as for the first result (Figure A13).
• Figure A15 shows the second result of scCross with κ = 0.5, which corresponds to the second cluster of Figure 6 of the main manuscript.

Figure A1 :
Figure A1: Execution time of scCross when sampling datasets of various sizes from the NSCLC DTCs dataset.This experiment was conducted on a MacBook Pro laptop (Mac OS 14.3.1,Apple M1 pro CPU, 16GB RAM).

Figure A2 :
Figure A2: Number of cells and genes identified by scCross in the NSCLC DTCs data in function of the κ parameter (values indicated next to the dots), for two different values of the µ parameter.

Figure A3 :
Figure A3: Number of cells and genes for the three biclusters identified by scCross in the pancreas data in function of the κ parameter, for two different values of the µ parameter.

Figure A4 :
Figure A4: Number of cells and genes for the two biclusters identified by scCross in the NSCLC DTCs data in function of the κ parameter.The results are reported here for µ = 20%, as the default value of 10% yields a solution with only two genes (for the first bicluster).This indicates that the constraint on the maximum number of negative values allowed in the bicluster is probably too strict, motivating the fact to increase the value of µ.

Figure A5 :
FigureA5: Histogram of the genes, in function of the number of cells they are expressed in, for the GARP+ Tregs data.Only genes expressed in at least two different cells are considered.a Genes kept after a 25% (blue), 33% (purple), and a 50% (orange) threshold.b In red, the genes forming part of the scCross solution with µ = 100%, which would be removed in subsequent runs to identify rare subpopulations of cells.
Figure A7 corresponds to Figure 3c of the main manuscript.It shows the result of scCross with all selected genes displayed.Section 3.3: non-small cell lung cancer dissociated tumor cells data • Figure A8 corresponds to Figure 4a of the main manuscript.It show the result of scCross with all selected genes displayed.• Figure A9 corresponds to Figure 4b of the main manuscript.It show the result of scCross on the MNN integrated data.Section 3.4: pancreas data • Figure A10 corresponds to the first cluster displayed in Figure 5 of the main manuscript.

Figure A12 :
Figure A12: Third result of scCross on the pancreas data.Only the 2000 cells with highest objective value are displayed (out of 11 641).

Table A1 :
Results of scCross on the GARP+ Tregs data depending on the chosen preprocessing.

Table A2 :
GO terms for the subpopulation identified in the NSCLC DTCs data (section 3.3 of the main manuscript).

Table A4 :
GO terms for the second subpopulation identified in the pancreas data (section 3.4 of the main manuscript).

Table A5 :
GO terms for the third subpopulation identified in the pancreas data (section 3.4 of the main manuscript).

Table A6 :
GO terms for the first subpopulation identified in the tabula muris data (section 3.5 of the main manuscript).

Table A7 :
GO terms for the second subpopulation identified in the tabula muris data (section 3.5 of the main manuscript).Figure A6: 10 genes with highest contribution to the objective for each cluster returned by MicroCellClust on the breast cancer data.Out of cluster cells are not displayed.Figure A8: scCross result on the NSCLC DTCs dataset.Only the 1000 cells with highest objective value are displayed (out of 26 027).Figure A9: scCross result on the NSCLC DTCs dataset after MNN integration.Only the 1000 cells with highest objective value are displayed (out of 26 027).Figure A10: scCross result on the pancreas data.Only the 1000 cells with highest objective value are displayed (out of 11 641).Second result of scCross on the pancreas data.Only the 500 cells with highest objective value are displayed (out of 11 641).