A cell-level discriminative neural network model for diagnosis of blood cancers

Abstract Motivation Precise identification of cancer cells in patient samples is essential for accurate diagnosis and clinical monitoring but has been a significant challenge in machine learning approaches for cancer precision medicine. In most scenarios, training data are only available with disease annotation at the subject or sample level. Traditional approaches separate the classification process into multiple steps that are optimized independently. Recent methods either focus on predicting sample-level diagnosis without identifying individual pathologic cells or are less effective for identifying heterogeneous cancer cell phenotypes. Results We developed a generalized end-to-end differentiable model, the Cell Scoring Neural Network (CSNN), which takes sample-level training data and predicts the diagnosis of the testing samples and the identity of the diagnostic cells in the sample, simultaneously. The cell-level density differences between samples are linked to the sample diagnosis, which allows the probabilities of individual cells being diagnostic to be calculated using backpropagation. We applied CSNN to two independent clinical flow cytometry datasets for leukemia diagnosis. In both qualitative and quantitative assessments, CSNN outperformed preexisting neural network modeling approaches for both cancer diagnosis and cell-level classification. Post hoc decision trees and 2D dot plots were generated for interpretation of the identified cancer cells, showing that the identified cell phenotypes match the cancer endotypes observed clinically in patient cohorts. Independent data clustering analysis confirmed the identified cancer cell populations. Availability and implementation The source code of CSNN and datasets used in the experiments are publicly available on GitHub (http://github.com/erobl/csnn). Raw FCS files can be downloaded from FlowRepository (ID: FR-FCM-Z6YK).


Introduction
Challenges in the diagnosis and prognosis of blood cancers partially lie in the phenotypic heterogeneity of cancer cells.Even within a leukemia subtype, leukemic cells may be derived from slightly different stages of the normal cell developmental trajectory and therefore express different marker proteins, resulting in phenotypic heterogeneity within and between patient samples.To characterize the cellular phenotypic heterogeneity, single-cell assays are essential.In clinical laboratories, complete blood cell count (CBC), cytogenetics for identifying chromosomal numerical and structural abnormalities, microscopy of bone marrow biopsy, cytology of cerebrospinal fluid, and flow cytometry (FCM) of peripheral blood and bone marrow aspirates are commonly used for leukemia and lymphoma diagnosis.Among these assays, FCM is the most mature single cell analysis technology, supporting identification and quantification of cell surface and intracellular proteins on individual cells.Compared with other single cell assays, FCM is rapid, cheap, and sensitive for detecting and monitoring phenotypic differences in cancer cells.Besides profiles of cellular marker expressions, proportions of the cancer cells within a specimen (i.e.cancer burden, an important measure for optimizing treatments and prognosis) can also be quantified from the analysis of FCM data.As a result, FCM immunophenotyping is routinely used in diagnosis and prognosis of blood and lymphoid cancers (Diamond et al. 1982, Terstappen et al. 1992, Stetler-Stevenson and Braylan 2001, Stetler-Stevenson et al. 2001, Weir and Borowitz 2001, Kaleem et al. 2003, Kern et al. 2003, Freeman et al. 2013, Chiaretti et al. 2014, Rawstron et al. 2018).It is also widely used to identify abnormal cell populations in association with nonneoplastic diseases including asthma, allergy, and autoimmunity (Ebo et al. 2005, Boumiza et al. 2005, Wei et al. 2007, Smyth et al. 2010, Wolff et al. 2010, Lazarus et al. 2012, Irvin et al. 2014).
Due to significant advances in cytometry instrumentation and reagent technologies since the 2000s (Nolan and Condello 2013, Spitzer and Nolan 2016, Park et al. 2020, Jaimes et al. 2022), applications of cutting-edge machine learning (ML) approaches began to emerge in the recent decade for addressing the increasing volume and complexity in this high-content cytometry data (Mair et al. 2016, Saeys et al. 2016, Hu et al. 2021).Automated gating analysis (autogating), which identifies cell populations using unsupervised clustering methods or recapitulates the manual identification gating process using supervised learning, in either the original or transformed feature space (Lo et al. 2008, Pyne et al. 2009, Qian et al. 2010, Zare et al. 2010, Aghaeepour et al. 2011, Qiu et al. 2011, Ge and Sealfon 2012, Amir et al. 2013, Finak et al. 2014, Naim et al. 2014, O'Neill et al. 2014, Shekhar et al. 2014, Levine et al. 2015, Malek et al. 2015, Van Gassen et al. 2015, Samusik et al. 2016, Li et al. 2017, Lee et al. 2018, Lux et al. 2018, Meehan et al. 2019), represents the largest category of these methods.To use these auto-gating methods for diagnosis, a separate disease classification step is required.The accuracy of the classification relies on the autogating step to identify cell populations in a complete and accurate way, which can be challenging for blood cancer applications.A second category of methods focuses on identifying cell-based biomarkers from the FCM data by extracting statistical features from cell-level expression patterns and comparing them between samples (Bruggner et al. 2014, Lun et al. 2017, Hu et al. 2018b, Weber et al. 2019, Greene et al. 2021, Yue et al. 2022).The identified biomarkers are selected to be statistically different between cohorts but may not be biologically meaningful cell populations with distinct phenotypes.These methods focus on identifying cohort-level differences and are not designed for or dependent on the accurate identification of (cancer) cells in each individual sample.A third category of methods makes use of representation learning models, such as neural networks, to bypass the feature extraction step.Some of these methods are shown to be effective in predicting the cancer diagnosis (Ko et al. 2018, Monaghan et al. 2022) but the predicted diagnosis is difficult to interpret and validate without identifying the cancer cells themselves.Another group of methods in this category has been applied to nonneoplastic diseases for predicting the sample diagnosis while identifying the diagnostic cell populations (Arvaniti and Claassen 2017, Hu et al. 2018a, Hu et al. 2020).However, the identified cell populations were not validated due to the lack of cell-level labels and it remains unclear whether these methods can effectively identify cancer cell populations with phenotypic heterogeneity.
Here, we define the problem to be solved as follows.Given a set of preexisting clinical FCM samples with diagnostic labels, with cancer burden being optionally available (as identified by expert manual or automated gating analysis), can we predict the diagnosis label of a new FCM sample from the same reagent panel and identify the diagnostic cells simultaneously, while retaining interpretability of the identified cancer cells.Addressing this problem requires the simultaneous optimization of cell population identification (biomarkers) and sample-level classification (diagnosis).In previous work (Ji et al. 2020), we showed that the simultaneous optimization can be achieved for diagnosis of chronic lymphocytic leukemia (CLL), by adapting gradient descent optimization for identifying a global optimal gate to maximize classification accuracy.In this article, we hypothesize that a density-based discriminative point set model using backpropagation can address the simultaneous optimization, without requiring initial gating.
Specifically, we developed an end-to-end differentiable representation learning approach-Cell Scoring Neural Network (CSNN)-that learns the density distribution of the cellular expressions on all markers and makes diagnostic predictions based on aggregating cell-level scores into sample-level predictions.In parallel, the sample-level information, including the diagnostic labels and the density patterns, is backpropagated to the cell level for identifying the diagnostic cell population(s).Based on the possible availability of cancer burden information, we developed two versions of CSNN: CSNN-Class (classification), which requires only diagnostic labels in the training data and CSNN-Reg (regression), which makes use of the additional sample-level cancer burden information to improve the prediction at the single cell level.We applied both CSNN modeling methods on two independent datasets for diagnosis of CLL (provided by University of California, San Diego) and B-ALL (B-cell lymphoblastic leukemia, provided by Stanford University).We compared the performance of the resulting models with two relevant representative deep learning modeling approaches, CellCNN (Arvaniti and Claassen 2017) and DeepCellCNN (Hu et al. 2020), assessing their clinical utilities regarding: (i) accuracy of diagnostic prediction, (ii) interpretability of the identified leukemic cell populations and their phenotypic heterogeneity, and (iii) accuracy of the identified cancer burden.To confirm that the CSNNidentified diagnostic cell populations are those that differ between the cancer and noncancer samples, we designed and performed independent data clustering analysis to identify the clusters of cells that can only be found from the cancer samples and compared them with the CSNN-identified cancer cells.To interpret the CSNN-identified phenotypic heterogeneity of the cancer cells, we constructed post hoc decision trees based on the predicted cell-level labels, enumerated all leukemic cell phenotypes along the tree paths, and compared them with the known cancer endotypes observed clinically in patient cohorts.The leukemic cell populations identified in individual samples are then visualized in traditional 2D dot plots for straightforward hematopathology review.

Notation
We consider N individuals (e.g.patients) where each individual i; 1 i N, is represented by an FCM sample X i .Each sample X i consists of a set of multi-dimensional vector measurements, where each vector x i;j in the sample corresponds to a single cell, i.e.X i ¼ fx i;1 ; . . .; x i;j ; . . .; x i;ni g, where j is an index of cells in sample X i .n i indicates the number of cells in sample X i , since in flow cytometry the number of cells can vary across samples i.Each dimension of vector x i;j corresponds to the expression of an FCM marker.
We assume a target is available for each subject i, provided (e.g.) by human experts based on manual evaluation of the FCM sample X i .In this article, we will consider two different types of targets.The first type of target is a real-valued target y i , taking values between 0 and 1, indicating the disease burden for sample i, i.e. the proportion of cells that are estimated to be pathogenic for that sample, with y i ¼ 0 for healthy samples and y i > 0 for samples diagnosed with the disease condition.The second type is a binary label y L i , taking the value 0 or 1, indicating a disease diagnosis for sample i, and where y L i ¼ 1 indicates disease presence for sample i.
At the cell level, let z i;j be a cell-level binary variable where z i;j ¼ 1 indicates that the cell is pathogenic and z i;j ¼ 0 indicates that a cell is nonpathogenic (present in healthy conditions).We will assume that cell-level labels are not available in the training data, i.e. that the z i;j value for cell j for patient i is unknown.
An important aspect of our overall approach is to be able to predict, for patient i, both the cell-level binary variables z i;j and the sample-level disease burden y i or disease diagnosis y L i .More specifically, we estimate both: 1) cell-level scores s i;j , in the form of conditional probabilities s i;j ¼ Pðz i;j ¼ 1jx i;j Þ, i.e. the probability that a particular cell is pathogenic, given marker measurements x i;j ; and 2) sample-level predictions of (i) real-valued burdens y i , or (ii) probability of disease Pðy L i ¼ 1jX i Þ, where these predictions are a function of the cell-level scores s i;j .

A cell-scoring neural network for disease prediction
Our goal is to construct a predictive model that takes a set of cell-level vectors for a sample, X i ¼ fx i;1 ; . . .; x i;j ; . . .; x i;n i g, and produces a sample-level prediction of y i (either a realvalued burden or a probability of a binary label).A challenge in this context is that predictive modeling techniques in statistics and machine learning typically assume a fixeddimensional vector representation as input to a model, rather than sets of vectors X i of varying sizes across i.This is the case with flow cytometry, as the number of recovered cells can be variable along subjects.
To handle this issue, we use the following two-step approach (Fig. 1).In the first step we map each cell-level vector x i;j to a scalar-valued cell-level conditional probability score s i;j ¼ Pðz i;j ¼ 1jx i;j Þ where the mapping s i;j ¼ sðx i;j ; /Þ has learnable parameters /, parametrized via a feedforward neural network, which we refer to as a Cell Scoring Neural Network (CSNN).Using this cell-level mapping, each sample X i ¼ fx i;1 ; . . .; x i;j ; . . .; x i;n i g can then be represented by a set of cell-level scores S i ¼ fs i;1 ð/Þ; . . .; s i;j ; . . .; s i;ni ð/Þg, where each score indicates how likely it is that a particular cell i, j is pathogenic.Note that the cell-level scores s i;j ð/Þ depend implicitly on the cell-level data vectors x i;j ; we suppress this dependence on x i;j in the notation for simplicity.
In the second step, to predict real-valued burden targets y i , we aggregate the cell-level scores by averaging, i.e.
j¼1 sðx i;j ; /Þ, representing an estimate of disease burden for sample X i .
To predict the probability of a binary label y L i we define Pðy L i ¼ 1jX i Þ to be a logistic function, i.e.
where a and b are learnable parameters of the logistic function.
Note that the two types of predictions have different interpretations.The burden prediction y i can in practice take values quite close to 0 for patients who have a disease diagnosis (e.g. a patient could have as few as 0.01% pathogenic cells and still have the disease).On the other hand, the conditional probability estimate, Pðy i ¼ 1jX i Þ, can be interpreted as having a threshold at 0.5, i.e. if Pðy i ¼ 1jX i Þ > 0:5 then individual i is more likely to have the disease than not (and vice versa).The logistic parameters allow for accommodation of this difference between predicting burden level and predicting likelihood of disease presence.
A key feature of our approach is that we use information at the sample-level (the real valued burdens y i or the binary labels y L i ) to learn the mappings for the cell-level scores (the s i;j 's).In particular, for real-valued burden targets we pursue a regression approach and minimize a weighted mean-square error loss function: where k > 1 is a hyperparameter that upweights the second term to encourage the model to push the predictions (burden estimates) for healthy individuals to be close to 0. For binary labels y L i 2 f0; 1g, we estimate the parameters by minimizing the standard binary cross-entropy objective function used in classification modeling (Hastie et al. 2001): We learn / for L MSE (and simultaneously, /, a and b for L LL ) by using standard gradient descent optimization methods.In what follows, we refer to the first approach above (with realvalued burdens and squared error functions) as CSNN-Reg (for cell scoring neural network regression), and the second approach (with binary labels and log-loss functions) as CSNN-Class (for cell scoring neural network classification).
To represent the cell-level mappings, x i;j !s i;j ð/Þ, for both CSNN-Reg and CSNN-Class, we use a flexible function approximator in the form of a multi-layer feedforward neural network.In particular we use a ReLU activation function in the intermediate hidden layers and a sigmoid (softmax) function, gðzÞ ¼ 1=ð1 þ expðÀzÞÞ as the activation function at the output layer so that the model's output per cell is constrained to lie between 0 and 1.Additional details on network architectures and hyperparameter settings for optimization are provided in Supplementary Appendix S1.
CellCNN (Arvaniti and Claassen 2017) and DeepCellCNN (Hu et al. 2020) are two existing methods that are comparable to a CSNN since they use neural networks that produce multiple scores per cell, which are then aggregated for sample-level classification.In these methods, the scoring neural network is replaced by a neural network that maps every cell to a vector in a latent space, rather than a single score.These vectors are then averaged together, into a final feature vector which is an abstract representation of the sample.This feature vector serves as a summary of the samples, but is difficult to interpret by humans.
In contrast, a distinct advantage of the interpretable celllevel scores produced by the CSNN models is that for any cell, for a particular sample, we can explore and interpret where pathogenic cells are located in marker space, e.g. by visualizing and highlighting what regions in marker space have scores s i;j ¼ sðx i;j ; /Þ above a particular threshold.In the case Cell scoring neural network of the logistic method, the score is defined as s i;j ¼ signðbÞsðx i;j ; /Þ.We discuss how the scores of individual cells are related to the concept of manual "gating" in Supplementary Appendix S2 and we also provide illustrations of how this can support biologically meaningful discovery with real FCM datasets later in the article.

Initializing CSNN models using density estimates
The quality of the learned CSNN models can be improved by using information related to the densities in marker space to initialize the models.We begin by generating initial estimates of cell-level scores s 0 i;j for each cell j in each sample i in the training data.Consider first the case of real-valued y i targets (i.e.sample burdens).For samples with y i ¼ 0, all cell-level scores are zero by definition, i.e. s 0 i;j ¼ 0, assuming that all cells in healthy samples are nonpathogenic.For samples with a disease diagnosis (y i > 0) Bayes' rule is used: We can estimate each of the three terms on the right-hand side from the training data as follows.The term in the denominator, Pðx i;j jy i > 0Þ, is the marginal probability density function (PDF) of marker measurements for sample i, given that sample i has a disease diagnosis: this PDF can be estimated straightforwardly using kernel density estimation (KDE).The first term in the numerator, Pðx i;j jz i;j ¼ 0; y i > 0Þ is the probability density for nonpathogenic cells in a sample with a disease diagnosis.We can approximate this by assuming that Pðx i;j jz i;j ¼ 0; y i > 0Þ ¼ Pðx i;j jz i;j ¼ 0Þ ¼ Pðx i;j jz i;j ¼ 0; y i ¼ 0Þ, i.e. that the PDF of nonpathogenic cells in marker space is the same in both positive and negative samples.We further assume that the density Pðx i;j jz i;j ¼ 0; y i ¼ 0Þ does not vary from sample to sample, allowing us to pool all x i;j measurements from all the negative samples (which have y i ¼ 0 and z i;j ¼ 0 by definition) and again use KDE to estimate this density.The second term in the numerator, Pðz i;j ¼ 0jy i > 0Þ, is equal to 1 À y i , under the assumption that y i (the sample burden) corresponds to the fraction of cells in sample i that are pathogenic.For the situation where the training data only contains binary labels y L i 2 f0; 1g, i.e. the CSNN-Class model, we again estimate scores s 0 i;j via Bayes rule, but using additional approximations for each of the three required terms in the absence of known real-valued burdens y i .
Finally, given scores s 0 i;j from the density-based approach above (for all cells for all samples in the training dataset), a feedforward neural network, parametrized by weights / 0 , is trained to create an initial cell-level model that can approximate the density-based scores via the neural network.The trained weights / 0 of this neural network are then used to initialize the weights in training of an end-to-end CSNN network (CSNN-Reg or CSNN-Class, using L MSE and L LL respectively as described earlier).We found in practice that this density-based initialization significantly improves the quality of the final sample-level disease predictions.Full details on KDE methods, approximations for binary y i labels, and training of the initial network, are provided in Supplementary Appendix S5.
Note that one could in principle use the density-based approach alone to build a sample-level prediction model, by using the density-based models from the training to generate scores s 0 i;j per cell for a new sample.A prediction for y i could then be based, e.g. on thresholding the sum of cell-level scores P j s 0 i;j .However, a purely density-based approach may be sensitive to modeling assumptions, whereas the sample-level discriminative training of the CSNN can allow the model to further tune the initial parameters / 0 to produce scores s i;j that are directly optimized for robust prediction of burden or likelihood of disease.
3 Experimental results

Datasets
Two independent FCM datasets were used in evaluating the performance of the CSNN modeling methods.The first dataset (DS1) was provided by the University of California, San Diego (UCSD) Center for Advanced Laboratory Medicine (CALM) diagnostic lab that was collected and analyzed for the identification of Chronic Lymphocytic Leukemia (CLL) cases using their standard diagnostic protocol.DS1 includes FCM data from 288 subjects-186 diagnosed as CLL and 102 judged to be non-CLL by the hematopathologist.For each subject, two reagent panels, PB1 and PB2 were used in the clinical FCM assay on peripheral blood (PB) samples.Each panel contained antibodies for the detection of 10 markers (fluorescence parameters): PB1: CD3, CD5, CD10, CD19, CD22, CD38, CD43, CD45, CD79b and CD81; PB2: Anti-Ig-lambda, Anti-Ig-kappa, CD5, CD7, CD19, CD20, CD23, CD38, CD49d and FMC-7.Datasets from each panel also included 6 scatter parameters: forward scatter (FSC)area(A)/height(H)/width(W) and side scatter (SSC)-A/H/W.
The second FCM dataset (DS2) was provided by the diagnostic lab at Stanford University (Stanford) for the identification of leukemic cells (blasts) of B-cell Acute Lymphoblastic Leukemia (B-ALL) using their standard diagnostic protocol.The samples in DS2 are bone marrows, which consist of both diagnostic samples and samples collected from patients who have received CD19-targeted CAR (chimeric antigen receptors) T-cell therapy.DS2 includes FCM data from 178 subjects: 50 diagnosed as B-ALL and 128 judged to be non-B-ALL.The FCS files of DS2 are from one reagent panel detecting nine markers: CD66b, CD22, CD19, CD24, CD10, CD34, CD38, CD20, CD45, and two scatter parameters (FSC/SSC).
Research on both datasets was approved by Institutional Review Boards of the respective institutions (UCSD and Stanford).Both datasets have been fully de-identified before being transferred and analyzed using the proposed neural networks.Diagnostic labels of DS1 samples were provided by UCSD.Cancer burdens for each DS1 sample were obtained by applying the DAFi automated gating analysis (Lee et al. 2018) on the de-identified FCS files, following the gating strategy used in the diagnostic lab (Supplementary Fig. S1).Specifically, DAFi examined the property of all markers, focusing on identifying the CD5 þ CD19 þ CD10 À CD79b dim CLL cells in PB1 by data clustering analysis, according to our previous study (Scheuermann et al. 2017).For DS2, both the diagnoses and the cancer burdens were provided by Stanford from expert manual gating analysis.

Training and testing sets
To assess the performance of the ML modeling approach developed, each dataset was divided into separate subsets for training (hyperparameter tuning and model training) and testing.For each dataset, the training sets consisted of initial batches of training samples provided by the clinic (183 samples for CLL, 60 for B-ALL), and the test sets consisted of batches of samples for different patients for each dataset that were provided later in time by the clinic (105 samples for CLL, and 118 for B-ALL).For both models, using the 183 and 60 training samples, respectively, hyperparameter values were optimized based on average test set performance for 5-fold cross validation runs, where the cross-validation partitions were stratified by class.

Quantitative assessment of classification accuracy
The performance of CSNN-Class and CSNN-Reg methods on the CLL and B-ALL datasets were compared with with two machine learning methods recently reported in the literature, CellCNN and DeepCellCNN.For each method hyperparameter optimization was performed by learning parameters on the training subsets and selecting the best model parameter setting based on the area under the receiving operating characteristic curve (AUROC) obtained with the validation subsets.We then retrained the best performing model on a sample containing all the samples in the training set and validation set and determined the final performance on unseen test subsets.
Hyperparameters evaluated for all models included the learning rate and an architecture search, with specific searches for w (CSNN-Class), k (CSNN-Reg) and the dropout rate (CellCNN).The specific grid values tested for these can be found in Supplementary Appendix S1.Using the best hyperparameters for each model, we then trained 20 more models of each type with different random initializations to measure their variance with respect to initialization.In order to filter out the models that were initialized incorrectly, we ran each training loop with five restarts and evaluated them with the testing set.We then picked the best run out of those 20 to report the ROC graphs in Fig. 2. Overall, the proposed CSNN-Reg and CSNN-Class produce higher AUROC scores than either CellCNN or DeepCellCNN, except in B-ALL where CellCNN beats CSNN-Reg on B-ALL by 0.2%.CSNN-Reg is superior to all methods on both datasets, including the K-means approach described in Supplementary Appendix S4, indicating that the sample-level burden information (as used by this method) carries additional value beyond sample-level binary labels (as used by the other three methods).In addition, both of the CSNN models are more robust than the other methods in that they have lower variance (than the other methods) across weight initializations during model training.
In addition to performance on sample-level diagnosis, the CSNN-Reg method was also assessed for its ability to determine leukemic cell proportions on both the B-ALL and CLL datasets.Excellent correlation between the predicted proportions and the proportions reported by the diagnostic lab is observed for the CLL dataset (Fig. 3c and d).A similar correlation between predicted and reported leukemic cell proportions was observed for the B-ALL dataset (Fig. 3a and b).However, it should be noted that the set of samples that were randomly picked for testing did not contain any samples with proportions in the 0.2-0.8range and therefore the performance of CSNN-Reg on the B-ALL dataset in this range is determined by interpolation.Ablation tests were conducted to evaluate whether the method performance is improved by the density difference initialization and the post initialization finetuning.Results of the ablation tests can be found in Supplementary Appendix S3.

Biological interpretation of the identified cancer cells
To confirm the cell phenotypes identified by the CSNN and baseline methods, we highlight the CSNN-identified pathologic cells on the key 2D dot plots for visual examination and interpretation.We use the term "pathologic" to refer to cell populations that are related to the leukemic state, which can include both the leukemic cells themselves and any reactive "normal" cell population elicited by the presence of a leukemia in the patient that could be equally diagnostic and prognostic.Visual examination of the distributional shapes of antigen expression and locations of the identified pathologic cell populations in 2D dot plots was used to determine: (i) if the CSNN-identified pathologic cell populations are found in the same regions of the UMAP plots across positive samples that are absent in the negative samples, (ii) if the location of the CSNN-identified pathologic cell populations in the original 2D dot plots matches the known leukemic cell phenotypes, and (iii) if all known leukemic cell populations seen in the original 2D dot plots are successfully identified by the CSNN models.
Figure 4a and b shows the pathologic cells (in yellow) identified in the CLL dataset by the CSNN-Reg model from four representative samples (results for all CLL positive samples can be found in Supplementary Fig. S2) in both UMAP and original 2D dot plots.The CLL cell populations are highlighted across key 2D dot plots that cover all the important surface protein markers used in the reagent panel (Fig. 4b).This representative sample set consists of a negative case (row #1, Fig. 4b), positive cases with (row #2) and without (row #3) normal B cells, as well as CD38-negative (rows #2-3) and a mixture of CD38-negative and CD38-positive (row #4) CLL cases to illustrate within and between sample the CLL phenotypic heterogeneity.Many studies have previously reported the important role of CD38 in CLL prognosis (Mainou-Fowler et al. 2004, Matrai 2005, Pittner et al. 2005, Schroers et al. 2005).Figure 4a and b clearly shows the capability and accuracy of the proposed CSNN model for Figure 1.Sample-level predictions of real-valued burdens y i from cell-level scores.Each marker vector is evaluated by the CSNN and given a score sðx i;j ; /Þ that corresponds to P ðz i;j ¼ 1jx i;j ; /Þ.For the CSNN-Reg model the average of these scores, s i , becomes the real-valued burden prediction for sample i.For the CSNN-Class model, sample-level probabilities P ðy L i ¼ 1jX i Þ, for binary labels y L i , are generated using a logistic transformation of s i .
Cell scoring neural network identifying these important CLL phenotypic endotypes.Without being informed that the typical CLL phenotype is CD5 þ CD19 þ , the major pathologic cell populations identified by the CSNN-Reg were found to be CD5 þ CD19 þ .Without using clustering analysis to define cell populations up front, CSNN still successfully identified the cell populations with natural antigen expression distributional shapes and did not mix the normal CD19 þ B cells with the abnormal CD5 þ CD19 þ CLL cells (row #2, Fig. 4b), which can be difficult for traditional gating methods to cleanly separate.We also observed that the CSNN models do not require a prefiltering step needed by some existing methods to filter out debris/dead cells/doublets, such that any novel cell subsets in individual samples that differ between the CLL and non-CLL cohorts can be potentially identified and used for classification.
The same visual assessment performed on the B-ALL dataset, highlighting the CSNN-identified pathologic cells from four representative samples (Fig. 4c and d), including a negative sample (row #1, Fig. 4d), a typical CD19 þ CD34 þ B-ALL sample (row #2), an atypical CD19 int CD34 À B-ALL case (row #3), and a B-ALL sample with at least three subtypes of blasts (row #4).Results from all B-ALL positive samples on CD19 versus CD34 can be found in Supplementary Fig. S3.Expression of CD19, CD34, CD10, and CD20 in the B-ALL samples illustrates the phenotypic heterogeneity observed in the B-ALL blasts.These B-ALL phenotypic endotypes can be extremely challenging to identify using a single gate in manual gating analysis due to this sample-to-sample heterogeneity.
The CSNN model identified these phenotypically heterogeneous B-ALL blasts at the single cell level in a data-driven fashion, by comparing all of the individual samples in the B-ALL positive and B-ALL negative cohorts, without requiring cell-level labels in the training data.The UMAP visualization (Fig. 4c) confirms that these blasts are only found in the B-ALL positive samples but absent in the negative sample.

Qualitative assessment and visual comparison of the results identified across competing methods
The results of the four different methods (CellCNN, DeepCellCNN, CSNN-Reg, and K-means) were visualized and compared for their identification of cancer cell phenotypes (not quantitatively but qualitatively).The K-means clustering approach is an ad hoc approach developed by us to identify cell clusters that can only be found in the cancer samples (method design can be found in Supplementary Appendix S4), whose results are more interpretable than the other two black-box baseline methods.The complete set of 2D dot plots for visual comparisons of results of CSNN-Reg with the two baseline methods on all CLL and B-ALL samples can be found in Supplementary Appendix S6. Figure 5 shows the results of CellCNN (row #1), DeepCellCNN (row #2), CSNN-Reg (row #3), and K-means (row #4) on selected CLL (case #22936, Fig. 5a) and B-ALL (case #134, Fig. 5b) cases across different 2D dot plots (columns).The 2D plots of the baseline methods were generated using a probability cutoff ¼ 0, because a cell should not be counted as a nonleukemic cell if it increases the likelihood of the sample being positive, and vice versa.Figure 5 shows that both CellCNN and DeepCellCNN identified only a small number of leukemic cells using the probability cutoff ¼ 0 (or wrong phenotypes, as shown in Supplementary Fig. S6), indicating that the probability values output by the baseline methods cannot be directly used to predict whether a cell is leukemia-related or not, which is not too surprising, given that these two baseline methods were not designed for cell-level classification for blood cancer diagnosis.

Interpretation of the neural network classification model using decision trees
In order to understand how CSNN was able to identify the heterogeneous leukemia-related cell subsets in individual samples, all B-ALL samples were pooled to construct a global decision tree to illustrate the classification paths of the cells in the pooled sample, based on the CSNN-output labels at the single cell level.Trees were then generated for each individual sample by calculating the cell-level statistics of the sample following the classification structure of the global tree, which preserved the tree layout for result comparison and interpretation across individual samples.
Using decision trees to interpret neural network analysis results is not new and has been discussed previously (Frosst andHinton 2017, Hu et al. 2020).However, Fig. 6 shows that a tree-based classifier can be adapted for not only interpreting the sample-level classification but also illustrating the phenotypic heterogeneity of B-ALL cells in individual patient samples.Three representative B-ALL positive samples were selected for visualizing the tree-based classification paths derived from the CSNN-identified cell-level labels side by side with the 2D dot plots that highlight these B-ALL cells (Fig. 6): case #134 (top left) is a CD19 À B-ALL example, case #211 (top right) is a CD19 þ B-ALL example, and case #166 (bottom) is an example that contains a mixture of both CD19 negative and positive leukemia-related cells.Each node in the tree corresponds to a protein marker used in the reagent panel.The root of the tree is automatically selected during the tree construction process as the most informative feature for classification.In the pooled B-ALL sample, CD19 was identified as the root of the tree, which matches with our understanding of the patient cohort, in which some have been treated with the CD19-targeted CAR-T therapy and therefore the leukemic cells that have remained following therapy have lost expression of CD19.Our decision tree model, derived from the CSNN output at the single cell level, successfully identified the two major B-ALL subtypes in the patient cohort: naive CD19 þ B-ALL and treatment-related CD19 À B-ALL.
Indeed, each path in the tree-based model leading to a B-ALL positive leaf node corresponds to a distinct B-ALL phenotype, potentially defining B-ALL endotypes (tree paths highlighted in red, Fig. 6).In case #134, while all of the B-ALL cells are CD19 negative, the CD19 À B-ALL cell phenotypes can be further subdivided based on CD10, CD34, and CD66b expression.For case #121, while all of the B-ALL cells are CD19 positive, they can be further subdivided based on CD45 and CD38 expression.The B-ALL cells in case #166 consist of three major subtypes: Navigating along the decision tree provides an exploratory capability of identifying both known and novel leukemia-related cell phenotypes in individual samples, in a data-driven exhaustive way.
Finally, each tree path can also be interpreted as a manual gating strategy, with the marker expression cutoff values identified at the tree nodes defining the gating boundaries in the original marker space.In case #134 (Fig. 6a), 1266 on CD19 (highlighted in black dotted line) is the cutoff for dividing the cells into CD19 þ and CD19 À .Visualization of the 2D dot plots confirms that the CD19 cutoff follows the natural boundary of the CD19 expression distribution.Similarly, the cutoffs of 2338 on CD10 and 938 on CD66b for case #134   Cell scoring neural network global cutoff.The interpretation of these results is 2-fold.First, the cell-level labels output by CSNN can be used to derive an accurate global cutoff for visual interpretation and validation.Second, the sample-level CSNN results provide for sample-specific cell classification that may deviate from the global cutoff identified in the pooled decision trees.This suggests that it could be very challenging to identify the B-ALL cell populations across all samples using traditional manual gating analysis.The same phenomenon was observed in case #166 on CD19 where the data and plots clearly show that there exists phenotypic heterogeneity of B-ALL within and between individual samples.While a global cutoff can be precisely identified based on the pooled data, the leukemic cells in individual samples needed to be identified using a "local gate" as predicted by CSNN.In clinical practice, the cutoffs and 2D dot plots output by CSNN along with the tree-based classification paths can be combined and converted into manual gating strategies for explainable validation by hematopathologists.

Additional findings
When examining 2D dot plots across individual samples, we noticed that CSNN was able to identify the leukemia-related cells with distinct atypical phenotype, which could be useful for cancer precision medicine.Supplementary Figure S4 shows B-ALL sample #40 side by side with other five other B-ALL samples on plots of CD34 versus CD38, in which few typical CD34 þ B-ALL cells are observed in Sample #40.However, the hematopathology report listed that the cancer burden of Sample #40 is 86.19%.Supplementary Figure S5 compares sample #40 with the typical B-ALL case #211, where CSNN identified an atypical CD19 int CD34 À leukemiarelated cell population from sample #40 at 83.8%, matching the hematopathology review result.cellCNN and DeepCellCNN could not identify this cell subset in Sample #40.
Another finding from reviewing the 2D dot plots is that CSNN was able to capture cell populations with natural protein expression distributions, similar to what unsupervised clustering analysis can do.In contrast, manual gating analysis, decision-tree classification, and statistical biomarker identification methods often involve abrupt expression cutoffs that do not reflect natural expression gradients.It is important to note that the models produced by CSNN do not generate or rely on any geometric shape of gates but identify the leukemic cell populations as continua.The CSNN models can identify multiple fine-grained (hyper)regions that differ between the cancer and noncancer cohorts, which allow the identification of complex classification patterns not easily captured through sequential gating methods.
A third finding is that the experiment results of CSNN show clear improvements in tagging cell-level labels, as it explicitly model whether each cell contributes to a sample-level classification as having leukemia or not.In contrast, other approaches such as DeepCellCNN (Hu et al. 2020) define the label of a cell as a product of its classification, by amplifying the cell to be 5% of a sample, followed by calculating the difference in the classification likelihood, resulting in a less robust heuristic, as cancer heterogeneity cannot be explained through amplifying the same single cell.

Model limitations and future extensions
A challenge that arises when evaluating the performance of these algorithms is the lack of ground truth annotations at the cell level.Although manual gating analysis can generate cell type labels of individual cells, they are only for known cell types and their precise accuracy is questionable due to the subjective manual operation.The only reliable evaluation metric is the classification error of the whole sample, based on the diagnostic labels of each sample.Therefore, to assess cell level classification we rely on visual examination of the cell populations on the original 2D plots in order to confirm that they match the known leukemic cell phenotypes.To improve this situation, we designed an independent data clustering approach (Supplementary Appendix S4) to identify the cell clusters that can only be found in the cancer samples.This allows us to compare and qualitatively confirm that the pathologic cells identified by CSNN are leukemia related.As CSNN is a probabilistic model, the model assigns a probability to each cell of being leukemia related.For nonleukemic cells, the probability values can be extremely low, usually around 1%; however, they are seldom equal to 0.
As the probability of a cell being a leukemic cell is usually nonzero, the above equation will return a nonzero value, even when the model has not found any cells that are likely to be related to the leukemia.In this case, a small number of cells may be classified as leukemic using a discrete threshold even for a noncancer sample, as long as the diagnosis of the sample is correctly predicted by the model.
As the focus is on minimizing a global objective, the training algorithm might overlook small ( < 1%) populations of cells in favor of correctly classifying the majority of the samples.This issue prevents the model from identifying specific cell populations that are found in only a small number of samples.Similarly, the model will have difficulty classifying samples that have very small numbers of cancer cells, e.g.minimum residual diseases (MRD), especially if these MRD samples are not included during training.A localized training loss objective designed specifically for identifying MRD samples could help solve this problem.The proposed model could also benefit from identifying and modeling cell populations as a hierarchy.Then each cell population, instead of individual cells, could be scored.Most machine learning models are discriminative, without modeling the cell populations in a generative way, as such, can be hard to interpret.By separating each cell into a cluster and then classifying these clusters individually, both performance and interpretability of the model may improve.
An immediate next step is discrepancy analysis.For discrepant predictions for sample diagnosis and cancer burden, we will need to plot the identified leukemic cells on the original sequential gating paths for hematopathology review.For each false positive case, we plan to investigate whether the subject eventually develops leukemia at a later time point when further clinical data are available and approved for research use.
There are also potentially useful directions for future work related to feature dimensionality and feature selection.Our proposed approach is currently well-suited to the relatively low-dimensional marker configurations of traditional flow cytometry datasets.However, the high-dimensional nature of modern single-cell assay technologies, such as scRNA-seq, will likely require extensions of our approach that can incorporate lower-dimensional latent embeddings of highdimensional measurement spaces.Even with low-dimensional sets of markers there is practical motivation to extend the proposed approach to allow for automated identification of a minimal set of highly informative markers (minimal panel) that are sufficient for sample-level diagnosis.The identification of a minimal panel needed for machine diagnosis will provide insights for improving the design of the reagent panels.In addition, extending and assessing the proposed modeling approach for applications to diagnosis of nonneoplastic samples is also an interesting future work.

Conclusion
The most important feature of the CSNN model is its capability of simultaneously predicting the diagnosis of a sample and identifying pathologic cells, even with phenotypic heterogeneity.Existing machine learning methods for FCM data analysis either are not designed for blood cancer diagnosis or do not identify and validate the cancer cells from individual samples for result interpretation.In order to demonstrate this capability and assess the performance of CSNN, we designed a suite of interpretation and validation approaches for comparing the CSNN results to independent clustering analysis, known patient endotypes, diagnostic labels, and expert-identified cancer burden, in addition to hematopathology review of the CSNN-identified cancer cells on original 2D dot plots.Using two independent experiments on CLL and B-ALL, we showed the superiority of CSNN over the existing representative neural network modeling approaches for blood cancer diagnosis.The proposed neural network model is generally applicable to other types of discriminative single cell data analysis.

Figure 2 .
Figure 2. Model performance of sample-level classification: (a, b) the average of the testing scores for each model class on the B-ALL (a) and CLL (b) datasets, with the error bars representing the standard deviation between the N testing scores.(c, d) The ROC curve for the model with the highest scoring training AUC for each model class out of the N tests on the testing set of the B-ALL (c) and CLL (d) datasets.

Figure 3 .
Figure 3.Comparison of machine learning-and manual gating-derived leukemic cell proportions-using the best finetuned version of CSNN-Reg, the proportion of leukemic cells produced by the best finetuned version of CSNN-Reg and expert manual gating (corresponding to y i used during training) are compared the B-ALL (a, b) and CLL (c, d) training (a, c) and testing (b, d) data subsets.

Figure 4 .
Figure 4. Visual assessment of the identified pathologic cells by CSNN-Reg-pathologic cells (colored lighter [yellow]; nonpathologic cells are colored darker [blue]) identified in four representative samples from the CLL (a, b) and B-ALL (c, d) datasets.The UMAP embedding is generated by pooling all the samples from each dataset together, before individual samples are plotted.Pathologic cells in the B-ALL samples, as shown on the UMAP visualization (c), have higher phenotypic heterogeneity than those found in the CLL (a), which is consistent with the existing knowledge.The pathologic cells in different samples are found in the same UMAP regions, which are absent in the negative samples.This indicates that the CSNN results are both robust and interpretable.The identified pathologic cell phenotypes are also consistent with the existing knowledge: typical CLL cells areCD5 þ CD19 þ (b), while typical B-ALL cells are CD19 þ CD34 þ (d).The CSNN model produced natural shapes of the identified cell clusters, without using clustering analysis.The identified phenotypic heterogeneity of the B-ALL blasts can also be seen within and across the samples.

Figure 5 .
Figure 5. Visualization of pathologic cells identified by the different modeling methods-pathologic cells identified from positive CLL case #22936 (a) and B-ALL case #134 (b) by CSNN-Reg (row #3) versus two baseline methods CellCNN (row #1) and DeepCellCNN (row #2) and an independent data clustering analysis using K-means (row #4) are colored lighter (yellow)with the rest of the cells colored darker (blue).CSNN-Reg and K-means identified consistent cancer cell phenotypes.Neither CellCNN nor DeepCellCNN could identify the correct set of leukemic cells under their default settings.

Figure 6 .
Figure 6.Neural network model interpretation using decision trees.2D dot plots show the important tree nodes and highlight the leukemic cells identified and the expression cutoffs of the corresponding markers.Heterogeneity of B-ALL can be clearly seen in (a) CD19 À B-ALL found in sample #134, (b) CD19 þ B-ALL found in sample #211, and (c) both CD19 À and CD19 þ B-ALL can be found in sample #166.The black dotted lines in the 2D plots indicate the marker expression cutoffs identified by the decision tree classifier.Conceptually, each path highlighted in with a thicker line can be thought of as corresponding to a traditional manual gating sequence with marker expression cutoffs determined in a data-driven manner.