DenVar: density-based variation analysis of multiplex imaging data

Abstract Summary Multiplex imaging platforms have become popular for studying complex single-cell biology in the tumor microenvironment (TME) of cancer subjects. Studying the intensity of the proteins that regulate important cell-functions becomes extremely crucial for subject-specific assessment of risks. The conventional approach requires selection of two thresholds, one to define the cells of the TME as positive or negative for a particular protein, and the other to classify the subjects based on the proportion of the positive cells. We present a threshold-free approach in which distance between a pair of subjects is computed based on the probability density of the protein in their TMEs. The distance matrix can either be used to classify the subjects into meaningful groups or can directly be used in a kernel machine regression framework for testing association with clinical outcomes. The method gets rid of the subjectivity bias of the thresholding-based approach, enabling easier but interpretable analysis. We analyze a lung cancer dataset, finding the difference in the density of protein HLA-DR to be significantly associated with the overall survival and a triple-negative breast cancer dataset, analyzing the effects of multiple proteins on survival and recurrence. The reliability of our method is demonstrated through extensive simulation studies. Availability and implementation The associated R package can be found here, https://github.com/sealx017/DenVar. Supplementary information Supplementary data are available at Bioinformatics Advances online.

In the first two simulation setups, we replicated the characteristics of the mIHC lung cancer dataset discussed in Section 3.1. In Figure (3) of the main text, we showcased the mean of HLA-DR densities of the subjects from the two clusters identified by JSD-based clustering method. We noticed that these mean densities can be well approximated using Beta distributions [1] with different sets of parameters (α, β). To find out the sets of parameters (α, β) that would approximate the mean densities of the two observed clusters, we considered the following strategy. To replicate the mean density of cluster 1, we first computed its empirical mode, denoted by m 1 . We wanted to find a set of parameters, (α 1 , β 1 ) such that Beta(α 1 , β 1 ) satisfied two properties, a) it had the same mode, and b) its density function was close to the empirical one. To equate the modes we considered the equation, m 1 = α 1 −1 α 1 +β 1 −2 . The value of α 1 was thus fixed for given values of β 1 and m 1 . We considered different values of β 1 and chose the one for which the simulated density appeared to be closest to the real density.
We repeated the above steps for replicating the mean density of cluster 2 as well.
The modes of the mean densities of clusters 1 and 2 were respectively, m 1 = 0.0039 and m 2 = 0.0176. The mean density of cluster 1 was well approximated by a Beta distribution with α 1 = 2.17, β 1 = 300 and the mean density of cluster 2 by a Beta distribution with α 2 = 1.78, β 2 = 45. Note that the first density had a distinctively higher peak and a narrower tail compared to the latter. Refer to Figure S1 to check how well the real and simulated densities agreed. In simulation setups (4.1) and (4.2), Beta(2.17, 300) and Beta (1.78, 45) were used respectively.
In simulation setup (4.1), two groups of subjects were considered. For a subject from the group 1, the marker expression in every cell was simulated from Beta(2.17, 300). For a subject from the group 2, the marker expression was simulated from Beta(x, 300) where x was such that the mode of this distribution was higher than 0.0039 by a percentage of l i.e., x satisfied the equation, In simulation setup (4.2), again two groups of subjects were considered. For a subject from the group 1, the marker expression in every cell was simulated from Beta(1.78, 45). For a subject from the group 2, the marker expression was simulated from Beta(x, 45) where x was such that the mode of this distribution was higher than 0.0176 by a percentage of l i.e., x satisfied the equation, In  3 Details of the simulation setup (4.3) Next, we devised a simulation where the true values of the thresholds: (t 1 , t 2 ) were known.
And the data generation process was dependent on those. Recall that t 1 controls how we define a cell to be positive for a marker and t 2 controls how we cluster the subjects into two groups. We again considered two groups with respectively N 1 = 60 and N 2 = 40 subjects each with n cells. We wanted the subjects in group 1 to have t 2 % positive cells and the subjects in group 2 to have more than t 2 % positive cells. We describe the process of simulating the marker data of the non-positive cells first. For subjects in group 1, we made sure that they had (100 − t 2 )% non-positive cells by randomly choosing n(100 − t 2 )/100 cells out of the total of n. Let I denote the set of indices of those non-positive cells for subject j. Next, the marker data of i-th cell from set I, X ij was simulated from Beta(2.17, 300).
X ij can be thought of as X kij from the methods section and we dropped the index k for simplicity. Once, all the X ij 's were generated, the values were scaled to be in the range (0, t 1 ) using the transformation, X * ij = X ij −min i∈I X ij max i∈I X ij −min i∈I X ij t 1 for i ∈ I. Next, we describe the process of simulating the marker data of the positive cells. The marker data of the positive cells were again simulated from Beta(2.17, 300) and a constant of t 1 was added, X * ij = max{X ij + t 1 , 1}, i ∈ I c . Thus, we were able to generate the whole cell-level data of a subject j from group 1, X * ij for i ∈ {1, . . . , n} making sure there were only t 2 % cells having marker expression more than t 1 . For subjects from group 2, a similar procedure as above was considered by replacing t 2 with a randomly chosen t * 2 ∈ (t 2 , 1].

Confidence interval of ARI in simulation setups (4.1) and (4.2)
Tables S1 and S2 respectively list the confidence interval of ARI across 100 replications in the simulation setups (4.1) and (4.2).

A real data example showing problems with binarizing marker expression
In Figure S2, we show five images each of the TME of two different subjects from the mIHC lung cancer dataset. In every image, the HLA-DR marker expression is plotted across the cells at their xy co-ordinates. Notice that in all the images from subject 2, the marker expression is mostly close to zero, whereas in some of the images from subject 1, the marker expression is noticeably high. Thus, these two subjects should ideally be distinguishable if the difference is quantified in terms of their overall marker expression. In Figure S3, we binarize these images meaning that we declare a cell to have marker expression one or zero based on its true expression being more or less than a chosen threshold of 0.55. We observe that subject 1 now has mostly zero marker expression bar just two cells and thus, becomes nearly indistinguishable from subject 2. This was probably the reason why subjects 1 and 2 were both assigned to be in MHC-II low group by the thresholdingbased method, whereas our method could identify their differences and assigned them into separate clusters. This example shows how discarding important marker information (by binarizing) can potentially result in an overall loss of power.