Compositional Data Analysis using Kernels in mass cytometry data

Abstract Motivation Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. Results We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. Availability and implementation CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. Contact prudra@okstate.edu Supplementary information Supplementary data are available at Bioinformatics Advances online.


Aitchison Distance (AD) (default)
2. Bray-Curtis Distance (BCD) 3. Euclidean Distance (ED) CODAK using these distances are denoted by CODAK, CODAK-BC and CODAK-ED in our plots. In each case, we allow the user to choose the kernel using Gower's method (default) as in Equation 3 or the Gaussian kernel as in Equation 6. Based on our empirical results, we suggest using CODAK with the default choices (See Figures 3,S2 and S6).
For the predictor variable, CODAK allows the user to choose either the linear kernel (LK) or the Gaussian kernel (GK). Based on our empirical results, we suggest using the LK (See Figure S7).
Three choices of permutation methods are available when adjusting for covariates using the CODAK-alr method. These are: 1. Shuffle residuals (KC) (Kennedy and Cade, 1996): Residualize both Y and X with respect to the covariates and permute the residuals.
2. Shuffle Y (SY) (Manly, 2018): Shuffle the rows of the Y matrix while keeping the relationship between X and Z intact.
3. Freedman and Lane's method (FL) (Freedman and Lane, 1983): Permute residualized data, then add the estimated covariate effects back, and fit model again.
See Anderson and Legendre (1999) for detailed description of each permutation method and the subtle differences between them. It is well known that the SY method is less stable than the other two methods and it can perform poorly in the presence of outliers (Anderson and Legendre, 1999;Winkler et al., 2014). Therefore, we recommend using one of the other two methods. Our empirical results show that the FL method has the best performance since it controls the type-I error well. The KC method has higher power, but it is also slightly anticonservative ( Figure 5, Figure S4).

S2 Additional Discussion on the relationship between the R 2 measures
As discussed in Section 2.6, the R 2 provided by MiRKAT (Zhao et al., 2015) is the square of the dcor measure (Equation 15) in CODAK when using the same kernel for the compositional data and LK for the predictor. The R 2 provided by PERMANOVA (Anderson, 2001) is not equivalent to these. Suppose we use the LK for the predictor and AD kernel for the compositions. The kernel matrix K for the compositions is obtained using Equation 3. Note that the R 2 measure for PERMANOVA is given by (Anderson, 2014) where H X is the hat matrix defined using X. On the other hand, the dcor for CODAK is given by Although it can be shown that the two expressions become equivalent when P is onedimensional, it is not useful in our case since the compositional profile P is not onedimensional here. We provide an empirical comparison of the R 2 measures from the 3 methods (CODAK, MiRKAT-AD, PERMANOVA-AD) using simulations in the bottom panels of Figure S1. The top panels of this figure shows the equivalence of three methods in the no covariate case, confirming the theoretical results in Hua and Ghosh (2015) and Pan (2011). The near perfect association in the top panels shows that the methods are equivalent in these cases and the only difference is likely due computation. The simulations were conducted using the simulation scheme to produce Figure 3.

S3 Computation Time
We compared the computation time for each of the commonly used methods to test cell type abundance in mass cytometry on a single machine with i7-10700 CPU @2.90 GHz and 32GB RAM. For each method, we ran it 100 times on a typical data set as obtained from our simulation schemes using 1000 permutations. The run times are reported in Table S1.  CODAK, PERMANOVA and MiRKAT in the no covariate case from simulated data sets. AD kernel and linear kernel were used for the compositions and the predictor variable, respectively. Black dots are from the null scenario (no association), and red dots are from the alternative (some association). The near perfect association in the top panels and the bottom left panel shows that the methods are equivalent in these cases and the only difference is likely due computation.

S4.1 Simulation details
We conducted the following simulation studies besides the ones reported in the main manuscript.
S(i) Continuous predictor ( Figure S2): This is identical to scenario (i) in the main manuscript ( Figure 3) except that the predictor variable is a continuous variable simulated from a U (0, 1) distribution. For comparability, the effect sizes in cases (a)-(d) are simulated such that the mean and variance of βX remains same as the binary predictor case, where β is the qdimensional vector of effect sizes.
S(ii) Binary predictor with larger random effects variance ( Figure S3): We also wanted to explore the effect of a larger random effect variance. This simulation is identical to the to scenario (i) in the main manuscript ( Figure 3) except that the standard deviation of the random effects term is now assumed to be 0.5.
S(iii) Continuous predictor, binary covariate ( Figure S4): This is similar to scenario (ii) in the main manuscript ( Figure 5) except that the predictor is now considered to be continuous like S(i) above.
S(iv) Binary predictor, binary covariate, repeated measures with larger dependence ( Figure  S5): This is similar to scenario (iii) in the main manuscript ( Figure 6) except that the dependence across the repeated measures is now larger (random effect SD = 0.3 instead of 0.1).
S(v) Comparison of CODAK (with AD) using default kernel vs Gaussian kernel, binary predictor ( Figure S6): The simulation scheme is similar to scenario (i) in the main manuscript ( Figure 3).
S(vi) Comparison of CODAK for linear kernel vs Gaussian kernel for the predictor, continuous predictor ( Figure S7): The simulation scheme is similar to scenario S(i) above.
S(vii) Comparison of CODAK-ED vs original distance correlation (dcor), continuous predictor ( Figure S8): The simulation scheme is similar to scenario S(i) above.
S(viii) Comparison of statistical power (binary predictor) for data with no zeros and data with some zeros each of which is replaced by a pseudocount of 1 ( Figure S9).

S4.2 Results of the additional simulations
The comparisons of the size and power for the methods under consideration for simulations S(i)-S(viii) are shown in Figures S2-S9, respectively. The diffcyt methods are not available for the continuous predictor cases. The findings from Figure S2 and Figure S4 are similar to the simulations in the main manuscript.
The advantage of CODAK over the other two methods is slightly less when a larger random effects variance is used ( Figure S3).
The CODAK-sk method is still the most robust among all methods against the violation of the independence assumption in Figure S5 when the violation is stronger, but it gradually starts to become anticonservative as the dependence in the repeated measures increase. Figure S6 shows that the default kernel and Gaussian kernel using AD have similar performance, with the default kernel performing slightly better. Figure S7 shows that the linear kernel performs better than Gaussian kernel for measuring the predictor similarities. Figure S8 shows that CODAK-ED and original dcor, although not equivalent, have near identical performance. However, both have poor statistical power, and are not recommended for compositional data. Figure S9 shows that there is a small amount of power loss for replacing the zeros by pseudocounts when compared to the situation with no zeros. ). The odds ratio method uses the largest values of the sample odds ratio. The odds ratio method is the 'true model' in the sense that the 'true top five' are determined using the true odds ratios.

S5 A second data analysis example
This analysis is based on data collected from an ongoing study conducted by the authors (study 2). Peripheral whole blood sample was collected from SLE patients and age and sex-matched controls. Peripheral blood samples were fixed (and red blood cells lysed) either immediately after collection (T0); or after incubation at 37C with a protein transport inhibitor cocktail (T6). Lysed/fixed cells were stored at −80 • C, and were thawed on the day of barcoding and staining. To decrease technical variability, palladium isotopes were used in different combinations for mass tag barcoding of separate samples, pooled in sets of 20, surface stained in a single tube with a metal-labeled antibody panel. Barcoding methodology was adapted from Zunder et al. (2015). Protocols for intracellular cytokine staining (ICS) assays were adapted from previous studies in O 'Gorman et al. (2017). Individual samples were barcoded to reduce technical variability, and data were also batch adjusted (Schuyler et al., 2019). Clinical data including clinical laboratory parameters (complete blood cell counts, autoantibodies, chemistry panels, etc), clinical disease manifestations identified on physical exam, imaging studies, and disease severity scores (SLE disease activity index, SLEDAI) were recorded from every participant and every study timepoint (Bombardier et al., 1992;Touma et al., 2011). Nephritis diagnosis, to be used as a covariate, was based on renal biopsy pathology evaluation (Schwartz et al., 2014).
Although questions similar to study 1 are relevant for this study too (e.g. SLE vs healthy control comparison), in this paper, for the purpose of demonstration of the application of CODAK for a continuous predictor, we focused on the association between the cell-type abundance and the SLEDAI score, a continuous predictor. Since we are interested in the disease activity (SLEDAI score), we only used the data from the SLE patients at T0. Application of CODAK on this data resulted in a p-value = 0.2794 showing that there is no significant association of the cell type compositions with the SLEDAI score. We also conducted analysis while adjusting for the effect of the binary covariate whether an SLE patient had nephritis or not. CODAK-sk and CODAK-alr both resulted in statistically non-significant outcomes (p-value = 0.3914 and 0.4298, respectively).

S6 Proof of the result in Equation 18
To show: Proof. Note that for this proof we can ignore the closure operator in the expression P −c = C(P 1 , P 2 , ..., P c−1 , P c+1 , ..., P q ) due to the scale invariance property of Aitchison Distance (AD). It is easy to see that the AD between the two compositions P (i) and P (j) can be written as: d(P (i) , P (j) ) = q r=1 (V r −V ) 2 , where V r = ln P ir P jr , by realizing that V = 1 q q r=1 ln P ir P jr = ln g(P (i) ) g(P (j) ) .
In the light of this, Equation S3 becomes argmin c r =c Note that the two rightmost terms in this last equation are both increasing functions of |V c −V |. This completes the proof.
S7 Proof of the fact that alr and clr methods lead to the same result for covariate adjustment Let us use the notations Y alr and Y clr to denote the alr-transformed and clr-transformed compositional vectors. Therefore, Y alr = ln P 1 P q , ln P 2 P q , ..., ln P q−1 P q , and Y clr = ln P 1 g(P ) , ln P 2 g(P ) , ..., ln P q g(P ) = (k1 n + Y alr , k), where k = ln( Pq g(P ) ).
In matrix notations, for n observations, Y clr = (k1 n 1 n + Y alr , k1 n ). Now, the residualized matrices can be written as Hence the proof.