BannMI deciphers potential n-to-1 information transduction in signaling pathways to unravel message of intrinsic apoptosis

Abstract Motivation Cell fate decisions, such as apoptosis or proliferation, are communicated via signaling pathways. The pathways are heavily intertwined and often consist of sequential interaction of proteins (kinases). Information integration takes place on the protein level via n-to-1 interactions. A state-of-the-art procedure to quantify information flow (edges) between signaling proteins (nodes) is network inference. However, edge weight calculation typically refers to 1-to-1 interactions only and relies on mean protein phosphorylation levels instead of single cell distributions. Information theoretic measures such as the mutual information (MI) have the potential to overcome these shortcomings but are still rarely used. Results This work proposes a Bayesian nearest neighbor-based MI estimator (BannMI) to quantify n-to-1 kinase dependency in signaling pathways. BannMI outperforms the state-of-the-art MI estimator on protein-like data in terms of mean squared error and Pearson correlation. Using BannMI, we analyze apoptotic signaling in phosphoproteomic cancerous and noncancerous breast cell line data. Our work provides evidence for cooperative signaling of several kinases in programmed cell death and identifies a potential key role of the mitogen-activated protein kinase p38. Availability and implementation Source code and applications are available at: https://github.com/zuiop11/nn_info and can be downloaded via Pip as Python package: nn-info.


example
As stated in the introduction, let X = {X 1 , X 2 , . . ., X n }, X i = (X i,1 , X i,2 ) be an n-sized sample of 2D uniform, independent and identically distributed (iid) random variables with no componentwise dependencies.Further, let Z i = ϕ 2 (X i ) + Y i , where ϕ 2 is the probability density function (pdf) of a standard 2D Gaussian distribution N 2 (0, Σ 2 ), and Y i iid ∼ N (0, 0.01), i = 1, . . ., n. Figure 1 displays data examples of two different covariance matrices.Here, the example on the right hand-side has a higher covariance, which leads to a higher value for 2D MI estimation between X and Z.

Supplementary information: BannMI to quantify information propagation in molecular networks
Let X = {X i } n i=1 with X i = (X i,1 , X i,2 , . . ., X i,d ) be a single cell sample with d measured phosphoproteins per single cell, such that X i,s with s ∈ {1, 2, . . ., d} is the expression/phosphorylation of protein s in single cell i and X s := {X i,s } n i=1 is the 1D sample of all n single cells.Now, suppose vector components u = (u 1 , u 2 , u 3 ) with u j ∈ {1, 2, . . ., d} correspond to upstream kinases of s (see Figure 2), such that X i,u = (X i,u 1 , X i,u 2 , X i,u 3 ) and X u := {X i,u } n i=1 is the 3D subsample of X consisting of the upstream kinases.Then, information propagation in this small molecular network can be measured by ÎB n,k (X s , X u ), which is our BannMI.It quantifies the dependency between X s and X u .It is easy to see that this approach can be extended to molecular networks consisting of more than one interface.

Supplementary material on methods
As mentioned in the Methods part of our paper, a popular approach to derive estimators for the entropy, the Kullback-Leibler divergence (KLD) as well as the mutual information is to apply a nearest neighbor (NN) distance approximation procedure.In what follows, we first introduce this approximation that can be applied for probability density functions (pdf)s.After that, we introduce the estimators and provide additional background information.

Notation
Let X = {X 1 , X 2 , . . ., X n } with n ∈ N be an iid sample with X i ∼ P and let ∥•∥ be a norm on R d .Now, nn i,1 ∈ X is the first NN of X i in that sample if and only if with distance r i,1 .For every positive integer k < n, the kth NN nn i,k of X i is defined equivalently.If we assume ,where Γ is the Gamma function.The NN distance method takes advantage of the following approximation from Lebesgue's differentiation theorem where B r (x) is a ball with a small radius r around x which has the volume r d c 1,d .Obviously, equality holds for constant p(y) in B r (x).The pdf estimator in the pioneering work of Kozachenko and Leonenko (1987) is defined as An intuition is as follows: Let X = {X 1 , X 2 , . . ., X n } be as defined above with X i ∼ P and pdf p. Further, let B r i,1 be the neighborhood d-ball around X i .Then, let X * ∼ P be a random variable.The probability p * for X * ∈ B r i,1 is B r i,1 p(y)dy.
Now, if we were to estimate p * given the sample X \ {X i }, we find that 1 out of n − 1 realizations is inside the closed ball B r i,1 .Hence, the maximum likelihood estimate (MLE) for p * is 1 n−1 and therefore The rest follows from (1).

Entropy estimation
The entropy estimator of Kozachenko and Leonenko (1987) is derived by averaging over the natural logarithm where γ ≈ 0.57721 is the Euler-Mascheroni constant.Alternatively, especially in information theoretic context, log 2 is applied.
The method gained a lot of attention in the statistics community, as it is simple and has favorable statistical properties: It is asymptotically unbiased with vanishing variance for bounded p and ∃ϵ > 0 such that E(p −ϵ ) < ∞ (Leonenko et al., 2006).The latter work also extends (2) to an arbitrary k < n, which then suffers from increased bias but has less variance in the scenario of a finite sample size.This is intuitively clear, as the local constancy assumption for p(x) in the neighborhood-ball is violated to a higher degree for a growing distance r i,k .However, the robustness of r i,k increases, given more data points {nn 1 , nn 2 , . . ., nn k }.

MI estimation
For d ≥ 2, write X i = (X i,1 , X i,2 ) and let X 1 and X 2 be the marginal samples of {X i,1 } i=1,...,n , {X i,2 } i=1,...,n , respectively.A straight forward MI estimator is Ĥn (X 1 ) + Ĥn (X 2 ) − Ĥn (X).Instead, the famous work of Kraskov et al. (KSG;2004) uses the flexibility in the choice of k to construct a superior MI estimator when compared to the straight forward approach.The algorithm of KSG is as follows: let k be fixed.Then, a variation of Ĥn (X) is computed w.r.t. the max norm, that is is the number of elements of X 1 that are in the projected ball.k i,2 is defined equivalently (see Kraskov et al., 2004, for an illustrative example of this procedure).This leads to the KSG estimator where ψ is the digamma function and Note that due to the clever choice of k i,1 and k i,2 , explicit distances cancel out in the formula.KSG provides excellent results if marginal distributions P 1 and P 2 are of low dimension with rather moderate componentwise dependencies.

Amelioration of the NN distance method
Many extensions of KSG exist.Berrett et al. (2019) propose a weighted average of pdf estimates derived from k = 1, 2, . . .m neighbors, with m being small with respect to n. Gao et al. (2015); Lord et al. (2018) apply a local principle component analysis (PCA) or single value decomposition (SVD) on the kth NNs of X i .By that, the neighborhood-ball around X i is replaced with a more suitable structure that captures local dependencies of the data.For example, an ellipsoid in which p(x) is rather constant.At this end, Lu and Peltonen (2020) enhance the PCA-approach by adding a bootstrap procedure to test for potential overcorrection.However, a substitution of the neighborhood-ball comes at the cost of an asymptotical bias.

KLD estimation
A KLD estimator takes two iid samples X with X i ∼ P and Y with Y i ∼ Q into account.For the sake of simplicity, we assume that X and Y are of identical sample size n.The pdf approximation of (1) is not exclusive for p(x).It can also be applied to approximate q(x), although only data points from the X sample are taken into account.Now, a straight forward KLD estimator is where r x i,k and r y i,k are the distances from X i to its kth nearest neighbor in X or Y , respectively.This estimator was explicitly proposed by Wang et al. (2009).However, its mean square consistency was already provided in the little noticed work of Leonenko et al. (2006).More recently, Zhao and Lai (2020) provided convergence rates for bias and variance if pdfs p(x) and q(x) meet some criteria.The authors further showed that the estimator's convergence rate is close to the minimax rate.The minimax rate is, heuristically spoken, the best theoretical performance of an estimator, given the worst case scenario.It is worth noting that (3) explicitly applies NN distances.Therefore, problems of well-definedness appear for data with duplicates as ∃X i ∈ X with r x i,1 = 0. To prevent log(0) in (3), we introduced a case discrimination max , ϵ in our implementation.

Derivation of the BannMI empirical hyperparameters
In the case study of our work, we apply BannMI with empirically derived hyperparameters { α, β} for the Beta priordistribution.In what follows we describe this empirical derivation.The approach was suggested by Gelman et al. (2015).It uses the method of moments as follows: 1. ∀X i ∈ X, compute estimates for

Estimate mean and variance of the sample
4. Apply a transformation to receive estimates for the parameters By that, { α, β} are plugged into the BannMI.

Supplementary material on the benchmark study
In the Method parts of our work we derived two KLD estimators: A frequentist NN ratio based KLD estimator that was inspired by the f -divergence estimator presented by Noshad et al. (2017) and our BannMI.For simplicity, we refer to the estimators NMI and BannMI -which is identical to their MI application, although KLD is estimated.Next, we first benchmark the estimators to derive an optimal parameter setting (Section 3.1.4).With the optimal estimator parameters at hand, we proceed to benchmark for MI in our main work.Furthermore, Section 4.2 provides detailed performance information about the MI benchmark with Gaussians.1.For each estimator, all combinations of parameters were tested.For the non-empirical hyperparameter setting in BannMI, it holds that α = β.For NMI in the ϵ-scenario, "eps" is the machine epsilon and adapt= 1 nd , where n, d are sample size and data dimensionality.The slight difference in k for both methods is due to empirical testing: We wanted to make sure, that the minimum for each estimator with frozen parameters despite the parameter k, is not at the edge of our testing range.For example: k = 5 does not minimize the mean squared error for BannMI, neither does k = 50.

KLD benchmark on pairs of non-negative random variables
4.1.1.Data generating proess (DGP): We simulate componentwise uncorrelated random variables (RV)s with dimension d ∈ {2, 5, 10} of the Exponential, Lognormal, Rayleigh, Gamma and Chi-squared distributions with various parameter settings.The samples are grouped into all feasible tuples (drawing with replacement).For example, (exp(1), exp( 2)) refers to a tuple of two exponentially distributed samples with parameters λ 1 = 1, λ 2 = 2 and so on.We then apply the KLD estimators with all possible combinations of parameter settings as shown in Table 1.The sample size for all scenarios is n = 1, 000.A detailed evaluation for both KLD estimators is shown for d = 2 with number of neighbors k = 50, see Figure 3 and the Tables 2 and 3.

Main results:
All hyperparameter settings for BannMI approximately halve the mean squared errors compared to NMI with its respective parameter settings, see Tables 2 and 3.These tables also demonstrate that the 0-scenarios (two identical distributions) are comparable for both estimators.Further, other parameter scenarios produce comparable results.

Key performance indicators of MI benchmark in Gaussian setting
In the Gaussian DGP of the Benchmark part of our work, we tested estimators (u)BannMI, KSG, WMI, NMI with optimized parameter settings derived from Section 3.1.4on data of dimension d = {2, 5, 10}.As for BannMI, the empirircal parameter setting performed comparable to an uninformative prior with α = β = 0.1, we included both scenarios.Results of the key performance indicators mean squared error (MSE), standard derivation and Pearson correlation are shown in Figure 4.The estimator WMI had to be excluded for the Pearson correlation due to its production of NaN values and numerical instabilities.( α, β) 0.1 0.5 1 emp.diverse 3.873511 4.724244 5.127471 5.144825 identical 0.001694 0.001743 0.001589 0.001682 Table 2. Average mean squared errors derived from BannMI estimates compared to estimates from numerical integration.Due to the componentwise independence of the distributions applied, the numerical integration could be reduced to an 1D scenario with neglectible standard errors.The quantity to estimate is KLD between sampled tuples of pairwise different non-negative distributions (diverse) and sampled tuples of a non-negative distribution (identical).The average mean squared errors were computed with respect to all tuples, whereas each tuple was drawn 25 times.Here, the column for the parameter 0.2 is not shown for brevity.The further parameters of this computation were as follows: sample size: 1,000, sample dimensionality: 2, nearest neighbors: 50.Other further parameter settings produce comparable results (not shown).

Random forests application
We selected time point t = 9, as most apoptotic cells are found in that joint sample of the control cell lines.We defined apoptosis as a cleaved Caspase-3 level above 5 and used 40% of the data for training.To avoid overfitting, we restricted the tree depth to four.For each 4D EfA combination, the following four random forests were trained: 1, default parameter setting.2, entropy as split quality measure.3, entropy as split quality measure and balanced class weight.4, entropy as split quality measure and application balanced subsamples.ϵ eps adapt 10 −6 0.1 diverse 9.467923 9.158444 10.029276 9.303394 identical 0.001760 0.001662 0.001836 0.001668  The best and second best performance is received for the feature combination of phosphorylated ERK, p38, AMPK, STAT1 with AUC as benchmark criterion, see the confusion matrix in Table 4. Here, we want to raise attention to the sensitivity of our second best model, which is > 0.999.Thus, these particular combinations of protein phosphorylation in healthy cells are found exclusively in their apoptotic subset.A multivariate logistic regression for all 4D EfA combinations performed poorly (data not shown).After ranking the models with respect to AUC, we computed the average feature importance per EfA of the 25 top performing models, see Figure 5.

KLD for differential expression
Differential expression (DE) is relevant in single cell analysis with yet no clear state-of-the-art method for its quantification (Hie et al., 2020).However, the KLD is a measure for dissimilarity between distributions and therefore well suited to analyse multivariate DE between phosphoproteomic samples.Next, although not in the context of a testing procedure (with a p-value as result), we use BannMI as KLD estimator to quantify DE between the control phenotypes in a 4D setting.To do so, we computed KLD between the apoptotic cells and the proliferating cells, as well as between the apoptotic cells and the resting cells.Again, we ranked results with respect to KLD and selected the 25 highest scores to count the individual EfAs in the occurring quadruples.Results are shown in Figure 6 (left).Intriguingly, the four most frequently occurring EfAs, which are AMPK, p53, STAT1 and ERK in the "apoptotic vs. resting" scenario, are identical with the phosphoproteins of the highest feature importance as shown in Figure 5.
In our final analysis, we were interested in the 4D DE when comparing the cancerous to the control cells per phenotype, see Figure 6 (right).To do so, again our EfA occurrence ranking procedure was applied.Besides a strong DE of the metabolic pathway (AMPK) and of the cell stress pathway (p53), our KLD estimations show a robust difference of GSK3β phosphoylation in cancer when compared to the control.As stated in our main work, the protein plays a role in DNA repair.
Here, GSK3β expression refers to its phosphorylation at Ser9, a phosphorylation site that inactivates the kinase.Compared to control cells, in cancer cells GSK3β Ser9 expression is approximately twice as high (data not shown).However, it is thinkable that overall GSK3β expression is leveraged in cancer, due to a high DNA damage frequency.

Analysis of phenotypic differential expression
via symmetric KLD  Left: DE between apoptotic and proliferating/resting control cells.Right: DE between control and cancer.To do so, 4D DE between control and every cancer subtype was derived and an EfA occurrence ranking was performed.Then, the five EfA occurrence rankings were averaged for each phenotype.

Fig. 2 .
Fig. 2. Graphical intuition for n-to-1 communication measured with help of BannMI as opposed to a standard graph-based 1-to-1 communication

Fig. 3 .
Fig.3.Performance example with sample size n = 1, 000 and amount of neighbors considered k = 50.Per distributional tuple, the mean squared error between 25 estimates and the numerical result (nquad) was derived.X-axis depicts KLDs based on non-identical distributions, results are sorted with respect of increasing KLD value, which means that they were sorted based on the numerical result of the tuple.Other parameter sample sizes and other values for k produce comparable results.

Fig. 5 .
Fig. 5.The phosphorylated proteins p53, AMPK, ERK and STAT1 have the largest effect on the prediction outcome in a 4D setting.

Fig. 6 .
Fig.6.EfA occurrence DE rankings with respect to cell phenotypes.Left: DE between apoptotic and proliferating/resting control cells.Right: DE between control and cancer.To do so, 4D DE between control and every cancer subtype was derived and an EfA occurrence ranking was performed.Then, the five EfA occurrence rankings were averaged for each phenotype.

Table 3 .
Average mean squared errors derived from NMI estimates compared to estimates from numerical integration.For further information, see description of 2. For each setting with the 62 covariance matrices, we computed 25 estimates and derived the mean.Here, the MSE of each cell is with respect to the 62 mean estimates per dimension and neighbor index.In the standard deviations benchmark, each cell is a mean of the 62 standard deviations that were derived from the 25 repeated experiments.In the Pearson correlation table, a cell represents the correlation between the 62 true values/mean estimates.