Discriminative pattern discovery for the characterization of different network populations

Abstract Motivation An interesting problem is to study how gene co-expression varies in two different populations, associated with healthy and unhealthy individuals, respectively. To this aim, two important aspects should be taken into account: (i) in some cases, pairs/groups of genes show collaborative attitudes, emerging in the study of disorders and diseases; (ii) information coming from each single individual may be crucial to capture specific details, at the basis of complex cellular mechanisms; therefore, it is important avoiding to miss potentially powerful information, associated with the single samples. Results Here, a novel approach is proposed, such that two different input populations are considered, and represented by two datasets of edge-labeled graphs. Each graph is associated to an individual, and the edge label is the co-expression value between the two genes associated to the nodes. Discriminative patterns among graphs belonging to different sample sets are searched for, based on a statistical notion of ‘relevance’ able to take into account important local similarities, and also collaborative effects, involving the co-expression among multiple genes. Four different gene expression datasets have been analyzed by the proposed approach, each associated to a different disease. An extensive set of experiments show that the extracted patterns significantly characterize important differences between healthy and unhealthy samples, both in the cooperation and in the biological functionality of the involved genes/proteins. Moreover, the provided analysis confirms some results already presented in the literature on genes with a central role for the considered diseases, still allowing to identify novel and useful insights on this aspect. Availability and implementation The algorithm has been implemented using the Java programming language. The data underlying this article and the code are available at https://github.com/CriSe92/DiscriminativeSubgraphDiscovery.


Technical details of the approach
The proposed approach is based on two key notions, i.e., Strength and Relevance. In this section we provide first some technical insights aimed at clarifying how such measures have been defined and computed. Then, we show the listing corresponding to the core function of the proposed approach.

SR-Network weights computation
Let DS be a population, associated to a dataset as described in section 2 of the main manuscript, t be an individual in DS and a be a gene expressed by t. Each value t(a), which represents the expression level of gene a in t, can be normalized with respect to the mean and the standard deviation of the other values for a in the same population, according to the z-score notion, that is:t = t(a)−µ σ .
In the following, we deal with such normalized values and discuss in details how the values of Strength and Relevance have been computed for a given pair of genes a i and a j .
It is worth pointing out also that, the generation of WIGA-networks from gene expression data proposed here, depends on two thresholds, namely τ s , for the strength, and τ r , for the relevance. The strength is a measure of correlation, therefore we have considered the standard approach in statistical analysis according to which values of correlation up to 0.7 are significant, and we have set τ s = 0.7 through our experiments. As for the relevance, we note that it is a probability measure, so higher τ r more probable the detected correlation. For the experiments discussed here, we have fixed τ r = 0.9.
Strength computation Lett i andt j be the z-score values of t(a i ) and t(a j ), and X t i , X t j be the random variables associated witht i andt j , respectively. Consider the bivariate normal distribution, which is usually assumed for gene-expression data, with components X t i and X t j , mean vector µ and covariance matrix Σ, where: The bivariate normal distribution can be written as: f (x, y, ρ titj ) = 1 2π 1 − ρ titj 2 e The strength between a i and a j for t is the valueρ titj of ρ titj such that the value of f in the point (t i ,t j , ρ titj ) is maximum; in formula:ρ titj = arg max ρt i t j f (t i ,t j , ρ titj ).
To evaluate the maximum of f we move to the logarithm as the logarithmic function is a continuous monotone increasing one: log(f (x, y, ρ)) = = log 1 2π 1 − ρ 2 + − 1 2(1 − ρ 2 ) (x 2 + y 2 − 2ρxy) = = − log(2π) − 1 2 log 1 − ρ 2 − 1 2 x 2 + y 2 − 2ρxy 1 − ρ 2 Moreover, as adding a constant to a function does not alter the argument of the maximum, we add log(2π) and consider the maximum of g(x, y, ρ) = − 1 2 log 1 − ρ 2 − 1 2 Since we aim at finding the maximum of g as a function of ρ, we consider the partial derivative of g w.r.t. ρ: and look for the stationary points by calculating the values of ρ where ∂g ∂ρ = 0.
Therefore, we get a cubic equation. By setting: we obtain that the solutions of (1) depend on the sign of ∆. Particularly, two possible cases may occur: ∆ 0: We have just one real solution that, as can be easily verified, corresponds to a maximum: We should compute the square root of a negative number. This task has a solution in the set of complex numbers. Let define: Note that z 1 , z 2 ∈ C and z 2 = z 1 . It follows that the solution of (1) can be written as: ρ = xy 3 + 3 √ z 1 + 3 √ z 2 . As there are three complex roots, there are three values for rho that are solution of (1): with k = 0, 1, 2. It follows that there are three solutions in R: However, not all the solutions we have found are valid stationary points for g, because they are not in the function's domain. Therefore, among the valid values of ρ (i.e. −1 ≤ ρ ≤ 1), we have to choose only the one that maximize g.

Relevance computation
Consider again the normalized expression valuest i andt j and let ρ * be the strength we can associated to such values. We are interested in evaluating the probability of observing a strength smaller than the observed one. We refer to such a probability as P − i and P − j , respectively. Then, the relevance between a i and a j for t is: , where P − i and P − j can be rewritten as: In order to evaluate the relevance, we can compute the probability P + i (P + j , respectively) of observing a value oft i (t j , respectively) such that the strength of a i and a j for t is greater than ρ * , by keepingt i (t j , respectively) fixed.
Consider Equation (1) again. By solving it with respect to x (y, respectively) and keeping ρ and y (x, respectively) fixed, we can determinate two points x , x such that the strength of a i and a j for t is greater that ρ * for any x ≤t i ≤ x . Particularly, given values ρ 0 and y 0 , with 0 < ρ 0 ≤ 1, we aim at finding the values of x such that the value of ρ solution of Equation (1) is larger than ρ 0 ; the same line of reasoning can be followed to find a value for y such that the value of ρ solution of Equation (1) is larger than ρ 0 , keeping x 0 fixed. More formally, the following theorem holds: Theorem 1 Let ρ 0 , x 0 and y 0 with 0 < ρ 0 ≤ 1 be such that Equation (1) holds with this input. Let, also, x and x be the solutions of Equation (1), solved w.r.t. x, by setting ρ = ρ 0 and y = y 0 . For any x ≤ x ≤ x , the value of ρ such that Equation (1) holds is greater than ρ 0 .
Proof First of all note that Equation (1) is a quadratic equation w.r.t. to x then it cannot admit more than two solutions. Since ρ 0 is the solution of the equation when y = y 0 , Equation (1) admits for sure real solutions x and x .
In order to prove the theorem, we prove that ψ is concave for any x between x and x and, then, the value of ρ is greater than ρ 0 .
Consider the first derivative of ψ w.r.t. x. Since ψ is implicitly defined, according to Dini's Theorem, .
x we obtain solutions By coupling Equation (4) with Equations (5), we obtain that x is smaller than x, x is larger than x and the function increases before x and decreases after x. Thus, x is a maximum and all the values of x in [x , x ] are such that ψ(x) > ρ 0 , qde. Let call t i and t i the points obtained by solving Equation (1) when dealing with the expression level of sample t on gene a i , and t j and t j the points we get for sample t with respect to its expression on gene a j ; then, the probabilities P + i and P + j can be rewritten as: where Φ(·) denotes the cumulative distribution function of the standard normal distribution.

Upper Bound
Let N be a set of WIGA-networks partitioned into two sub-set N 1 and N 2 and let s(P, N 1 ) be the incidence of a pattern P in N 1 . The upper bound on the discriminative power that can be obtained by extending P can be computed assuming that the incidence of the patterns obtained from P remains unchanged in N 1 and becomes zero in the other set N 2 . Therefore, you need to evaluate the discriminative power pow(P) choosing s(P, N 2 ) = 0. In this scenario, the entropy terms which appears in the information entropy formula of def. 10 in the main paper become: Thus, H(N P ) is unchanged while H(N P ) becomes 0 as q1 = 1.
As already pointed out in the main paper, the definition of information entropy is not symmetric; indeed if you are intended to detect patterns characterizing N 2 but not N 1 the upper bound can be computed reasoning in the same way but and evaluating pow(P) when s(P, N 1 ) = 0.

Function PatternMine
Function PatternMine(P) Input: Current analysed pattern P Output: Set of discriminative patterns res obtained from P res ← ∅; Lp ← rankExtension(P); foreach P in Lp do if isDiscriminative(P ) then res ← res ∪ P ; if isExtensible(P ) then return res ∪ PatternMine(P ); else break; return res;

Results on Synthetic Data
The proposed approach has been evaluated in a simulation scenario built on synthetic data. In particular, a pair of datasets have been randomly generated, each containing 256 samples described by 25 attributes (i.e., the gene expression values simulated for each sample, respectively), for 10 times. Then, two groups of attributes, one of size 2 and the other of size 3, have been chosen to simulate high co-expression between genes in each group in one of the population against the other one, and check the ability of our method to detect possible discriminative patterns, injected this way, and correctly involving genes in the two modified groups. In more detail, to simulate high co-expression between genes in the two groups, their corresponding entries in one population have been substituted by data coming from a multivariate normal distribution with a number of dimensions equal to the size of the associated group. By properly choosing the covariance matrix associated to the normal distribution, we are able to ensure that the selected genes result to be correlated each other. Such a correlation will lead to a strong association, in terms of both strength and relevance, of the selected genes in the corresponding WIGA networks associated with each sample. WIGA networks have been constructed for each of the 10 runs, with each node associated to one out of the 25 attributes, and the proposed approach applied accordingly. As for the values injected to simulate high co-expression, the first group of attributes correspond to nodes labelled by 0 and 1 in the networks, while the second group to nodes 2, 3 and 4, respectively.
The approach has returned from 17 to 20 discriminative patterns, for the 10 different runs. In all runs, 4 patterns involving the two groups of nodes with injected values have been returned: a linear pattern involving the two nodes 0 and 1, and other three linear patterns involving 2, 3 and 4, respectively. We call injected patterns such 4 patterns, for short. It is worth pointing out that nonlinear patterns involving nodes 2, 3 and 4 have been discarded, since their discriminative power is lower than the linear returned ones. Table 1 summarizes the structures of the best scoring discriminative patterns detected over the 10 runs, together with the mean of their discriminative power, and their position spanning within the rank based on the discriminative power of all obtained patterns. As an example, the position of pattern 0 − 1 in the rank spans from 1 to 4 over the 10 runs. The 4 injected patterns are always the best scoring ones, and they alternate each other across the 4 first positions of the rank in the different runs. Therefore, this first experiment has been successful in showing the effectiveness of our approach.
Pattern Structure Mean ± Standard Deviation Position spanning 0 -1 0.0571 ± 0.0199 1/4 4 -2 -3 0.0646 ± 0.008 2 2 -3 -4 0.0598 ± 0.014 1/3 2 -4 -3 0.0686 ± 0.0101 2/3 Another experiment has been performed on synthetic data to test if our technique is robust to noise, as described below. Given a contamination level c (expressed as a percentage of the number of samples in the datasets), c samples of the two populations are randomly exchanged and the ability of the proposed approach in finding out the injected patterns is evaluated in the new induced scenario. Table 2 shows the results for different values of the contamination level c. Here, again, the mean of the discriminative power values obtained over the 10 runs is considered. Note that, although the mean of the discriminative power values is lower than the one detected without noise, the method is always capable of detecting the two-size injected pattern; even for the three-size patterns, at least one of them is detected despite the contamination level, showing that the method can be considered robust to noise.
To provide also a quantitative evaluation of the accuracy of the proposed approach, the AUC over the rank obtained by sorting patterns based on (the mean of) their discriminative power values is shown in Table 3, for each contamination level. Without noise, the injected patterns are always at the top positions in the rank, therefore in that case the AUC is always equal to 1. The mean of the AUC values, obtained over the 10 runs and for each percentage of noise, spans from 0.805 to 0.989, thus confirming the robustness of our method. Moreover, this latter test shows also that, possible false positives returned by the proposed approach, are not among those patterns scoring the highest values of discriminative power.

Results on Real Data
Microarray chips are designed to analyse a fixed number of probes, i.e. single helix DNA fragments, to quantify the expression of specific mRNA sequences complementary to them. The number of probes depends on the platform used for the experiments (see #Probes column in tab. 4). Data used for our experiments have been normalized using Robust Multi-array Average method Irizarry et al. (2003) implemented in Bioconductor using default settings. After that, probe ids have been mapped on genes names in order to remove those data not corresponding to any gene. Furthermore, as some genes can exist that are mapped on more than one probes, we have handled them by taking into account the median among the expression levels measured for the target gene by each probe. Table 4 summarizes the main characteristics of the datasets we use for ouu experiments. All dataset are accessible at NCBI GEO database Edgar et al. (2002) through the GEO Series accession number reported in the table.
In this section some more details are provided about the results we get by running our algorithm on the datasets presented above. Table 5 reports the main characteristics of the result-sets referred to each dataset.

Functional Enrichment Analysis
As an example, we report in Figure 1 one of the connected components for the global view of Pancreas Cancer (Healthy).

Shape of the Resulting Patterns
To go in depth with the structural characteristics of the patters detected by our method we consider, for each dataset, the top-12 patterns which score the highest values of commonness. In Table 6

Frequency of Occurrence Analysis
Patters detected by our method are often built based on a small subset of genes. Tables 7 -10 list, for each dataset, the top-10 most frequent single genes, pairs, triples and quadruples detected in the result-set .

Comparison Against a Standard Approach
Here the proposed approach is compared against a standard one for studying gene expression variation between two populations, that is, searching for genes differentially expressed in the two populations, without considering co-expression between genes. The main idea guiding this set of experiments is that, genes involved in co-expression patterns, are not necessarily expected to be found as relevant, if considered alone. Instead, part of them come out in the analysis only within discriminative patterns involving other significantly co-expressed genes, characterizing a population in contrast to the other one. Therefore, the proposed approach is useful to identify genes with important roles in the onset and progress of diseases, that otherwise would remain unseen. The comparison has been performed over all the datasets associated to the four considered diseases. In particular, for each gene, the expression value in all the available samples has been considered, and the t-test has been computed to compare its expression trend in the two populations. The p-value resulting from the test has been used to quantify the statistical significance of the differences in the expression trend over the two sub-populations, and genes have been sorted accordingly.
Then, for each dataset, genes involved in at least the 20% of the returned patterns have been considered, identifying their position within the p-value-based ranking. Such genes are listed in Tables 13-16, together with the scored p-value (second column), their position in the p-value ranking divided by the whole number of genes in the dataset (third column) and the value of log fold change (fourth column). Highest the value in the third column, lowest the gene position in the ranking. For three out of the four considered datasets, as expected, the statistical significance of many (although not all) considered genes is poor, according to the t-test, when they are considered alone. Interestingly, for the Prostate Cancer dataset, genes coding for ribosomal proteins and frequent in patterns characterizing the healthy population, come out with high significance also in the differentially expressed search. Also for the log fold change values, which divergence to the zero value represents to what extent the gene expression value differs between the two population, it is evident that the genes considered in the tables would not be considered as extremely significant. This emerges also in the Volcano plots represented in Figure 2, which combine a measure of statistical significance (the p-value of the t-test) with the magnitude of the change in expression level of the genes between the two populations involved in the analysis (fold change). In more detail, the negative logarithm of the p-value has been reported on the y-axis, such that data points with low p-values, i.e., the most significant ones, are shown at the top of the plot. On the other hand, those points that are both at the top of the plot and far to either the left-or right-hand sides, correspond to large magnitude fold changes as well as to high statistical significance.
The green and yellow squares on the plots denote the areas where the genes reported in Tables 13-16 are located, based on their p-value and log fold change. Interestingly, they are not in the positions that are worth of attention for a traditional differentially expressed analysis, thus confirming that it would be difficult to identify them through a standard analysis. GSE68907 GSE15471 GSE65801 GSE13355 Figure 2: Volcano Plots. The plots combine a measure of statistical significance (the p-value of the t-test) with the magnitude of the change in expression level of the genes involved in the analysis (fold change). The red dots refer to the genes that are more expressed in the healthy samples, while the blue ones refer to genes more expressed in unhealthy samples. The green and yellow squares delimitate the area on the plot where the genes which are more frequent in our discriminative patterns are located, for healthy and unhealthy samples, respectively.  Table 13: Genes frequent in the discriminative patterns for the considered dataset (first column), scored p-value resulting from the t-test (second column), position in the p-value rank divided by the whole number of genes in the dataset (third column).   Table 15: Genes frequent in the discriminative patterns for the considered dataset (first column), scored p-value resulting from the t-test (second column), position in the p-value rank divided by the whole number of genes in the dataset (third column).  Table 16: Genes frequent in the discriminative patterns for the considered dataset (first column), scored p-value resulting from the t-test (second column), position in the p-value rank divided by the whole number of genes in the dataset (third column).