Abstract

Motivation

Learning probabilistic graphs over mixed data is an important way to combine gene expression and clinical disease data. Leveraging the existing, yet imperfect, information in pathway databases for mixed graphical model (MGM) learning is an understudied problem with tremendous potential applications in systems medicine, the problems of which often involve high-dimensional data.

Results

We present a new method, piMGM, which can learn with accuracy the structure of probabilistic graphs over mixed data by appropriately incorporating priors from multiple experts with different degrees of reliability. We show that piMGM accurately scores the reliability of prior information from a given expert even at low sample sizes. The reliability scores can be used to determine active pathways in healthy and disease samples. We tested piMGM on both simulated and real data from TCGA, and we found that its performance is not affected by unreliable priors. We demonstrate the applicability of piMGM by successfully using prior information to identify pathway components that are important in breast cancer and improve cancer subtype classification.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

We undoubtedly live in the era of data deluge. Massive amounts of data from different social and scientific fields are collected daily and stored in databases. Recently, many research efforts are focused on developing computationally efficient data analysis methods that can mine and reveal ‘hidden’ patterns, trends and associations in such large datasets. Probabilistic Graphical Models (PGMs) offer an attractive solution to this problem since they can: i) discover and elegantly represent (as a graph) the conditional (in)dependences between variables in a dataset; ii) estimate the joint probability distribution of the data (Koller and Friedman, 2009); information that can be used to develop predictive models of any variable in the network.

Many algorithms have been proposed for learning the underlying structure of PGMs (Koller and Friedman, 2009; Spirtes et al., 2000). One of the most popular is graphical lasso (glasso) (Friedman, 2008), which can quickly learn the underlying graph structure. Although glasso can only handle datasets of continuous variables, recent extensions can learn PGMs over mixed data (hereby referenced as Mixed Graphical Models—MGM) (Lee and Hastie, 2015; Sedgewick et al., 2016; Tsagris, 2017). MGMs are gaining popularity in biomedical research due to their ability to reveal complex associations among multimodal variables that jointly influence the disease or biological mechanism that generates the mixed dataset. However, due to the fact that most biomedical datasets are high-dimensional (i.e. small number of samples, large number of variables), their accuracy depends on the selection of regularization (sparsity) parameters, which control the sparsity of the graph (in terms of the number of edges) (Liu, 2010). Incorporating prior information during structure learning can significantly improve accuracy by biasing the computed structure on known biological associations.

During the last few years several methods have been proposed to incorporate prior information in graph structures learned over continuous variables. (Wang et al., 2013) proposed a modified glasso algorithm, named prior information dependent lasso (plasso), which aims to incorporate prior knowledge about the presence of the graph’s edges. To achieve this, the authors use two sparsity parameters: one for the edges for which no prior exists and one for those with prior information. To select the values of these parameters, the authors proposed two modified versions of the Bayesian Information Criterion. Both BIC versions introduce extra parameters (i.e. minimum average node degree and proportion of the true edges in the given prior information) where their values must be ‘guessed’ by the user. However, in real world applications, it is impossible for the user to know the ‘true’ values of these parameters, a fact that reduces the algorithm’s accuracy. Other modified versions of the glasso which can incorporate prior information into structure learning have been proposed by (Li and Jackson, 2015) and (Zuo et al., 2017). Their prior incorporation methods are based on the estimation of a confidence p × p matrix (where p is the number of variables) using prior information. The elements of this matrix take their values in [0-1], where a 1(0) indicates that there is (not) a connection between the corresponding pair of variables. To select the value of the glasso’s sparsity parameter, they propose the BIC and Cross-Validation (CV) methods. However, (Liu, 2010) in their seminal work have shown that although BIC and CV perform well in low dimensional data (i.e. small number of variables) they tend to perform poorly in high dimensional data.

We note that all these methods can incorporate prior information into graph structure learning over datasets with continuous only variables. In this paper we propose a new method, piMGM (prior incorporation Mixed Graphical Model), which incorporates prior information over mixed datasets, addresses the limitations of the aforementioned methods, and introduces the following unique aspects: (i) Incorporates prior information from multiple sources that may have different degrees of reliability. (ii) Introduces a novel probabilistic scheme to evaluate the reliability of each prior information source. (iii) Introduces a weighted scheme that fuses the multi-source information and represents it in a probabilistic way, which is used by the proposed prior information incorporation method. (iv) Introduces a novel score function for the selection of the regularization parameter values, which not only favors the incorporation of the most reliable prior information, but it also produces stable graphs (i.e. graphs insensitive to data variations). To the best of our knowledge this is the first method that attempts to incorporate prior information to graph structure learning over mixed data.

2 Materials and methods

2.1 Preliminaries

Suppose that we have a dataset S of size n × p where n is the number of samples and p is the number of random variables. Using this dataset, the objective of a graph structure learning algorithm is to find a graph G=(V,E) that best represents the conditional dependences between the random variables. In graphical models, the nodes V of the graph, have a 1-1 correspondence to the random variables, and the presence of an edge εkE indicates the conditional dependence relation between the kth pair of random variables. Throughout this paper, we assume that there is a fixed ordering over node pairs and we denote the set of corresponding edges as E={ε1,ε2,,εK} where K=|V|(|V|1)2 (number of edges in a fully connected graph with no self-loops). We assume that εk=1(0) if an edge is present (absent) between the kth kthpair of nodes in the graph.

We denote the sources of prior information as T={t1,t2,,tR}. The prior information of a source tr where r={1,2,,R} is given as a vector Mtr of size K, where each of its elements mktr, describes the probability of the corresponding edge (εk) to appear in the graph structure. For the edges with no available prior information about their presence, we assign a Null value to the corresponding vector elements. In general, each source trT may provide information for a different subset of edges. wptr (with prior) and nptr (no prior) denote the set of edges for which the source trdoes or does not provide prior information, respectively. It holds that forumla. Finally, we use wp (np) to denote the set of edges for which we have (no) prior information.

2.2 Learning the structure of graphical models over mixed variables

In Sedgewick et al. (2016), the authors proposed a novel algorithm named CausalMGM which accurately learns the structure of graphical models over mixed variables (continuous and discrete), and it is an improvement over the work of Lee and Hastie (2015). CausalMGM’s novelties are: a) It utilizes edge type-specific regularization parameters {λcc,λcd,λdd} for the model [see Equation (1)] for continuous-continuous, continuous-discrete and discrete-discrete edges, respectively, that control the sparsity (in terms of the number of edges) of the estimated graph structure. b) It proposes a computationally efficient subsampling method (StEPS) to select the value of these parameters which give the most stable graph structure, as high stability graphs should be closer to the true graphs.

Assume that we have p Gaussian variables x, and q categorical y. Equations (1)–(4) (Sedgewick et al., 2016) summarizes the MGM algorithm. The subscripts {1, 2, F} in Equation (1) denote the l1, Euclidean and Frobenius norms respectively.βωϕ represents the interaction term between the continuous variables xω and xϕ; ρωυ represent the interaction term between continuous variable xω and discrete variable yυ; ϒζυ represents the interaction term between discrete variables yζ and yυ. Θ is shorthand for all the model parameters (i.e. {βωϕ,αω,ρωυ,ϒζυ}). (Sedgewick et al., 2016) describes the MGM algorithm in detail.

Parameter estimation in the MGM model is done via:
argminΘ l~(Θ)+ λccϕ<ω|βωϕ|1+λcdω,υρωυ2+λddζ<υϒζυF 
(1)
where l~(Θ) is the negative log pseudo-likelihood:
l~(Θ)=ω=1plogp(xω|xω,y;Θ)ζ=1qlogp(yζ|x,yζ;Θ)
(2)
and p(x,y;Θ) is a pairwise Markov random field.
p(x,y;Θ)exp(ω=1pϕ=1p12βωϕxωxϕ+ω=1pαωxω+ω=1pυ=1qρωυ(yυ)xω+υ=1qζ=1qϒζυ(yζ,yυ))
(3)

2.3 Estimating prior information dependent MGM structures

In this section we present our new method, the prior information dependent MGM (piMGM), which appropriately combines the information from observational data with information provided by multiple sources (e.g. pathways, domain experts, etc.) with different confidence.

piMGM extends the framework of the original MGM [see Equation (1)] by replacing each of the norms in Equation (1) with two norms [see Equation (4)]: λxxwp and λxxnp where xx {cc,cd,dd} and wp and np designate the edge sets with prior or no prior information, respectively. This new scheme provides greater flexibility to the model, since it independently controls the sparsity of the corresponding types of graph’s edges (cc, cd and dd) for which we do or do not have prior information.
(4)

To estimate the values of the regularization parameters λxxwp and λxxnp, we proposed the following procedure: We randomly draw Q subsamples {S1,S2,,SQ} of size b (1bn ) from the dataset S (Fig. 1). Unlike bootstrapping, each subsample is drawn without replacement. Since there is a total of (nb) different subsamples, Q should satisfy Q(nb). For each subsample Sq(q{1,2,,Q}) we run the CausalMGM algorithm for a range of regularization parameter values Λ={λ1,λ2,,λJ}. In order to reduce the time complexity of the method, we assume independence between the three different edge types (cc, cd, dd) as in Sedgewick et al. (2016), and so for each run we use the same value for all three regularization parameters λcc=λcd=λdd. These runs generate Gi graph structures, where i{1,2,,Q·J} (Fig. 1). It is worth mentioning that CausalMGM runs Q·J times independently, and therefore it can be easily parallelized for better efficiency.

Fig. 1.

The Q·J graphical models. For each regularization parameter value λ, we estimate Q graphical models’ structures

2.4 Data-driven estimation of the probabilities of presence of the graph edges

Using the graph structures that have been estimated using the regularization parameter values λjΛ (see Fig. 1), we calculate the probability of presence of an edge εkE as:
Pkp=NkQ·J
(5)
where Nk denotes the number of times εk appears in the Q·J graphs.

2.4.1 Selecting regularization parameters λxxnp for the edges with no prior

For each regularization parameter value λjΛ we have estimated Q graphs (Fig. 1). Using the structures of the Q graphs, for each edge εkE, we count the number of times, zkλjQ, that εk appears in the Q graphs. We assume zkλj follows the Binomial distribution B(Q,Pkp), where Pkp is the probability for edge εk to appear in a graph structure, and we calculate the probability θkλj for edge εk to appear exactly zkλjtimes in the Q graphs.
θkλj= (Qzkλj)·(Pkp)zkλj·(1Pkp)Qzkλj
(6)
The higher the θkλj, the better the regularization parameter λj ‘explains’ the presence of the edge εk in the Q graphs. Next, for each edge εknp, we calculate the relative frequency of its presence in the Q graphs as follows:
fkλj=zkλjQ
(7)
Using the fkλj we then calculate,
gkλj=4·fkλj·(1fkλj)
where
0gkλj1
(8)

The algebraic expression in Equation (8) allows us to answer the following question for each edge εknp: ‘how often do theQgraphs disagree on the presence ofεk?’ Therefore, gkλj can be considered as a measure of instability for an edge εk in the Q graphs (Liu, 2010).

After applying this procedure for all edges εknp, we calculate for each λjΛ the following score:
scorenpλj= k=1|np|θkλj·(1gkλj)
(9)

We select as λxxnp the λjΛ that maximizes the score function [Equation (9)]. This selection is justified if we observe that the value of the score function increases with an increase in the probabilities θkλj, and decreases with an increase in the instabilities gkλj. Therefore, the λ that maximizes the score function defines a stable graph structure that ‘best explains’ the presence of the edges for which no prior information exists.

2.4.2 Selecting regularization parameters λxxwp for the edges with priors

In this section we present the selection procedure of the regularization parameters λxxwp, which control the incorporation of the prior information to the graphical model. First, we present the simplest case, where the prior information is provided by a single source (t1), and then we continue with the general case, where multiple sources T={t1,t2,,tR}, provide information about different subsets of edges {wpt1,wpt2,,wptR} when each source may have a different degree of reliability for the given dataset. For example, if we consider the KEGG pathways as different sources of prior information, then not all pathways are expected to be active (relevant) in a given gene expression dataset. Thus, the information included in the ‘irrelevant’ pathways will be ‘unreliable’ for this dataset.

Quantifying reliability of the prior information source(s): The probability mktr (see Section 2.1) can be used to answer the question ‘how many times edgeεkis expected to appear inQgraphs?’ In our case, for each λjΛ we have Q graphs and therefore the answer is:
ψktr=mktr·Q 
(10)
The estimated probability of presence Pkp [see equation (5)] of an edge εkwptr, allows us to answer the same question. As discussed in Section 2.4.1, we assume that the number of times where an edge εkwptr will appear across the Q graphs is described by a Binomial distribution B(Q, Pkp). It has been proven that Normal distributions are very good approximations to Binomial distributions (Papoulis and Pillai, 2002). Therefore, to make our calculations computationally tractable, we approximate the corresponding Binomial distributions with normal distributions Dk=N(μkd,Varkd) with parameters:
μkd=Pkp·QVarkd=Pkp·(1Pkp)·Q
(11)
Using ψktr and μkd, we calculate:
τtr=k=1|wptr||ψktrμkd||wptr|
(12)

Equation (12) measures the confidence we have on the source tr. More specifically, the larger the τtr the less ‘confident’ we are about source tr. For the derivation of this formula, we assumed that the estimations ψktr of the source tr, should be similar to the estimations μkd, which are derived from the data (i.e. the Q·J graphs, Fig. 1).

However, prior information datasets may inform about different numbers of edges, which can affect the value of τ. For example, having prior information about the presence of only three edges may match the edges learned by a graphical model just by chance, even if this prior information is not well represented in the data (i.e. unreliable prior). To mitigate this, we learn an empirical null distribution similarly to GSEA (Subramanian et al., 2005). Given a prior source (e.g. a pathway) tr with |wptr| edges, we randomly permute the labels of the nodes in these edges (randomly select two other nodes to be in the edge) N=10000 times to produce a distribution of random pathways of equal size. For each pathway r in this distribution, we compute τtr exactly as specified in Equation (12). Then, the empirical P-value of the prior tr is the percentage of values in the null distribution greater than τtr. To accurately measure τtr independently of the size of the prior, we normalize τtr by dividing by the mean of the empirical null distribution. With this, P-value we can quantify whether we believe a prior source is well represented by the system under study.

Using Equations (10) and (12), we can model the prior information of a source tr, with a normal distribution Uktr=N(ψktr,(τtr)2), which describes the number of times where an edge εkwptr is expected to appear in Q graphs. Note that if more than one source of prior information is available (T={t1,t2,,tR}), then each source is modeled by a different normal distribution Ukti (i={1, 2,,R}).

Selecting  λxxwp  using prior information from a single source: If we have a single source of prior information (t1), then for each edge εkwp we estimate two normal distributions: Dk=N(μkd,Varkd), which is estimated from the data; and Ukt1=N(ψkt1,(τt1)2) which constitutes the t1 prior information. By applying Bayes theorem, it can be proved (Bromiley) that since the prior Ukt1 and the likelihood Dk are both Normal distributions, the posterior will also be Normal N(μk*,Vark*) with parameters given by the following closed form expressions:
μk*=μkd·(τt1)2+ψkt1·Varkd(τt1)2+Varkd, Vark*=(τt1)2·Varkd(τt1)2+Varkd
(13)

The posterior distribution N(μk*,Vark*) is derived from the fusion of the information provided by source t1 (prior information) and the data, and describes our updated ‘belief’ about the probability of an edge εkwp to appear a specific number of times in the Q estimated graph structures.

For each regularization parameter value λjΛ we have estimated Q graphs (see Fig. 1). For each edge εkwp, we count the number of times zkλjQ that it appears in the Q graphs. From the corresponding posterior distributions, we can calculate for each edge εkwp, the probability θkλj, to appear exactly zkλj times in the Q graphs
θkλj=zkλjɛzkλj+ɛN(x|μk*,Vark*)dx
(14)
where ɛ is an arbitrarily small positive quantity.
Using Equations (7) and (8) we calculate for each edge in εkwp, the relative frequency fkλj of its presence in the Q graphs, and its corresponding instability gkλj. After applying this procedure for all edges εkwp we calculate for each λjΛ the following score:
scorewpλj= k=1|wp|θkλj·(1gkλj)
(15)

We select as λxxwp the value of λjΛ where the score function takes its maximum value. The justification of this selection is similar to the one provided in Section 2.4.1.

Selecting  λxxwp  using prior information from multiple sources: In this section we present the selection procedure of λxxwp when prior information is provided by multiple sources T={t1,t2,,tR}. Using Equations (10)–(12) we can estimate for each source trT the normal distributions Uktr that correspond to the edges εkwptr. For each edge εk we have |Tk||T| normal distributions that were extracted by the information of the sources TkT that contain prior information for edge εk.

Similar to the single source case, for each edge εkwp we estimate the following normal distributions: Dk=N(μkd,Varkd) from the data; and |Tk| normal distributions Uktr=N(ψktr,(τtr)2), which are estimated using the prior information of the sources trTk. To combine their information, we propose the following procedure:

First, we calculate the mixture of the |Tk| normal distributions [black curve in Fig. 2a]
Φk(x)= i=1|Tk|wti·N(ψktr,(τtr)2)
(16)
Fig. 2.

(a–c) The steps of the proposed information fusion method. (a) The prior information (normal distribution {Ukt1,Ukt2,Ukt3}) of the 3 sources ({t1,t2,t3}- blue, green and red curves) and their corresponding mixture Φk (black curve). (b) The normal distribution FkKL (black dashed curve) that has minimum Kullback-Leibler divergence from the mixture Φk(black continuous curve). (c) The posterior distribution which consists of the information fusion of the probability distribution Dk and the FkKL(see text for details)

To calculate the mixture weights wti, we apply the following expressions:
wti=atii=1|Tk|ati where ati=i=1|Tk|τtiτti
(17)
For the calculation of the weights (wti), we use the assumption that the normal distributions with the smaller variances (τti)2 should be trusted more. Note that i=1|Tk|wti=1. Next, by applying variational calculus, we approximate the Gaussian mixture Φk(x) using a single normal distribution FkKL=N(μkKL,VarkKL) which minimizes the Kullback-Leibler (KL) discrimination (maximizes the similarity) from the Gaussian mixture. The parameters of the normal distribution FkKL are calculated using the following analytical algebraic expressions (Runnalls, 2007):
μkKL=i=1|Tk|wti·ψkti
(18)
VarkKL=i=1|Tk|wti·((τti)2+(μkKLψkti)2)
(19)

After applying this procedure for all edges εkwp we have 2 normal distributions for each edge εk, a) the Dk=N(μkd,Varkd), which is estimated using the given dataset [pink dashed curve in Fig. 2c], and b) the distribution FkKL=N(μkKL,VarkKL) [black dashed curve in Fig. 2c]. Thus, for each edge εkwp, the parameters of the corresponding posterior distribution N(μk*,Vark*) can be calculated using the expressions in Equation (13). Next, using Equations (14) and (15) we select the corresponding regularization parameter λxxwp as described in the single source case.

2.4.3 Limiting the selection range of the regularization parameters

In this section we propose a procedure that limits the selection range Λ of the regularization parameters values, which help us to further improve the estimation of the graph structure. In many real world applications (e.g. biological, clinical, etc.) we expect the underlying graphical structures (e.g. gene networks) to be sparse (Wang et al., 2013). Our method exploits this by limiting the selection range (Λ*Λ) of regularization parameter values such as to avoid the selection of parameter values that result in very dense or very sparse graph structures.

For each regularization parameter value {λ1,λ2,,λJ} we calculate the average number of edges that appear across the Q graph structures, estimated using the Sq (q{1, 2,,Q}) subsamples (see Fig. 1). In Figure 3 each red circle corresponds to the average number of edges that appear across the Q graphs for a specific regularization parameter value. The curve formed by the red circles in Figure 3a indicates that as λ increases (x-axis) the number of edges (y-axis) in the graphs monotonically decreases. Note that for small (large) values of λ the estimated graphs are almost fully connected (empty). Below we propose to use a new procedure to limit the selection range of the regularization parameter values.

Fig. 3.

(a–c) The ‘knee-points’ and the estimated lines (blue and green) that minimize the corresponding sums of fit errors. (d) The new range of the regularization parameter values Λ*Λ determined by the ‘knee-points’ 2 and 3 on the x-axis (black line segments)

We traverse the monotonically decreasing curve one point at a time (‘current point’) and fit two lines. The first line [blue line in Fig. 3a] is fitted to the points that belong on the left side of the ‘current point’ and the second line [green line in Fig. 3a] is fitted to the points that belong on the right side of the ‘current point’. The ‘current point’ that corresponds to the minimum sum of the corresponding lines’ fit errors becomes a ‘knee-point’ [see ‘knee-point’ 1 in Fig. 3a]. We apply this procedure to the points that are located on the left and right side of ‘knee-point’ 1 [see Fig. 3b and c] and we determine the ‘knee-point’ 2 and 3 respectively. The projections of the ‘knee-points’ 2 and 3 on the x-axis [see Fig. 3d], determine the ‘new’ selection range (Λ*={λ1,λ2,,λJ*} Λ) of the regularization parameter values which will be used by our regularization parameter selection method (see Section 2.4).

3 Results

To demonstrate the value of piMGM we used simulated and real data. In the latter case, we also address two important biological problems: i) pathway scoring on expression datasets and ii) network structure inference for understanding of genomic drivers of disease subtype.

3.1 Description of data sources

Simulated data. Simulated datasets of varying sizes were generated using the Lee and Hastie simulation method from Tetrad VI (http://www.phil.cmu.edu/tetrad/). First, a Directed Acyclic Graph (DAG) was generated uniformly at random with number of edges equal to the number of nodes. Each node was randomly assigned to be a continuous or categorical variable with equal probability. This DAG was parameterized with random edge weights in the range:

[−1.5, −0.5], [0.5, 1.5]. Independent samples were generated from this parameterized graph to produce a final dataset, according to the model from Lee and Hastie (2015). To compare the estimated graph learned by piMGM, the DAG was converted to its equivalent ‘moralized’ undirected version, which maintains the independence relationships present in the original DAG (Cowell, 1999).

Prior sources were generated based on the ground truth DAG using two different methods depending on whether we wanted to evaluate learning of the full network or scoring the prior sources (also called ‘pathways’ here to highlight their relevance to biological pathways in gene expression data). Specifically, for the pathway experiments: we defined a ‘pathway’ i as a random selection of Ei edges, where Ei is in the range [10, 2N] and N is the number of edges in the data generating moralized graph. We then randomly selected Ti edges from the ground truth graph to include in the pathway, where. Ti is randomly selected from the range: [1, min(Ei, N)]. Thus, the reliability of pathway i is Ti/Ei, which measures what percent of a pathway’s edges are present in the data generating graph. Finally, we randomly selected Ei-Ti ‘false’ edges from all edges not in the data generating graph. These pathways were given as ‘hard priors’ which means that the prior information given by a pathway only contained the values null and 1 corresponding to the absence or presence of an edge in the pathway, respectively.

For the full network inference experiments, it was assumed that all prior sources provide information about the same edges, but with different reliability, to test the ability of piMGM to successfully synthesize prior information from multiple sources. We tested cases where the sources only provide information for the true edges in the ground truth graph and for edges uniformly at random. The number of edges provided, and the number of experts were experimentally controlled. The edges for which the experts give prior information were determined randomly, and the information itself was a ‘soft prior’ with a real numbered value ranging from (0.6, 1) for a reliable source, and from (0, 1) for an unreliable source; in contrast to the ‘hard priors’ from the previous pathway experiments.

Biological data. We used the TCGA-BRCA RNA-Seq expression dataset to demonstrate the applicability of piMGM. This data included gene expression measurements from 800 breast tumor samples and 95 matched normal samples. Prior information consisted of 33 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa and Goto, 2000), which were selected by the same criteria as a related pathway enrichment method (Ma et al., 2016), but excluding those with fewer than 10 expressed genes or fewer than 20 gene–gene interactions. The gene–gene interactions encoded in each pathway were used as a ‘hard prior’ with a value of 1 if the edge existed in the pathway, and no information otherwise. 645 genes were included in the final dataset, consisting of the union of the genes in the 33 pathways, excluding 2735 consistently lowly expressed genes or those without a variance of at least 0.5.

To evaluate the usefulness of full network inference for classification, breast cancer subtype information for each tumor sample was obtained from Jiang et al. (2016). This determination was used as a ground truth for classification experiments, and the fifty genes commonly used to compute this classification on microarray data, Prediction Analysis of Microarray 50 (PAM50) was used as a prior information source. In this case, the edges between each gene in the PAM50 list and the categorical variable denoting subtype was given a prior probability of 1, while all other edges had no prior information.

3.2 piMGM for pathway activation assessment

We first evaluate how well piMGM determines the state of activation of given pathways using both simulated and real data.

3.2.1 piMGM correctly estimates the reliability of pathways

Figure 4 presents the results of applying piMGM on 25 simulated datasets with 100 variables each, 15 pathways for prior information, and 200 samples. The figure demonstrates the strong inverse correlation between the predicted deviance [Equation (12)] of each simulated pathway from the ground truth and its reliability score that piMGM calculated. The major outliers from the trendline are those pathways that provide information about relatively few gene–gene interaction edges. This is because if a pathway has few interactions, they might match spurious correlations in the dataset by chance. Thus, it is prudent to consider both the reliability score of the pathway as well as its P-value to determine its reliability. On a graph by graph basis, the mean correlation between τ and the reliability score across all 25 datasets was 0.992 (±0.006), further confirming the accuracy of the piMGM model for each individual dataset.

Fig. 4.

Association between the true reliability of priors (experts) and the deviance (τ) of the data from the priors as calculated by piMGM

3.2.2 Identifying gene pathways associated with breast cancer subtypes

As a first application of piMGM on biological data, we used it to identify de-regulated pathways in breast cancer patients from TCGA. In this application, we used the pathways as ‘experts’ and we scored their ‘reliability’ on tumor and normal samples. ‘Reliable’ pathways are those that are active in the given sample dataset (low P-value; Section 2.4.2), and ‘unreliable’ are those that are not. We selected 33 pathways from KEGG for this test, as we describe in the Section 2.

piMGM found eight of the 33 pathways to be differentially regulated (FDR-corrected P-value < 0.1) between receptor positive (Luminal A and B subtypes) and receptor negative (HER2 and Triple-Negative) subtypes (Table 1). Five of the eight pathways were also found to be differentially expressed by a related method, NetGSA, so we further examined the three remaining pathways to understand what piMGM found mechanistically.

Table 1.

Differentially regulated pathways by receptor status (positive: Luminal A and B; negative: HER2, triple negative) in Breast Cancer (FDR < 0.1)

PathwayP-value (+)P-value (−)Reference
Glutathione metabolism0.5070.091Lien et al. (2016)
Glycolysis0.0000.129Schramm et al. (2010)
Neurotrophin signaling0.7020.074Patani et al. (2011)
Notch signaling0.0000.223Hossain et al. (2017)
Pentose phosphate0.0250.239Cha et al. (2017)
B Cell Receptor signaling0.1410.004Hill et al. (2011)
Insulin signaling0.0980.384(see text)
T cell receptor signaling0.5070.058(see text)
PathwayP-value (+)P-value (−)Reference
Glutathione metabolism0.5070.091Lien et al. (2016)
Glycolysis0.0000.129Schramm et al. (2010)
Neurotrophin signaling0.7020.074Patani et al. (2011)
Notch signaling0.0000.223Hossain et al. (2017)
Pentose phosphate0.0250.239Cha et al. (2017)
B Cell Receptor signaling0.1410.004Hill et al. (2011)
Insulin signaling0.0980.384(see text)
T cell receptor signaling0.5070.058(see text)

Note: Pathways in italics found by piMGM but not NetGSA. The significance of the bold values is an FDR < 0.1.

Table 1.

Differentially regulated pathways by receptor status (positive: Luminal A and B; negative: HER2, triple negative) in Breast Cancer (FDR < 0.1)

PathwayP-value (+)P-value (−)Reference
Glutathione metabolism0.5070.091Lien et al. (2016)
Glycolysis0.0000.129Schramm et al. (2010)
Neurotrophin signaling0.7020.074Patani et al. (2011)
Notch signaling0.0000.223Hossain et al. (2017)
Pentose phosphate0.0250.239Cha et al. (2017)
B Cell Receptor signaling0.1410.004Hill et al. (2011)
Insulin signaling0.0980.384(see text)
T cell receptor signaling0.5070.058(see text)
PathwayP-value (+)P-value (−)Reference
Glutathione metabolism0.5070.091Lien et al. (2016)
Glycolysis0.0000.129Schramm et al. (2010)
Neurotrophin signaling0.7020.074Patani et al. (2011)
Notch signaling0.0000.223Hossain et al. (2017)
Pentose phosphate0.0250.239Cha et al. (2017)
B Cell Receptor signaling0.1410.004Hill et al. (2011)
Insulin signaling0.0980.384(see text)
T cell receptor signaling0.5070.058(see text)

Note: Pathways in italics found by piMGM but not NetGSA. The significance of the bold values is an FDR < 0.1.

We found that both the T and B cell receptor signaling pathways shared a common subnetwork that was driving their identification by piMGM as significant. This network included genes AKT3, PIK3CA, PIK3CD and PIK3R1 (Fig. 5, left), all of which are genes of significance in cancer. Several studies have found changes in the regulation of AKT3 in receptor negative breast cancer (Chin et al., 2014; Nakatani et al., 1999) and one recent study has found changes in expression of PI3K/Akt across receptor subtypes (Cizkova et al., 2013), consistent with our identified subnetwork An interesting finding from this pathway is that the direct connection between PIK3CA and AKT3 is more present in receptor negative tumors. PIK3CA is an oncogene whose aberrant activation results in AKT3 activation which can lead to uncontrolled cell proliferation and tumorigenesis (Hernandez-Aya and Gonzalez-Angulo, 2011). It is possible that this subnetwork elucidates distinct mechanisms of AKT3 overactivation in different breast cancer subtypes. piMGM also identified a NFATC1 subnetwork of T cell receptor pathway as critical for breast cancer development (Fig. 5, right). NFAT1 is a nuclear factor that alters T-cell transcription in response to T-cell receptor stimulation, but NFAT1 mRNA is also found in breast tissue (Safran et al., 2010). A recent study found that the NFAT1 pathway was active in triple negative breast cancer but not in other subtypes, and that the pathway was critical for metastasis and tumorigenesis (Quang et al., 2015). Though KEGG labels these pathways as lymphocyte related pathways (T and B cells), the reason these pathways were found to be differentially regulated was due to the subcomponents critical to breast cancer progression in different subtypes.

Fig. 5.

Edge differences between receptor positive (luminal A and B) and receptor negative (HER2, triple negative) breast tumor samples for a common sub-network of the T cell and B cell receptor signaling pathways (left) and the T cell receptor signaling pathway (right). Red edges are those that are more stable in receptor positive breast cancer, while green refers to receptor negative breast cancer. AKT3 and NFATC1 regulation appear to be driving the change between these phenotypic groups

The third pathway not corroborated by NetGSA was the insulin signaling pathway. Studies have shown that estrogen receptor positive cells can have increased proliferation in the presence of estrogen due to upregulation of the insulin receptor (IRS1), which binds insulin growth factor (IGF) (Molloy et al., 2000). Several of the most divergent connections found by piMGM involve IRS1 in the insulin signaling pathway (Supplementary Material).

All eight differentially regulated pathways appear to have an established relationship to breast tumor subtype. piMGM not only identified them as significant, but it also identified which parts of them are the most significant in breast cancer subtyping (Fig. 5), thus pointing to the mechanisms that influence this subtyping. Follow up studies can further investigate the specific mechanisms implied by piMGM networks. We should also note that piMGM tends to identify fewer differentially regulated pathways than conventional analysis techniques, but this is largely because piMGM uses stronger conditions than typical pathway enrichment methods. piMGM uses independence changes between genes while conventional enrichment methods ignore network connectivity and focus on gene expression changes. Individual gene expression changes may be related to established pathways, but independence relationships query the precise network information stored in KEGG pathways to determine differential regulation.

3.3 piMGM for full network inference

Next, we present piMGM results on simulated and biological data for inference of the full network structure on datasets with mixed continuous and categorical variables. Again, the prior information sets (i.e. ‘experts’) can have different degrees of reliability.

3.3.1 piMGM learns accurate networks despite unreliable priors

To evaluate the ability of piMGM to incorporate prior information, we compared piMGM with several baseline approaches on simulated data we generated as described in Section 2. We used four baseline methods for comparison (Fig. 6, left columns, 0% prior). STARS (Liu, 2010) is a network stability approach that tunes a single regularization parameter for MGM without prior information. STEPS is an extension of this approach that uses three regularization parameters, one for each edge type (Continuous-Continuous, Continuous-Discrete, Discrete-Discrete). The Oracle graph is MGM run with the set of three regularization parameters that maximize accuracy, and Oracle One λ is an equivalent approach using only one regularization parameter for the whole network.

Fig. 6.

Accuracy of piMGM inferred networks vs. other approaches on simulated data with 100 variables, 500 samples and 5 prior sources. The datasets differ in the percent of edges with prior information (0, 10, 30, 60%) and the number of reliable priors (1, 3, 5 out of 5). The blue bars represent the F1 score accuracy of the baseline networks (no prior)

We compared the baseline results with piMGM runs with five prior information sources, varying both the percent of edges with prior information (10%, 30%, 60%) and the number of reliable experts among those 5 (reliable experts: 1, 3 or all 5). Figure 6 displays the amount of prior information (x axis) vs. the F1 score of the learned graph compared to the ground truth (moralized) graph. As expected, increase in the ratio of reliable to unreliable priors increases overall accuracy (F1 score) for all percentages of edges with prior information. We also see that in most cases the F1 score is not affected by unreliable priors even when 4 out of 5 experts are unreliable (Fig. 6, yellow bars; compared to STARS and STEPS methods with no prior). Interestingly, if prior information is available for 10% of edges, piMGM is equivalent to approaches with no priors, regardless of the reliability of the information sources. However, if 60% of edges have prior information, piMGM outperforms even the Oracle graph given that the prior information is somewhat reliable (P < 0.01). If the prior information is highly unreliable, with 30% prior, piMGM does not have degrading performance, and it maintains at least the quality that it would have had without any prior (P > 0.05). This is desirable in cases where prior information may or may not be well represented in the system under study. When ‘experts’ provide priors for edges that are not in the data generating graph the results are nearly identical (Supplementary Material).

3.3.2 Prior knowledge helps stabilize predictive models

Our final experiment evaluates the ability of piMGM to recover network structure from biomedical mixed data. Since ground truth for the whole network is difficult to come by in biological systems due to incomplete understanding of them, we evaluated our approach by determining how well the learned network can be used for a classification task. One of the advantages of graphical models is that you can use the variables that are connected to a target variable in the network (the Markov blanket, if the network is directed) to predict the target variable. Breast cancer consists of four subtypes with varying prognoses: triple negative, Luminal A, Luminal B and HER2. Using nested 10X cross validation, we tested the ability of our piMGM graphs to correctly classify the four subtypes. We compared classification accuracies on networks learned using four sets of priors: No prior information, five random sets of fifty genes each connected to the Subtype variable (Random Prior—unreliable), the PAM50 set of genes (PAM 50 Prior) and the same five random sets along with the PAM50 gene set (Both Priors). We ran piMGM with each of these sources as prior information and inferred a full network from data consisting of the 50 genes in PAM50 along with the genes used in Section 3.2.2. We then used the genes connected to the subtype variable as features in a multiclass logistic regression model.

We found that there is no significant difference in classification accuracy between each of the four models (Fig. 7, left columns). This means that piMGM is not affected by random (unreliable) priors even when they are the only source of information piMGM gets. On the other hand, incorporating appropriate prior information results into models that require significantly fewer features to achieve the same accuracy (Fig. 7, right columns). In addition, piMGM found the PAM50 gene set to be active in the BRCA dataset of tumor samples (P < 0.01) unlike the random gene sets (P > 0.1). piMGM does not have a reduction in classification performance when given a random prior source, further proving that this method is resilient to unreliable prior information sources on real data.

Fig. 7.

piMGM Network Inference to classify breast cancer subtypes with varying prior information. Performance is measured using classification accuracy and stability (y-axis left) and the number of features used in the model (y-axis right)

4 Discussion

We have presented a novel method to incorporate prior knowledge for learning mixed graphical models; and two applications of this method to genomic and clinical data. Our method, piMGM, which can incorporate multiple priors with varying degrees of reliability, consists of three steps, each of which presents a new algorithmic novelty. Step-1, we developed a novel reliability score [Equation (12)] that quantifies how well the prior information from a given expert fits the data. The reliability score has an interesting application: it can be used to assess whether a gene pathway is active in the data. Our analysis on TCGA data identified several previously known and novel pathways implicated in breast cancer. Step-2, we developed a novel method that uses the reliability score to weight and merge prior information from multiple experts into a single prior distribution for each edge. Step-3, we presented a new approach to incorporate prior information in the learning of probabilistic graphs over mixed data types. This method uses separate regularization parameters for edges with and without priors. The regularization parameters are also edge type specific to allow more flexibility and reduce false positive edges as we have shown in (Sedgewick et al., 2016). In simulated data we showed that piMGM is not affected by unreliable priors, and its performance increases with increasing amounts of prior information. When 60% of edges have informative (i.e. relatively reliable) priors, piMGM outperforms the Oracle graph without prior. The efficiency of piMGM in analyzing omics and clinical data was demonstrated in a classification example, where the variables surrounding the breast subtype variable in the learned graph were used in classification. We compared classification efficiencies of graphs learned with no priors, PAM50 genes, random priors and combination of PAM50 and random priors. We showed that the classification accuracy of piMGM is not affected by the inclusion of random priors (Fig. 7), but including PAM50 priors reduces the number of variables required for efficient classification. All parameters of piMGM can be ‘best’ selected using a subsampling procedure presented in Section 2.3, thus eliminating arbitrary choices. Furthermore, this procedure is parallelizable for increased efficiency.

To the best of our knowledge, this is the first method to incorporate unreliable prior information sources to infer a full network from mixed data and the first method to analyze pathway activation in single phenotype samples. We believe that piMGM is a valuable method that will help future analyses of datasets with mixed data types (continuous and discrete variables). In systems biology research, it will not only identify deregulated pathways, but it will also generate hypotheses about specific edges that are deregulated to guide intervention experiments.

Acknowledgements

We thank Prof. Peter Spirtes, Carnegie Mellon University, for critically reviewing a previous version of this manuscript. This work was supported by NIH through grants U01HL137159, R01LM012087 and T32CA082084).

Funding

This work was supported by the National Institutes of Health (NIH) under award numbers R01LM012087, U01HL137159 to Prof. Panayiotis V. Benos. Vineet K. Raghu was supported by a NIH training fellowship (T32CA082084). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Conflict of Interest: none declared.

References

Bromiley
 
P
. (
2003
) Products and Convolutions of Gaussian Distributions Medical School. University of Manchester, Manchester, UK, Technical report , Vol. 3, p. 2003.

Cha
 
Y.J.
 et al.  . (
2017
)
Differential site-based expression of pentose phosphate pathway-related proteins among breast cancer metastases
.
Dis. Markers
,
2017
,
1
.

Chin
 
Y.R.
 et al.  . (
2014
)
Targeting Akt3 signaling in triple-negative breast cancer
.
Cancer Res.
,
74
,
964
973
.

Cizkova
 
M.
 et al.  . (
2013
)
PIK3R1 underexpression is an independent prognostic marker in breast cancer
.
BMC Cancer
,
13
,
545
.

Cowell
 
R.G.
(
1999
)
Probabilistic Networks and Expert Systems
.
Springer-Verlag
,
New York
.

Friedman
 
J.
 et al.  . (
2008
)
Sparse inverse covariance estimation with the graphical lasso
.
Biostatistics
,
9
,
432
441
.

Hernandez-Aya
 
L.F.
,
Gonzalez-Angulo
A.M.
(
2011
)
Targeting the phosphatidylinositol 3-kinase signaling pathway in breast cancer
.
Oncologist
,
16
,
404
414
.

Hill
 
L.
 et al.  . (
2011
)
Differential roles of estrogen receptors alpha and beta in control of B-cell maturation and selection
.
Mol. Med.
,
17
,
211
220
.

Hossain
 
F.
 et al.  . (
2017
)
Non-cannonical notch signaling pathways regulate breast cancer stem-like cells function in triple negative breast cancer
.
FASEB J.
,
31
,
671.676
671.676
.

Jiang
 
G.
 et al.  . (
2016
)
Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer
.
BMC Genomics
,
17
,
525
.

Kanehisa
 
M.
,
Goto
S.
(
2000
)
KEGG: kyoto encyclopedia of genes and genomes
.
Nucleic Acids Res.
,
28
,
27
30
.

Koller
 
D.
,
Friedman
N.
(
2009
)
Probabilistic Graphical Models: Principles and Techniques
.
MIT Press, Cambridge, Massachusetts, USA.

Lee
 
J.D.
,
Hastie
T.J.
(
2015
)
Learning the structure of mixed graphical models
.
J. Comput. Graph. Stat.
,
24
,
230
253
.

Li
 
Y.
,
Jackson
S.A.
(
2015
)
Gene network reconstruction by integration of prior biological knowledge
.
G3 (Bethesda)
,
5
,
1075
1079
.

Lien
 
E.C.
 et al.  . (
2016
)
Glutathione biosynthesis is a metabolic vulnerability in PI(3)K/Akt-driven breast cancer
.
Nat. Cell Biol.
,
18
,
572
578
.

Liu
 
H.
 et al.  . (
2010
)
Stability Approach to Regularization Selection (StARS) for high dimensional graphical models
. In: Proceedings of the Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS). Curran Associates, Inc., Red Hook, NY.

Ma
 
J.
 et al.  . (
2016
)
Network-based pathway enrichment analysis with incomplete network information
.
Bioinformatics
,
32
,
3165
3174
.

Molloy
 
C.A.
 et al.  . (
2000
)
Insulin receptor substrate-1 expression is regulated by estrogen in the MCF-7 human breast cancer cell line
.
J. Biol. Chem.
,
275
,
12565
12571
.

Nakatani
 
K.
 et al.  . (
1999
)
Up-regulation of Akt3 in estrogen receptor-deficient breast cancers and androgen-independent prostate cancer lines
.
J. Biol. Chem.
,
274
,
21528
21532
.

Papoulis
 
A.
,
Pillai
S.U.
(
2002
)
Probability, Random Variables, and Stochastic processes
.
McGraw-Hill
,
Boston
.

Patani
 
N.
 et al.  . (
2011
)
Brain-derived neurotrophic factor expression predicts adverse pathological & clinical outcomes in human breast cancer
.
Cancer Cell Int.
,
11
,
23
.

Quang
 
C.T.
 et al.  . (
2015
)
The calcineurin/NFAT pathway is activated in diagnostic breast cancer cases and is essential to survival and metastasis of mammary cancer cells
.
Cell Death Dis.
,
6
,
e1658
.

Runnalls
 
A.R.
(
2007
)
Kullback-Leibler approach to Gaussian mixture reduction
.
IEEE Trans. Aerospace Electr. Syst.
,
43
,
989
.

Safran
 
M.
 et al.  . (
2010
)
GeneCards Version 3: the human gene integrator
.
Database
,
2010
,
baq020
baq020
.

Schramm
 
G.
 et al.  . (
2010
)
Analyzing the regulation of metabolic pathways in human breast cancer
.
BMC Med. Genomics
,
3
,
39
.

Sedgewick
 
A.J.
 et al.  . (
2016
)
Learning mixed graphical models with separate sparsity parameters and stability-based model selection
.
BMC Bioinformatics
,
17
,
175
.

Spirtes
 
P.
 et al.  . (
2000
)
Causation, prediction, and search
. In:
Adaptive Computation and Machine Learning
.
MIT Press
,
Cambridge, MA
, pp.
xxi, 543

Subramanian
 
A.
 et al.  . (
2005
)
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc. Natl. Acad. Sci. USA
,
102
,
15545
15550
.

Tsagris
 
M.
 et al.  . (
2017
)
Constraint-based Causal Discovery with Mixed Data
. In:
KDD
.
Halifax
,
Nova Scotia, Canada
.

Wang
 
Z.
 et al.  . (
2013
)
Incorporating prior knowledge into Gene Network Study
.
Bioinformatics
,
29
,
2633
2640
.

Zuo
 
Y.
 et al.  . (
2017
)
Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO
.
BMC Bioinformatics
,
18
,
99
.

Author notes

The authors wish it to be known that, in their opinion, Dimitris V. Manatakis and Vineet K. Raghu authors should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Supplementary data