- Split View
-
Views
-
Cite
Cite
Wenping Deng, Kui Zhang, Sanzhen Liu, Patrick X Zhao, Shizhong Xu, Hairong Wei, JRmGRN: joint reconstruction of multiple gene regulatory networks with common hub genes using data from multiple tissues or conditions, Bioinformatics, Volume 34, Issue 20, October 2018, Pages 3470–3478, https://doi.org/10.1093/bioinformatics/bty354
- Share Icon Share
Abstract
Joint reconstruction of multiple gene regulatory networks (GRNs) using gene expression data from multiple tissues/conditions is very important for understanding common and tissue/condition-specific regulation. However, there are currently no computational models and methods available for directly constructing such multiple GRNs that not only share some common hub genes but also possess tissue/condition-specific regulatory edges.
In this paper, we proposed a new graphic Gaussian model for joint reconstruction of multiple gene regulatory networks (JRmGRN), which highlighted hub genes, using gene expression data from several tissues/conditions. Under the framework of Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log likelihood function. We formulated it as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. The performance of JRmGRN was first evaluated with synthetic data and the results showed that JRmGRN outperformed several other methods for reconstruction of GRNs. We also applied our method to real Arabidopsis thaliana RNA-seq data from two light regime conditions in comparison with other methods, and both common hub genes and some conditions-specific hub genes were identified with higher accuracy and precision.
JRmGRN is available as a R program from: https://github.com/wenpingd.
Supplementary data are available at Bioinformatics online.
1 Introduction
Though all cells in a multicellular organism carry out some common processes that are essential for survival, different tissues can exhibit some unique patterns in gene expression that helps define their phenotypes. In addition, some organisms like plants may experience various environmental conditions in particular stresses. These common and tissue/condition-specific processes are ultimately controlled by GRNs that contain both common and tissue/condition-specific hubs. These hubs play critical roles for organisms to complete their life cycle. For example, abiotic and biotic stresses-responsive genes in rice have 70% in common and these genes showed conserved expression status, and the majority of the rest were down-regulated in abiotic stresses and up-regulated in biotic stresses (Shaik and Ramakrishna, 2014), indicating the presence of common hubs and network between two conditions. The local GRNs for different environmental conditions have been built in Arabidopsis (Barah et al., 2016; Hickman et al., 2013). Tissue-specific genes for 38 tissues have been identified in humans and the GRNs for each of these 38 tissues in humans have been built (Sonawane et al., 2017) and analyzed.
Comparison of global GRNs (Boyle et al., 2014) has revealed that the GNRs are largely conserved and share remarkable commonalities though they can change in response to environmental stimuli or at different tissue types. The high similarity in GRNs is primarily caused by the relatively smaller number of tissue/condition-specific nodes, for example, 23.4% genes were indicated to be tissue-specific (with complicity equal to one) after studying multiple tissues of humans (Sonawane et al., 2017). However, some local GRNs may be subject to some local topology changes (Faisal and Milenkovic 2014; Martin et al., 2016), making some regulatory interactions exist in all tissues or conditions while some others exist only in specific tissue or specific treatment.
Therefore, identification of both common and tissue- or condition-specific gene regulation provides key insights into complex biological systems (Tian, et al., 2016). In the past two decades, advances in microarray and RNA-seq technology have led to the generation of enormous wealth of gene expression data across various cell/tissue types and conditions. Although these datasets provide valuable opportunity to more robustly reconstruct condition-specific GRNs, there are very limited methods for modeling the complicated GRNs with high accuracy. Advanced and highly efficient methods are still in great demand.
Gaussian graphical models (GGMs) are widely used to reconstruct gene networks using gene expression data (Kumari et al., 2016). The models assume that gene expression data on genes from each sample follows a multivariate normal distribution with mean and covariance matrix , where is a vector with elements and is a positive definite matrix. The conditional independence of two genes given other genes corresponds to a zero entry in the inverse covariance matrix (also called the precision or concentration matrix) (Lauritzen, 1996). Usually, we set , called precision matrix or concentration matrix. Gaussian graphical models have the advantage of reconstructing direct dependencies between genes that represent edges in the reconstructed network: an edge corresponds a non-zero entry in . A natural way to estimate is by maximizing the log-likelihood of the data, which result in an estimation of precision matrix where is the sample covariance matrix.
However, directly applying GGM to reconstruct GRN is not applicable due to two problems. First, since the number of samples () is generally much less than the number of genes () from gene expression data, the sample covariance matrix becomes singular and thus it is impossible to computing the inverse. Second, even if the sample covariance matrix is not singular, the elements in the estimated precision matrix are in general not exactly equal to zero. For these reasons, Yuan and Lin (Lin et al., 2007) proposed to maximize a regularized log-likelihood function. Similar to LASSO regression (Tibshirani, 1996), they put a penalization on the sum of absolute value of each element in precision matrix, which leads to a sparse and positive definite estimation of .GLASSO (Friedman et al., 2008) is a fast algorithm to solve this optimization problem.
When applying GGMs to reconstruct gene regulatory networks, the underlying assumption is that each observation is drawn from the same distribution. However, when the gene expression data come from different tissues or under different treatments, this assumption is inappropriate. In this case, if one insists on modeling the gene expression data by one GRN, the results would be dubious and we cannot obtain the differential network e are interested in. A straightforward method to obtain the differential network is to reconstruct the network of each condition separately and then find the difference between them. However, this procedure ignores the similarity shared between GRNs across different tissues/treatments, which is critically important to reconstruct the GRNs, especially when the sample size is small. To reconstruct these dependent GRNs, Guo et al. (2011) proposed a joint penalized model using a hierarchical penalty and derived the convergence rate and sparsity properties of the resulting estimators. Danaher et al. (2014) proposed a joint graphical lasso model (JGL) to estimate multiple GRNs simultaneously. They proposed a fused graphical lasso penalization and a group graphical lasso penalization in addition to the sparsity penalization. In fused graphical lasso, the corresponding elements in the precision matrices are encouraged to have the same values. In group graphical lasso, the precision matrices in different conditions are encouraged to have similar sparsity pattern.
The above-mentioned methods do not impose any structural information of gene networks. That is, each gene has approximately the same number of interactions within the network, and each pair of nodes has equal probability to be an edge and all edges are independent. However, recent evidence points to scale-free properties in biological networks (Han et al., 2004; van den Heuvel and Sporns, 2013), in which most genes interact with a few partners whereas a small proportion of genes, called hub genes, are densely-connected to many other genes (high connectivity). To incorporate hub genes in GRNs, Liu and Ihler (2011) replaced the regularization in GLASSO with a power law regularization and optimized the objective function by solving a sequence of iteratively reweighted regularization problems, where the regularization coefficients of nodes with high degree were reduced, which encouraged the appearance of hub genes. Tan et al. (2014) introduced a row-column overlap norm penalty to incorporate hub genes explicitly. In their model, called hub graphical lasso (HGLASSO), the precision matrix was decomposed into two parts, one is elementary matrix , the other is hub matrix , where is a symmetric matrix that is encouraged to be sparse, is a matrix whose columns are encouraged to be either entirely zero or almost entirely non-zero through the norm penalization. The non-zero columns of correspond to hub genes. A detailed description of existing GGM related methods (GLASSO, JGL and HGLASSO) are given in Supplementary File 1 (S1).
The aim of this research is to develop new and more accurate method for: (i) construction of GRNs containing the important common hubs that may play essential roles for survival and/or adaptation; (ii) construction of GRNs containing tissue/condition-specific regulatory relationships that help us to understand phenotypes/traits of interest. In this manuscript, we assumed that a network for a specific tissue/condition can be decomposed into an elementary network that is unique to the tissue/condition, and a common network centered on hub genes that is shared across multiple tissues/conditions. Based on this hypothesis, we proposed a new method to jointly reconstruct multiple GRNs for multiple tissues/conditions in just one effort. Our method, JRmGRN, is different from the aforementioned methods. The methods from Yuan and Lin (Lin et al., 2007), Danaher et al. (2014) cannot model hub genes. Although the methods from Liu and Ihler (2011) and Tan et al. (2014) can be used to model hub genes, their methods are dedicated to reconstruction of a gene network from each dataset independently. With the availability of enormous amount of gene expression data from multiple tissues/conditions in public repositories, it is important to use datasets from multiple tissues or conditions together to identify common hub genes across multiple tissues or conditions and some tissue- or condition-specific hub genes, which will advance our understanding on regulation of biological processes and pathways. Our method hypothesizes that there are common hub genes in different tissues or under different environmental conditions. Figure 1 illustrates two example networks obtained from two tissues or conditions. There are many common edges (dash green) between two networks and some tissue- or condition-specific edges belonging to only one of the two networks (e.g. solid red and solid blue).
2 Materials and methods
2.1 Gaussian graphical model and regularization
. The prior assumption is that elementary network under each condition is sparse, meaning that most of elements in is zero. Therefore, we use penalty to encourage the off diagonal elements of to be zero.
. The prior assumption is that each contains some unique edges, representing the specific network for the kth tissue/condition; but have many common edges due to similarity among networks. Therefore, we use penalty to encourage the elementary networks across different conditions to be the same.
. We assume matrix contains zero columns and dense non-zero columns, where the non-zero columns represent the common hub genes across all tissues/conditions. Therefore, we use group lasso penalty to force some columns of to be zero columns.
. For the non-zero columns of , we also use penalty to encourage some elements to be zero, so a hub gene will not connect to all other genes.
2.2 Algorithm to estimate parameters
For fixed values of tuning parameters , the expression of (2.1) is a convex optimization problem, which can be solved by efficient algorithms available. The convexity of (2.1) can be proved by the following facts: the function of negative log determinant is a convex function, the norm functions are convex functions, and the nonnegative combination of convex functions is a convex function. We solved the problem (2.1) using the alternating directions method of multipliers (ADMM) algorithm, which allows us to decouple some of the terms in (2.1) that are difficult to optimize jointly. For more details on ADMM algorithm and its convergence properties, please consult the previous publication (Boyd et al., 2011).
2.3 Selection of tuning parameters
BIC is just a guide for turning parameter selection. In reality, we may also consider other factors in addition to BIC. For example, network interpretability, stability, and the desire for an edge set with a low false discovery rate, as pointed out by some researchers (Meinshausen and Bühlmann, 2010).
We used the grid search to find the tuning parameters that maximized the expression of (2.10). The computational complexity for the network construction with a fixed set of tuning parameters mainly depends on the number of genes included in the analysis. The grid search is feasible when the number of genes falls into small to moderate ranges but quickly becomes impractical for large number of genes. In this situation, we need to explore some theoretical properties of the problem that can be used to guide our search of tuning parameters.
Similar to lemma 4.1 in Danaher et al. (2014), the following fact exists.
Proof of Theorem 1 is given in Supplementary File 1 (S3.1).
Proof of Theorem 2 is given in Supplementary File 1 (S3.2).
If the conditions for Theorem 1 are satisfied, we decomposed the reconstruction of a big network into the reconstruction of two or more small networks separately, thus the computational time for (2.1) could be greatly reduced. With Theorem 2, we could reduce the search space of parameters and as these four tuning parameters are related. If is large and are too small, then the elementary network matrices can become very spare and the number of hub genes becomes too large. On the other hand, if is small and are too large, the elementary network matrices can become dense, and the number of hub genes will become too small. In this paper, we used a uniformed grid of log space from 0.01 to 20 (size = 30) for parameter , set to be 0.5, 1 and 2 folds of , set to be 0.5, 1, 2,… folds of , and set to be 0.1, 0.5, 1, 2,…, folds of .
3 Results
3.1 Results from simulated data
We simulated two types of gene networks, Erdős-Rényi (ER)-based network (Mendes et al., 2003) and Barabási-Albert(BA)-based network (Barabási and Albert, 1999), and then generated corresponding gene expression data to assess and validate the method developed. We then compared our method, JRmGRN, with three GGM based methods, the graphical lasso (GLASSO) (Friedman et al., 2008), the joint graphical lasso (JGL) (Danaher et al., 2014), and the graphical lasso with hubs (HGLASSO) (Tan et al., 2014). The precision recall curves were constructed based on the edges instead of hub genes in the network since GLASSO and JGL do not model hub genes explicitly.
3.1.1 Results on ER-based network
In an ER-based network, each pair of nodes was selected with equal probability and connected with a predefined probability. To simulate scale-free ER-based networks, we adopted a similar procedure used in Tan et al. (2014) with some modifications. Specifically, for a given number of tissues or conditions (), genes (), samples (), we used the following procedures to simulate ER-based network and corresponding gene expression data. (i) We generated a sparse matrix by setting each element to be a random number in with probability (elementary network sparsity ) and zero otherwise. This step is the same as the simulation procedure in Tan et al. (2014); (ii) We first set the matrix to be a zero matrix, and then randomly chose m hub genes. For each element in the column of H that represents a hub gene, , we set it to be a random number in with probability (hub sparsity 1 –) and zero otherwise, then set to . (iii) To construct the elementary network, , we first set it equal to , and then randomly chose a fraction of (network difference) of elements and reset its value to be a random number in with probability (elementary network sparsity 1 –) and zero otherwise. We set for each so that is symmetric. (iv) We defined the precision matrix, as . If was not positive definite, we added the diagonal element of by , where is the minimum eigenvalue of . (v) W generated the gene expression of samples for the kth tissue or condition with independent multivariate normal distribution .
For the sake of clearness in network display, the simulation was conducted based on 2 tissues or conditions, 40 samples for each tissues or condition. The elementary network sparsity, the hub sparsity and the network difference were set as 0.98, 0.70 and 0.20, respectively. We simulated 3 networks with 80, 160 and 300 genes, respectively. As we have described, we used the BIC and the grid search to find the tuning parameters and best model.
We first evaluated how well JRmGRN could find hub genes and their edges. Figure 2 shows the simulated ER gene network and the estimated gene network with 80 genes. There were 5 hubs genes that had an average of 30 edges. JRmGRN successfully identified 4 hub genes, and 95 out of 119 original edges of these 4 hub genes. Only one hub gene (76), which had only 26 edges, was not identified by JRmGRN. Two genes, 36 and 51, as shown in Figure 2B, were not hub genes but were identified as hub genes by JRmGRN. We found that the number of edges of these two genes were 15 and 14, respectively. These numbers were slightly higher than other non-hub genes and deviated toward 30, the average number of edges from 5 hub genes. These results manifested the usefulness of JRmGRN in identifying true hub genes and their corresponding edges through network reconstruction.
We then compared JRmGRN with several other methods. The precision recall curve was constructed based on the edges instead of hub genes in the network since GLASSO and JGL do not model hub genes explicitly (Fig. 3). Additional evaluations of the performance of JRmGRN under different network settings with varying sparsity and similarity were shown in Supplementary File 1 (S4). The results clearly showed that our method, JRmGRN, performed the best in all circumstances. JRmGRN jointly modeled multiple networks simultaneously so that the common network could be constructed more accurately by using datasets from multiple tissues/conditions, which resulted in more accurate tissue/condition-specific networks. The comparison of JRmGRN and other methods in identifying tissue/condition-specific edges were shown in Supplementary File 1 (S5). Hub genes were explicitly modeled by JRmGRN and HGLASSO, and the comparison of the capability to identify true hub genes were shown in Supplementary File 1 (S6). All results manifested that JRmGRN had higher precision, and comparable recalls with other methods.
3.1.2 Results on Barabási-Albert (BA)-based network
BA-based network is used in Danaher et al. (2014) to evaluate the performance of network inferring algorithm. A big network consisted of a number of disconnected BA subnetwork. There are no explicit hub genes in these subnetworks; genes that have more high connectivity were considered hub genes. For a given number of tissues or conditions (), genes (), samples (), we used the following procedures to generate a BA-based network using corresponding gene expression data. (i) We divided genes into m groups evenly. (ii) For each of first (1 – gene groups, the BA-based subnetwork was the same across all tissues or conditions and were generated with the function ‘barabasi.game’ from the R ‘igraph’ package. (iii) For each of the rest gene groups, the BA-based subnetwork was different for each tissue or condition and were generated separately. (iv) For each edge in the network, we set the corresponding element in the precision matrix of the th tissue/condtion,, to be a random number in . (v) We generated the gene expression of samples for the kth tissue or condition with independent multivariate normal distribution .
The simulation was conducted based on 2 tissues or conditions, 40 samples for each tissues or conditions and 8 subnetworks. The elementary network sparsity and hub sparsity were not explicitly implemented. We varied the parameters in the function ‘barabasi.game’ from R ‘igraph’ package to generate BA-based networks with desired elementary network sparsity ( = 0.98). We set seven out of eight subnetworks to be the same across two tissues or conditions, and one subnetwork to be different. We simulated 3 networks with 80, 160 and 320 genes, respectively.
The simulated BA gene network (Fig. 4A) and the estimated gene network (Fig. 4B) with 80 genes. JRmGRN successfully identified 191 edges with a true positive rate of 0.702, and falsely identified 290 edges with a false positive rate of 0.048. JRmGRN identified 17 genes as hub genes. The average number of edges connected to these 17 genes in the true network was 5.76, and the average number of edges connected to the rest 63 genes in true network are 3.05. Therefore, the hub genes identified by JRmGRN had much higher degree of connectivity. As pointed out in Han et al. (2004) and van den Heuvel and Sporns (2013), the genes with higher degrees of connectivity may be more important in biological development, which validates and manifests the usefulness of JRmGRN.
The comparison of PR curves of JRmGRN and other methods are shown in Figure 5. When the number of genes were 80 or less, JRmGRN and JGL had similar performance, and they were better than the other two methods. As the number of genes increased, the performance of JRmGRN surpassed that of JGL and became the best.
3.2 Results from real RNA-seq data of Arabidopsis thaliana
The networks built based on the above method are shown in Figure 6. The common network of both low and high R and FR conditions is represented by the green edges with the 15 common hub genes being highlighted in yellow. All of these common hub genes had a connectivity > 172, which is at least 5 times larger than that of any the non-hub gene. Among these 15 hub genes, 8 were up-regulated in overall trends (BZO2H3, CCL, TCP11, PLPC5, AT1G62310, AT3G15570, NAC102 and AT3G45260) light-responsive. PLPC encodes a blue light receptor protein while BLH10 while 7 were down regulated in overall trends (BLH10, ELIP1, PD1, PEX11B, PLIM2a, WAV2 and POP12) upon low and high R and FR treatments. At least 8 genes, including PLPC, BEL10, CCL, PD1, ELIP1, PEX11B, AT3G15570, POP1 and WAV2, were previously reported to be encodes a protein that interacts with PLPC (PAS/LOV PROTEIN). Their interaction diminishes by blue light (Ogura et al., 2008). PD1 encodes a plastid-localized arogenate dehydratase required for blue light-induced production of phenylalanine (Warpeha et al., 2007) while PEX11B is involved in light response (Hu and Desai, 2008). ELIP1 is light-responsive (Rus Alvarez-Canterbury et al., 2014) and plays an essential role in the assembly or stabilization of photosynthetic pigment-protein complexes (Beck et al., 2017). CCL’s transcripts are differentially regulated at the level of mRNA stability at different times of day controlled by a circadian clock (Lidder et al., 2005). AT3G15570 encodes a phototropic-responsive NPH3 family protein. POP1 encodes a member of the NAP subfamily of ABC transporters whose expression pattern is regulated by light and sucrose (Marin et al., 2006). WAV2 negatively regulates root bending when roots alter their growth direction in response to environmental stimuli such as light (Mochizuki et al., 2005).
4 Discussion
The results from synthetic data, which were generated with ER-based network or the BA-based network, clearly showed that JRmGRN outperformed several other methods, including GLASSO, HGLASSO and JGL for the reconstructing GRNs. Since common hub genes were explicitly modeled in the ER-based network, it was not surprised to see that JRmGRN had a higher accuracy in identifying common hub genes and had the largest area under the PR curves when the synthetic dataset from the ER-based network was used in the evaluation. In contrary, common hub genes were not clearly modeled though genes that had high connectivity can be seen in the BA-based network. When the synthetic dataset from the BA-based network was used in the evaluation, JRmGRN and JGL had similar performance as a small number of genes was included, whereas JRmGRN had a much better performance than JGL as a large number of genes was used.
When JRmGRN was used to construct gene networks using real gene expression dataset generated from Arabidopsis cotyledons under low to high red: far-red light regime conditions, it successfully built three networks and identified 15 common hub genes, and at least 9 of them were explicitly documented in existing literature for their involvement in light-response or related biological processes. Some of them, like ELIP1, PLPC and BLH10, play important roles in light perception and harvest in photosynthesis. In addition to these common hub genes, some other genes like bHLH071, a blue-light regulated gene (Jiao et al., 2003), and RSM1, a light-responsive gene (Soitamo et al., 2008), are identified to be hubs in low and high R: FR- specific networks, respectively, indicating the usefulness of the method in building two or more condition-specific networks and a common network across all conditions. We also implemented other three methods to the real dataset from Arabidopsis under far-red light and shade condition. We found JRmGRN and JGL identified much more numbers of common edges and less number of condition-specific edges than GLASSO and HGLASSO. HGLASSO identified 37 hub genes, 14 red: far-red-specific and 23 shade-specific hub genes. Of these 37 hub genes, TCP11, PD1 PLIM2A and AT3G45260 are among the 15 hub genes of JRmGRN. Among the 15 hub genes identified by JRmGRN, 9 are involved in light response and considered to be positive, whereas among 37 hub genes identified by HGLASSO, 8 genes are involved in light response and considered to be positive [Supplementary File 1 (S8, S9)]. It appears that JRmGRN is more efficient in recognizing positive hubs.
In the model of JRmGRN, four tuning parameters were used and a grid search was employed to find the optimal turning parameters, resulting in the optimal model based on the BIC like criterion. For a fixed set of tuning parameters, an efficient ADMM algorithm was derived and implemented to enable a fast estimation of precision matrices. Theoretical properties of the penalized likelihood function were also investigated and used to reduce the search space of tuning parameters. For a fixed set of tuning parameters, using a Mac desktop computer with 2.2 GHz Intel Core i7 processor and 16 GB 1600 MHz DDR3 memory, the average running times for estimating the precision matrices were about 30 s for 100 genes, 2.5 min for 200 genes, 6 min for 300 genes, 25 min for 500 genes and 3.7 h for 1000 genes, respectively. Therefore, implementation of JRmGRN allows us to reconstruct GRNs for 500 genes within a reasonable time frame by an ordinary desktop computer. In future, we will explore two possible strategies to reduce the computational burden so that JRmGRN can be used for a large number of genes. First, instead of using the grid search, we will investigate how the heuristic search algorithms, such the genetic algorithm (Grefenstette, 1987) and taboo (Glover, 1989; Glover, 1990) perform. Secondly, we will find out how the domain knowledge on gene networks and differentially expressed genes can be used to reduce the search space of tuning parameters.
In the current model of JRmGRN, it is assumed that all hub genes are shared across different tissues or conditions. In many situations, hub genes that are specific to an individual network also exist. One of our future works is to extend the current model to incorporate both common and unique hub genes. This can be done by adding an additional symmetric matrix to the decomposition of the precision matrix. The corresponding penalized log likelihood function and an efficient algorithm will be developed accordingly.
5 Conclusion
We proposed JRmGRN as a novel method for joint construction of GRNs using gene expression data from either several tissues or environmental conditions. The model was based on a convex penalized log likelihood function that not only took gene network sparsity and similarity into account but also explicitly modeled common hub genes across multiple GRNs, leading to multiple networks with common network moieties being highlighted. The resulting networks can significantly advance our understanding of genetic regulation of various biological processes. Reconstruction of both common moieties and each individual network corresponding to a tissue or a condition was improved by borrowing information of common hub genes and regulatory relationships from other individual networks. Therefore, JRmGRN can potentially generate more accurate gene networks, as manifested by the precision recall curves.
Funding
This work was supported by the Advances in Biological Informatics (ABI) National Science Foundation collaborative research grant awards DBI-1458130 to H.W., DBI-1458597 to P.X.Z. and DBI-1458515 to S.X. The funders had no roles in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflict of Interest: none declared.
References