Summary: Graphical Gaussian models (GGMs) are a promising approach to identify gene regulatory networks. Such models can be robustly inferred by solving the sparse inverse covariance selection (SICS) problem. With the high dimensionality of genomics data, fast methods capable of solving large instances of SICS are needed.
We developed a novel network modeling tool, Ultranet, that solves the SICS problem with significantly improved efficiency. Ultranet combines a range of mathematical and programmatical techniques, exploits the structure of the SICS problem and enables computation of genome-scale GGMs without compromising analytic accuracy.
Availability and implementation: Ultranet is implemented in C++ and available at www.broadinstitute.org/ultranet.
Supplementary information:Supplementary data are available at Bioinformatics online.
Graphical Gaussian models (GGMs) are rapidly gaining traction as a promising approach to identify gene regulatory networks from genomics data (Markowetz and Spang, 2007; Schäfer and Strimmer, 2005b). GGMs are created by estimating the inverse of the sample covariance matrix , and then computing the partial correlation matrix . Assuming that the observations (typically gene expression values) have a multivariate Gaussian distribution, indicates that variables (genes) and are conditionally independent, given the other variables (i.e. all correlation between them can be explained by co-correlation with other variables); indicates that they are conditionally dependent (i.e., the correlation between them cannot be explained in full by co-correlation with other observed variables). The advantage of GGMs is that edges (non-zero s) are more likely to reflect direct dependencies between genes than edges in correlation networks (Markowetz and Spang, 2007; Schäfer and Strimmer, 2005a).controls the impact of the regularization term, and hence the sparsity of the network. This is called the sparse inverse covariance selection (SICS) problem. Solving SICS for high-dimensional genomics data is challenging because current standard methods, originally developed for lower-dimensional data, are associated with computational requirements that grow steeply with the number of genes. This leads to long wait times, increases the need for extraordinary computing resources and complicates analysis.
To address this issue, we developed an efficient network modeling tool, Ultranet, that solves the SICS problem significantly faster. As shown, Ultranet readily computes genome-scale GGMs with ∼20 000 genes without compromising the analytic accuracy.
As described in detail in Supplementary Information, Ultranet uses several different techniques to accelerate the computations.
First, as a preprocessing step, Ultranet identifies the connected components of the network by block-partitioning the adjacency matrix obtained by soft-thresholding at . This decomposes the original problem into a set of smaller independent SICS problems. Because the time needed to solve SICS grows superlinearly with dimension, solving multiple small subproblems instead of one big problem is faster.
Second, to solve subproblems, Ultranet uses a first-order dual method based on accelerated projected-gradient descent (PGD). This method improves on earlier PGD approaches (Duchi et al., 2008) by using simpler, yet effective, step-length selection rules and a heavy-ball momentum term to avoid zig-zagging. This eliminates most computations needed for step-length selection, and promotes smoother descent paths. The descent step is given by, denotes the regularized covariance matrix at each iteration, and denotes projection onto the λ-box centered at . The parameters α and control the step length and impact of the momentum term, respectively.
Third, to initialize the solver, Ultranet uses a continuation approach where the problem is solved to low accuracy for decreasing values. This helps finding initial solutions near the optimum.
3 RESULTS AND DISCUSSION
We tested Ultranet on gene expression microarray data from various sources and of varying dimensionality. We made the following recurrent observations (Fig. 1): first, adding block-partitioning to basic PGD (implemented in the same coding style) yielded substantial speed-ups for large ; second, adding continuation yielded marginal speed-ups for intermediate in some datasets but did not yield significant speed-ups in general; third, in contrast, adding the momentum term accelerated convergence several fold for small because of dampened zig-zagging. Thus, we observed significant improvements over basic PGD across the full range of values.
We compared Ultranet with previous methods (Duchi et al., 2008; Friedman et al., 2008; Lu, 2009; Scheinberg et al., 2010; Yuan, 2009). In our tests, the two fastest competitors were the methods by Duchi et al. (2008) (a PGD variant) and Yuan (2009) (an augmented Lagrangian method). Ultranet was 1–2 orders of magnitude faster than these methods without block partitioning, and 1–4 orders of magnitude faster with block partitioning (Fig. 1c and d).
We finally applied Ultranet to a large set of genome-wide gene expression profiles of bone marrow samples from a wide range of conditions (Haferlach et al., 2010, NCBI Gene Expression Omnibus GSE13159) to estimate a draft regulatory network for human blood cell development. Examining the results, we focused on the neighborhood of GATA1, a transcription factor that drives the formation of red blood cells, a well-studied process. For all tested, the inferred GATA1 neighborhood was clearly enriched for known red cell genes, including genes for transcription factors (KLF1, GFI1B, TAL1), blood groups (RHD, RHAG, KEL, GYPB, XK), heme synthesis enzymes (UROD, HMBS), and the erythropoietin receptor (EPOR) (Fig. 1e). Most neighbor genes contained GATA1 ChIPseq (Chromatin Immuno-Precipitation followed by next-generation sequencing) peaks, supporting that they are true GATA-1 targets (Supplementary Table S1). The results illustrate the ability of GGMs to find biologically relevant networks, and the use of GGMs as a complement to motif searches and ChIP-seq to identify relevant transcription factor target genes. The results also illustrate Ultranet’s ability to compute GGMs at a genome-wide scale.
In summary, Ultranet is a fast tool to infer gene networks based on GGMs. Ultranet is available for Microsoft Windows and Linux operating systems, exploits the multicore capacity of Intel x64 processors, and accepts data in tab-delimited text format.
Funding: This work was supported by the Swedish Foundation for Strategic Research (grant no. ICA08-0057), Marianne and Marcus Wallenberg's Foundation (grant no. 2010.0112) and the Swedish Children's Cancer Fund (grant no. 10/125).
Conflicts of Interest: none declared.