A graphical model approach visualizes regulatory relationships between genome-wide transcription factor binding profiles

Abstract Integrated analysis of multiple genome-wide transcription factor (TF)-binding profiles will be vital to advance our understanding of the global impact of TF binding. However, existing methods for measuring similarity in large numbers of chromatin immunoprecipitation assays with sequencing (ChIP-seq), such as correlation, mutual information or enrichment analysis, are limited in their ability to display functionally relevant TF relationships. In this study, we propose the use of graphical models to determine conditional independence between TFs and showed that network visualization provides a promising alternative to distinguish ‘direct’ versus ‘indirect’ TF interactions. We applied four algorithms to measure ‘direct’ dependence to a compendium of 367 mouse haematopoietic TF ChIP-seq samples and obtained a consensus network known as a ‘TF association network’ where edges in the network corresponded to likely causal pairwise relationships between TFs. The ‘TF association network’ illustrates the role of TFs in developmental pathways, is reminiscent of combinatorial TF regulation, corresponds to known protein–protein interactions and indicates substantial TF-binding reorganization in leukemic cell types. With the rapid increase in TF ChIP-Seq data sets, the approach presented here will be a powerful tool to study transcriptional programmes across a wide range of biological systems.

TFs (except with itself), and further assume that each ChIP-seq experiment correspond to one variable in the model regardless of multiple experiments immuno-precipitating the same TF in the same cell type. Consequently, this network is a direct dependence graph for the set of TFs = [1, ] where each ∈ , = 1, … , is one ChIP-seq experiment (note that consists of the non-unique set of all TFs). This network contains the set of vertices (i.e. nodes) = { 1 , 2 , … , } and the set of undirected vertex pairs (i.e. edges) ⊂ {� , ′ �, , ′ = 1, … , , ≠ ′ }. In a TF association network, a node corresponds to one TF ∈ while an edge corresponds to the direct dependence between the bindings of a pair of TFs.
When analyzing multiple ChIP-seq samples, many studies infer TF binding similarity by using correlation or mutual information measures. However, a high correlation coefficient or mutual information between the bindings of any two TFs may suggest different scenarios: a 'direct' interaction, an 'indirect' interaction or regulation by a common TF. Therefore, in the context of a 'TF association network', only 'direct' interactions are relevant as these correspond to edges between nodes (TFs) in the resulting graph and the main purpose of a network inference algorithm is to discover the set of direct interactions or significant links between nodes (TFs) from the input data.
Crucially, the algorithms should be able to distinguish 'direct' interactions from 'indirect' interactions (e.g. two highly correlated TFs that are similarly regulated by another TF). Instead, the notion of conditional independence commonly applied in gene regulatory network algorithms provides a useful way to determine the set of 'true' interactions. To achieve this, we utilized various network inference algorithms (see below) to obtain a matrix of size × where each entry in the matrix is a measure of direct dependence between and ′ . To select important or significant edges, a threshold (differs for each algorithm) is applied to the matrix to obtain an adjacency matrix and discover 'true' interaction edges as follows -matrix elements that pass the threshold are given a score of '1' and '0' otherwise. This denotes the presence and absence of an edge, respectively. The adjacency matrix defines which edges should be plotted in the resulting network and all edges are treated as un-weighted. Only off-diagonal elements in the upper or lower matrix are used for constructing the network as these correspond to an undirected graph with no self-regulation. In other words, we do not assume any directionality or hierarchy in the TF interactions and TFs do not regulate itself.
In this study, four types of network inference algorithms were employed to obtain the matrix of direct dependence measures. They were chosen based on their known performance in gene regulatory network inference, computational efficiency and coverage across different classes of graphical modeling theory.
(1) Gaussian Graphical Model (GGM) -This method was first developed to model the gene association structures from microarray data [1] and produces an undirected graph. We adapted this method to generate a 'TF association network' as follows. Given a TF binding matrix, , as input data, conditional dependence between TFs and ′ is given by the partial correlation of and ′ : Standard graphical modeling theory shows that the partial correlations may be calculated by inversion of the covariance matrix. However, for large datasets, conventional approaches for obtaining the covariance matrix (i.e. the unbiased empirical covariance matrix and the maximum likelihood estimate) are known to be statistically inefficient and ill-conditioned. Hence, Schäfer and Strimmer proposed the use of shrinkage estimators of the covariance matrix [2] and showed that the shrinkage approach provides a more accurate, well-conditioned and always positive-definite covariance matrix. Moreover, it is computationally more efficient compared to conventional approaches. In this study, we used the GeneNet package in R [3] to compute a reliable estimate of the partial correlation matrix for all pairs of TFs. Statistical significance is then assigned to the edges in the GGM network by fitting a mixture model to the observed partial correlations and computing the p-value and posterior probabilities for each edge. Significant non-zero partial correlations are obtained using the empirical Bayes local false discovery rate (fdr) statistic where edges less than 1e-10 are considered significant. In total, 366 nodes and 2076 undirected and non-self-regulating edges were identified in the GGM network.
(2) Graphical lasso -Graphical lasso [4] is another approach to estimate the covariance matrix as part of learning the structure of an undirected graphical model. Under this framework, the inverse covariance matrix is assumed to be sparse and the ℓ 1 regularization parameter controls the number of zeros in the matrix. Any inverse covariance matrix element with the value of zero implies that the corresponding variables and ′ are conditionally independent, given the rest of the dataset. The simultaneous estimation and selection of the partial correlations, therefore, is an intuitive way of introducing sparsity in the network.
The TF association network in this study was generated by calculating a shrunken partial correlation matrix using the glasso R package [5]. We set the ℓ 1 regularization parameter ('rho' option in the 'glasso' function) to 0.01 and considered an edge to be significant if the partial correlation value is non-zero. For any pair of TFs, the partial correlation value is taken as the average between elements in the corresponding upper and lower triangular matrix values. Any non-zero self-regulating edges were removed. We obtained 196 nodes and 2033 undirected edges in the network generated using this algorithm. By identifying non-zero regression coefficients, we can discover which TFs are directly interacting with and thereby assign an edge. In this scenario, network inference is a feature selection problem performed node by node to identify significant and non-zero coefficients which corresponds to direct dependence relationships. TIGRESS [6] is one such network inference algorithm that combines the multivariate feature selection method known as least angle regression (LARS) [7] with stability selection [8] to achieve this. Given a TF binding matrix, , as input data, we used TIGRESS to model each TF as a function of all other TFs ′ ( ′ ≠ ) in the input dataset by using the linear regression model: For each TF, LARS selects the set of predictor variables ′ that best explains the binding of that TF so that indirect interactions are omitted. However, LARS is known to be sensitive and unstable for datasets where variables are highly correlated. Therefore, stability selection is employed to run LARS many times (samples and variables resampled at each run) to compute the frequency with which each variable were selected by LARS. This is then used to calculate the normalized score for each TF,  which states that each variable in the model (node) is conditionally independent of the set of all its predecessors in the graph, given the value of its parents. The structure of a BN in a 'TF association network', therefore, represents the factorization of the joint probability of all ChIP-seq samples: where is the set of parents of node .
Unlike the three methods described earlier, the conditional independence tests in a BN are estimated concurrently as part of the network structure learning procedure. The resulting network takes the form of a directed acyclic graph (DAG) where directed edges between two nodes indicate the direction of causal influence. However, we do not enforce the notion of causality in our networks since the directions only represent associations but not genuine causal relationships. Instead, we interpret all edges as undirected.
Using the bnlearn R package [9], the conditional independence structure of the input data was learned using the score-based hill climbing algorithm. The adjacency matrix was computed by taking into account only off-diagonal entries in the upper/lower matrix and treating all edges as undirected.
In total, there were 367 nodes and 2013 edges in the network.