Summary: Network component analysis (NCA) is a method to deduce transcription factor (TF) activities and TF-gene regulation control strengths from gene expression data and a TF-gene binding connectivity network. Previously, this method could analyze a maximum number of regulators equal to the total sample size because of the identifiability limit in data decomposition. As such, the total number of source signal components was limited to the total number of experiments rather than the total number of biological regulators. However, networks that have less transcriptome data points than the number of regulators are of interest. Thus it is imperative to develop a theoretical basis that allows realistic source signal extraction based on relatively few data points. On the other hand, such methods would inherently increase numerical challenges leading to multiple solutions. Therefore, solutions to both the problems are needed.
Results: We have improved NCA for transcription factor activity (TFA) estimation, based on the observation that most genes are regulated by only a few TFs. This observation leads to the derivation of a new identifiability criterion which is tested during numerical iteration that allows us to decompose data when the number of TFs is greater than the number of experiments. To show that our method works with real microarray data and has biological utility, we analyze Saccharomyces cerevisiae cell cycle microarray data (73 experiments) using a TF-gene connectivity network (96 TFs) derived from ChIP-chip binding data. We compare the results of NCA analysis with the results obtained from ChIP-chip regression methods, and we show that NCA and regression produce TFAs that are qualitatively similar, but the NCA TFAs outperform regression in statistical tests. We also show that NCA can extract subtle TFA signals that correlate with known cell cycle TF function and cell cycle phase. Overall we determined that 31 TFs have statistically periodic TFAs in one or more experiments, 75% of which are known cell cycle regulators. In addition, we find that the 12 TFAs that are periodic in two or more experiments correspond to well-known cell cycle regulators. We also investigated TFA sensitivity to the choice of connectivity network we constructed two networks using different ChIP-chip p-value cut-offs.
Transcription factors (TFs) act alone or in combination on target promoters to control gene expression. TF regulatory activity is controlled by higher-level cellular functions, such as signaling pathways to reflect cellular physiology and environment, typically through post-transcriptional regulation, such as phosphorylation, oligomerization, ligand binding or subcellular translocation. Activated TFs control transcriptional initiation through interaction with DNA or RNA polymerase. Thus TFAs in principle can be inferred from the transcript levels of the genes they control. In the simplest case, where one TF controls one gene, the activity of this TF is directly proportional to the gene it controls. In reality, however, TF-promoter connectivity is more complicated requiring non-trivial mathematics for deconvolution.
Network component analysis (NCA) (Liao et al., 2003; Tran et al., 2005) is a model-based decomposition method to deduce such information, namely transcription factor activity (TFA) and regulation control strength (CS), from transcriptome data and transcription factor (TF)-promoter connectivity network. Connectivity networks are constructed by preprocessing ChIP-chip data, DamID methylation profiling or extensive literature search (Lee et al., 2002; van Steensel and Henikoff, 2000). To allow unique mathematical decomposition, three identifiability criteria need to be satisfied (Liao et al., 2003; Tran et al., 2005), which govern the applicability of this method. In particular, one of the criteria requires that the number of data points be greater than the number of TFs. In this work, we relax this theoretical limitation to allow analysis based on limited data points.
Other approaches to deduce TFA include the REDUCE (and equivalently MATRIX REDUCE) methods that fit a linear model of gene regulation to mRNA expression data by regression (Bussemaker et al., 2001; Gao et al., 2004). These methods are ‘cluster free’ and only require a input promoter sequence (or ChIP-chip binding data) and mRNA expression data. From this set REDUCE evaluates all possible candidate motifs, where each motif corresponds to a known or unknown (but shared sequence) regulatory factor. The motifs are fit individually across genes and the authors select a final set of regulatory motifs as those that have the largest reduction of variance of the corresponding gene expressions. Here normalized motif binding copy number is taken as the regulation strength of the TF for that gene, and TFA profiles are obtained from gene expression data by linear regression. The advantage of REDUCE is that it is not limited by the identifiability criteria governing NCA, since its CS is fixed. However, because the method uses the copy number of binding sites as a surrogate for regulatory strength, it does not distinguish between positive and negative regulation of the same TF on different genes and may compromise the accuracy of TFA estimates.
Other methods to deduce underlying regulatory signals (not TFA per se) include principal components analysis (PCA) and independent components analysis (ICA). These methods explain the collection of gene expression profiles by a set of underlying basis profiles (vectors) termed eigengenes or expression modes (Alter et al., 2000). In PCA the eigengenes are orthonormal. In ICA the expression modes are statistically independent and non-Gaussian (Lee and Batzoglou, 2003; Liebermeister, 2002). In this case independence is a stronger assumption than uncorrelatedness. The top k significant basis vectors describe the majority of the observed gene expression variance, and they have been shown to correspond to major biological phenomena in the cell cycle (Alter et al., 2000).
However, both ICA and PCA approaches fail to incorporate the biological structure of the network, and accordingly the description is based on statistical properties. To provide biological insight, one may integrate these basis vectors with expression profiles by robust regression (Yeung et al., 2002) or with binding site data (ChIP-chip) by pseudo-inverse projection (Alter and Golub, 2004). The latter provides a dynamic temporal view of TF-gene binding activity in select cell-cycle and replication initiation factors.
That PCA and ICA methods prove useful for mapping highly coordinated transcriptional responses, such as the cell cycle, is encouraging because it suggests that data decompositions can find independent features of related biological processes. However, orthogonal (or independent) basis vectors do not necessarily have to correspond to underlying biological features and in fact may depend on rotation. We argue that it is the topology (or structure) of the network that is important for determining TFA because TFs may work together to regulate gene expression. This insight is the motivation for NCA, a method that takes network topology into account in order to compute the component TFAs and control strengths from a dynamic non-linear model (Kao et al., 2005; Kao et al., 2004; Liao et al., 2003; Tran et al., 2005).
1.2 Network component analysis
NCA formulates gene expression as the product of the contribution of each regulating TFA using a combinatorial power-law model, which can be viewed as a log-linear approximation of any non-linear kinetic system in multiple dimensions (Almeida and Voit, 2003; Savageau, 1976; Torres and Voit, 2002; Voit and Almeida, 2004). It captures some non-linear synergistic effects and yet still remains mathematically tractable and generally applicable to most genes.
TFA itself is driven by interactions between proteins (P), metabolites (M), and other internal and external parameters Θ). The change in activity per time can be represented mathematically as a non-linear function:
Since DNA microarray is commonly expressed in log-ratio with respect to a reference state, we derive the following log-linearized form of Equation (1) at steady-state:
NCA differs from PCA and ICA in that it constrains solutions by the known missing connectivity between the TF and promoters (i.e. the A matrix) and the known zero TFAs (i.e. the P matrix) from mixed wild-type and knockout datasets. The connectivity constraints are derived from genome-wide DNA binding assays or extensive literature search. The presence of potential TF-promoter interaction indicates non-zero control strength to be estimated as an element in A. Otherwise, the corresponding elements in A are constrained to zero.
Note that A and P are both unknown which makes the objective of NCA decomposition similar to PCA and ICA. However, the results of PCA and ICA yield full connectivity between TF and genes. The orthogonal or independent assumption and the resulting full connectivity are not always satisfied by biological networks. Yet without additional constraints (orthogonal or independence), the solution to the matrix decomposition problem is not unique.
The data matrix E can be a composite of parallel experiments from different strains, different conditions, in addition to time series, the corresponding columns of P represents the TFA ratio under that condition or in that strain. Thus, if the TFA ratio is known to be unchanged between the experimental condition and the reference (e.g. TF knockouts), the log TFA ratio in the corresponding P elements can also be constrained to zero. If the zero patterns of A and P satisfy a set of mathematical criteria, then the decomposition of E to A and P, is unique (Liao et al., 2003; Tran et al., 2005).
2.1 The revised third criterion for NCA
To ensure unique solutions to the matrix decomposition problem, NCA requires that three criteria be satisfied (Liao et al., 2003, Tran et al., 2005). In particular, one criterion requires that the P matrix must have full row rank. This implies that the number of TFA analyzed must be smaller or equal to the number of data points. This criterion significantly limits the number of TFAs that can be derived from microarray data. To alleviate this problem, we noted that most genes are only regulated by a much smaller number of TFs than the total TFs of interest. Based on this observation, the third criterion is revised as follows (see proof in Section 5).
Given a data matrix E, the decomposition E = AP + Γ, s.t. A ∈ ZA, P ∈ ZP is unique to scaling factors (or essentially unique) for a given residual Γ if each reduced matrix Pri(i = 1, … , N) has full row rank, and P ∈ ZP. The other two criteria remain the same (see Section 5).
Here E is an N × M matrix of gene expression data with N genes and M conditions, A is defined by a suitable zero pattern ZA and is N × L, where L is the number of TFAs (N < L) and P is L × M , also defined by a zero pattern ZP (Tran et al., 2005). Pri is defined as a matrix composed of rows of P corresponding to the non-zero elements in row i of A. This new criterion is checked during iteration gene by gene. The revised criterion implies that the number of data points for each gene must be greater than or equal to the number of TFs regulating that gene. This improves the applicability of NCA, because it is likely that for most organisms, the maximal number of TFs regulating a gene is <5 or 6. For example, at least in Saccharomyces cerevisiae genes regulated by a high number of TFs probably occupy a small portion of the whole genome (Lee et al., 2002).
According to the revised third criterion, the TFs regulating each gene must be linearly independent in the available experiments. This mathematical condition cannot be satisfied a priori, but can be tested during the numerical procedure. In general, analyzing multiple dissimilar experimental conditions is likely to generate linearly independent TFAs. In practice, the condition number is checked instead of the rank of the matrix, as noisy data may have full row rank but a condition number that is high. Therefore, the condition number of each Pri must be lower than a set threshold. This condition number must be checked during numerical iteration, which minimizes the Frobenius norm of Γ, ‖Γ‖F, subject to the specific zero constraints in A and P:
2.2 Probabilistic aspects of identification
Since A and P are unknown we seek essentially unique decompositions of E which are related by an invertible scaling matrix X that preserves the zero pattern of A and P. Thus, and represent all equivalent solutions. If X is diagonal, it represents legitimate scaling factors. If X is not diagonal then any non-diagonal elements in X will recombine contributions between different TFs and degenerate the results.
If NCA criteria are violated, then it is possible to have a non-diagonal X such that is still in ZA, but the CS values are degenerate. The result is an infinite number of decompositions of E into A and P beyond simple scaling. This is problematic for data decomposition because when X is not diagonal, the solutions may be linear combinations of the columns of A and rows of P. This problem is common to any matrix decomposition to two unknown matrices and also manifests itself in multivariate regression and factor analysis methods as colinearity, and is generally addressed by ridge regression or by the requirement of orthogonality (Anderson, 1984). Note that the degeneracy of solutions does not represent stochastic behavior of the system. It is a result of the ill-posed model.
In a Bayesian framework, the identifiability problem can be treated using a probabilistic approach. Degenerate solutions owing to violation of the identifiability criteria may in principle be distinguished using informative priors (Sabatti and James, 2005). However, uninformative priors, as in most common cases, result in ‘variance inflation’ in the posterior distribution of parameters that violates the identifiability criteria (Gelman, 2004). This is readily apparent from a simple example [Equation (6)].
To investigate the significance of this problem, Figure 1c details what happens when we increment a non-diagonal term of X in simulated data. The result is a uniform spread of the values of each point estimate of TFA2. Thus, in a probabilistic method such as Bayesian NCA (Sabatti and James, 2005), variance inflation will be seen in posteriors of A and P when the NCA identifiability criteria are not satisfied.
2.3 Enforcing the third criterion
It is possible that the revised third criterion is violated during iteration. In this case, we propose to orthogonalize each reduced matrix (subnetwork) whenever the regulating TFs of each gene violate criterion 3. This renders each co-regulating TFA independent, reducing the condition number and satisfying the third criterion. The new TFAs are then used to continue the NCA iteration.
2.4 Whitening transformation
Computation time is a significant factor for algorithms that analyze large regulatory networks. For NCA faster convergence may be obtained by preprocessing the microarray data by a whitening procedure. This method, common in neural network and data preprocessing applications, has the principal feature of flattening noise and amplifying the signal. Optimization can be performed in this transformed feature space, and the solution in the original space can be obtained by the inverse procedure as with ICA methods (Hyvarinen and Oja, 2000).
A zero-mean random vector x is white if its elements are uncorrelated and unit variance. Any input data matrix can be made white by a linear transformation that decorrelates the experimental input vectors and scales them to unit variance (Hyvarinen and Oja, 2000). For microarray data, the whitening transformation can be computed from eigenvalue decomposition of the variance–covariance matrix of E. Let represent the variance–covariance matrix of the microarray data E. Since ΣE is positive definite and symmetric, it can always be represented as ΣE = UDUT where U is an orthogonal matrix having e1, … , en eigenvectors of ΣE, and the diagonal eigenvalues of ΣE. Then one possible whitening transformation for E is:
3 APPLICATION AND RESULTS
3.1 Network component analysis of the S.cerevisiae cell cycle
To test the biological applicability of our new algorithm, we analyzed S.cerevisiae cell-cycle data (Spellman et al., 1998). We restricted the analysis to α-factor, cdc15, elutriation and cdc28 experiments (73 data points). We generate the connectivity network from ChIP-chip binding data by selecting TF–gene interactions with those genes whose ChIP-chip binding p-value < 0.001 (Lee et al., 2002). A TF-gene pair is connected if the binding p-value < 0.001 and disconnected (set to zero in the A matrix otherwise). Note that the presence of a connection does not indicate its final value. An edge may go to ∼0 after decomposition, or to some regulation strength value. The initial values of the non-zero connections in A may be set randomly. This procedure results in a connectivity network (N1) that covers 96 TFs over ∼2200 genes. The remaining 17 TFs were excluded owing to NCA condition 2 violations. Note that this network is data limited and not identifiable (more TFs > experiments) if we were to use the previous NCA criteria (Liao et al., 2003; Tran et al., 2005). However, most genes are regulated by <7 TFs so we can apply our new criterion and numerical procedure. We ran NCA using whitening and the SO procedure (as described above) starting with a random initial guess for the non-zero CS values in the connectivity network.
3.2 Comparison of Methods that Compute TFA: NCA and REDUCE
To compare these results with REDUCE (e.g. MATRIX REDUCE) we obtained the TFAs reported by Bussemaker et al. (2001) and Gao et al. (2004). These data consist of 37 TFAs reported from the analysis of 113 TFs covering ∼6000 genes over ∼750 experimental data points. The authors determined that only 37 transcription factor ChIP-chip binding patterns out of 113 were significant predictors of gene expression in one or more experiments. As REDUCE analyzes each microarray experiment individually, we can extract cell cycle TFA data points without effecting the results. The common experiments include alpha factor arrest, cdc15, elutriation experiments (Spellman et al., 1998). In addition, Gao et al. (2004) analyzed additional cell cycle data from Zhu et al. (2000) that includes fkh1/fkh2 double knockout in α-factor treatment, as well as fkh1/fkh2 double knockout cell cycle experiments. The cdc15 data include a replicate at time t = 160, so we averaged these two data points for further analysis.
The S.cerevisiae cell cycle is periodic with cell divisions approximately every 60 mins. An interesting question to ask is which TFs have periodic TFAs as these may indicate regulated/coordinated transcriptional responses and possibly provide additional evidence that a TF is in fact a cell cycle regulator. We tested the TFAs for periodicity in each individual cell cycle experiment using a robust estimator (Ahdesmaki et al., 2005) with the standard false discovery rate (FDR) adjusted p-value < 0.05. We did not detect periodicity in any of the exclusive experiments (cdc28, fkh1/fkh2 knockouts), however, in the common experiments we did detect periodic TFAs for both REDUCE and NCA (Table 1). In the case of NCA some TFAs were periodic in two or more experiments. In REDUCE most TFAs were statistically periodic in one cell cycle experiment only. For the cdc15 data we adopted an additional Bonferrori p-value correction (0.05/23) because the data might be noisy owing to periodic stress responses induced by heat-shock arrest method (Zhang, 1999).
Figure 2 shows plots of NCA and REDUCE computed TFAs for a few common cell cycle TFs. Overall the results suggest that the REDUCE TFAs (Fig. 2a–f, dotted lines) are noisier than NCA TFAs (Fig. 2a–f solid lines), but qualitatively are in agreement with each other. For example, Figure 2a shows that the result of NCA and REDUCE are similar for the well-known cell cycle regulator ACE2. Both methods predict periodic phases that peak near cell division and this peak is characteristic of ACE2 regulation of early G1 specific gene transcription (Spellman et al., 1998). Figure 2b shows that the TFAs of SWI4 exhibits two major peaks within a cell cycle that both correspond to late G1 transcriptional activity (Spellman et al. 1998). Thus unlike the original periodicity measure by (Spellman et al., 1998) we are able to extract and detect multiple periodic peaks that correlate with the cell cycle.
The remaining REDUCE TFAs (Fig. 1d–f) are not statistically periodic when the corresponding NCA TFAs are. The most probable reason for this is that the REDUCE introduces numerical error due to fitting a fixed CS value. Of interest is the TFA of DIG1. Figure 1c shows the NCA TFA which peaks in approximately one cell cycle. This peak correlates with the current known function of DIG1; a TF that mediates cell cycle arrest in response to mating factor (α-factor) at Start (Mendenhall and Hodge, 1998).
3.3 What is the effect of false connectivity?
False connectivity includes incorrect edges (false positives) and the non-detection of true edges (false negatives) by binding site analysis or experimental methods. The sensitivity of TFAs to false connectivity in the S.cerevisiae cell cycle was investigated previously (Yang et al., 2005). The authors altered up to 10% of the connectivity for each network by randomly deleting and inserting connections between TFs and genes. They found that most TFAs, especially the cell cycle regulators, are robust and not affected by perturbation.
In this work, we assessed TFA variability by using two connectivity networks constructed from ChIP-chip data using different p-value cut-offs. Here network N1 is selected from ChIP-chip data with a binding p-value < 0.001 (as described previously), and network N2 is selected analogously with ChIP-chip binding p-values < 0.01. We eliminate any genes and TFs that do not satisfy NCA condition 2. In N1 we found 31 periodic TFAs out of the total 96 (FDR p-value < 0.05), 23 are known cell-cycle regulators (Table 1). In N2 we found 29 periodic TFAs out of 112 (FDR p-value < 0.05), 14 are known cell-cycle regulators. Between these two networks, we identified nine well-known cell cycle regulators and computed their correlation coefficients between the two networks N1 and N2 (Table 2). These results are consistent with previous results obtained with NCA by subnetwork decomposition on similar cell cycle data (Yang et al., 2005), and with the results of other statistical methods to identify cell cycle regulated TFs computed by a non-parametric test of differential expression between sets of genes bound or not bound by a ChIP-chip TF (Tsai et al., 2005). Thus TFA computations are robust to connectivity network selection provide the NCA criteria are satisfied.
|TF||ρ(N1,N2)||Number of genes N1||Number of genes N2|
|TF||ρ(N1,N2)||Number of genes N1||Number of genes N2|
3.4 Statistical significance of TFA
The researcher may be interested in identifying experiments where one or more TFAs are significantly perturbed with respect to the reference. To determine the statistical significance we construct a null hypothesis distribution for each experimental data point using the network connectivity pattern and shuffling (scrambling) the genes. The distribution is built by multiple iterations of permutation testing (∼100). From this distribution we compute the Z-score of the original TFA for each experimental data point. The result is a p-value at each data point that determines if the computed TFA is different then the null hypothesis.
NCA is a method to deduce TFAs and CS from gene expression data and a connectivity network. However, previous mathematical requirements limit the total number of TFAs inferred in any single analysis to the experiment sample size. In this work, we developed a new identifiability criterion that relaxes this requirement such that the number of data points only has to exceed the maximal number of TFs regulating any single gene. Since most genes are regulated by less than five TFs, one needs only five transcriptome data points for NCA, but can analyze far more TFs than the number of data points.
4.1 NCA compared with linear methods for computing TFAs
Another method similar to NCA is REDUCE which computes TFAs from gene expression data and a connectivity network. Similar to REDUCE, NCA uses the connectivity network as a model of the global regulatory network. The significant difference between these methods is that NCA models CS as a function of gene expression data whereas REDUCE fixes CS to binding affinity (as determined by ChIP-chip assay or from promoter binding site analysis). Thus unlike NCA, REDUCE does not distinguish between positive and negative regulation of the same TF on different genes and this may affect the accuracy of TFA estimates. It is well established that binding strength does not correlate with regulation activity. Thus, using binding data as the fixed CS may compromise TFA estimates.
On the other hand, NCA is also similar to PCA or ICA methods in that it specifies constraints that guarantee a unique bi-linear data decomposition. These constraints are derived from the connectivity network. From a bipartite network point of view, PCA and ICA, assume full connectivity of TFs and genes. In addition, PCA and ICA require that each basis vector is decorrelated or independent from the others. Thus these decompositions represent a statistical interpretation of gene regulation. We argue that integrating the connectivity network directly rather then by projection methods or regression methods provides a more detailed description of biological phenomena at the TF level and thus more accurate quantification of TFAs and CS.
4.2 The importance of identifiability
NCA identifiability violations result in infinite solutions that represent linear combinations of TFAs. Therefore, these degenerative results are not interpretable from a biological standpoint. One possible way to circumvent this problem is to use a Bayesian approach that uses different prior probabilities for different solutions (Sabatti and James, 2005). In this framework, the equivalent solutions due to non-identifiability are distinguished based on given prior probability distributions. This approach may work well if informative prior distributions are available. However, without such information, violation of the identifiability criteria leads to variance inflation, and the ability to distinguish among equivalent solutions is degenerated. This situation is particularly severe if a large fraction of the network violates the identifiability criteria. Therefore, it is important to detect whether the problem is well-posed before analysis.
In conclusion, we note that it is important to determine the identifiability of bi-linear data decomposition such as NCA. In this regard, this article provides a new criterion that allows realistic TFAs to be determined from less data in both deterministic and probabilistic approaches.
5 PROOF OF CRITERION 3 FOR NETWORK COMPONENT ANALYSIS
To define the reduced matricesGrj(j = 1, … , L), which is used in criterion 2 below, we will first build the combination matrices Gj((N × MN) × (L + 1)) from A (N × L) and P(L × M). Gj is mathematically described as
Given a matrix E (N × M), if the following conditions are satisfied, then the decomposition ofE = AP + Γ, where A(N × L) ∈ ZA, P(L × M) ∈ Zp is unique up to a scaling non-singular diagonal matrixX(L × L). In other words, for the same allowableΓ, any alternative decompositions ofwhere, andare different only by a diagonal matrixX(L × L) i.e.
Criterion 1: A has full column rank.
Criterion 2: Each reduced matrix Grj (j = 1, … , L), defined in Tran et al. (2005) and in Definition 2, has a rank of L −1.
Criterion 3: Each row of E is regulated by rows of P that are linearly independent.
Conditions 1 and 2 are the same as those proven previously in Tran et al. (2005), and the Condition 3 will be proved in the following section.
5.1 Proof of the third criterion:
Therefore, the necessary condition is z ≥ q. If r = M = min (L,M), then z ≥ L − M, and hence the number of non-zero element in Ai (i.e. the number of transcription factors regulating gene i), l = L−z, must be less than M (i.e. the number of experiments).
Note that the left null matrix N ∈ ℝq × L having rank q represents q sets of linear constraints on the dependent rows of P, and each column j of N defines linear coefficient of row j with other rows in P. That means if row j of P is linearly independent to all other rows, the column j of N will be zero. For example, if row 1 and 2 of P (5 × 4) are linearly dependent, the left null matrix N is
This work was supported by NSF grant CCF-0326605 and the UCLA-DOE Genomics and Proteomics Institute. S.J.G. was partially supported by a UCLA-IGERT bioinformatics traineeship (NSF-IGERT 9987641). L.M.T. was supported by a UCLA-IGERT bioinformatics traineeship (NSF-IGERT 9987641).
Conflict of Interest: none declared.