Information-incorporated gene network construction with FDR control

Abstract Motivation Large-scale gene expression studies allow gene network construction to uncover associations among genes. To study direct associations among genes, partial correlation-based networks are preferred over marginal correlations. However, FDR control for partial correlation-based network construction is not well-studied. In addition, currently available partial correlation-based methods cannot take existing biological knowledge to help network construction while controlling FDR. Results In this paper, we propose a method called Partial Correlation Graph with Information Incorporation (PCGII). PCGII estimates partial correlations between each pair of genes by regularized node-wise regression that can incorporate prior knowledge while controlling the effects of all other genes. It handles high-dimensional data where the number of genes can be much larger than the sample size and controls FDR at the same time. We compare PCGII with several existing approaches through extensive simulation studies and demonstrate that PCGII has better FDR control and higher power. We apply PCGII to a plant gene expression dataset where it recovers confirmed regulatory relationships and a hub node, as well as several direct associations that shed light on potential functional relationships in the system. We also introduce a method to supplement observed data with a pseudogene to apply PCGII when no prior information is available, which also allows checking FDR control and power for real data analysis. Availability and implementation R package is freely available for download at https://cran.r-project.org/package=PCGII.


S-1.1 Simulation Results of Moderate Sample Sizes
This subsection presents additional simulation results on the performance of E-TEST, F-GGM, CLEVEL, and PCGII.Supplementary Figure S-1 is the result for count data under dense network settings with moderate sample sizes in a way presented similar to Figure 1 in the main manuscript.The advantage of PCGII becomes increasingly evident as the network structure grows more intricate.To sum up, Figure 1 and Supplementary Figure S-1 together illustrate that E-TEST cannot achieve high power as it is very conservative after FDR control, while F-GGM can detect more edges but fails to control FDR, resulting in a great number of false edges.CLEVEL shows effective FDR control, and PCGII as a power-enhancing alternative to CLEVEL can detect a larger number of true edges while maintaining the FDR controlled to the desired level.

S-2 Supplementary Results for Analysis of Excess Power
This section presents supplementary figures of excess power, defined in Section 3 of the main manuscript.Supplementary Figures S-5, S-6, and S-7 show results for count data with moderate sample sizes obtained under the scale-free, random and block-diagonal structures, respectively.S-4 Supplementary Results for Supplemented Data Analysis on E. coli Dataset

S-4.1 Methodology
When prior knowledge of gene-gene associations is not available, incorporating information can be challenging.We propose supplementing the original network by including a pseudogene and analyzing the supplemented data.Connections between the pseudogene and the original set of genes will be used as prior knowledge.In addition, this supplemented information can be used to evaluate a method's performance.
Suppose a pseudogene connects with a subset of p nodes under the original study.Then based on the observed dataset {Y i,j }, we can generate a supplemented dataset { Ỹi }, where Ỹi = ( Ỹi,1 , ..., Ỹi,p , Z i ) ′ , Ỹi,j = Y i,j + a j Z i , and Z 1 , ..., Z n are i.i.d.from standard normal distribution.It is important to note the difference between Y i,j and Ỹi,j , where the former is a sample of the j th node under the original p×p network but the latter is supplemented data under the (p+1)×(p+1) pseudo-network.We use Σ p×p and Σ(p+1)×(p+1) to denote the corresponding covariance matrices.Note that, a j demonstrates the marginal covariance between the pseudogene and the j th node in the pseudonetwork.It can be shown that As Σ + aa ′ − aa ′ is indeed the original covariance matrix Σ and is always invertible, by the property of block matrix inverse, where a = (a 1 , ..., a p ) ′ , a j ∈ R, for j = 1, ..., p.
Equation 2 implies the first p × p block matrix in Σ−1 is exactly the original Σ −1 .So, when we analyze the partial correlation graph of the supplemented data Ỹi = ( Ỹi,1 , . . ., Ỹi,p , Z i ) ′ , the upper p × p block Ψ of the expanded partial correlation matrix of { Ỹi } is the same as the partial correlation matrix of the original data.Furthermore, the block matrix −a ′ Σ −1 , denoted as h ′ (a 1 × p row vector), characterizes the conditional correlations between the pseudogene Z and the original set of p genes.In practice, h ′ is unknown as Σ −1 is unknown.However, it is feasible to specify a sparse vector a, then each component of h is a linear combination of only a few entries in Σ −1 .Note that, non-zero a j leads to non-zero h j as the jth diagonal entry of Σ −1 is not zero.Moreover, since Σ −1 is assumed to be sparse, h is also a sparse vector that can be estimated precisely based on the original data.We consider estimating Σ −1 by Graphical Lasso (with R package glasso), denoted as Σ−1 glasso , and then compute ĥ′ = −a ′ Σ−1 glasso .As a is designated to be sparse, the estimation error of ĥ′ is expected to be small.Lastly, true connections between the pseudogene and the original set of genes are determined by non-zero a j s, because non-zero a j leads to nonzero h j .Nulls are estimated by ĥj , where ĥj are ranked by their absolute value, and those of the lowest magnitudes are defined as "estimated nulls".

S-4.2 Application on E. coli Dataset
To assess the performance of different approaches on E. coli dataset, we randomly selected 20 out of 102 a j s, set as 0.1, while the rest were all 0. The set {j : a j = 0.1, j = 1, 2, ..., 102} was denoted as L, |L| = 20.Note L determines the true connections between the pseudogene and the original set of genes.To estimate null connections, Graphical Lasso was applied to the original dataset and the estimated precision matrix was denoted as Σ−1 glasso .We then ranked ĥj by its absolute value and considered the 20 of the lowest magnitude as "estimated nulls".For PCGII, 20 out of 20 true connections were incorporated as prior information because the knowledge is true and free to use under the supplemented data analysis framework.To control variability arising from sampling Z i , we fixed a and repeatedly generated supplemented dataset { Ŷi } (k) , k = 1, 2, ..., K.For each supplemented dataset { Ŷi } simulated, we applied four approaches.We stored the set of significant edges S k found by each approach given FDR controlled at .05.The final results were aggregated from S k by majority voting.Supplementary Table S-1 shows the total discoveries, true discoveries of edges determined by nonzero a j s, and false discoveries of estimated null edges.PCGII identified 19 out of 20 true edges due to information incorporation, with a 0% error rate.The other three approaches all discovered some estimated null edges while identifying fewer true edges.Two sets depict the outcomes of supplemented data analysis using CLEVEL and PCGII, denoted as "CLEVEL on supplemented data" and "PCGII on supplemented data".The third set represents the results of the original data analysis by CLEVEL, serving as the reference.Four Venn diagrams suggest analysis of the original data without the pseudogene is highly consistent with the analysis of the supplemented dataset.This indicates that supplementing the observed data with the pseudogene did not seem to change the main results of the other methods.

S-5 Supplementary Results for Sensitivity Analysis
In this section, we present the simulation setting and results of investigating how sensitive PCGII is concerning the size and accuracy of prior information.Supplementary Figures S-11, S-12, and S-13 present results for count data with moderate sample sizes under the scale-free structure, the random structure, and the block diagonal structure, respectively.
To further investigate how PIS and PA affect the performance of PCGII, we considered a 3 by 3 factorial design additionally, where PIS = 100%, 70%, 30% and PA = 1, 0.7, 0.3.The power and empirical FDR of PCGII were computed by averaging across 500 datasets.
Note the two prior information scenarios, one with PIS=100% and PA=0.7 and the other with PIS=70% and PA=1, both provide 70% of true edges.The distinction lies in the nature of the prior information utilized.In the first case, the prior information contained 30% false information, whereas in the second case, only true information was incorporated.Comparing the results from these two applications will shed light on the impact of including false positives in the prior set of the proposed method.
The findings indicate that the higher the proportion of correct information in the prior, the greater the power of PCGII.Whether the prior set contains false edges or not, the ROC curves and power are quite similar as long as an equivalent amount of correct information is provided to PCGII.For instance, the ROC curves from the settings with PIS=100% and PA=0.7, and PIS=70% and PA=1, exhibit substantial overlap, and the box plots of total power for both settings show comparable distributions.
When the prior information is accurate, FDR is effectively controlled regardless of the prior information size.However, inaccurate prior information inflates the empirical FDR, and the extent of inflation is influenced by the amount of false information included.When the prior information is inaccurate (e.g., PA=0.3),PCGII does not solely rely on the prior information but also leverages the data to recover the network.Notably, a network constructed purely based on the prior information with PA=0.3 results in a false discovery proportion of 0.7.In such instances, PCGII improves the network with a much lower FDR of approximately 0.3, indicating its capability to fully use the data for network recovery.
Furthermore, larger sample sizes enhance power and FDR control, whereas the number of nodes and the complexity of the network do not significantly impact the results.These findings are consistent across various network settings.

Figure S- 1 :
Figure S-1: Simulation Results for Count Data under Dense Network Settings with Moderate Sample Sizes.Panel (A): ROC curves.Panel (B): False Discovery Rate.Panel (C): Precision-Recall curves.Each panel includes three rows of plots under different network structures and dimensions.Colors distinguish approaches.All presented network structures are dense.Dashed and solid lines represent n = 60 and n = 80, respectively.Line colors indicate different approaches.The prior setting used for PCGII is PIS=.3 and PA=1.

Figure S- 3 :
Figure S-3: Simulation Results for Count Data with Small Sample Sizes.For both sub-figures (a) and (b), panels are listed as follows.Panel (A): ROC curves.Panel (B): False Discovery Rate.Panel (C): Precision-Recall curves.Each panel includes three rows of plots under different network structures and dimensions.Colors distinguish approaches.All presented network structures are dense.Dashed and solid lines represent n = 10 and n = 15, respectively.Line colors indicate different approaches.The prior setting used for PCGII is PIS=.3 and PA=1.

Figure S- 4 :
Figure S-4: Simulation Results for Normal Data with Small Sample Sizes.For both sub-figures (a) and (b), panels are listed as follows.Panel (A): ROC curves.Panel (B): False Discovery Rate.Panel (C): Precision-Recall curves.Each panel includes three rows of plots under different network structures and dimensions.Colors distinguish approaches.All presented network structures are dense.Dashed and solid lines represent n = 10 and n = 15, respectively.Line colors indicate different approaches.The prior setting used for PCGII is PIS=.3 and PA=1.
Figure S-5: Excess Power for Count Data under The Scale-free Structure.Excess power versus empirical FDR curves under different combinations of p, η, PIS and PA.Dashed and solid lines represent n = 60 and 80, respectively.Line color indicates different methods.
Figure S-8: Supplementary Figures for FERONIA Network Analysis.(a)Prior network used for PCGII; (b) Venn diagram: common connections between networks constructed by PCGII, CLEVEL, and F-GGM.
Venn diagrams in panels (a), (b), and (c) of Supplementary Figure S-10 showcase the intersections and distinctiveness of edges identified by CLEVEL, E-TEST, and F-GGM across both original and supplemented data analyses.Each diagram highlights the comparison between results obtained from the original and supplemented data analyses for each method.Panel (d) of Supplementary Figure S-10 is the Venn diagram displaying the relationship among three sets.Each of the three sets represents the reconstructed edges in the target E. coli network.
Figure S-11: Performance of PCGII across Different Settings of Prior Information under The Scalefree Structure.(a)ROC curve; (b) empirical FDR; (c) box-plots of power across 500 datasets for each setting while FDR was controlled at the nominal level of 0.10; (d) box-plots of empirical FDR across 500 datasets for each setting while FDR was controlled at the nominal level of 0.10.PIS: prior information size; PA: prior accuracy.
Supplementary Figure S-2 presents all results for normal data with moderate sample sizes.It shows the results are highly consistent with those for count data (Figure1and Supplementary FigureS-1) and suggests that four methods are robust to count data.This subsection presents simulation results for both count and normal data with small sample sizes that may occur in some real studies.Supplementary Figures S-3 and S-4 are the results for count data and normal data respectively.The relative rankings of different methods stay the same, although power and FDR control are worse with small sample sizes than moderate sample sizes for all methods.

Table S -
1: Summary of Supplemented Data Analysis on E. coli Dataset.