Materials and Methods Literature mining and manual curation of transcriptional regulatory interactions in Arabidopsis thaliana

Literature mining and manual curation of transcriptional regulatory interactions in Arabidopsis thaliana In this study, we used two data sources, PubMed Abstracts and ResNet Plant 3.0 (now called Pathway Studio Plant), a commercial knowledgebase for molecular relationships mined from more than 190,000 PubMed abstracts and 60,000 full-text articles from 28 plant-specific journalsplant) (Nikitin, et al. 2003), to collect verified transcriptional regulatory interactions in A. thaliana. Using each of the 1,701 A. thaliana transcription factors (TFs, " proteins that show sequence-specific DNA binding and are capable of activating or/and repressing transcription ") in PlantTFDB 2.0 (Zhang, et al. 2011) as input keywords, we retrieved 4,150 TF-associated interactions (supplementary fig. S1A) from ResNet Plant 3.0 and identified 3,211 TF-associated interactions from PubMed abstracts using MedScan (Novichkova, et al. 2003). After pooling the results, we obtained 4,663 TF-associated interactions. We subsequently manually assessed each interaction in the following manner: 1) Screened for transcriptional regulatory interactions. Transcriptional regulatory interactions represent interactions between TFs and the promoters of the target genes. After reviewing the original texts, we removed 3,134 interactions falsely recognized by the text-mining tools or non-transcriptional regulatory interactions (e.g., protein-protein interactions). 2) Identified transcriptional regulatory interactions missed by the text-mining tools. Through reviewing the original texts, we further identified 195 transcriptional regulatory interactions that were reported in original texts but missed by the text-mining tools. 3) Assigned a regulatory activity for each interaction. TFs activate and/or repress the transcription of target genes through binding specific cis-elements. Through reviewing original texts and original papers, we further assessed the regulatory activity (activation/repression) of each interaction and removed 55 interactions without determined regulatory activity. 4) Mapped the gene names of interactions to The Arabidopsis Information Resource (TAIR) IDs. To this end, we removed 26 interactions with gene names that could not be mapped to a unique TAIR ID and 212 redundant interactions. Ultimately, we collected 1,431 functionally confirmed transcriptional regulatory interactions, 44.5% (637 of 1,431) of which represented regulations between two TFs in Arabidopsis, and constructed an Arabidopsis transcriptional regulatory map (ATRM) (supplementary fig. S1B). involved in developmental processes (GO:0032502 developmental process) and genes involved in stress response processes (GO:0006950 response to stress, GO:0009607 response to biotic stimulus, or GO:0009628 response to abiotic stimulus). The genes involved in both developmental and stress response processes were labeled " Dev. & Res. " Genes lacking " biological process " annotations (including those without biological process annotation or with …


4) Mapped the gene names of interactions to The Arabidopsis Information Resource (TAIR)
IDs. To this end, we removed 26 interactions with gene names that could not be mapped to a unique TAIR ID and 212 redundant interactions.

Biological process assignment
Based on Gene Ontology (GO) annotation with experimental evidence (evidence code: EXP, IDA, IPI, IMP, IGI, or IEP; version: TAIR 6/05/2012) (Berardini, et al. 2004), we identified genes involved in developmental processes (GO:0032502 developmental process) and genes involved in stress response processes (GO:0006950 response to stress, GO:0009607 response to biotic stimulus, or GO:0009628 response to abiotic stimulus). The genes involved in both developmental and stress response processes were labeled "Dev. & Res." Genes lacking "biological process" annotations (including those without biological process annotation or with biological process annotation but without experimental evidence) and genes not involved in the developmental or stress response processes were classified as "other".

Quality evaluation of the ATRM
Because TFs regulate the transcription of target genes, TFs and target genes co-exist in the same biological process. The proportion of regulations that co-exist in the same biological process is typically larger in high-quality transcriptional regulatory networks than in low-quality networks.
After combining all TFs ("TF list") in the regulator column of AtRegNet and the ATRM, we mapped their "biological process" (BP) annotations onto plant GO slim using the map2slim tool (http://search.cpan.org/~cmungall/go-perl/scripts/map2slim). We subsequently selected GO slim terms with no fewer than 10 mapped TFs and filtered out those terms too general for a TF, such as "regulation of RNA biosynthetic process," for continued analysis. TFs in the "TF list" and all genes mapped to the selected slim terms were marked as "mapped TFs" and "mapped genes", respectively. All combinations between "mapped TFs" and "mapped genes", except the selfregulations, were used as the background. The regulation in any of the selected biological processes described above was regarded as "co-existing" in the same process. One-tailed binomial tests between different datasets were performed to confirm that the proportion of regulations in the same biological process of the test sample was no greater than that of the background. We evaluated the quality of the ATRM through comparisons of the proportion of regulations co-existing in the same biological process with the proportion of regulations in the background, AtRegNet, and AtRegNet (confirmed) (high-reliability regulations in AtRegNet (Yilmaz, et al. 2011)).

Identification of transcriptional regulatory communities
In networks, communities are defined as components with more intra-regulations than interregulations, which usually perform relatively specific functions (Fortunato 2010). Employing a Markov clustering algorithm, we classified the ATRM into 156 communities using CytoMCL 1.1 (Guzzi and Cannataro 2012) with an inflation parameter of 2.0. Sixty-two of the identified communities containing no fewer than five members were used for subsequent analyses. GO enrichment for each community was performed using topGO (Alexa and Rahnenfuhrer 2010), and genes with BP annotation were used as the background. The P values were adjusted for multiple tests using the method of Benjamini and Hochberg (Benjamini and Hochberg 1995).
Based on the enriched GO terms, we assigned a name for the community whose enriched terms were consistent in terms of biological processes.

Measurement of the global topological structure
We used the largest connected component covering 98.7% and 94.2% of the regulations of the developmental and stress response sub-networks, respectively, as the representative network for the following analysis. Four parameters (<Targets per TF>, <TFs per target>, <Path length>, and <Clustering coefficient>; "< >" indicates the average value in networks) were used to measure the global topological structure of the transcriptional regulatory networks. In a transcriptional regulatory system, <Targets per TF> indicates how many targets could be immediately regulated by this TF; <TFs per target> measures the complexity with which a gene is regulated; <Path length> measures how long it takes for a signal to transfer from a TF to terminal genes; and <Clustering coefficient> indicates the complexity of regulations among TFs (Luscombe, et al. 2004). The global topological parameters of the representative developmental sub-network and stress response sub-network were calculated using igraph 0.6 (Csardi and Nepusz 2006).

Calculation of the information content of the TF binding matrices
High-quality TF binding matrices in plants were downloaded from TRANSFAC (professional 2011) (Matys, et al. 2006). The following methods were used to map the TRANSFAC id to the TAIR id. 1) For A. thaliana TFs with a gene alias in TRANSCFAC, we used the gene alias to map the TFs directly. 2) For TFs of other plant species or those without a gene alias in TRANSFAC, we used BLAST (Altschul, et al. 1997) to search against Arabidopsis genes to identify the best one-to-one pair. For TFs with two or more binding matrices, the matrix with the most sequences in construction was selected for the subsequent analysis.
The information content (IC) of the binding matrix of a TF measures the distinction of the binding profile from arbitrary sequences (Schneider, et al. 1986;Hertz and Stormo 1999). In this study, the previously described methods (Hertz and Stormo 1999) (equations 2 and 3) were adopted to calculate the IC of the binding matrices. To avoid any bias resulting from the difference in the sequence numbers used to construct the matrices, we adopted an adjusted pseudo count in the calculation (equation 1). TFs with two or more binding matrices were used to determine to ensure that the ICs of the TF binding matrices were comparable and not subject to systematic bias among matrices from different sequence numbers. AT1G13260 from the RAV family, which has two different types of DNA-binding domains, represented a special case, and we added the ICs of the two corresponding binding matrices together to represent its IC.
is the pseudo count added according to the number of sequences s used to construct this matrix; is the count of nucleotide at position in the binding matrix, is the prior probability of nucleotide , and is the corrected frequency of nucleotide at position ; is the width of the matrix, and is the IC of the binding matrix.
To determine whether the calculated IC of the binding matrices of the TFs (Available at http://atrm.cbi.pku.edu.cn/download.php) could successfully represent the binding specificity of each TF, we predicted the putative target genes in the upstream 1,000 bp of the A. thaliana genome using Match (Matys, et al. 2006). A predefined cutoff for each matrix in TRANSFAC was used to minimize the sum of false positives and false negatives (Matys, et al. 2006). The high negative correlation between the IC of the binding matrix and the number of predicted targets (Spearman"s rank correlation ρ = -0.76 and P = 7.72e-15) suggests that the IC of the binding matrices of TFs calculated using this method successfully represents the binding specificities of the TFs.

Identification of network motifs
We used Mfinder 1.2 (Milo, et al. 2002) to screen all possible three-node regulatory patterns (supplementary fig. S6A) and to identify enriched regulatory patterns among them. By generating 1,000 randomized networks with out-degree, in-degree, and mutual-degree conserved, we identified the three-node regulatory patterns that appeared significantly higher than those in randomized networks under default thresholds (P < 0.01, Mfactor > 1.10, and Uniqueness ≥ 4). P was calculated based on 1,000 randomized networks, Mfactor is the ratio between the number of this regulatory pattern in the real network and its number in randomized networks, and Uniqueness is the number of distinct sets of nodes involved in this regulatory pattern in the real network (Milo, et al. 2002).
To determine whether there was any novel network motif in the Arabidopsis transcriptional regulatory network compared with those of unicellular organisms, we also identified network motifs in E. coli and S. cerevisiae. The transcriptional regulatory network of E. coli was downloaded from RegulonDB 8.0 (Salgado, et al. 2013), and that of S. cerevisiae with direct evidence and confirmed function was retrieved from YEASTRACT (Abdulrehman, et al. 2011).
We further classified the network motifs in the ATRM into two classes, motifs in development and motifs in stress responses, based on the following criteria: when two or more nodes of the three nodes were involved in the developmental process and no node was involved in the stress response process, this motif was assigned as a motif in development, and vice versa for motifs in the stress response. The TFs with BP annotations were used as the background for the biological process enrichment analysis of the TFs involved in network motifs. P was adjusted for multiple tests using the method of Benjamini and Hochberg (Benjamini and Hochberg 1995).

Kinetic simulation of the novel network motifs in Arabidopsis
The changing rate of the transcription level of a gene represents the combined effect of the basic transcription rate, the rate of transcriptional activation/repression of other genes, and the degradation rate. We referred to kinetic equations of a previous study (see equations 4-6) (Mangan and Alon 2003) to simulate the function of the novel network motifs in Arabidopsis. In equations 4-6, B i represents the basic transcription rate of gene i. K ji is the transcriptional activation or repression coefficient of gene i through gene j, and ( ) is the transcriptional the degradation rate of gene i. For a gene activated through two TFs, e.g., Z activated through X and Y, the equation f(X, K xz , Y, K yz ) = (K xz C x + K yz C y )/(1 + K xz C x + K yz C y ).
In our simulation, the lowest transcription level of gene i was 0, and the highest level was 1, with a maximum allowable activation/repression rate of 1 (∑ ≤ 1). To fulfill this constraint, we used B i = 0, α = 1, β = 0.5, and K ji = 1. The high expression of gene X represented the signal occurring during a defined period (supplementary fig. S6B), and X was transcribed with the highest rate 1 at the beginning (supplementary figs. S6C and S6D). Genes Y and Z were initially transcribed with rate 0. When the transcription levels of Y and Z were no less than 0.5, these genes were activated to activate/repress the transcription of related genes. The high expression of X represented one state, and the high expression of Z represented another state. The kinetic simulations were performed using ODE45 in MATLAB. The MATLAB source codes are available at http://atrm.cbi.pku.edu.cn/download.php.

Classification of ancient and novel TF families
TFs are classified into 58 families according to their signature domains in PlantTFDB 2.0 (Zhang, et al. 2011). By dating their birth times based on 28 plants with sequenced genomes, we classified 54 TF families appearing in the most recent common ancestor (MRCA) of land plants into two types: ancient and novel families. TF families present in any of the nine green alga species were defined as ancient families, and TF families present in the MRCA of 19 land plants, but absent from the nine green alga species, were defined as novel families.
To determine whether the ancient families were previously present in E. coli, S. cerevisiae, or H. sapiens, we used the built-in TF prediction pipeline of PlantTFDB 2.0 (Zhang, et al. 2011) to scan the genome proteins of the three species under a relaxed cutoff (e-value ≤ 0.01 for sequence cutoff and domain cutoff). Based on the presence in E. coli, S. cerevisiae, or H. sapiens, ancient families were further divided into two types, "Ancient1" and "Anicent2". When ancient families were identified in any of the three species, these families were classified as "Ancient1"; otherwise, these families were classified as "Ancient2".

The wiring preference of ancient and novel TF families in biological processes
We used a "preference index," the proportion of genes involved in the developmental or stress response processes, to represent the wiring preference of TF families in biological processes. TFs involved in the developmental or stress response processes were clustered using BLASTClust (Altschul, et al. 1997) to merge highly redundant sequences (cutoff: coverage 0.9 and identity 0.9). Each cluster was regarded as a "Refgene" in this study. Refgenes only in developmental or stress response processes and families that included no fewer than five Refgenes were used to calculate the preference index.

The wiring positions of ancient-and novel-family TFs in the ATRM
We compared the following aspects of the wiring of ancient-and novel-family TFs: "Targets per TF," "TFs per target," the number of Motifs (5, 6) involved, the number of Motifs (10,11,12) involved, and the proportion of TFs to target genes. The first two aspects were used to determine whether there was any bias in the connectivity between the TFs of ancient and novel families; the last three aspects were used to determine whether there was any wiring preference in the transcriptional regulatory network between the TFs of ancient and novel families. Selfregulations in the ATRM were removed in this analysis. TFs in the ATRM with a degree of no fewer than four (the median degree of TFs in the ATRM) were used to calculate the wiring of these TFs in the network. When calculating the proportion of TFs in the targets, we only used TFs with no fewer than four targets. For each aspect, we summarized the numbers of novel-and ancient-family TFs that were fewer than and more than the average value. One-tailed Fisher"s exact tests were performed to compare the wiring preferences of novel-and ancient-family TFs.

Classification of ancient and novel TF families in E. coli, S. cerevisiae, and H. sapiens
The TFs of E. coli, S. cerevisiae, and H. sapiens were downloaded from DBD (Wilson, et al. 2008), and only TFs predicted through Pfam HMMs were used in this study. The taxonomic distribution of TF families was adopted from V. Charoensawan et al. (Charoensawan, et al. 2010).
Novel families in E. coli, S. cerevisiae, and H. sapiens were considered as those with taxonomic distributions limited to Proteobacteria, Fungi, and Metazoa, respectively.

Binding specificities of TFs and the proportion of TFs to target genes in E. coli, S. cerevisiae, and H. sapiens
The transcriptional regulatory networks in E. coli, S. cerevisiae, and H. sapiens were downloaded from RegulonDB 8.0 (Salgado, et al. 2013), YEASTRACT (Abdulrehman, et al. 2011), and the ENCODE project (Gerstein, et al. 2012), respectively. The TF binding matrices of E. coli were downloaded from RegulonDB 8.0 and those of H. sapiens were downloaded from TRANSFAC (professional 2011, only matrices from SELEX were used) (Matys, et al. 2006 (Kersey, et al. 2012), respectively. These data were used to cluster TFs that descended from the common ancestors at

Clustering of TFs descended from a common ancestor
four key time points in the evolution of A. thaliana as "Refgenes" to investigate the distribution of these TFs in biological processes (supplementary fig. S8). Only TFs with orthologous genes in these species were used in this study.

Identification of TF individuals born during plant landing and classification of old and young TF individuals
Using the methods described above, we clustered the TFs  1D). Interestingly, novel interactions for AP2 revealed a potential mechanism for the function of AP2 as an A-class gene.

Robustness and significance of the differences in the global topological structures of the developmental and stress response sub-networks
We observed that the developmental and stress response sub-networks were different in global We further performed the following analyses to assess the robustness and significance of the observed differences. To determine whether the representative sub-networks sampled from the Arabidopsis transcriptional regulatory networks robustly reflected the differences between the two sub-systems, we randomly sampled 50%, 60%, 70%, 80%, and 90% of the regulations from the developmental and stress response sub-networks 1,000 times, and the results showed that the differences between these two sub-networks were robust ( fig. 2B). To determine whether the observed differences reflected the different sizes of the developmental and stress response subnetworks, we sampled developmental sub-networks comprising the same number of regulations as those in the stress response sub-networks 10,000 times, which revealed that the differences between these sub-networks were significant (supplementary fig. S4A). In addition, we obtained consistent results using a different version of the GO annotation (supplementary fig. S4B) or with genes involved in both developmental and stress response processes counted during network classification (supplementary fig. S4C). Moreover, the binding specificities of the TFs and the predicted networks in A. thaliana generated consistent results (supplementary fig. S5 and supplementary table S2).
These results demonstrate that the differences in the global topological structures of the developmental and stress response sub-networks are robust and significant.

Other possible reasons for the wiring preference of novel-family TFs in biological processes
Why are TFs of novel families preferentially wired into developmental processes rather than stress response processes? Potential explanations include the following: 1) bias duplication of some TF individuals; 2) selective pressure for development during plant landing; 3) the wiring preferences of young TF individuals; or 4) the properties of novel-family TFs. Discussions of explanations 1-3 follow: 1) Does the wiring preference reflect the bias duplication of some TF individuals?
Clustering TFs descended from the common ancestors as "Refgenes" at four key time points in the evolution of A. thaliana yielded consistent results (supplementary table S9), suggesting that the observed wiring preference was persistently present in its evolutionary history and cannot be explained as the bias duplication of some TF individuals.
2) Did the wiring preference result from selective pressure for development during plant landing?
By dividing ancient families into families present or absent in E. coli, S. cerevisiae, or H.
sapiens, we determined that later-born ancient families were also preferentially wired into developmental processes (supplementary fig. S9). Compared with the ancient-family TFs born during plant landing, TFs of novel families still showed the same wiring preference (supplementary table S10), demonstrating that the observed preference did not reflect the potential selective pressure for development during certain periods.
3) Did the wiring preference result from the wiring preferences of later-born TF individuals?
A comparison of the wiring positions of TF individuals born during different periods revealed that later-born TF individuals did not display wiring preferences for developmental processes (supplementary table S11), confirming that the properties of novel-family TFs, and not their time of birth, affected the wiring preferences of the novel-family TFs.  Table S2. Topological structures of the predicted developmental and stress response sub-networks in A. thaliana. We used Match to predict these networks in the upstream 1,000 bp of the gene transcription start site (TSS) using binding matrices from TRANSFAC.

Supplementary Figures
Regulations with binding sites of no fewer than two and an expression correlation coefficient value (Pearson correlation coefficients, PCCs) of no lower than 0.30, 0.35, and 0.40 were adopted as the putative regulations. The PCCs for Arabidopsis genes were downloaded from ATTED-II (Obayashi, et al. 2009