Putative cis-Regulatory Elements Predict Iron Deficiency Responses in Arabidopsis Roots1[OPEN]

More than 100 putative cis-regulatory elements robustly predict Arabidopsis root iron deficiency responses in computational models and shed light on the mechanisms of transcriptional regulation. Plant iron deficiency (−Fe) activates a complex regulatory network that coordinates root Fe uptake and distribution to sink tissues. In Arabidopsis (Arabidopsis thaliana), FER-LIKE FE DEFICIENCY–INDUCED TRANSCRIPTION FACTOR (FIT), a basic helix-loop-helix (bHLH) transcription factor (TF), regulates root Fe acquisition genes. Many other −Fe-induced genes are FIT independent, and instead regulated by other bHLH TFs and by yet unknown TFs. The cis-regulatory code, that is, the cis-regulatory elements (CREs) and their combinations that regulate plant −Fe-responses, remains largely elusive. Using Arabidopsis root transcriptome data and coexpression clustering, we identified over 100 putative CREs (pCREs) that predicted −Fe-induced gene expression in computational models. To assess pCRE properties and possible functions, we used large-scale in vitro TF binding data, positional bias, and evolutionary conservation. As one example, our approach uncovered pCREs resembling IDE1 (iron deficiency-responsive element 1), a known grass −Fe response CRE. Arabidopsis IDE1-likes were associated with FIT-dependent gene expression, more specifically with biosynthesis of Fe-chelating compounds. Thus, IDE1 seems to be conserved in grass and nongrass species. Our pCREs matched among others in vitro binding sites of B3, NAC, bZIP, and TCP TFs, which might be regulators of −Fe responses. Altogether, our findings provide a comprehensive source of cis-regulatory information for −Fe-responsive genes that advance our mechanistic understanding and inform future efforts in engineering plants with more efficient Fe uptake or transport systems.

The micronutrient iron (Fe) is crucial for survival of all organisms. Plants encounter Fe deficiency (2Fe) on calcareous and alkaline soils or during developmental phases with increased sink demands. As a central component of heme and Fe-sulfur (FeS) clusters, Fe acts in redox processes in plants in basically all important metabolic processes, such as the respiratory and photosynthetic electron transport chains, chlorophyll biosynthesis, DNA replication and repair, and nitrogen and sulfur assimilation. Consequently, plants react to 2Fe with a range of molecular, physiological, and morphological adjustments, which is reflected in transcriptional alterations of more than 1000 genes in Arabidopsis (Arabidopsis thaliana; Dinneny et al., 2008;Rodríguez-Celma et al., 2013;Mai et al., 2016). In the shoots, the photosynthetic machinery is remodeled, leading to visible leaf chlorosis symptoms, and essential Fe-requiring processes are prioritized, which can be achieved through break-down of dispensable Fe-bound proteins and Fe redistribution between organelles (Blaby-Haas and Merchant, 2013;Balk and Schaedler, 2014;Hantzis et al., 2018). In the roots, genes controlling soil Fe uptake and detoxification of other transition metal ions acquired along with Fe are up-regulated. Additionally, Fe is mobilized from internal storages and distributed to Fe sinks. 2Fe also leads to changes in root architecture and root hair morphology (Brumbarova et al., 2015;Curie and Mari, 2017;Jeong et al., 2017;Li and Lan, 2017).
To acquire soil Fe, grasses secrete mugineic acid (MA) family phytosiderophores and import Fe 31 -MA complexes into the root ("Strategy II"). In contrast, nongrass monocots and dicots, such as Arabidopsis, acquire Fe via a reduction-based mechanism, in which soil Fe 31 is solubilized by lowering the local pH through proton extrusion, followed by reduction to plant-accessible Fe 21 at the root epidermis and Fe 21 uptake ("Strategy I"; Marschner and Römheld, 1994). In Strategy I, secreted chelators (mainly phenylpropanoidderived coumarins or riboflavin derivatives) aid efficient Fe 31 solubilization and reduction (Fourcroy et al., 2014;Schmid et al., 2014). Thus, Fe chelation is important during acquisition in both strategies.
Despite these conserved regulatory and functional interactions of bHLH proteins between Arabidopsis and rice, it remains unclear if other regulatory components between Strategy I and II are conserved. For example, in grasses, IRON DEFICIENCY-RESPONSIVE ELEMENT BINDING FACTOR1 (IDEF1, ABI3VP1 subfamily of B3 TF) and IDEF2 (NAC TF) coordinate 2Fe responses through binding to IDE1 (iron deficiencyresponsive element 1) and IDE2 Ogo et al., 2008). IDE1 has been connected to induction of genes involved in Strategy II MA biosynthesis and Fe-MA uptake (Kobayashi et al., 2005;Ogo et al., 2007). However, whereas barley (Hordeum vulgare) IDE1 can drive reporter gene expression in tobacco (Nicotiana tabacum) in a 2Fe-dependent manner and IDE1-like motifs are present in several Arabidopsis 2Fe response genes, a function for IDE1 has not been shown in Strategy I plants (Kobayashi et al., 2003;Kobayashi et al., 2005;Kobayashi et al., 2007;Murgia et al., 2011).
A coexpression network built with 2Fe-responsive genes has provided insight into the 2Fe regulatory system in Arabidopsis (Ivanov et al., 2012;Schwarz and Bauer, 2020). Among coexpression clusters, one contains root-specific and FIT-dependent genes involved in Fe acquisition, whereas another one is composed of root-and shoot-expressed FIT-independent genes, regulated by IVc and IVb bHLH proteins. In this work, we refer to robust (i.e. consistently identified in different studies) FIT-dependent and FIT-independent genes as the gold standard (GS; i.e. benchmark) 2Fe-induced genes. The concept to discriminate FIT-dependent and FIT-independent coexpression clusters is informative for interpreting mutant phenotypes and to place regulators into the 2Fe response cascade (e.g. Zhang et al., 2015;Liang et al., 2017;Gratz et al., 2019;Kim et al., 2019;Schwarz and Bauer, 2020). FIT-independent network genes mostly act in subcellular and long-distance transport and distribution of Fe and in Fe signaling, and they include subgroup Ib BHLH genes and PYE (Ivanov et al., 2012;Schwarz and Bauer, 2020). The latter, PYE, controls NICOTIANAMINE SYNTHASE4 (NAS4), ZINC-INDUCED FACILITATOR1 (ZIF1), and FERRIC RE-DUCTION OXIDASE3 (FRO3) of the same coexpression regulon (Long et al., 2010).
For most 2Fe-responsive genes, including reliable marker genes (Ivanov et al., 2012;Mai et al., 2016), the specific cis-regulatory elements (CREs) that coordinate their expression are unknown. Computational approaches uncover regulatory connections on a genome-wide scale, such as through elucidating the cis-regulatory code, i.e. the collection of CREs and the genes they regulate in a given regulatory context (Yáñez-Cuna et al., 2013). Putative CREs (pCREs) could be identified computationally by the overrepresentation of sequences in the promoter regions of coregulated genes. Combining with data for TF binding motifs (TFBMs) in Arabidopsis (Weirauch et al., 2014;O'Malley et al., 2016), regulatory connections can be made between TFs, binding sequences, and target genes. To further improve the confidence of computationally derived cis-regulatory code, machine learning algorithms (for review, see Ma et al., 2014) can be applied to build models with pCREs to predict gene expression levels or transcriptional responses. These models can be evaluated by how well they can predict expression levels or responses of genes that are not used for model training. Most importantly, a good model indicates that the pCREs used are most likely important for regulating the expression/response of interest. Previous work has demonstrated the suitability of machine learning for elucidating the cis-regulatory code of environmental stress responses in Arabidopsis (Zou et al., 2011;Uygun et al., 2019).
To get a deeper understanding of 2Fe response regulation, we elucidate its underlying cis-regulatory code using a combination of coexpression, motif finding, and machine learning approaches. Because some TFs have well established roles in 2Fe response, we use these TFs to validate our findings. In addition, we combined genome, transcriptome, and in vitro protein-DNA interaction data to uncover links between pCREs controlling 2Fe responses and their upstream TFs. With pCREs overrepresented in promoters of coexpressed genes we modeled 2Fe-induced up-regulation and identified over 100 informative pCREs of 2Feresponsive processes.

Overview of Approach and Functions of 2Fe-Responsive Genes
To identify root 2Fe-associated CREs at a genomewide scale, we defined root 2Fe response coexpression clusters; then we identified k-mers (sequences with k number of nucleotides) enriched in the promoter regions of those genes, and finally we modeled 2Fe response on the basis of the enriched promoter k-mers. An overview of our complete workflow including functional analysis of the identified pCREs is shown in Figure 1A. Because many factors impact the discovery of functional connections between genes, such as the choice of data set or the measure used to define expression similarity (Uygun et al., 2016), we used multiple expression data combinations and algorithms with varying parameters (see "Materials and Methods").
2Fe-responsive genes (log 2 fold-change [log2FC] $ 1 or # 21, q , 0.05) were identified using transcriptomic data available for six time points after a 2Fe treatment in Arabidopsis seedling roots (Dinneny et al., 2008). Enrichment analysis (Fisher's exact test, or FET; q , 0.05) of biological process gene ontologies (GOs) showed that, next to Fe-related GOs (e.g. Fe transport, homeostasis, and Fe-Sulfur [FeS] cluster assembly), responses to several hormones were overrepresented, including auxin, ethylene, abscisic acid (ABA), and jasmonic acid (Fig. 1B). This is consistent with the roles of hormones in 2Fe response (Brumbarova et al., 2015) and in root and root hair morphology and development (Schmidt et al., 2000), which were also enriched GOs. 2Fe affects the photosynthetic machinery and often correlates with oxidative stress responses (Rodríguez-Celma et al., 2013), which is reflected in enrichment of GOs regarding oxidative stress, photosynthesis, and primary metabolism even in roots ( Fig. 1B; Supplemental Fig. S1). Figure 1. 2Fe pCRE identification workflow and transcriptomic data. A, pCRE identification workflow. B, Heatmap of enrichment (FET, q , 0.05) of selected GO terms in genes that were significantly up-(red) or down-regulated (blue; q , 0.05) at $1 of 6 time points in 2Fe-treated roots of 6-d-old seedlings (Dinneny et al., 2008). Differential regulation was defined as log 2 fold-change (log2FC) $ 1 or # 21 (treatment vs. control). GOs are sorted by category, and expression patterns of genes corresponding to Fe-related GOs are shown below the heatmap. Yellow genes indicate 2Fe GS genes. C, Transcriptomic data combinations, which were used for clustering of coexpressed genes. Gray filled boxes in columns depict which expression data were used in each data combination top), and indicate if only up-regulated genes (up) or upand down-regulated genes (up & down) were considered (bottom). 1 Dinneny et al., 2008;2 Kilian et al., 2007;3 Goda et al., 2008;4 Schmid et al., 2005; *DC 5 tested with (5a) and without (5b) genotoxic stress data, (5b) input only upregulated genes.

Using Multiple Expression Data Sets to Define 2Fe Coexpression Clusters
We next grouped differentially regulated 2Fe response genes into coexpression clusters using two approaches: k-means clustering and correlation to robust GS 2Fe response genes. K-means clustering was based on the transcriptional responses to 2Fe alone and on 2Fe responses combined with different responses to other stress and developmental conditions ( Fig. 1C; Schmid et al., 2005;Kilian et al., 2007;Dinneny et al., 2008;Goda et al., 2008). Correlation-based clusters were generated for each gene in our curated list of 28 GS 2Fe response genes (see "Materials and Methods"; Supplemental Table S1). These clusters contain the query GS gene, and all 2Fe response genes (GS and non-GS) with a significantly similar expression pattern to the GS gene (Pearson's Correlation Coefficient, or PCC; see "Materials and Methods"). Again, the different combinations of transcriptional data were used (Fig. 1C).
The main reason for using both k-means and correlation-based approaches was to maximize chances of detecting clusters of coregulated genes. Our earlier studies demonstrated that the clustering approach used impacts how well we can recover genes with similar functions from expression data and that different clustering approaches complement each other (Uygun et al., 2016). The k-means algorithm is commonly used and generally provides good results; however, k-means clusters are impacted by the arbitrarily chosen value of k. We took this into account by testing different k values (see "Materials and Methods"); yet coexpression associations, most importantly with GS genes, might have been missed. Thus, in addition to using k-means, we applied the widely used coexpression measure PCC. For this correlation-based approach, GS genes were used as queries to establish GS-centric clusters that regulate central 2Fe response mechanisms. Genes that did not correlate with GS genes were not considered in this PCC-based approach but would be captured by the k-means approach. Therefore, these two approaches are complementary and enabled us later to compare regulatory elements of cluster types with and without GS genes.
In total, 1,520 k-means and 439 PCC-based clusters were defined. To characterize them, we determined enriched GOs. We grouped clusters with similar biological functions according to their enriched GOs (FET, q , 0.05) into "superclusters" ( Fig. 2A; Supplemental Fig. S2A; see "Materials and Methods"). Superclusters were defined as groups of at least 20 clusters with significantly higher similarity to each other than the average of all clusters (all Mann-Whitney U, P , 2.2e-16; Supplemental Fig. S2, B and C). We found that not all k-means clusters have directly Fe-related functions. For example, while k-means supercluster C was enriched in the Fe-related GO "cellular response to Fe" (GO shared by $75% of coexpression clusters within the supercluster), k-means superclusters A and B shared GOs related to different stress responses.
Because the current The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org) GO annotation for 2Fe response-related processes does not contain all 2Fe-responsive genes of interest (missing, for example, the GS genes MYB10, UDP-GLUCOSYL TRANSFERASE 72E1 [UGT72E1], AT3G07720, FE UP-TAKE-INDUCING PEPTIDE3 [FEP3] or NAS4; Ivanov et al., 2012), we also determined if k-means coexpression clusters were enriched with GS genes ("GSenriched"; FET, q , 0.05; Fig. 2A, right). While many of these clusters were part of the Fe-related GO supercluster A, the GS approach allowed us to identify an additional 23 Fe-related coexpression clusters that would have been overlooked by conventional GO enrichment analysis. In total, 7% of the k-means clusters were GS enriched (Fig. 2B, left). Applying this same analysis to the PCC-based clusters ( Fig. 2A; Supplemental Fig. S2A), we found higher levels of similarity between PCC-based clusters compared with k-means clusters (Mann-Whitney U, P , 2.2e-16; Supplemental Fig. S2D), because we precondition their identification on GS genes, some of which are tightly coregulated (Ivanov et al., 2012). Accordingly, we found that 93% were GS enriched (Fig. 2B, right).
GS genes are either regulated by FIT ("FIT-dependent", typically Fe acquisition genes) or by other factors ("FITindependent" genes, typically involved in Fe distribution). We found this reflected in our GS-enriched clusters: 71% of the GS-enriched clusters were specifically enriched with either FIT-dependent (39%) or FIT-independent genes (32%; FET, q , 0.05; Fig. 2B, bottom). The remaining GS-enriched clusters were enriched with both ("mixed"). Interestingly, clusters based on combined expression data (i.e. data combinations [DC] 2, 3, 5a/b, 6) were more often enriched with FIT-dependent or FIT-independent genes, whereas 2Fe time course data alone (DC 1) produced mainly mixed category clusters (Fig. 2C). The utility of including spatial or developmental data (DC 2, 6) to define coexpression clusters reflects that 2Fe-responsive genes act at different time points and in different tissues and organs (Dinneny et al., 2008;Ivanov et al., 2012;Jeong et al., 2017). Finally, genes in clusters not enriched with GS genes ("nonenriched") tended to respond to particular abiotic stresses, for example cold (Supplemental Fig. S3 e.g. clusters 937, 973) or salt (e.g. clusters 900, 915, 936, 987), whereas expression of GS-enriched cluster genes tended to randomly oscillate under different abiotic stresses (e.g. clusters 818,835,858,889). This might indicate different regulatory networks and highlight the usefulness of incorporating additional abiotic stress data (as in DC 3, 5, 6) when defining coexpression clusters that are likely coregulated.
In summary, by using different expression data sets and clustering methods, we defined 1,959 2Fe coexpression clusters, many of which were enriched with FIT-dependent and/or FIT-independent GS genes. These represent possibly coregulated functional units in Fe acquisition and Fe homeostasis processes, wellsuited to identify pCREs, which can explain 2Feinduced up-regulation. Genes in coexpression clusters that were enriched with 2Fe-responsive genes but not GS genes (nonenriched clusters) are presumably regulated by mechanisms different from GS-enriched clusters.
A Machine Learning Approach to Model Regulation of 2Fe-Responsive Coexpression Clusters The machine learning algorithm Random Forest (RF) has been successfully used to model stress transcriptional responses using cis-regulatory sequences in plants (Deng et al., 2017;Uygun et al., 2017;Zou et al., 2011). Here, for each coexpression cluster, we used pCREs (enriched k-mers in putative promoter sequences) to build a RF model that classifies genes as belonging to the cluster in question or as a nonresponsive gene (see "Materials and Methods"). Here the model performance is measured using the F1 score defined as the harmonic mean of precision and recall, where a perfect model would have a F1 of 1 and, because we used balanced data sets (50% responsive and 50% nonresponsive) in model training, the F1 for a model performing no better than random guessing would have an F1 of 0.5. The pCREs from models performing above a defined threshold (moderate model performance, F1 $ 0.7) were considered further. Out of 1,959 coexpression clusters, 28% of the models passed the performance threshold, 60% performed poorly, and for 12% no model could be built due to small sample size (median size 5 2 genes; Supplemental  Table S2), likely due to the lack of training data. Nonetheless, 66 large clusters (.100 genes, median size 5 135) also performed poorly (median F1 5 0.64). This is likely because these large clusters are too heterogeneous containing genes with multiple regulatory codes (Uygun et al., 2016;Uygun et al., 2017), and/or are coregulated but not at the transcriptional level (e.g. posttranslationally controlled). Interestingly, of the 28% of clusters with models above the threshold, only 36% were GS-enriched clusters (Supplemental Fig. S4A). Nonetheless, models built for GS-enriched clusters (median F1 5 0.68) tended to perform better than models built for nonenriched clusters (median F1 5 0.65; Mann-Whitney U, P , 2.358e-09; Fig. 3, A-C). The 159 GS-enriched (27 k-means, 132 PCC based) and 358 nonenriched (349 k-means, 9 PCC based) cluster models (total n 5 517) performed above the threshold and were considered further.
Good model performance indicates that genes in a cluster are more likely coregulated, and, because pCREs were used to build the model, these pCREs are likely the regulatory sequences contributing to the coregulation. Each pCRE had an importance score, Figure 2. Characterization of the defined coexpression clusters by GO terms, 2Fe GS gene content, and FIT-dependent/FIT-independent gene content. A, Similarity between coexpression clusters evaluated by enrichment of GOs. GO similarity between cluster pairs was calculated with the Jaccard index. Top, Heatmap of GO similarities between k-means clusters (n 5 985). Bottom, PCC clusters (n 5 238). Clusters were generated from expression data containing only up-regulated genes ( Fig. 1C; similarity between clusters containing up-or downregulated genes: Supplemental Fig. S2A). Clusters were grouped by hierarchical clustering. Superclusters (A-F) were defined as groups of .20 clusters that have a within-mean Jaccard index significantly higher than the mean Jaccard index of all clusters. Enriched GO terms shared by $75% (k-means) and $90% (PCC) of the clusters in each supercluster are shown (left). Coexpression clusters enriched with 2Fe GS genes are designated (yellow, right). B, Proportions of all k-means (top left) and PCC (top right) clusters in which 2Fe GS genes are significantly overrepresented (yellow). Of those, the proportion of clusters enriched with FIT-dependent genes (FIT, blue), FIT-independent genes (non-FIT, red), or with both (mixed, gray) was calculated (bottom). C, Proportions of enrichment categories FIT, non-FIT, and mixed clusters that resulted from using different expression data combinations (as in Fig. 1C). 1, 2Fe time course; 2, time course 1 root zones; 3, time course 1 abiotic stresses; 4, time course 1 hormone treatments; 5a, time course 1 abiotic stresses 1 hormones; 5b, as 5a, genotoxic stress deleted; 6, time course 1 abiotic stresses 1 developmental data. *PCC clusters only. All enrichment analyses: FET, q , 0.05. reflecting how useful a pCRE was for predicting 2Fe response genes in a cluster (see "Materials and Methods"). Because the models were trained on presence/absence of pCREs, the rate of a gene being correctly predicted (true positive; TP) is positively correlated with the total number of pCREs present in its promoter (Spearman's rank correlation coefficient, calculated for 159 GS-enriched clusters; median 5 0.8; Supplemental Fig. S4D). However, not all pCREs were necessary for overall good model performance, mostly due to pCRE redundancy. The minimum set of pCREs (termed min-pCREs), which are sufficient to predict a cluster model, will be addressed in a later section. Figure 3, D-F, demonstrates the principle of RF models for coexpression clusters with good ( Fig. 3, D and E) and poor (Fig. 3F) performance. Roughly speaking, the more genes are correctly predicted as -Fe responsive, the better the overall model performance.
Across 517 well-performing cluster models, we identified 5,639 pCREs enriched in promoters of 2Feresponsive genes that may be predictive of 2Fe-induced up-or down-regulation. The pCREs of each cluster model were specific for the coexpressed genes in these clusters. As control, we applied the models to all 2Fe response genes in general. As expected, the specific models could not generally predict any 2Fe response genes (Supplemental Fig. S4E; Supplemental Table S3). Instead, pCREs were even specific for models of 2Fe-responsive expression at certain time points or in different root zones. For example, expression of cluster 44 (expression starts 24 h upon 2Fe treatment) is predicted by a different set of pCREs than cluster 120 (expression starts at 12 h). Similarly, cluster 532 (genes expressed in root zone 4) and cluster 552 (genes expressed in root zones 3 and 4) are predicted by different sets of pCREs (Supplemental Fig. S5).
To further evaluate the biological relevance of pCREs, in the following sections, we assess pCREs based on their association with GS-enriched or nonenriched clusters, importance for model performance, and similarity to known TF binding sites. Known 2Fe CREs from Arabidopsis and also from Left, expression profile of all genes in the coexpression cluster (log 2 foldchange [log2FC]). Center, percentage of times each gene was correctly predicted as 2Fe-responsive across 50 RF replicates (TP; black 5 100%, white 5 0%). Right, all pCREs of the cluster sorted by importance rank (top ranked pCRE on the left). Heatmap designates whether pCRE was present (gray) or absent (white) in the gene's promoter. T, 2Fe treatment time course. R, 2Fetreated root zones 1-4. F1 score, harmonic mean of precision and recall, with 1 5 perfect prediction and 0.5 5 random guessing. Detailed cluster information (IDs, F1 scores, genes, pCREs): Supplemental Table S2. grasses, for example E-/G-boxes (bHLH TF binding sites) and IDE1, will serve as positive controls.

Identifying Common pCREs Across Coexpression Clusters
We expect true 2Fe response CREs to be (1) important for building models with good performance in predicting 2Fe response and (2) reliably identified in coexpression clusters with similar gene content. For an overview of the 5,639 pCREs, we calculated for each the proportion of clusters enriched with the pCRE and its average importance rank across those clusters (Supplemental Table S4). The importance rank of a pCRE was derived from its importance score in the trained RF models.
pCREs were identified in between 1 (0.6%) and 56 (35%) GS-enriched clusters and in between 1 (0.3%) and 54 (15%) nonenriched clusters. Also, 173 pCREs were considered "frequent pCREs" (freq-pCREs), because they were identified in .5% of GS-enriched or nonenriched clusters (Supplemental Table S4). Across GSenriched clusters, pCREs tended to occur at higher proportions and with higher importance ranks than across nonenriched clusters (Mann-Whitney U, P , 2.2e-16; Fig We next determined whether GS-enriched clusters are regulated by a different set of pCREs than nonenriched clusters. Out of the 5,639 pCREs, 15% (n 5 860) were unique to GS-enriched clusters, whereas 73% (n 5 4,109) were unique to nonenriched clusters. Twelve percent (n 5 670) were found in both GS-enriched and nonenriched clusters (Fig. 4A, inset). This indicates that GS-enriched clusters and nonenriched clusters are regulated partly by different pCREs, but also by a fraction of shared pCREs. However, 43% (n 5 286) of the 670 shared pCREs were predominant to GSenriched clusters (i.e. having only low proportion and low importance rank in nonenriched clusters; Supplemental Fig. S6C, top). This indicates that pCREs that were categorized as shared might not be equally important for regulating both GS-enriched and nonenriched clusters. Interestingly, unique GS-enriched freq-pCREs represented 59% (n 5 102) of the 173 freq-pCREs, whereas 35.5% (n 5 58) were unique nonenriched, and only 7.5% (n 5 13) freq-pCREs were shared between GS-enriched and nonenriched clusters, indicating that pCREs with high proportion are also the ones that seem to regulate almost exclusively either GS-enriched or nonenriched cluster functions, but not both (Fig. 4A,inset;Supplemental Fig. S6C,bottom). Furthermore, freq-pCREs tended to have higher importance ranks than nonfrequent pCREs (Mann-Whitney U, P , 1.924e-14; Supplemental Fig.  S6D). Together, this suggests that freq-pCREs could be particularly relevant for regulation of 2Fe response mechanisms.
To characterize the freq-pCREs, we grouped them according to sequence similarity using pairwise PCC distances of pCRE position weight matrices (PWM; see "Materials and Methods"). Sixty-two percent (n 5 107) of all freq-pCREs could be placed into one of eight pCRE groups (Fig. 4, C and D; Supplemental  Fig. S6E). Freq-pCREs of the same group tended to be predictors of the same cluster category (GS-enriched/ nonenriched).
In summary, we identified more than 100 2Fe pCREs that were reliably associated either exclusively with GSenriched or nonenriched coexpression clusters or with high preference for one of the categories. Those pCREs were also ranked as important for machine learning models and might therefore be candidates for functionally relevant motifs to different responses to 2Fe. Based on threshold similarities that differentiate between TFBMs and random sequences (Uygun et al., 2017), we were able to match a specific TF and/or a specific TF family to each of the 173 freq-pCREs (see "Materials and Methods"; Fig. 5A; Supplemental  Fig. S7). This indicates that these freq-pCREs are not randomly occurring promoter sequences. To gain an overview that TF families could be associated with GS-enriched clusters, we calculated the proportions of TF family TFBMs that match freq-pCREs from GS-enriched and nonenriched clusters (Fig. 5B). We found most TF families with higher proportion in either GS-enriched clusters (14 TF families) or nonenriched clusters (12), whereas only four were equally distributed between both categories.
Most known 2Fe regulators in Arabidopsis are bHLH TFs (e.g. FIT, subgroup Ib, IVb, and IVc bHLH proteins; Jakoby et al., 2004;Wang et al., 2007;Long et al., 2010;Zhang et al., 2015;Gao et al., 2020). Two more important TFs are MYB10 and MYB72 (Palmer et al., 2013). Consistently, we identified bHLH and MYB TF families associated with our clusters, and even with higher proportion in GS-enriched clusters than in nonenriched clusters. Other matching TF families overrepresented in GS-enriched clusters were bZIP (FET, q , 0.05), B3, TCP, and NAC. Although a B3 TF (IDEF1; ABI3VP1 subfamily of B3) and a NAC TF (IDEF2) are important regulators of Strategy II Fe acquisition in grasses Ogo et al., 2008), the role of these TF families in Strategy I nongrass plant species has not yet been described. In contrast, ARID, WRKY (both FET, q , 0.05), Homeobox, and CAMTA TF families were matched more in nonenriched clusters. That these TF families were not primarily associated with GS genes might point toward them regulating 2Fe responses, which are not directly Fe uptake or distribution. Alternatively, because several nonenriched clusters contain downregulated 2Fe-responsive genes, some of these TFs might be repressors.
While the DAP-seq and CIS-BP TFBM databases contain binding information for many TFs, they are far from exhaustive. For example, out of 162 known Arabidopsis bHLH proteins (Bailey et al., 2003), only 46 were available to be included in the analysis. Therefore, some TF families were likely underrepresented in our analysis and some top match TFBMs may not accurately reflect the binding partner for certain pCREs. Consequently, some important pCRE-TFBM matches might not be detectable at this time. However, as new experimental TF binding data are collected, we might gain more biological insight into our 2Fe pCREs. . Analysis of pCREs predictive of 2Fe coexpression clusters. A, Proportion of GS-enriched (yellow) and nonenriched (gray) clusters in which each pCRE (total n 5 5,639) was identified (y axis) and mean importance rank (1 5 most important) of that pCRE in those clusters (x axis). Inset, Numbers of unique and shared pCREs of GS-enriched and nonenriched cluster categories. Top: all 5,639 pCREs. Bottom: pCREs identified in .5% of GS-enriched or nonenriched clusters (n 5 173; freq-pCREs). B, Frequency of normalized mean importance ranks across all pCREs in GS-enriched (yellow) and nonenriched (gray) clusters. C, Cytoscape network of the 173 freq-pCREs based on sequence similarity, where similar pCREs (nodes) are connected by edges representing pairwise correlation (PCC) distance of freq-pCRE PWMs. Bold black edges: distance 5 0. Light gray edges: distance # 0.22. Highly interconnected freq-pCREs were arranged in groups and numbered. Hierarchical clustering representation of PCC distances: Supplemental Figure S6E. Yellow filled, freq-pCRE unique for GS-enriched clusters; gray filled, freq-pCRE unique for nonenriched clusters; not filled, shared freq-pCRE. D, PWMs of merged freq-pCREs from the same group (as in Fig. 4C).

Inferring Upstream Regulators of the 2Fe Response
Because our genome-wide approach for identifying regulatory elements may shed light on areas of 2Fe responses that are less well understood, we next interpreted our findings in context with open questions in the field. For example, the ABI3VP1/B3-type TF IDEF1, a key regulatory factor of 2Fe responses in rice and barley roots, binds in vivo and in vitro to the CATGC core of IDE1 Kobayashi et al., 2009;Kobayashi et al., 2010). With ten of our freq-pCREs having a CATGC (or GCATG) core and matching ABI3VP1/B3 family TFBMs, IDE1-likes were fairly dominant among the freq-pCREs and unique to GS-enriched clusters (Supplemental Table S4). Besides validating our approach, this strongly suggests an important function for IDE1-like motifs in Arabidopsis. IDE1-containing grass promoters function in a heterologous manner in nongrass plants, as shown in vivo in transgenic tobacco (Kobayashi et al., 2003). Corresponding tobacco TFs are not known yet. The closest homologs of the rice IDEF1 in Arabidopsis are AFLs (B3 family TFs ABA- ). In fact, ABI3 and FUS3 bind in vitro to RY-like elements (CATGCA), regulating FeS cluster subunit formation during seed maturation (Roschzttardtz et al., 2009). However, ABI3 or FUS3 functions during later developmental stages, particularly in the root during 2Fe response, remain to be elucidated. Because the FUS3 TFBM matched our top most abundant IDE1-like (CATGCC; Supplemental Table S5), and because FUS3 is expressed in the root epidermis and in lateral root primordia during later developmental stages (Boulard et al., 2017;Tang et al., 2017), FUS3 might be an IDEF1 homolog in Strategy I plants.
Another 2Fe response-related TF in rice and barley, IDEF2, belongs to the NAC family and binds to the CA(A/C)G(T/C)(T/C/A)(T/C/A) core in IDE2 (Ogo et al., 2008). Although we did not have a perfect (PCC 5 1) pCRE-NAC TFBM match, we found matched NAC TFBMs slightly overrepresented in GS-enriched clusters (Fig. 5B). Furthermore, two of the top ten most abundant freq-pCREs unique to GS-enriched clusters matched NAC TFBMs (PCC . 0.9), with one freq-pCRE being highly similar to the IDE2 core (CACGCC). This indicates that IDE2-like motifs might also play a role during Arabidopsis 2Fe responses.
Other freq-pCREs matched to known TFBMs from TFs with unknown roles in 2Fe responses. For example, bZIP TFBMs were significantly overrepresented in GS-enriched clusters, but have no known direct roles in 2Fe responses. However, bZIP TFs are known regulators of the zinc (Zn) deficiency response, which, together with the fact that one GS gene, ZIP9, is also responsive to Zn deficiency, could indicate an interdependency of Zn and Fe homeostasis (Assunção et al., 2010;Sinclair et al., 2018). Furthermore, two matched Figure 5. Similarity of freq-pCREs to in vitro identified TFBMs. A, Sequence similarity of freq-pCREs from GS-enriched clusters to the best matching TFBM from DAP-seq or CIS-BP data. If the freq-pCRE matched TFBMs from both data sets, DAP-seq match was preferred over CIS-BP match (Supplemental Table S5). Bars 5 95th percentile (PCC) significance thresholds for within TF family (red, pCRE sequence is more similar to a specific TFBM than to other TFBMs from the same family), between TF families (light blue, pCRE more similar to a TFBM in a family than to TFBMs from other families), or random (dark blue, pCRE more similar to a TFBM from a family than to random 6-mers). Similarity of freq-pCREs from nonenriched clusters to TFBMs: Supplemental Figure S7. B, Proportion of TF family TFBMs (representing freq-pCRE matches meeting at least "between" threshold) in GS-enriched clusters (x axis) and nonenriched clusters (y axis). TFBM matches significantly overrepresented (FET, q , 0.05) in the GS-enriched or nonenriched cluster category are depicted in red and marked with "X." Dashed line marks theoretical position for TF family TFBMs with the same proportion in both categories.
TFs, bZIP3 and bZIP16, are involved in ABA signaling, which is connected to 2Fe responses by modulating root growth (Séguéla et al., 2008;Matiolli et al., 2011;Hsieh et al., 2012). Possible functions of bZIP TFs in response to 2Fe stress should be explored in the future.
PTF1 and FAR1 TFBMs are two more examples for perfect freq-pCRE matches with yet unknown specific roles of the TFs during 2Fe. Their specificity to GSenriched clusters points toward important roles in regulating GS genes. TCPs are involved in plant development, but also act in signaling of hormones that influence 2Fe responses (Davière et al., 2014;Brumbarova et al., 2015;Resentini et al., 2015;Nicolas and Cubas, 2016). For example, TCP20 was reported to bind to the BHLH39 promoter (Andriankaja et al., 2014), indicating a possible connection of TCPs and 2Fe responses during plant development. PTF1 is involved in regulating responses to light shade signals through PHYTOCHROME INTERACTING FACTORS (PIFs; Zhou et al., 2018). Interestingly, FAR1 and its homolog FAR-RED ELONGATED HYPOCOTYL3 (FHY3) also act in phytochrome-PIF signaling . Together, this suggests a connection of light perception and 2Fe responses mediated through these TFs, which is consistent with the known diurnal influence on Fe uptake (Vert et al., 2003;Santi and Schmidt, 2009;Hong et al., 2013;Salomé et al., 2013). In addition, FAR1/FHY3 act in the regulation of the phosphate starvation response, together with ethylene regulator ETHYLENE-INSENSITIVE3 (EIN3; Liu et al., 2017), which also binds FIT to promote Fe uptake . Therefore, FAR1 might also regulate Fe acquisition via the ethylene pathway.
Finally, a perfect freq-pCRE-WRKY11 match indicates that WRKY TFs, although significantly overrepresented in nonenriched clusters, are also important for regulating GS-enriched clusters. WRKY11 is involved in abiotic stress tolerance in Arabidopsis (Ali et al., 2018), with no specific role known during 2Fe response yet. However, WRKY TFs in general have already been connected to 2Fe, for example as putative regulators of the coumarin transporter gene PLEIO-TROPIC DRUG RESISTANCE9 (PDR9; Ito and Gray, 2006) and of PYE (Koryachko et al., 2015). Furthermore, WRKY46 negatively regulates the vacuolar Fe importer gene VACUOLAR IRON TRANSPORTER-LIKE1 (VTL1; Gollhofer et al., 2011;Gollhofer et al., 2014;Yan et al., 2016). We found a WRKY TFBM (GTCAAC) in several nonenriched clusters containing down-regulated Fe-responsive genes, including the VTL1 homolog VTL5 (AT3G25190; Supplemental Table S2), indicating that some of the TFs matching nonenriched cluster pCREs might act as repressors of Fe excess genes.
In summary, many pCREs commonly found among GS-enriched clusters share significant sequence similarity with known 2Fe CREs, such as IDE1, or with binding sites of known 2Fe-associated TF families, such as ABI3VP1/B3, NAC, MYB, and bHLH. Notably, we found evidence for IDE1-like motifs being relevant not only in Strategy II plants, but also in the Strategy I plant Arabidopsis. Our results also suggest a role of bZIPs or TCPs in 2Fe responses. We assessed in the next paragraph specific 2Fe response processes in which pCREs of particular interest, such as IDE1-likes, might be involved.
Identifying the Most Important pCREs (min-pCREs) and Associating Them with FIT-Dependent or FIT-Independent Functions After identifying potential regulators in the 2Fe response, we pinpointed some of those that could best explain models of 2Fe-responsive up-regulation and explored their potential functions.
More than 1,500 pCREs were identified in total in GSenriched clusters, raising the question of a core set of important pCREs needed to robustly predict 2Fe responses in each cluster. Using pCRE abundance (freq-pCREs) across GS-enriched clusters as the only selection criterion could result in missing motifs, simply due to the fact that some coexpression clusters were more unique than others. This is supported by the fact that rare pCREs (i.e. pCREs identified in few clusters) still can have a high importance rank (Fig. 4A). Those pCREs were not included in the set of freq-pCREs, although they seem to be important for regulating individual GS-enriched clusters. We identified the most important pCREs for a cluster model (defined as the minimum set of pCREs; min-pCREs) by building RF models iteratively with successively deleting the least important pCREs in each round (see "Materials and Methods"; Supplemental Fig. S8A). We applied this approach to each of the 159 GS-enriched clusters. In most cases (117 of the clusters), 50% to 90% of the pCREs were not important for model performance, mostly due to sequence redundancy with a more important pCRE (Supplemental Table S2). pCREs were regarded as redundant if they were reverse complements or if one was the subset of the other. We calculated that, on average, 27.8 pCREs were identified in a GS-enriched cluster, and 10.3 of them were min-pCREs. Supplemental Figure S8B shows the pCRE presence/ absence patterns of Figure 3, D and E, example clusters (IDs: 493, 823), with the min-pCREs highlighted. Genes in cluster 493 had on average 9.2 min-pCREs and genes in cluster 823 had on average 4.09 min-pCREs. We also counted the min-pCREs per gene in the other 157 GSenriched clusters (Supplemental Table S2) and calculated that, on average, 7.01 different min-pCREs were present in a gene's promoter. This indicates that several pCREs are needed for a gene to be 2Fe-responsive.
The collective set was 615 different min-pCREs across GS-enriched clusters. Each min-pCRE was identified in at least 1 (0.6%) and in up to 48 (30%) GS-enriched clusters. The IDE1-like CATGCC was the top most abundant min-pCRE. Together with CATGCC, two more IDE1-like motifs, TCATGC and CCATGC, were among the top ten most abundant min-pCREs (Supplemental Table S4). This supports a previous computational analysis of rice promoters, in which IDE1-like was among the top scoring motifs (Kakei et al., 2013). Together with our previous finding that IDE1-like ABI3VP1/B3 TFBMs are associated with GS-enriched clusters, it suggests an important, yet unknown, function of IDE1-like motifs in Arabidopsis 2Fe response regulation. Min-pCREs matching a bHLH (CGTGAC), a MYB (TAACTA), and the IDE2-like NAC TFBM (CACGCC; all Supplemental Table S5) were also among the top ten most abundant min-pCREs (Supplemental Table S4), further demonstrating the utility of our approach.
To determine in which processes min-pCREs might function during 2Fe, we tested if min-pCREs were more likely to be found in FIT-dependent or FITindependent coexpression clusters. More than 60% of the 159 GS-enriched clusters were classified as either FIT dependent (35% out of 159) or FIT independent (28% out of 159; Fig. 6A). We then calculated the proportion of min-pCREs (present in $5 GS-enriched clusters) in each cluster category (Supplemental Table  S6). Interestingly, the two IDE1-like motifs, CATGCC, CCATGC, and the related ABI3VP1/B3 TFBM match ATGCAT, were predominantly identified in FITdependent clusters, but IDE2-like CACGCC had no preference for either FIT-dependent or FIT-independent clusters (Fig. 6B). This suggests that the IDE1-like pCREs tend to be more important for FIT-dependent root Fe acquisition rather than FIT-independent Fe sensing, signaling, and distribution. This is also consistent with the role of grass IDE1 in Fe uptake (Kobayashi et al., 2003;Kobayashi et al., 2005). Two ARF TFBMmatching min-pCREs (AACGTA/ARF16, GTCGGA/ ARF2) were also preferentially found in FIT-dependent clusters. ARFs are involved in auxin signaling, thereby controlling, among other functions, root hair elongation (Pitts et al., 1998;Mangano et al., 2017;Choi et al., 2018). Different studies reported that 2Fe responses can be accompanied by an increase of root hair number, by elongation of root hairs, and by deformed or short root hairs (Schmidt et al., 2000;Müller and Schmidt, 2004;Dinneny et al., 2008). ARF2 and ARF16 TFs are root hair growth repressors (Choi et al., 2018), which would be consistent with a short root hair phenotype and downregulation of respective GO terms under 2Fe (Dinneny et al., 2008;Supplemental Fig. S1). During 2Fe, several root hair-acting genes are coexpressed in a regulon, which also contains IRON-REGULATED TRANSPORTER2 (IRT2; Ivanov et al., 2012;Schwarz and Bauer, 2020), indicating a possible connection of ARF TFBM-matching min-pCREs with these root hair processes.
In contrast, three bZIP TFBM-matching min-pCREs, GTGGCA, CACGTC, and CACTAC, were predominantly identified in FIT-independent clusters. As described in the previous section, bZIPs are involved in ABA signaling. ABA negatively regulates FITdependent Fe 31 reductase gene FERRIC REDUC-TION OXIDASE2 (FRO2) and Fe 21 importer gene IRON-REGULATED TRANSPORTER1 (IRT1; Séguéla et al., 2008). However, ABA signaling also leads to enhanced apoplastic and vacuolar Fe utilization and root to shoot transport under 2Fe (Séguéla et al., 2008;Lei et al., 2014). Lei et al., 2014 propose that ABA-responsive gene regulation and Fe remobilization and transport are connected through bZIPs, which is consistent with our results that bZIP TFBMs are preferentially found in FIT-independent coexpression clusters of genes involved in Fe mobilization and translocation. Similarly, two TCP TFBMs, GAC-CAC and ACCCAC, were identified almost exclusively in FIT-independent clusters, which is in agreement with TCP20 regulating BHLH39 in a FIT-independent manner (Andriankaja et al., 2014). Finally, bHLH TFBMs were identified in all cluster categories with a preference for mixed clusters. This matches the ubiquitous nature of bHLH target motifs (E-/G-boxes), which act at many levels in the 2Fe bHLH regulatory cascade. In summary, we can propose plausible roles for pCREs as TFBMs in FIT-dependent and FIT-independent processes.

Positional Bias, Conservation, and Inferred Functions of min-pCREs
Next, to explore if min-pCREs displayed significant positional bias in the promoter regions of coexpressed genes, we compared the observed min-pCRE frequencies in 100-bp bins of 21000 to 1500 bp and of 2500 to 11000 bp flanking regions adjacent to the transcription start site (TSS) and transcription termination site (TTS), respectively, with the expected frequencies from shuffled pCRE sequences (according to Uygun et al., 2017;Supplemental Fig. S9). Furthermore, we examined the noncoding as well as the coding sequences of the transcribed regions. Overall, the distributions of min-pCREs revealed a slight positional bias in the promoter regions (Fig. 6C, top). We investigated the distribution plots separately for ten selected min-pCREs: FITdependent ABI3VP1/B3 TFBM-matching min-pCREs (containing the two IDE1-likes), the IDE2-like (NAC TFBM match), FIT-independent bZIP TFBM matches, and bHLH TFBM matches (Fig. 6C). These min-pCREs had significant location bias in the putative promoters up to 1000 bp upstream of the TSS. Because known CREs often exhibit positional bias (Zou et al., 2011;Heyndrickx et al., 2014;Yu et al., 2016), this provides additional support for these pCREs having regulatory functions in Fe uptake and homeostasis. Interestingly, some of the pCREs common to mixed clusters did not show position bias in any of the genomic regions tested (e.g. ABI3VP1/CTTATA and MYB/TAACTA; Fig. 6C, bottom), indicating that genes of such clusters are less likely to be truly transcriptionally coregulated.
Next, we sought to pinpoint the specific processes of Fe homeostasis, which these ten min-pCREs might regulate. To assess this, we determined the genes containing the respective min-pCRE and counted the number of incidents in which these genes were likely Figure 6. Characteristics of the most informative pCREs (min-pCREs). A, Proportion of GS-enriched clusters enriched with FIT-dependent genes (FIT, blue), FITindependent genes (non-FIT, red), or both (mixed, gray). B, Ternary plot of min-pCREs identified in .3% (n 5 5) GSenriched clusters. Position of the min-pCREs corresponds to the normalized proportions of FIT, non-FIT, and mixed clusters in which the min-pCRE was identified. Bubble size corresponds to the overall proportion of GS-enriched clusters with min-pCRE. Labeled min-pCREs are shown in (C) and (D) or mentioned in the main text. C, Positional bias of all (mean with SD; top) and selected min-pCREs (below) in the putative promoter region (1st column), all introns (In) and all exons (Ex; 2nd column; mean with SD), and in the putative noncoding region (3rd column). 1st and 3rd column: position distributions in all coexpression clusters with min-pCRE (gray areas). Red line: mean distribution. TFBM matches (PCC) for each min-pCRE are shown (4th column), and min-pCREs are sorted by TF family. Log2(obs/exp), log 2 of the number of observed (obs) min-pCRE occurrences divided by the number of min-pCRE occurrences in randomized sequences (expected, exp). D, Genes which might be regulated by the selected min-pCREs. Count, number of GS-enriched clusters in which the min-pCRE was identified and which included the respective gene having the min-pCRE in its promoter. Dashed line, total number of GS-enriched clusters with the min-pCRE. Genes in which the min-pCRE overlaps with a CNS are highlighted black. A high-resolution image is available as Supplemental Figure S10. regulated by the min-pCRE ( Fig. 6D; high-resolution image Supplemental Fig. S10; Supplemental Table S7). For example, CATGCC was found in 48 GS-enriched clusters, and 36 of those clusters (75%) included the min-pCRE-containing gene MYB10, whereas only four of those clusters (8%) included MEK KINASE16 (MAPKKK16; Fig. 6D, second row at right). We inferred that CATGCC might be regulating predominantly processes in which MYB10 is required. As an additional line of evidence for pCRE functionality, we determined if min-pCREs overlapped with conserved noncoding sequences (CNS) of the Brassicaceae family (Supplemental Table S8; Haudry et al., 2013). As expected, bHLH TFBM-matching min-pCREs were identified in many gene promoters, including BHLH39/ BHLH101 (direct targets of bHLH IVc TFs; Zhang et al., 2015), NAS4 (direct PYE target; Long et al., 2010), and IRT1 and GENERAL REGULATORY FACTOR11 (GRF11; FIT targets; Fig. 6D; Sivitz et al., 2012;Yang et al., 2013). In BHLH39/BHLH101, IRT1, and GRF11, the respective min-pCREs overlapped with CNS, further supporting the importance of these motifs. Interestingly, bHLH matched min-pCREs were also located in CNS of BRUTUS (BTS) and BRUTUS-LIKE1 (BTSL1), two genes that negatively regulate Fe uptake by marking positive regulators for degradation (Selote et al., 2015). If BTS were to be regulated by bHLH proteins from the same regulatory cascade, this may indicate a negative feedback loop.
In summary, we identified robust 2Fe pCREs, which, in addition to sharing significant sequence similarity to known TFBMs, were also part of the core sets of pCREs needed for prediction of 2Fe responses of GS-enriched clusters (min-pCREs). Furthermore, they were preferentially located in promoter regions upstream of the TSS, and even in CNS' of some genes. Together, these findings indicate that these pCREs might be authentic 2Fe CREs. From the biological context of the genes, which are likely regulated by some of the pCREs, we were able to generate hypotheses relevant for understanding 2Fe response regulation in Arabidopsis. For example, our work highlighted that in addition to the bHLH TFBMs, IDE1-like motifs and bZIP TFBMs are likely involved in different responses to 2Fe and should be considered of high interest for future work. CONCLUSION We identified 5,639 pCREs that were enriched in promoters of coexpressed 2Fe-responsive genes and were used as features to predict 2Fe-responsive regulation of root-expressed genes on a genome-wide scale. Of those, 173 pCREs reliably predicted 2Fe response genes of .5% of our defined coexpression clusters (freq-pCREs). Because most of those pCREs were either unique to coexpression clusters enriched with our GS Fe acquisition and homeostasis genes, or unique to coexpression clusters lacking those genes, we conclude that our approach had captured motifs specifically regulating different responses during 2Fe. Several pCREs contained the previously experimentally validated core motif of IDE1. To take advantage of the publicly available in vitro TF binding information, we compared the freq-pCREs to TFBMs from two studies (Weirauch et al., 2014;O'Malley et al., 2016) and found that our approach had captured known Strategy I 2Fe recognition motifs for bHLH and MYB proteins. Our approach also led to regulatory connections of bZIP, B3, NAC, and TCP families to Strategy I 2Fe response regulation. While bZIP and bHLH TF families are also associated with high salinity stress response (Uygun et al., 2017), other high salinity stress response-associated TF families (e.g. WRKY and AP2) were not common among our 2Fe pCREs regulating GS-enriched coexpression clusters, highlighting that our approach pinpoints regulators specific to a stress condition.
We inferred possible functions of pCREs, which were most important for modeling 2Fe responses (min-pCREs) from their enrichment in FIT-dependent or FITindependent coexpression clusters and their location bias in promoters of particular 2Fe-responsive genes (Fig. 6, B and D). Our results provide evidence that IDE1-like pCREs are linked to coumarin synthesis, indicating that the function of IDE1 to ensure supply of Fe-chelating compounds for Fe acquisition could be an evolutionarily conserved function, at least among flowering plants. While our results highlight the importance of IDE1-like motifs for Fe acquisition, it was not the only prominent 2Fe pCRE. This is in contrast to Zn deficiency, where ZDRE seems to be singularly associated with multiple Zn deficiency responses (Assunção et al., 2010), and indicates that despite of overlaps of Zn and Fe homeostasis control (Briat et al., 2015), their transcriptional regulation must follow different mechanisms.
Our results support a concept in which 2Fe is not regulated by only one or few regulatory elements. In fact, 7.01 min-pCREs were present on average in the promoter of a 2Fe-responsive gene. Of the many important pCREs for 2Fe response, many share significant similarity with TFBMs of TF families known to undergo hetero-dimerization and protein interaction across families, such as bHLH, MYB, bZIP, TCP, and ABI3VP1 (Bemer et al., 2017). A combinatorial mechanism would dramatically increase the flexibility of transcriptional responses driven by a set of few TFs. It might be that some pCREs not as important in our prediction models, would become informative in combination, as suggested for high salinity stress response (Zou et al., 2011;Uygun et al., 2017). A next step would therefore be to build 2Fe prediction models that explicitly account for interactions between pCREs. A limitation of our approach is that our coexpression clusters were based on ATH1 chip microarray data, the only comprehensive 2Fe time course transcriptome set available to date. Some important 2Fe marker genes (e.g. FRO2) are not represented on the chip and others might not have passed our significance threshold because of sensitivity issues with the microarray technology. Additionally, we restricted our analysis to the promoter regions 1,000 bp upstream of the TSS. While this is expected to cover most important cis-acting elements and reduce the occurrence of promoters overlapping with adjacent genes, introns as well as more distal promoter regions are known to harbor cis-acting elements (Rose et al., 2008;Rose et al., 2016).
The large number of TFs known to be involved in 2Fe-induced up-regulation points toward the importance of transcriptional regulation. However, 2Fe responses are also heavily controlled at the posttranscriptional and posttranslational level Meiser et al., 2011;Sivitz et al., 2011;Selote et al., 2015;Zhang et al., 2015;Gratz et al., 2019). Naturally, our approach cannot cover such regulatory aspects. However, it allows us to predict TF families, that may act upstream of the known 2Fe-responsive genes. Besides bHLH and MYB TFs, we suggest TFs of the bZIP, ABI3VP1/B3, NAC, and TCP families as upstream regulators of 2Fe responses in the root. Because major Fe sinks are located in the shoot, a systemic shoot-to-root signal must exist for proper Fe supply (Vert et al., 2003;García et al., 2013). Integrating shoot transcriptomic data would expand our knowledge on how responses to 2Fe stress are coordinated at the whole-plant level.
In conclusion, we demonstrate that our machine learning-based approach can identify pCREs for 2Feinduced gene up-regulation. This strategy can be applied to various stresses and developmental conditions to elucidate regulatory mechanisms, especially when cis-and/or transacting elements were previously elusive (Zou et al., 2011;Uygun et al., 2017). We provide a comprehensive source of potential 2Fe response cisregulators for a wide range of 2Fe-responsive genes. Because the identified pCREs are potentially involved in enhancing Fe uptake and translocation, they generate potential for future applications in engineering plants with improved plant performance traits, for example, higher nutritional value because of better Fe allocation and coping with unfavorable soil conditions.

Expression Data Processing and Generation of Multiple Expression Data Combinations
Arabidopsis (Arabidopsis thaliana) expression data (Affymetrix ATH1) from a 2Fe treatment time course experiment with six time points and of four 2Fetreated root zones (both Dinneny et al., 2008) were downloaded from Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/, GSE10502, GSE10497) in the Affymetrix CEL data file format (stores probe intensities from a single ATH1 chip), preprocessed, normalized, and contrasted as described below. AtGenExpress expression data (Affymetrix ATH1) of abiotic stresses (Kilian et al., 2007;GSE5620-5628 or TAIR-ME00325-330, only data of root samples were used), hormone treatment (Goda et al., 2008;GSE39384 or TAIR-ME00333-340, ME00343-344, ME00350-352, ME00356), and plant development (Schmid et al., 2005; GSE5629-5634 or TAIR-ME00319) were downloaded from The Arabidopsis Information Resource (TAIR; https://www.arabidopsis.org/ portals/expression/microarray/ATGenExpress.jsp), preprocessed, normalized, and contrasted by S. Uygun (Uygun et al., 2016) as described below. Background correction and quantile normalization of CEL files were performed with Robust Multi-Array Average expression measure using the Bioconductor affy package (Gautier et al., 2004). The log 2 fold-change (log2FC) in expression was calculated for all data sets except developmental data by pairwise comparison of treatment and control experiments for each treatment and time point. Contrast matrices and linear model fits were created using R and the Bioconductor LIMMA package (Ritchie et al., 2015;Phipson et al., 2016). Because developmental stages have no control treatment, absolute normalized fluorescence intensity values were used. The p-values for log2FC or fluorescence intensities were corrected for multiple testing (adjusted p-values 5 q) using the BH (Benjamini-Hochberg) method (Benjamini and Hochberg, 1995). Genes were regarded as 2Fe-responsive if abs(log2FC) $ 1, and q , 0.05 at least at one 2Fe treatment time point or in at least one 2Fe treated root zone. 2Fe deficiency time course data were combined with 2Fe root zone expression data or ATGenExpress data sets in different combinations (Fig. 1C) either including only genes up-regulated ("up") or all genes up-or down-regulated ("up and Plant Physiol. Vol. 182, 2020 down") in $1 2Fe time point or root zone. This resulted in 12 different expression data combinations.

Coexpression Clustering using k-Means
To cluster genes with similar expression pattern, k-means clustering was implemented in R using the Hartigan-Wong algorithm (Hartigan and Wong, 1979). Because k-means returns a local optimum solution depending on the number of clusters (k) created and the random selection of genes as initial "means", the outcome varies with run (i.e. nondeterministic). Therefore, different k (25,30,35,40,50,70,80,100) were tested, and the clustering was repeated up to four times. Machine learning models were built (see below) with all clusters generated from expression data combinations (DC) 1, 2, 3, and 5 (Fig. 1C). To prevent confusion, it should be noted that the total number of k-means clusters used to build models represents several repeated clustering events of always the same two sets of 2Fe-responsive genes (up; up and down, see above). The clustering events differ in the DC, which was used and in the k. Two DC were excluded from the analysis: DC 4 produced identical clusters as DC 1, which were therefore not considered. DC 6 contained different measuring units (log2FC and absolute normalized fluorescence intensity) and could not be handled by the k-means algorithm.

Coexpression Clustering by Correlation with GS Genes
To generate coexpression clusters based on gold standard (GS) genes (Supplemental Table S1), each GS gene was used as a query to identify genes with similar expression patterns. Briefly, for each expression DC (Fig. 1C), the PCC was calculated between the query gene and each gene in DC using SciPy (http://www.scipy.org; Jones et al., 2001). Similar to Uygun et al., 2016, a random background PCC was calculated representing the null distribution of expression correlation by calculating the PCC of 10,000 randomly selected gene pairs in DC and the 95th percentile of PCCs was tried as the threshold for classifying a pair of genes as significantly correlated. For some DC (mostly those containing only up-regulated 2Fe-responsive genes), a significance threshold was allowed below 90% down to 45%, because the PCC between random 2Feresponsive genes was already very high. On the other hand, when .50 genes were considered significantly correlated, the threshold was raised above 95% to 99% to home in on genes most likely to be coregulated. In addition, a second version of clusters was generated with .50 genes, containing only the 10 genes with highest PCC. Machine learning models were built (see below) with both versions, and the results were further used from the better performing version only. Percentiles used for each PCC-based cluster are given in Supplemental  Table S2. Two DC were excluded from the analysis: DC 4 (up), because the resulting clusters were identical to those generated from DC 1 (up), and DC 6 (up and down), because developmental data seemed to have a disproportional influence on the PCC with the result that even 2Fe up-and down-regulated gene pairs were identified as strongly correlating. As in the k-means clustering, the total number of PCC-based clusters used for modeling represents repeated clustering events of the same two sets of 2Fe-responsive genes (as described above).

The Coexpression Clusters: GO and GS/FIT-Dependent/ FIT-Independent Gene Enrichment and GO/Gene Content Similarity
GO associations for Arabidopsis were downloaded from TAIR (ftp://ftp. arabidopsis.org/home/tair/Ontologies/Gene_Ontology; Berardini et al., 2004). Biological process (BP) GO annotations were downloaded from GO (http:// purl.obolibrary.org/obo/go.obo) and parsed for BP information. Enrichment of GO terms in genes that were significantly differentially regulated (q , 0.05, abs(log2FC) $ 1) in the 2Fe time course data set was determined with aN FET implemented in Python using SciPy (Jones et al., 2001), and p-values were corrected for multiple testing (5 q) using the "qvalue" function in R (Storey, 2002;Supplemental Table S9).
All coexpression clusters were tested for enrichment of GO terms as described above. The similarity of enriched GOs between coexpression clusters was assessed using the Jaccard index (JI), or the intersection of GOs divided by the union of the GOs, where JI 5 1 if the exact same GOs were enriched in both coexpression clusters. Coexpression clusters were grouped by hierarchical clustering using the JI with the UPGMA method in the R cluster package (Maechler et al., 2017). Groups containing $20 coexpression clusters and having a within-mean JI that was significantly higher than the mean JI of all clusters were defined as "superclusters." Biological functions of superclusters were defined through GOs shared by $75% (k-means clustering) or $90% (GS gene correlation; PCC) of the clusters. Similarly, FET with p-value correction for multiple testing was used to identify coexpression clusters enriched with (A) 2Fe GS genes, (B) FIT-dependent genes, and/or (C) FIT-independent genes.
K-mer Enrichment and Identification of pCREs Predictive of 2Fe Response using Random Forest (pCRE Identification Pipeline) Promoter sequences 1 kb upstream from the transcription start site (TSS) were downloaded from TAIR (ftp://ftp.arabidopsis.org/home/ tair/Sequences/blast_datasets/TAIR10_blastsets/upstream_sequences/ TAIR10_upstream_1000_20101104). A list of all possible 6-mers of A, T, C, G was generated with the Python itertools function and the Biopython Bio.Seq module (https://biopython.org/docs/dev/api/Bio.Seq.html; Cock et al., 2009). Only one 6-mer for each reverse complement pair was kept (resulting in 2,080 6-mers).
Potentially meaningful cis-regulatory elements for 2Fe response were identified in two steps, where we first looked for enriched k-mers in the promoters of 2Fe-responsive genes and then determined how well the enriched k-mers predicted 2Fe response using machine learning. The code for this analysis is available on GitHub (https://github.com/ShiuLab/MotifDiscovery). For the first step, the promoter sequences of the genes in coexpression clusters (positive set) were searched for enriched 6-mers in comparison with promoter sequences of nonresponsive genes (negative set, as defined below; 10,068 genes). These enriched 6-mers were elongated by one base and tested again for enrichment. This process was repeated until no longer k-mer was more enriched than the shorter k-mer. Enrichment was calculated using a one-sided FET (P , 0.01).
For the second step, to determine which sets of enriched k-mers were predictive of 2Fe response, we generated features based on presence or absence of each enriched k-mer and used these features to build machine learning models using the RF algorithm. RF was implemented in Python using Scikit-Learn (Pedregosa et al., 2011). To avoid building biased models, 50 models were generated for each coexpression cluster. Each time, the same number of positive set genes were randomly drawn from the negative set to generate balanced (i.e. size positive set equals size of negative set) input data sets. A 10-fold crossvalidation approach was used to train and test the models. Briefly, the balanced data sets were divided randomly into ten even groups with a 1:1 ratio of positive to negative class genes. The model was trained on the 1-9 folds and applied to the 10th (and successively trained on 1-8110 and applied to the 9th, etc.). This cross-validation scheme was repeated ten times. Each RF model was made up of 500 decision trees, each trained on a random subset of enriched k-mers and of training set genes. The final model performance is represented by the mean F1 score (i.e. F-measure) across all 50 balanced models. The F1 score is the harmonic mean of precision, P 5 TP/(FP 1 TP), and recall, R 5 TP/(FN 1 TP), where TP 5 true positive, FP 5 false positive, and FN 5 false negative. Only coexpression clusters for which the enriched k-mers were deemed as good predictors (F1 $ 0.7) were used in the downstream analysis. Predictive k-mers (then referred to as putative cis-regulatory elements, pCREs) were ranked by importance. The importance score is based on the Gini index (Breiman et al., 1984), built into the Scikit-Learn implementation of RF. The Gini index is defined as the total decrease in node impurity (i.e. how accurately positive [2Feresponsive] and negative [nonresponsive] class genes are split by a feature [pCRE] at a particular node) averaged over all decision trees. The importance score of a pCRE increases with the frequency it was selected for a node split and with its overall ability to correctly classify -Fe-responsive and nonresponsive genes. To determine how well the models predicted individual 2Fe-responsive genes, the RF models were rerun with an updated code, which outputs the prediction scores for each gene during each of the 50 balanced replicates (https://github.com/ShiuLab/ML-Pipeline). Using the scores the ratio of correct predictions (TP) was calculated by counting the number of times that a gene was correctly predicted out of the 50 replicates.
Genes were considered robustly 2Fe nonresponsive if they were not significantly differentially expressed (abs(log2FC) , 0.4) during any time point during the 2Fe time course experiment or in any 2Fe treated root zone or in four additional 2Fe treatment experiments (Li and Schmidt, 2010: GSE16964;Long et al., 2010: GSE21443;Schuler et al., 2011: GSE24348;Sivitz et al., 2012: GSE40076). The four additional data sets were downloaded in CEL format from GEO and processed as described in the first section of "Materials and Methods." The log2FC threshold for nonresponsive genes was chosen far below abs(log2FC) , 1 to create a safe log2FC distance between positive and negative class genes. This was regarded as helpful for the RF classifier, since genes with abs(log2FC) $ 0.585 (corresponds to FC $ 1.5) can be considered weakly differentially expressed. However, to maintain a large pool of negative set genes for the random draws during RF, the threshold was not pushed below abs(log2FC) , 0.4.

pCRE Sequence Similarity
To assess sequence similarities between the 173 pCREs that were frequently identified (in .5%; freq-pCREs) in GS-enriched or nonenriched coexpression clusters with good model performance (F1 $ 0.7), sequence dissimilarity of pCRE position weight matrices (PWMs) was calculated by pairwise PCC distance and a distance matrix was generated using the TAMO package in Python (Gordon et al., 2005). Freq-pCREs were grouped by hierarchical clustering of the PCC distance matrix using the UPGMA method in the R cluster package (Maechler et al., 2017), and visualized in a dendrogram (Supplemental Fig.  S6E). Due to groupwise averaging of PCC distances during hierarchical clustering, the algorithm produced skewed PCC distances of some similar pCRE pairs. Therefore, freq-pCRE clusters were additionally visualized as a network, in which freq-pCREs with PCC distance 5 0 (identical freq-pCREs or subsets of each other) were connected with black bold edges and freq-pCREs with PCC distance # 0.22 were connected with light gray edges (Fig. 4C). Highly interconnected nodes were arranged in groups. The network was created using the Cytoscape software (Shannon et al., 2003). To show a consensus of freq-pCREs within a network group, freq-pCRE sequences were aligned using ClustalX (Larkin et al., 2007) with default parameters and a sequence logo was created with weblogo (https://weblogo.berkeley.edu/logo.cgi).

Identification of Most Informative pCREs (min-pCREs) by Nonlinear Regression
The most informative pCREs of a coexpression cluster were defined as the minimum set of pCREs (min-pCREs) needed for RF models without sacrificing performance. To identify min-pCREs, for each GS-enriched coexpression cluster, the pCREs used as features were stepwise reduced, with the least important pCREs deleted at each step. First, for pCREs that were subsets of each other (PCC distance 5 0), the lower ranked one was removed. Then, from this list of pCREs and for successively shorter lists of pCREs (n 5 40, 30, 25, 20, 15, 12, 10, 8, 6, 5, 4, 3, 2, 1), 10 replicates of RF models were trained on balanced data sets. F1 scores were plotted against the number of pCREs (x) and a nonlinear regression curve was fitted to the data points. An exponential recovery function: F1ðxÞ 5 a À 1 2 e ð 2 nxÞ Á was found to best describe the data behavior. Starting values for variables a and n were approximated by fitting a linear model to the logarithmic transformation of the function. The set of pCREs with the highest F1 closest to the inflection point of the regression curve was defined as min-pCRE set (example in Supplemental  Fig. S8A).  Table S10). DAP-seq TFBMs were preferred over CIS-BP TFBMs when a pCRE-TFBM match was present in both databases.

pCRE Similarity to TFBMs
PWMs of pCREs were compared with PWMs of TFBMs using PCC, and the pCREs were classified as similar to (A) a specific TF, (B) a TF family, or (C) to TFs generally, based on the degree of similarity to their best matching TFBM (Uygun et al., 2017). A pCRE was similar to a specific TFBM (A) if the PCC between the pCRE and the TFBM was $95th percentile of PCCs between that TFBM and TFBMs from the same TF family. Alternatively, a pCRE was similar to TFBMs from a TF family (B) if the PCC between the pCRE and a TFBM from that family was $95th percentile of PCCs between TFBMs from that family and TFBMs from other TF families. Finally, a pCRE was similar to TFBMs (C) if the PCC between the pCRE and any known TFBM was $95th percentile of PCCs between TFBMs and randomly generated 6-mers. For 95th percentile PCC thresholds see Supplemental Table S11. To determine if pCREs similar to specific TF families were enriched in GS-enriched versus nonenriched coexpression clusters, the percentage of pCREs similar to TFBMs (significance level A or B) from each TF family was calculated for each coexpression cluster category. Then, FET with multiple testing correction (q # 0.05) was used to determine if GS-enriched coexpression clusters were enriched with TF families compared with non-GS-enriched coexpression clusters and vice versa.

Positional Distribution of pCREs
To determine the positional distribution of the min-pCREs for each GSenriched coexpression cluster, min-pCREs were converted to PWMs adjusted to the Arabidopsis background AT (0.33) and CG (0.17) content using the TAMO package (Gordon et al., 2005) and mapped to the promoter sequences ranging from 1,000 bp upstream to 500 bp downstream of the transcription start site (1000-TSS-500), using Motility (https://github.com/ctb/motility). For comparison, min-pCRE PWMs were also mapped to exons and introns, respectively, and to the region 500 bp upstream and 1,000 bp downstream of the transcription termination site (500-TTS-1000). Arabidopsis sequences were downloaded from TAIR (ftp://ftp.arabidopsis.org/Sequences/ blast_datasets/TAIR10_blastsets/). Positional distributions were calculated as described in Uygun et al., 2017. In brief, min-pCREs were mapped to 100 bp bins of 1000-TSS-500 and 500-TTS-1000 and to whole exons and introns. For comparison, min-pCREs were mapped to randomized versions of the sequences. Randomization was performed within each 100-bp bin and in each exon or intron, respectively, to maintain nucleotide composition and therefore GC content. Positional distribution was calculated as log2FC of number of observed mappings divided by number of randomly expected mappings (log2FC[observed/expected]).

pCRE Coordinate Overlap with CNS Coordinates
All 615 min-pCRE PWMs were mapped to the putative promoter region (1 kb upstream of TSS) of 2Fe response genes (described above). The min-pCRE coordinates were then compared with the coordinates reported as CNS across nine species in the Brassicaceae family (Haudry et al., 2013) downloaded from the UCSC Genomics Bioinformatics Web site (http://mustang.biol.mcgill.ca:8885/ download/A.thaliana/gff/AT_CNS.gff; Supplemental Table S8).

Supplemental Data
The following supplemental materials are available.
Supplemental Figure S1. Complete GO enrichment analysis of -Feresponsive genes.
Supplemental Figure S2. GO terms and -Fe GS gene enrichments of the defined coexpression clusters containing up-/down-regulated genes, and mean similarity within designated superclusters of up-regulated and up-/down-regulated genes.
Supplemental Figure S3. Expression plots of all well-performing coexpression clusters.
Supplemental Figure S4. Coexpression cluster RF model performance.
Supplemental Figure S5. Different sets of pCREs predict gene expression at different time points of -Fe treatment or in different -Fe treated root zones.
Supplemental Figure S6. Comparison of the pCRE abundance and importance in GS-enriched clusters vs. nonenriched clusters and hierarchical clustering of freq-pCRE sequences.
Supplemental Figure S7. Sequence similarity for freq-pCRE from nonenriched clusters and the best matching known TFBM (DAP-seq or CIS-BP).
Supplemental Figure S8. Example of a nonlinear regression curve to determine the minimum set of pCREs (min-pCREs) for a coexpression cluster and min-pCREs highlighted in two more example clusters.
Supplemental Figure S10. High-resolution image of Figure 6.
Supplemental Table S2. All generated coexpression clusters. Table shows the input expression data combinations, the algorithm and parameters used for clustering, significant enrichment in GS genes and more detailed enrichment in FIT-dependent genes (5 FIT), FIT-independent genes (5 non-FIT) or both (5 mixed).
Supplemental Table S6. Proportional association of min-pCRE to FITdependent, FIT-independent or both cluster categories.
Supplemental Table S7. Total counts of incidences in which each min-pCREs (n 5 615) located in the putative promoter of a gene was also identified as predictive for the cluster containing the gene.
Supplemental Table S8. Overlap of min-pCRE coordinates with conserved noncoding sequences (CNS) in Brassicaceae.
Supplemental Table S9. P-and q-values of complete GO enrichment analysis of -Fe-responsive genes (Supplemental Fig. S1).
Supplemental Table S10. DAP-seq and CIS-BP TFBMs used in this study.
Supplemental Table S11. TF family 95 th percentiles of within, between and random PCC thresholds determining the pCRE-TFBM similarity.