A DNA-binding-site landscape and regulatory network analysis for NAC transcription factors in Arabidopsis thaliana

Target gene identification for transcription factors is a prerequisite for the systems wide understanding of organismal behaviour. NAM-ATAF1/2-CUC2 (NAC) transcription factors are amongst the largest transcription factor families in plants, yet limited data exist from unbiased approaches to resolve the DNA-binding preferences of individual members. Here, we present a TF-target gene identification workflow based on the integration of novel protein binding microarray data with gene expression and multi-species promoter sequence conservation to identify the DNA-binding specificities and the gene regulatory networks of 12 NAC transcription factors. Our data offer specific single-base resolution fingerprints for most TFs studied and indicate that NAC DNA-binding specificities might be predicted from their DNA-binding domain's sequence. The developed methodology, including the application of complementary functional genomics filters, makes it possible to translate, for each TF, protein binding microarray data into a set of high-quality target genes. With this approach, we confirm NAC target genes reported from independent in vivo analyses. We emphasize that candidate target gene sets together with the workflow associated with functional modules offer a strong resource to unravel the regulatory potential of NAC genes and that this workflow could be used to study other families of transcription factors.

Supp. Figure S1 -Sequence conservation and identity of NAC protein DNA binding domains.
A) NAC DNA binding domain sequence similarity tree for all studied NAC proteins shows 3 main clusters for our candidate TFs. Cluster I contains ANAC092, NST2, SND1, VND3, VND7, ANAC019, ANAC055, ATAF1 and NAP; Cluster II contains NTL6 and NTL8 and, finally, Cluster III contains VOZ2, ANAC003 and SOG1. B) Multiple sequence alignment of the DNA binding regions of the selected 14 NAC proteins. Residues that based on the x-ray model of the ANAC019-DNA complex are close to DNA are shown by a bar. Residues marked with black boxes are common to at least half of the sequences and residues marked in grey boxes are chemically similar in half of the sequences. Asterisks highlight those residues showing remarkable divergence between NTL6, NTL8 and the remaining NAC proteins. Lindemose et al. Supp. Fig. S1   T  TACGTA  TACGTC  TACGTG  TACGTT  TCCGT  TG.CGT  TG.CGTA  TG.CGTG  TGACGT  TGCCGT  TGCGT  TGCGT.Y  TGCGTA  TGCGTG  TGGCGT  TGTCGT  TT.CGCTC  TT.CGTA  TT.CGTG  TT.CGTT  TTA.CGT  TTACGC  TTACGT  TTCCGT  TTCGTA  TTCGTG  TTG.GTA  TTG.GTG  TTGC.TA  TTGC.TG  TTGCGT  TTGCTT  TTTCGT  0ther_NTL AAG Supp. Figure S3 -A) Boxplots of ES distributions for 130 signature 6-mers for all tested TFs. TF boxplots are grouped according to clusters in Figure 1A.  Supp. Figure S4. -Overview of the number of target genes for different NAC TFs and filtering approaches.

NTL 6 68 DSE WL YFC PL D RK YP S GS RQ NR--A TVA G -----Y WKA TG KDR KI KSG KTNII GVK RT LVFHA GR APR GT RT NWI IH EYR AT EDD L SG TN P NTL 8 69 -NE WF YFC AR G RK YP H GS QS RR--A TQL G -----Y WKA TG KER SV KSG NQ-VV GTK RT LVFHI GR APR GE RT EWI MH EY-----C I HG AP -NST 2 67 QND WY FYS HK D KK YP T GT RT NR--A TTV G -----F WKA TG RDK TI YTN GD-RI GMR KT LVFYK GR APH GQ KS DWI MH EYR -L DES V LI SS C SND 1 72 QND WY FFS HK D KK YP T GT RT NR--A TVA G -----F WKA TG RDK II CSC VR-RI GLR KT LVFYK GR APH GQ KS DWI MH EYR -L DDT P MS N--VND 7 65 QNE WY FFS HK D RK YP T GT RT NR--A TAA G -----F WKA TG RDK AV LSK NS-VI GMR KT LVYYK GR APN GR KS DWI MH EYR -L QNS E LA P--VND 3 68 QTE WY FFS HR D KK YP T GT RT NR--A TVA G -----F WKA TG RDK AV YLN SK-LI GMR KT LVFYR GR APN GQ KS DWI IH EYY SL ESH Q NS P--ANA C0 9 2 74 -KE WY FFC VR D RK YP T GL RT NR--A TEA G -----Y WKA TG KDK EI FKG KS-LV GMK KT LVFYK GR APK GV KT NWV MH EYR -L EGK Y CI EN -
The number of target genes is shown for each TF. Blue bars indicate the number of P target genes through simple screening of promoters with high scoring k-mers. Green and yellow bars show the number of target genes when integrating co-expression information and conserved motif information, respectively.
For each protein, the red box shows the ES distribution of k-mers containing the TACGTC key k-mer, which is specific for Cluster 1a and 1b proteins. The green box shows the ES distribution of k-mers containing the TAAGTA key k-mer, specific for Cluster 3 proteins.
Enrichment Score