Prediction of cooperative homeodomain DNA binding sites from high-throughput-SELEX data

Abstract Homeodomain proteins constitute one of the largest families of metazoan transcription factors. Genetic studies have demonstrated that homeodomain proteins regulate many developmental processes. Yet, biochemical data reveal that most bind highly similar DNA sequences. Defining how homeodomain proteins achieve DNA binding specificity has therefore been a long-standing goal. Here, we developed a novel computational approach to predict cooperative dimeric binding of homeodomain proteins using High-Throughput (HT) SELEX data. Importantly, we found that 15 of 88 homeodomain factors form cooperative homodimer complexes on DNA sites with precise spacing requirements. Approximately one third of the paired-like homeodomain proteins cooperatively bind palindromic sequences spaced 3 bp apart, whereas other homeodomain proteins cooperatively bind sites with distinct orientation and spacing requirements. Combining structural models of a paired-like factor with our cooperativity predictions identified key amino acid differences that help differentiate between cooperative and non-cooperative factors. Finally, we confirmed predicted cooperative dimer sites in vivo using available genomic data for a subset of factors. These findings demonstrate how HT-SELEX data can be computationally mined to predict cooperativity. In addition, the binding site spacing requirements of select homeodomain proteins provide a mechanism by which seemingly similar AT-rich DNA sequences can preferentially recruit specific homeodomain factors.


INTRODUCTION
The differential control of gene expression is fundamental for the specification of distinct cell types during de v elopment. At the transcriptional le v el, sequence-specific transcription factors (TFs) regulate gene expression by binding cis -regulatory modules (CRMs) to inhibit or promote RNA polymerase activity through the recruitment of co-factors.
Hence, the binding of TFs to their corr ect r egulatory r egions is vital for proper de v elopment. Many CRMs are conserved across metazoans by sequence and / or function ( 1 , 2 ), and not surprisingl y, m uta tions in these regula tory regions have been associated with de v elopmental, autoimmune, and cardiovascular diseases as well as cancer ( 3 ). Thus, it is essential to define the DNA binding characteristics of TFs to better understand how each TF accura tely regula tes target gene expression and ultimately cellular fates.
Metazoan genomes encode numerous sequence-specific TFs that participate in gene regulation ( 4 ). Many of these TFs can be categorized into families based on conserved DNA binding domains, and often members within a famil y bind highl y similar DN A sequences. The homeodomain (HD) family is one of the largest TF families, consisting of almost 200 family members in humans ( 5 ). HD proteins have been separated into distinct classes based on conserved sequence features such as the presence of additional DNA binding domains in the P air ed, Pou, and CUT classes; amino acid insertions within the HD in the three amino acid loop extension (TALE) and Pr osper o classes; and conserved amino acid motifs in the NK-like, HOXlike, and P air ed-like classes ( 6 ). All members of the HD family encode a helix-turn-helix DNA binding domain that contains three alpha helices. Structural and mutation studies re v ealed that HD proteins use this DNA binding domain to mediate direct contact to a AT-rich core DNA motif such as TAATNN through largely conserved residues. Within the typical 60 amino acid HD, the conserved Arginine 5 (R5) and R3 residues in the N-terminal Arginine-Rich Motif (ARM) of the first alpha-helix contact the first and second DNA positions on the minor groove of DNA, and the common N51 and I / V47 in the third alpha-helix contact the third and fourth DNA positions in the major groove of DNA ( 7 , 8 ). In addition, the 50th residue of the HD, which can vary between differ ent r esidues including Q50, K50 and S50, contacts the fifth and sixth DNA positions on the major groove of DNA, and thereby contributes to the binding site specificity of these DNA positions (6)(7)(8)(9)(10). Additional amino acids that vary between HDs can impact DNA binding specificity ( 10 ), but the above listed residues are thought to be the dominant dri v ers of site specificity.
Gi v en the high-degree of sequence conservation in residues that contact DNA between HD factors, it is not surprising that the majority of HD TFs bind similar AT-rich sequences in vitro ( 6 , 11 ). Howe v er, in sharp contrast to their similar in vitro DNA binding activities, most HD TFs regulate distinct de v elopmental pathways in vivo , and the mechanisms by which members of the HD family dif ferentia te between like DNA sequences to bind and regulate distinct targets is not well understood ( 12 , 13 ).
One solution to the specificity problem is that TFs can form homo-and heterodimer complexes that bind distinct sequences and / or lengthen the recognition site. For example, while the Hox TFs that specify anterior-posterior identities in metazoans have relatively low sequence specificity as monomers, Hox TFs form heterodimer complexes on DNA with the PBX factors, and ther eby incr ease their DNA binding specificity ( 14 ). Further, some P air ed-like HD factors, which contain a HD most homologous to those found in PAX factors but lack the accompanying paired domain ( 6 ), bind as homodimers and heterodimers with other members of the P air ed-like class to increase specificity ( 15 ) and alter transcriptional output ( 9 , 16 ). Likewise, TFs outside of the HDs including the bZIP ( 17 ), bHLH ( 18 ) and nuclear receptor ( 19 ) families bind DNA as obligate homodimers and / or heterodimers, and thereby lengthen the DN A reco gnition sequence. Many of these dimeric TF complex es ar e considered cooperati v e since the binding of the second protein is largely facilitated or dependent upon the binding of the first protein.
Gi v en the central importance of TFs in regulating gene expr ession, a gr eat deal of effort has been put toward defining the DNA sequence binding pr efer ences of each TF. In vivo methods such as chromatin immunoprecipitation sequencing (ChIP-seq) and cleavage under targets and release under nuclease (CUT&RUN) can identify genomic regions bound by a target TF. Howe v er, both methods r equir e either a high-quality antibody for each TF or expression of a tagged version of the protein which may alter TF activity. Ther e ar e also an endless number of combinations of biological tissues, de v elopmental stages, and TFs to test with these methods. Further, the most highly enriched motif in a ChIP-seq experiment is not always the motif for the TF examined in the assay. An alternative approach has been to use in vitro assays such as protein binding microarrays (PBMs) and high throughput sequencing of systematic evolution of ligands by exponential enrichment (HT-SELEX) assays, which have been used to define the sequence preferences for hundreds of transcription factors in a standardized synthetic environment ( 11 , 20-23 ). A strength of PBMs is that microarrays can be designed to contain each 8mer sequence 32 times and the assay provides semi-quantitati v e binding information since binding is measured using a fluorescent protein that does not r equir e amplification. Howe v er, PBMs hav e less utility in systematically identifying motifs longer than 8-10 bps. In contrast, HT-SELEX uses random sequences that are between 20 and 40 bps and can detect the binding of longer motifs, including those bound by TF m ultimers. Until recentl y, SELEX assays onl y obtained qualitati v e binding information unlike PBMs. Howe v er, Rube et al. de v eloped the machine learning algorithm, ProBound, to compute quantitati v e TF binding models and absolute affinities from a modified SELEX-seq protocol, kD-seq ( 24 ).
As expected, analysis of the TF binding motifs for human and mouse HD factors using the PBM and HT-SELEX methods re v ealed highly similar AT-rich monomer binding sites for most proteins ( 11 , 20 ). In addition, the HT-SELEX assays showed that some HD TFs enriched for both monomer and dimer sites, suggesting that a subset of HD TFs may gain DNA binding specificity by binding DNA as homodimers ( 11 ). Moreover, as part of a previous study on how the mouse HD TF, Gsx2, regulates gene expression during forebrain de v elopment, we re-analyzed the human GSX2 HT-SELEX data that was originally found to only enrich for a monomer site, and identified a significantly enriched dimer site consisting of 2-TAAT motifs spaced 7 bp apart ( 25 ). We subsequently used electrophoretic mobility shift assays (EMSAs) to show that Gsx2 cooperati v ely bound this novel dimer site. This capability of Gsx2 to regulate gene expression through both short monomer sites and longer dimer sites with specific spacing and orientation r equir ements enhances the DNA binding specificity of this factor compared to other HD factors. In this study, we broadly assess the prevalence of homodimer cooperativity in the HD family by first de v eloping a computational approach to predict cooperati v e HD DNA binding from existing HT-SELEX data, and second, by using quantitati v e EMSAs to systematically test a subset of the HDs for cooperati v e DN A binding. Importantl y, our findings re v eal how the relati v e rate of binding site enrichment within HT-SELEX assays and the selection for a specific spacer length within the dimer site can be utilized to accurately predict cooperati v e TF binding. Further, we show that a subset of the P air ed-like class of HDs cooperati v ely binds palindromic dimer sites spaced 3 bp apart, and we explore how amino acid differences between paired-like HDs impact cooperati v e DNA binding. In contrast to the consistent spacing r equir ements of the paired-like class, we found that a subset of other HD proteins bind cooperati v ely to unique binding site arrangements. We then re-analyzed available genomic data and found evidence that these dimeric sites are being bound in vivo f or man y of these HDs. Taken together, these findings highlight how HT-SELEX data can be analyzed to identify members of the HD family that enhance their DNA binding specificity by forming cooperating complexes on distinct dimer DNA binding site configurations.

HT-SELEX TF binding dataset acquisition and analysis
The in vitro HT-SELEX data analyzed using the Cooperativity Predictor was acquired from European Nucleotide Archi v e under accession entry, PRJEB3289 ( 11 ). The run accession IDs and download links for all the HD TFs analyzed from the HT-SELEX study are listed in Supplementary Table S1. All SELEX datasets performed with a HD DNA binding domain that had an available initial library and utilized a ligand 20 bp or longer were analyzed with the Cooperativity Predictor pipeline (code available at https://github.com/cainbn97/Cooperativity predictor ). The steps and rationale for the HD analysis criteria within the Cooperativity Predictor pipeline are provided in the Results and Figure 1 . In brief, the general steps were as follows: (i) Homer de novo motif analysis for 16-18 bp motifs was performed using 50 000 randomly selected sequences from the fourth round of HT-SELEX selection to define potential dimer sites. (ii) The two highest information content 4mer sequences that were at least 2 bp apart were defined as core 4mer motifs. The 4mer PWM was then generated with the seq2profile.pl script in Homer. (iii) The rate and ratio of dimer to monomer site enrichment after each cycle were calculated using the initial library and the data from each HT-SELEX cy cle. (i v) COSMO determined the number of dimers that were present at each spacer length after each HT-SELEX cycle ( 26 ). (v) The enrichment of a single spacer length from the fourth cycle HT-SELEX selection was tested with a Grubb's test for outliers and the change in dimer proportion between the initial library and the fourth cycle of SELEX selection was tested with a chi-square test for independence. The pipeline results for the HD family are provided in Supplementary Figure S3. PWMs of the enriched long motifs that contained dimer sites are listed in Supplementary Table S2. The Jaspar formatted PWMs (jpwm) used in the COSMO analysis can be found in Supplementary Table S3. The multiple sequence alignments were generated using the HD amino acid sequences used in the HT-SELEX assay with the MSA package ( 27 ) and the phylograms were plotted with the ape package ( 28 ) in R.

Genomic TF binding dataset acquisition and analysis
For the in vivo genomic data analysis, each respecti v e ChIPseq and CUT&RUN dataset was acquired from Gene Expression Omnibus using the accession IDs listed in Supplementary Table S4. For consistency, all genomic binding assay datasets were repr ocessed fr om their raw fastq files. The sequencing data underwent adaptor trimming with Cutadapt ( 29 ) and quality control with FastQC via the wrapper, TrimGalor e. Results wer e mapped to hg19 or mm10 using Bowtie2 ( 30 ), and duplicates were removed with Picard (Broad Institute). All reads longer than 150bp were removed prior to further analysis in CUT&RUN processing. Peaks were called with MACS3 using reads across all r eplicates ( 31 ). Peaks wer e extended to 1kb on each side of the MACS3 summits with BEDTools ( 32 ), and motif densities across these regions were determined via Homer ( 33 ). Datasets in which less than 1200 peaks were called had a moving average smoothing function applied to the motif densities to increase readability of overall trends. The log2ratios of the binding signals between the IP experiment and the IgG control were calculated and plotted using deep-Tools ( 34 ). The number of dimer sites at each spacer length within the called narrow peaks from MACS3 (not the extended peak summits) was determined via COSMO ( 26 ).

Molecular cloning
TF sub-fragments containing HD and flanking sequences were PCR amplified from cDNA clones obtained from Genscript. Oligonucleotide sequences used to amplify the TF sub-fragments are listed in Supplementary Table S5 (IDT). Accuzyme DN A pol ymerase (Bioline) or GoTaq Master mix (Promega) was used to PCR amplify regions. The Isx sub-fragment was synthesized through Genscript rather than PCR amplified from its cDNA clone. The amino acid sequences of the TF sub-fragments are listed in Supplementary Table S6. TF sub-fragments were ligated into a bacterial expression vector with T4 DNA ligase (NEB). The bacterial e xpression v ector was either pET-14b (Novagene), which contains an N-terminal 6xHis-tag, or a modified version of pET-14b called pET-14P, in which additional restriction enzyme sites and a PreScission Protease site was inserted between the His-tag and TF coding sequence in place of the original thrombin cleavage site. The vector used for each TF is indicated in Supplementary Table S6 and the full sequence of pET-14P is provided in the supplementary information. Isx, VENTX, Gsx1, MSX1, Msx2 and HESX1 were cloned in between BamHI and NotI sites. Cart1, Alx4, Arx, Barx1, Bsx and Gsx2 were cloned in between NdeI and XhoI sites. All constructs were confirmed via DNA sequencing. Figure 1. The Cooperativity Predictor uses dimer to monomer site enrichment rates and spacer length constraints between sites to predict the cooperativity of HD TFs from HT-SELEX data. ( A ) A de novo motif analysis for long motifs (16-18bps) was performed using the cycle 4 sequencing pool of HT-SELEX for each HD TF. ( B ) The two 4mers with the highest information content within each PWM were selected to define each site. Each generated PWM was then interrogated for dimer sites using the following criteria: First, the information content of the two 4mer sites (Site 1 and Site 2) had to be greater than 0.6. Second, the information content of the 4mer sites must be 1.5 times greater than those of the surrounding regions. Third, the motif had to be present in at least 5% of the sequences in the cycle 4 sequencing pool. Note, those numbers shown in bold b lue te xt passed the selected criterium. ( C ) For each dimer site that passed the selection criteria, the two 4mers and the spacer length wer e defined. ( D ) The per centage of sequences with dimer sites and monomer sites were determined after each cycle and used to calculate the fold changes of the dimer and monomer sites. Gi v en the exponential nature of cycle amplification in PCR, the data was linearized through a natural log transformation. The enrichment factor is defined as the slope of the dimer enrichment over the average slopes of the 2 individual 4mer sites. ( E ) COSMO was used to count the number of dimers composed of the 2 4mers at each spacer length. Note, cycle 0 is the initial library prior to any selection process. The specific spacer length enrichment in cycle 4 was tested via a Grubb's test for outliers, and the dimer proportions between the initial library and cycle 4 were compared with a chi-square test for independence.

Protein purification and electrophoretic mobility shift assays (EMSA)
TF sub-fragments containing HD and flanking regions were purified either from BL21 (DE3) E. coli under nati v e conditions without dialysis using Ni-chromatography as previously described ( 35 ) or via the following method: The e xpression v ector was transformed into C41DE3 (Sigma-Aldrich) E. coli and bacteria were grown in autoinduction media at 37 • C for 3 h and then cooled to 20 • C overnight. The cultur es wer e harvested by centrifugation. Cell pellets wer e r esuspended in binding buffer (1XBB; 20 mM Tris pH 8, 500 mM NaCl, 5 mM Imidazole), lysed by sonication, cleared by centrifugation, and loaded onto Ni 2+ beads. Beads were then loaded into a gravity column, washed with 1XBB with 0.1% Triton twice and 1XBB with 0.1% NP40 once. Protein was eluted using 1XBB with 0.1% NP40 and 0.5 M Imidazole. The eluted protein was dialyzed, and the His-tag cleaved with PreScission Protease (GE Healthcare) per the manufacturer's pr otocol. Pr otein was further purified via cation exchange and size exclusion chromato gra phy. Finally, the protein was concentrated in a buffer containing 20 mM MES pH 6, 150 mM NaCl, 1% ethylene glycol and 0.1 mM TCEP.
TF purity was confirmed by SDS-PAGE with GelCode blue staining (Thermo Scientific) (Supplementary Figure  S11). Protein concentrations were determined by Bradford assays (Bio-rad). EMSA probes were prepared as previously described ( 36 ), and the sequences used for EMSA probes are listed in Supplementary Table S7. EMSA binding reactions were prepared as described previously ( 35 ) and incuba ted a t room tempera ture for 20 min before being run on a 4% polyacrylamide gel for 2 hours at 150V. Gels were imaged via Li-Cor Odyssey CLx scanner and monomer , dimer , and free probe bands were quantified via the Li-Cor image studio software. Calculation of cooperativity via the Tau factor was performed as previously described ( 15 ). In brief, the Tau factor calculation is based on the dissociation constants deri v ed from the equilibrium reactions of a single protein binding to DNA ( K d1 ) and the binding of second protein to the protein-DNA complex ( K d2 ).
In this equation, [ P 2 D ] r epr esents the proportion of probe bound as a dimer, [ P D ] indicates the proportion of probe bound as a monomer, and [ D] is the proportion of unbound probe. The way in which the binding of the first protein facilitates the binding of the second protein is the coefficient of K d2 / K d1 , or the Tau factor.

Modeling HD variants in paired-like subclass
Variants were modeled on the Drosophila paired HD (PDB: 1FJL) ( 37 ) using the mutagenesis wizard in Pymol v2.2.0 (Schr ödinger). The disks and colors indicate pairwise overlap of atomic van der Waals radii. Large red disks indicate significant van der Waals overla p, w her eas gr een and yellow disks r epr esent minor overlap. The rotamer that demonstrated the least amount of clashing was modeled for each of the variants.

Pr edicting tr anscription factor cooper ativity using HT-SELEX data
Inspired by our prior successful demonstration of the cooperati v e binding of GSX2 to bioinformatically identified dimeric DNA sites and the large amount of available HT-SELEX data for HD TFs, we de v eloped a computational pipeline, termed the Coopera tivity Predictor, tha t examines the growth rate of dimer versus monomer motif enrichment and the pr efer ence for a specific spacer length between sites within each HT-SELEX cycle to predict cooperative binding behavior. First, we used Homer de novo motif analysis ( 33 ) to identify potential dimer sites by generating position weight matrices (PWMs) r epr esenting the most enriched long motifs (i.e. 16 or 18mers) after the fourth cycle of HT-SELEX compared to the initial unselected DNA library (Figure 1 A). Our rationale is that enriched long motifs are likely to contain more than one HD site, consistent with a potential dimer binding site. Using GSX2 as an example, Homer identified four enriched long motifs to consider as potential dimer sites (Figure 1 B). Howe v er, automating this unbiased approach re v ealed two fundamental challenges.
First, the HT-SELEX method does not directly differentiate between monomer versus dimer binding. Thus, enriched long motifs could either consist of a single TF (i.e. a monomer) bound to a single long site or r epr esent two TFs (i.e. a dimer) bound to two independent sites. To help discriminate between long monomer sites versus dimer sites, we first identified the two non-overlapping 4mer sequences that have the highest information content within each motif (boxed in Figure 1 B). 4mers were used in this analysis as structural studies have revealed that conserved residues in the HD primarily mediate direct contact to a core 4mer sequence ( 7 , 8 ). We then r equir ed the non-overlapping 4mers to be at least 2 bp apart and the average information content for each 4mer to be > 0.6 (see Supplementary Figure S1A-B,E for how we established the 0.6 threshold). Importantly, we used all available human HD datasets to establish the information content thresholds (Supplementary Figure S1E). A ppl ying the site information content threshold to the four enriched GSX2 motifs re v ealed that only the first and fourth motifs had sufficient information content within each 4mer sequence to pass this criterion (Figure 1 B).
The second challenge is that HT-SELEX relies upon nonlinear sequence amplification due to the polymerase chain reaction (PCR) between cycles. Thus, enriched dimer motifs may r epr esent r elati v ely rare binding e v ents that hav e been artificially enriched due to PCR bias. In general, if a sequence is over-amplified, the replicate 20mers and 30mers produce motifs with high information content across the entire PWM and not just the core 4mer sequences (Supplementary Figure S1D). To eliminate such motifs, we required the information content within the 4mers to be 1.5 times greater than the surrounding flanking and spacer sequences (see Supplementary Figure S1C-E for how we empirically established 1.5 as a threshold). In essence, by requiring sufficient variability in information content between the nucleotide positions of the PWM, we ensure that the motif was generated by multiple distinct sequences rather than by a few over-amplified sequences. When a ppl ying this filter to the GSX2 motifs, we found that the fourth motif, which had sufficient information content within the 4mers to pass the first criterion, failed this criterion due to very little variation in information content between the 4mer sites and surrounding sequence. Consistent with this motif r epr esenting a relati v ely rare binding e v ent, we found that < 1% of the sequences after the fourth selection cycle contained this motif. In contrast, GSX2 motif 1, which encodes the previously identified cooperati v e GSX2 binding site ( 25 ), passed this second criterion and was found in over 30% of the selected sequences. To ensure the exclusion of rare binding e v ents prior to further analysis, we only considered dimer sites that occurred in at least 5% of the cycle 4 HT-SELEX datasets (Figure 1 B). A visual demonstration of how these thresholds impacted dimer site selection of a sampled set of TFs is shown in Supplementary Figure S2.
Having selected candidate dimer sites using motifs from the fourth HT-SELEX cycle, we next incorporated HT-SELEX's multiple rounds of selection to calculate the rate of enrichment of dimer sites versus monomer binding sites. Our rationale is that if a TF forms a cooperati v e homodimer on DNA, the added pr otein-pr otein interactions in dimer binding and / or DNA conformation changes triggered by the binding of the first protein will increase complex stability ( 38 , 39 ). Thus, cooperati v e dimer sites should enrich at a faster rate through the HT-SELEX cycles than monomer sites. To a ppl y this idea, we compared the enrichment rate of dimer versus monomer sites through each of the four HT-SELEX cycles. Since dimer sequences contain two binding sites that would also be considered as monomer sites, we masked sequences that Homer predicted contained a dimer site when calculating the rate of monomer site enrichment. Using this approach, we calculated the number of sequences after each selection cycle that do not have a dimer site but contain at least one of the selected 4mer sequences (e.g. TAAT for GSX2). We defined the fold change as the ratio of the number of sites after each SELEX cycle versus the initial library. As sites undergo exponential enrichment due to SELEX selection and PCR amplification ( 23 ), we linearized the fold enrichment by taking the natural logarithm. Enrichment slopes of dimer and monomer sites were calculated, and the enrichment factor was defined as the ratio of the dimer site enrichment slope and the average of the two monomer site enrichment slopes. In this analysis, a positi v e prediction of cooperativity was defined by an enrichment factor > 2 as the probability of finding the dimer site was more likely than finding the two monomer sites. A ppl ying this anal ysis to the GSX2 HT-SELEX data using the predicted dimer motif in Figure 1 C and the 4mer sites (TAAT) re v ealed a calculated slope of 1.51 for the dimer site versus a slope of 0.4 for the 4mer sites ( Figure  1 D). Hence, the enrichment factor for GSX2 was calculated to be 3.78. Thus, the rate of GSX2 dimer to individual site enrichment across the HT-SELEX cycles positively correlates with GSX2's ability to cooperati v ely bind dimer sites.
An alternati v e e xplanation for the fast rate of dimer site enrichment is that HT-SELEX may simply select sequences with any two sites at a faster rate than those sequences with only a single site. Since cooperati v e TFs typically bind dimer sites with a specific spacer length ( 5 , 11 , 38 ), we next assessed for the selecti v e enrichment of specific spacer lengths between sites using combinatorics of stereospecific motif orientation (COSMO) ( 26 ). COSMO systematically counts the number of sequences with two sites at variable spacer lengths. We applied this program to count the number of dimers consisting of the 4mers at each spacer length after each round of HT-SELEX (Figure 1 E). The pr efer ence of a specific spacer length in the cycle 4 sequencing pool was tested with a Grubb's test for outliers, and the proportions of dimers at each spacer length were compared between the initial library and cycle 4 using a chi-square test for independence. If P < 0.05 for both tests, the TF was considered to select for a dimer site with a specific spacer length which positi v ely correlates with cooperativity. A ppl ying COSMO to the GSX2 HT-SELEX data re v ealed that only TAAT sites separated by a 7 bp spacer length passed the Grubb's outlier and chi-square test for independence (Figure 1 E). In summary, the Cooperativity Predictor pipeline considers both the added stability that cooperativity provides using the rate of site enrichment as well as spacer length specificity using COSMO to predict cooperativity using HT-SELEX data.

Cooper ativity pr edictor identified pr eviously known and novel cooperative DNA binding sites
The HD family has been separated into distinct classes based on defining features that include the presence of additional DNA binding domains (i.e. the P air ed, Pou, and CUT domains) ( 6 ). As these TFs are well known to use both DNA binding domains to bind dimer sites, we focused our analysis on HD TFs that lack well characterized domains that contribute to dimer binding. Mining the available human HT-SELEX data, we applied the Cooperativity Predictor to 88 human HD proteins (Figure 2 ; Supplementary Figure S3). Of the 88 HD proteins analyzed, 28 (32%) had at least one detectable dimer site within the long motif (16-18 bp) identified by Homer after cycle 4 (Supplementary Figure S3), and of these, 15 had specific spacer r equir ements within the highly enriched dimer site ( Figure  2 ). In addition, 24 mouse HD proteins have HT-SELEX da tasets tha t met our analysis r equir ements. Of these, 16 were orthologues to the above human HD factors. Analysis of the mouse data re v ealed that 7 of the 24 mouse TFs wer e pr edicted to bind cooperati v ely ( Supplementary Figure S3). Importantly, we found that the cooperativity predictions of the mouse and human orthologs were consistent for 14 of the 16 orthologues tested. For example, like the human CART1 and ALX4 TFs, the mouse orthologs Alx1 and Alx4 enriched for a sequence containing two palindromic sites spaced 3bp a part, w hich is consistent with past studies ( 9 , 40 , 41 ) (Supplementary Figure S3). In addition, the mouse and human orthologs of ARX and UNCX wer e pr edicted to bind cooperati v el y to highl y similar sites, and ther e wer e 10 cases in which both orthologs were not predicted to bind DNA cooperati v ely (Supplementary Figure S3).
To define the relationships between these HD TFs, we used the human HD factors to generate phylogenetic trees based on multiple sequence alignments of the protein re-  ( 6 ). Prior studies had not re v ealed that NK-like and HOX-like family members bind DNA as cooperati v e homodimers, e x cept for Gsx2 (members categorized b y gr ay dividers). In contr ast, some members of the P air ed-like family (b lue di vider) wer e pr e viously shown to cooperati v el y bind DN A ( 15 ). The Coopera tivity Predictor identified TFs tha t ar e pr edicted to bind cooperati v ely (bolded and colored in b lue te xt). (Top) Fi v e members of the NK-like class were predicted to bind cooperati v ely to four unique binding arr angements. B ARX1 and MSX2 wer e pr edicted to bind cooperati v el y to similar DN A binding sequences. (Bottom left) Two of the 22 members of the HOX-like class were found to bind to 1 unique binding arrangement, in which close homologues GSX1 and GSX2 both were predicted to bind cooperati v ely to a similar site. (Bottom right) Eight of the 23 P air ed-like factors were predicted to bind palindromic sites 3 bp apart.
gions tested in the HT-SELEX assays and classified these proteins into the NK-lik e, HOX-lik e, and P air ed-like subclasses as described by Burglin and Affolter ( 6 ) (Figure 2 ). Ther e wer e select human TFs in each subclass that were predicted to bind DNA cooperati v ely ( Figure 2 ). Notably, in some cases, the PWMs were largely identical, although differences in information content caused the selected top 4mers to shift by one or two nucleotide positions. For example, the BARX1 and MSX2 NK-like factors bound nearly identical PWMs, but the consensus sequences defined by the highest information content were T AAT6NTT AA and AA TT7NAA TT respecti v ely (Supplementary Figure S4A). Thus, these motifs wer e consider ed one unique binding arrangement. In total, fiv e members of the human NK-like class were predicted to bind cooperati v ely to four unique binding arrangements with spacer lengths between 3 and 9 bp (Figure 2 , top). Most of these NK-like TFs are not near one another in the phylogram tree and the unique binding arrangements found suggest that they independently obtained their dimer binding site pr efer ences.
The classic HOX factors that specify segment identities along the anterior-posterior axis are known to bind in complex with PBX and MEIS factors to increase TF-DNA specificity by lengthening the recognition sequences ( 14 , 42 , 43 ). Howe v er, members of the HOX-like subclass are not thought to bind as cooperati v e homodimers. Consistent with this idea, the Cooperativity Predictor did not predict any cooperati v e dimer sites for the human HOX-like factors apart from GSX1 and GSX2 (Figure 2 , bottom left). Mor eover, we pr eviously confirmed that the mouse Gsx2 protein cooperati v ely bound this predicted dimer sequence biochemically in vitro and via CUT&RUN analysis in vivo, and we found that the fly GSX homolo gue, Ind, cooperati v ely binds this dimer motif ( 25 ). Hence, the prediction that the close homologue GSX1 also cooperati v ely binds to a nearly identical enriched homodimer site suggests this activity is a conserved feature of the GSX / Ind HD TF family.
Se v eral members of the P air ed-like class of HD factors have been previously shown to cooperatively bind palindromic DNA sequences spaced 3 bp apart ( 15 ). Here, we greatly expanded the number of these factors with this activity by predicting that 8 of the 23 human members of the P air ed-like class cooperati v ely bind to highly similar palindromic 4mer sequences (Figure 2 , bottom right; Supplementary Figure S4B). The NK-lik e, HOX-lik e, and P air edlike HD cooperativity predictions re v ealed a di v erse set of spacing r equir ements from 3 bp for the P air ed-like HDs to 4 bp for BSX to 9 bp for VENTX. Intriguingly, howe v er, the unrelated BARX1 and MSX2 NK-like HDs and the GSX1 and GSX2 HOX-like HDs shared highly similar dimer sites (T AAT7NT AAT) (Supplementary Figure S4A). Taken together, these data suggest that a subset of HD TFs utilize dimer formation to increase DNA binding specificity through the binding of differentially spaced sites, yet others may still compete with one another for specific dimer sites under certain cellular conte xts. Howe v er, to our knowledge, only three of these factors (CART1, ALX4, and GSX2) hav e been e xperimentall y shown to biochemicall y bind such sequences in a cooperati v e manner.

Experimental validation of spacer-dependent cooperativity of homeodomains in vitro
To test the accuracy of the Cooperativity Predictor, we assessed TF spacer-dependent cooperativity using purified proteins and EMSAs. Since TF cooperativity can be masked by affinity and affinity can be heavily dependent on the flanking and spacer sequences ( 44 ), we used a systematic approach to select fiv e different DNA probes for each HD to increase the likelihood that TF cooperativity is accurately assessed in EMSAs. These probes were designed using the following criteria: Two probes were selected based upon the most frequent random 20-30mers in the fourth cycle of the TF's HT-SELEX assay, labeled SELEX 1 and SELEX 2. One probe was designed to r epr esent the optimal sequence based upon the highest sequence information content at each position across the PWM identified by Homer, labeled HOMER. Lastly, two probes were designed to contain only the core 4mer sequences separated at the ideal distance with all other sequences in the probe being randomly generated GC nucleotides to avoid making additional AT-rich HD binding sites. These two probes are labeled as RULE 1 and RULE 2. Schematics of the probe design are presented in Supplementary Figure S5. Using purified Gsx2 as an example, we found that Gsx2 pr efer entially bound as a dimer to all designed probes (Figure 3 A). Howe v er, additional flanking sequences are likely to greatly contribute to Gsx2-DNA binding, as Gsx2 did not bind as well to the RULE 1 and RULE 2 probes that contained a ppropriatel y spaced TAAT sequences flanked solely by GC nucleotides (Figure 3 A).
To provide a quantitati v e assessment of HD dimer binding, we next selected the probe with the highest affinity (i.e. for Gsx2, the Homer probe had the highest percentage of depleted free probe when tested at the same protein concentrations) to calculate a cooperativity factor, T au. T au is the multiplier that defines how much the binding of a single site facilitates the binding of the second site (see Methods and ( 15 )). A Tau multiplier of 1 would indicate independent, non-cooperati v e binding, whereas a Tau multiplier > 1 would indica te coopera ti v e DNA binding. To determine if cooperativity was spacer specific, we also tested a probe in which a single nucleotide (G) was added in the center of the spacer of the selected probe and called this the +1 probe. If the TF bound cooperati v ely in a spacerdependent manner, then the addition of a single nucleotide should disrupt cooperati v e binding to the probe, and the Tau factor for the selected probe versus the +1 probe should be statistically significant. For example, the optimal highaffinity HOMER probe (T AAT7NT AAT) strongly favored the binding of two Gsx2 proteins, whereas the HOMER + 1 (T AAT8NT AAT) probe was largely bound in an additi v e manner (Figure 3 B). We performed these EMSAs in triplicate and calculated the Gsx2 Tau factor for each probe and found that the rate at which the second binding site of the HOMER +1 probe was being filled was significantly lower than the rate of the second site of the HOMER probe (Figure 3 B, C; P = 1.06E −06). We next performed the same test using DLX2, a HD factor that was not predicted to have a cooperative dimer site. Similar as Gsx2, Dlx2 bound poorly to the RULE 1 and 2 probes, suggesting that flanking sequences significantly impact Dlx2 binding. Howe v er, . Schematics of the protein-DNA complex es ar e shown to the right of each gel. ( C , F ) Tau cooperativity factors were calculated for each binding reaction in which the TF was added. Bar graphs depict the average Tau factor for each TF with each dot r epr esenting a Tau factor from an individual binding reaction ( n = 12 for each gr oup). Err or bars denote standard deviation. Tau cooperativity factors were compared with two-sided unpaired student t-tests. Note, Gsx2 has a Tau of 189 on the Homer probe, indicating that the binding of the first site facilitates the binding of the second site 189-fold. Howe v er, the addition of the nucleotide to the spacer disrupted cooperativity and caused a significant decrease in Tau. In contrast, Dlx2 has a Tau factor of ∼1 on both probes, indicating that the binding of the first site has no impact on the binding of the second site to either pr obe. ( G ) This appr oach was applied to 9 TFs that were predicted to bind cooperati v ely (b lue te xt) and 5 TFs that were not predicted to bind cooperati v ely (gray te xt). TFs that were predicted to bind cooperati v el y had significantl y higher Tau factors on their respecti v e high affinity pr obes (Pr obe Tau) compared to the probes in which a single bp was added between the two sites (+1 Probe Tau), demonstrating spacer dependent cooperati vity. Conv ersely, TFs that were not predicted to bind cooperati v ely had similar Tau values between the two probes. Data points that fall on the gray dashed line have equal Tau factors on both probes. ( H ) Bar graph comparing the log 10 P -values of the Tau values for the high affinity compared to the +1 probes for each TF. P -values that are higher than the dashed line indicate significance. TFs predicted to bind cooperati v ely had significantly dif ferent coopera tivities between the two probes, wher eas pr edicted non-cooperati v e TFs did not with the e xception of Barhl2. Note, howe v er, that Barhl2 was weakly cooperati v e to both probes and had a significantly higher Tau factor for its +1 probe compared to its high affinity probe, demonstrating spacer independent cooperativity. as predicted, Dlx2 did not bind cooperati v ely to any of the probes, and the spacer length had no significant impact on the binding characteristics of Dlx2 to DNA (Figure 3 D-F; P = 0.31), consistent with independent monomeric binding to each site.

Cooper ativity pr edictor accur ately classified homeodomains' cooperativity capabilities
To assess the accuracy of the Cooperativity Predictor more broadly, we tested a total of 14 TFs for cooperativity using EMSAs. Of these, 9 passed all cooperativity criteria, whereas 5 did not have a predicted cooperati v e dimer site. As mentioned, the P air ed-like Alx1 and Alx4 ( 9 , 40 ) factors and the HOX-like Gsx2 factor ( 25 ) have been previously confirmed to cooperati v ely bind their respecti v e predicted dimer sites. We purified and tested these proteins in EM-SAs to ensure that our biochemical methods produced results that were consistent with past work. In addition, we tested the Gsx1 HOX-like factor; the Bsx, Barx1, Msx2 and VENTX NK-like factors; and the Arx P air ed-like factor that were all predicted to bind cooperati v el y. Importantl y, comparati v e analysis of the EMSAs ( Supplementary Figure S6; Supplementary Figure S7) re v ealed that all the predicted cooperati v e HD TFs had strong spacer length preferences as evidenced by their significantly higher Tau values for their high affinity dimer probe compared to the +1 probe ( Figure 3 G-H). Howe v er, they did differ in their strength of cooperativity as the P air ed-like Arx, Alx1, and Alx4 factors as well as the Gsx2 HOX-like factor were highly cooperati v e (Tau > 100), whereas the Bsx and Barx1 NK-like factors were less cooperati v e (Tau < 10) (Figure 3 G).
As a control for our analysis, we similarly tested fiv e HD TFs (Barhl2, Dlx2, En1, MSX1 and HESX1) that were not predicted to bind DNA in a cooperati v e manner. Intriguingly, in a prior study, a multinomial method detected dimer sites for MSX1, HESX1 and EN1 ( 11 ). Howe v er, these dimer sites did not pass the absolute site content and / or information content variability thresholds established by the Cooperativity Predictor algorithm. Importantly, our EMSA and Tau value analysis confirmed these predictions as none of these TFs cooperati v ely bound their dimer sites in a spacer-dependent manner. The Tau factors of these TFs on their high affinity probes were all approximately 1, consistent with non-cooperati v e binding ( Figure  3 G). Further, the Tau factors between the probes of different spacer lengths were not statistically significant ( Figure  3 H). In contrast, the Tau factors of the TFs predicted to be cooperati v e were significantly greater than those of TFs not predicted to be cooperati v e on their respecti v e dimer sites ( P = 0.011), whereas there was not a significant difference between the calculated Tau values of the +1 probes of these groups ( P = 0.194). The only unexpected result was obtained for Barhl2, which was not predicted to be cooperati v e, but had a small, statistically significant increase in Tau on the +1 probe (6.4 ± 1.2) relati v e to its predicted high affinity probe (5.4 ± 0.6) ( Figure 3 H Figure S7L). This finding suggests that Barhl2 has relati v ely weak cooperati vity on DNA, but without a binding pr efer ence for a specific spacer length between sites. In total, these findings suggest that the Cooperativ-ity Predictor was able to find previously undetected dimer sites in the case of GSX1, GSX2 and BSX and was able to discriminate between cooperati v e v ersus non-cooperati v e sites with increased accuracy in the cases of MSX1, HESX1 and En1. Moreover, these findings demonstrate that the Cooperativity Predictor both (i) accurately predicted whether a TF could bind cooperati v ely and (ii) identified specific DNA sequences that facilitated this cooperati v e binding.

Cooper ativity pr edictions in the pair ed-like family corr elate with the presence of key amino acids
Analysis of the P air ed-like subclass of HDs re v ealed that ov er a thir d of these TFs ar e pr edicted to bind cooperati v el y to highl y similar, if not identical, palindromic sites 3 bp apart (Figure 2 , bottom right). We next wanted to determine if this cooperativity was related to specific amino acids that were conserved in the predicted cooperati v e TFs but not the non-cooperati v e TFs. To take a targeted approach, we took advantage of information deri v ed from the crystal structure of a S50Q variant Drosophila Paired HD bound as a cooperati v e dimer on the TAA T3NA TTA palindromic site ( 37 ) (Figure 4 A). The authors identified specific interactions between residues within the 60 amino acid Paired HD tha t facilita ted coopera tivity: (i) E42 was f ound to f orm a wa ter media ted intermolecular hydrogen bond with R44 and an intermolecular hydrogen bond with R3; (ii) I28 of one HD fits tightly against the N-terminal ARM of the second HD (Figure 4 A) and (iii) A43 creates symmetrical hydrophobic interactions between the two HDs ( 37 ) (Figure  4 A). Alignment of the predicted cooperati v e v ersus noncooperati v e P air ed-like factors r e v ealed that R3, E42, and R44 are conserved across all paired-like factors, howe v er the 28th and 43rd residues vary between the paired-like factors, possibly signifying that these residues ma y con vey cooperati v e v ersus non-cooperati v e acti vity (Figure 4 B).
To assess the impact of the different residues at the 28th and 43rd positions, we first modeled residue changes present in the cooperati v e and non-cooperati v e P air ed-like factors using Pymol, and then tested a subset of these predictions within the highly cooperati v e ALX4 protein using EMSAs and Tau factor calculations. At position 28, the branched nonpolar amino acids, valine and isoleucine, are found in all the predicted cooperati v e TFs (Figure 4 B-D). Howe v er, alanine, which is predicted to not make similar favorable nonpolar interactions is found in three of the non-cooperati v e TFs (Figure 4 E). In agreement with these predictions, we found that changing the ALX4 valine to alanine in the 28th position reduced cooperativity ∼7-fold (Figure 4 F-G). Moreover, a methionine at position 28, as seen in the PITX factors, is predicted to be unable to tightly pack with the N-terminal ARM due to van der Waals overlap (red disks) (Figure 4 H), and we found that a V28M variant in ALX4 negati v ely impacted cooperativity ∼6-fold (Figure 4 I-J). Next, we modeled the different residues found at position 43, which forms a hydrophobic interaction at the pr otein-pr otein interface between the HDs in the Drosophila P air ed structur e ( 37 ). The alanine and serine residues found in cooperati v e factors possess side chains that fit within this 4.2 angstrom gap without steric clashes (Figure 4 K-L). Howe v er, valine and aspar- nM. Schematics of the protein-DNA complex es ar e shown to the right of each gel. Tau cooperativity factors were calculated for each binding reaction in which the TF was added. Bar graphs depict the average Tau cooperativity factor for each TF with each dot r epr esenting a Tau factor from an individual binding reaction ( n = 12 for each gr oup). Err or bars denote standard devia tion. Tau coopera tivity factors were compared with two-sided unpaired Student t -tests. ( H ) A methionine at position 28, as in PITX, produces van der Waals overlap that would cause clashing at this interface (red discs). ( I, J ) An ALX4 valine to methionine mutation at position 28 reduced the cooperativity of ALX4 6-fold. ( K ) Alanine residues at position 43 in the P air ed structur e form a symmetrical hydrophobic interaction. ( L, M, P ) A serine at position 43 as found in OTX can also be accommodated between the two HD proteins, howe v er a valine as seen in the RAX factors or aspartic acid as seen in the VSX factors at position 43 is predicted to cause considerable clashing between residues as shown by the red discs. ( N, O, Q, R ) Quantitati v e EMSAs showed that altering the ALX4 residues from A43V or A43D decreased cooperativity ∼9 and 5-fold respecti v ely. *Note, although HT-SELEX performed on ALX3 did not enrich for the palindromic dimer site and the factor was not predicted to bind cooperati v ely by the Cooperati vity Predictor, it should be noted that Alx3 has been shown in EMSAs to bind cooperati v ely to the P air ed-like site in past studies ( 65 ). tic acid residues in this position, which are found in the non-cooperati v e, RAX and VSX P air ed-lik e TFs, w ould be unable to fit due to clashing between amino acids ( Figure  4 M, P). Experimentally, we found that ALX4 proteins with the A43V and A43D mutations reduced cooperativity ∼9and ∼5-fold, respecti v el y (Figure 4 M-R). Similarl y, proline and glutamic acid at the 43 rd positions as seen in PITX1 and GSC2 are also predicted to reduce cooperativity due to steric clash at this pr otein-pr otein interface (Supplementary Figure S8I-J). Importantly, these residue changes did not dramatically affect DNA binding affinity, as all protein variants had similar free (unbound) probe depletion (Supplementary Figure S8A-H).
While variants in the 28th and 43rd positions provide insight into why the RAX, SHOX, GSC2, PITX and VSX factors fail to enrich for the palindromic dimer site, these two positions alone fail to provide a structural explanation for fiv e of the predicted non-cooperati v e P air ed-like factors (Figure 4 B). Further, these residue changes only influenced the magnitude of cooperativity within the ALX4 protein and did not abolish the ability of this protein to bind in a cooperati v e manner. To ensure that the Cooperativity Predictor algorithm accurately classified the TFs in which these r equir ed r esidues wer e pr esent, we tested Isx, one of the predicted non-cooperati v e factors that has the pr eferr ed I28 and A43 residues found in cooperati v e factors, on an ideal P air ed-like TAA T3NA TTA site in EMSA. As expected, based on the Cooperativity Predictor output, Isx was unable to bind cooperatively to this DNA site (Supplementary Figure S8K), suggesting that additional residues influence the cooperativity of these factors. In summary, we found that residues at the 28th and 43rd positions can clearly contribute to cooperati v e DNA binding, but further studies will be needed to determine the other residues within the P air ed-like family that ar e r equir ed for cooperati v e complex formation on DNA.

Homeodomains bind cooperative dimer sites in vivo
We next sought to assess the in vivo prevalence of the predicted cooperati v e dimer sites using available genomic binding data (ChIP-seq and CUT&RUN) for the Gsx2, PHOX2B, Phox2a, MSX2 and Barx1 TFs through the following gener al str ategy: First, we separ ated the called peaks into three categories: peaks that contained at least one HT-SELEX dimer site, peaks that contained at least one HT-SELEX 4mer but not a dimer site, and peaks that contained neither site. Second, we plotted the motif densities of the dimer-dependent and dimer-independent sites for each category to confirm motif enrichment at the peak centers. Third, we plotted the fold changes of the r ead str engths to compare the binding signals between these categories of peaks. Fourth, we applied COSMO on the called peaks to determine if a dimer site consisting of the 4mers separated at the identified spacer length found in HT-SELEX was also specifically enriched in the genomic binding data.
We first validated this approach by a ppl ying it to Gsx2 mouse forebrain CUT&RUN data, in which Gsx2 was previously shown to bind both cooperati v e and noncooperati v e sites in vivo ( 25 ) . Analysis of the dimer and 4mer motifs re v ealed each was highly enriched at the peak centers for each respecti v e category, consistent with Gsx2 binding both types of sequences (Figure 5 B). Moreover, the binding signal of the peaks with dimer sites was slightly higher than those peaks with only 4mer sites (Figure 5 C), suggesting that dimer sites ar e mor e fr equently bound and / or that the interactions between two proteins improve the stability of the Gsx2-DNA complex. Further, the category of peaks that did not contain either site type was the smallest and had the lowest binding signal (Figure 5 C), consistent with Gsx2 requiring HD sites to bind and regulate target genes. Lastl y, COSMO anal ysis of the peaks re v ealed that dimer sites spaced 7 bp apart occurr ed mor e fr equently than those at alternati v e spacer lengths (Figure 5 D), consistent with the findings from the HT-SELEX analysis ( Figure  1 E). Hence, this analysis is consistent with past work that showed Gsx2 binds both cooperati v e and non-cooperati v e sites in vivo ( 25 ).
We similarl y anal yzed available data for the following other cooperati v e TFs: PHOX2B ChIP-seq from the KELLY ( 45 ), BE2C ( 45 ) and CLB-GA ( 46 ) neuroblastoma cell lines; Phox2a ChIP-seq from mouse induced cortical motor neurons ( 47 ); MSX2 ChIP-seq from human trophoblast stem cells ( 48 ); eGFP-MSX2 ChIP-seq from the MCF7 cell line ( 49 ); and FLAG ChIP-seq performed on 3xFLAG-Barx1-EGFP infected immortalized E13.5 stomach mesenchymal cells ( 50 ). Howe v er, three of the datasets were difficult to interpret for different reasons: the Phox2a study in differentiated motor neurons co-expressed the Isl1 TF using doxy cy cline prior to ChIP-seq, making it difficult to distinguish between Phox2a homodimerization versus Phox2a / Isl1 heterodimerization (Supplementary Figure  S9I-K), whereas the two MSX2 datasets had low overall motif enrichment in called peaks (Supplementary Figure  S10A-H). In contrast, all the other in vivo datasets re v ealed significant enrichment for both the predicted monomer and dimer sites identified by the HT-SELEX data analysis. Comparati v e peak analysis largely re v ealed higher binding signal in peaks with dimer sites versus those that lacked dimer sites with the exception of BARX1 ( Figure 5 ; Supplementary Figure S9; Supplementary Figure S10). For example, the PHOX2B datasets provided clear evidence for both cooperati v e and non-cooperati v e sites in vivo with dimer and monomer site motifs heavily enriched in the center of most peaks called from PHOX2B ChIP performed on the KELLY cell line (Figure 5 E). In fact, ∼87% of the called peaks contained a dimer site and only ∼1% did not contain either site type, and those peaks with dimer sites had a higher binding signal than peaks with only monomer sites or peaks that lacked either type of site ( Figure 5 G). As mentioned above, PHOX2A, a close homologue to PHOX2B, can heterodimerize with ISL1 on a similar TAA T3NA TTA motif ( 47 ). Hence, to ensure that the increased binding signal and motif localization was not due to heterodimerization, we plotted the ISL1 ChIP-seq signal on the PHOX2B peaks ( Figure 5 G). This analysis re v ealed v ery little colocalization between these factors in this cell line, suggesting that these findings are independent of ISL1 heterodimerization and increasing the likelihood that the palindromic motif is enriched due to PHOX2B homodimerization. Further, COSMO analysis showed that a dimer site with 4mers 2bp apart (note, the PHOX2B dimer motif (AA TT2NA TTA) Figure 5. Genomic binding assays re v eal that predicted TFs bind cooperati v e dimer sites and non-cooperati v e monomer sites in vivo . ( A ) HT-SELEX re v ealed that GSX2 binds to a dimer site with TAAT 4mers spaced 7 bp apart. ( B ) Dimer motifs and the identified 4mers are highly prevalent in the centers of peaks called from mouse f orebrain CUT&RUN assa ys. ( C ) Peaks containing Gsx2 dimer sites had a higher binding signal than peaks containing only 4mer sites or neither site, consistent with dimer sites either being bound more frequently or with higher stability than monomer sites. ( D ) Dimer sites consisting of 4mers 7bp apart are highly enriched compared to 4mers spaced by alternate distances. ( E ) PHOX2B HT-SELEX data enriched for a dimer motif consisting of 4mers 2bp apart and a monomer ATTA site. ( F ) These motifs are highly prevalent at the peak centers and (G) peaks containing a dimer site have increased binding compared to peaks with only a monomer site or lacking both types of sites. ISL1 does not bind many of the PHOX2B bound regions, which is consistent with these dimer sites being bound by PHOX2B homodimers rather than PHOX2B-ISL1 heterodimers. ( H ) 4mers spaced 2bp apart occurred nearly twice as frequently as the next most frequent spacing, consistent with the spacer specificity seen in cooperativity.
identified by the Cooperativity Predictor is shifted by a single bp relati v e to the TAA T3NA TTA motif (Supplementary Figure S4B)) was highl y enriched, w hich is again in agreement with the results found in HT-SELEX ( Figure  5 H). Importantly, similar results were found across all three neuroblastoma cell lines (Supplementary Figure S9). Thus, these in vivo analyses are consistent with our in vitro predictions from the HT-SELEX data, and thereby provide further evidence that a subset of HD TFs utilize cooperati v e homodimer formation to enhance DNA binding specificity to AT-rich DNA sequences by binding sites of distinct spacer lengths. Supplementary Table S9 provides a summary of the computational pr edictions, EMSA r esults, and in vivo analyses results.

DISCUSSION
In this study, we evaluated the prevalence of cooperativity in the HD family. First, we designed the computational pipeline, Cooperativity Predictor, to identify cooperati v e HD TFs and their DNA dimer sites using only HT-SELEX data. Out of the 88 human and 24 mouse HDs analyzed, we predicted 15 human and 7 mouse HDs to exhibit cooperati v e behavior. We ne xt used quantitati v e EMSAs to calcula te the coopera tivity of 14 HDs on DNA, 9 of which wer e pr edicted to be cooperati v e and 5 of which were not predicted to bind in a cooperati v e manner. All but 1 of the tested HDs were correctly characterized by the Cooperativity Predictor: the 9 HD TFs predicted to be cooperati v e were confirmed to cooperati v ely bind dimer sites with constrained spacer lengths, whereas 4 of the 5 HD TFs predicted as non-cooperati v e were confirmed to bind in a noncooperati v e manner. Moreov er, the one e xception (Barhl2) was weakly cooperati v e on both the identified dimer site and the +1 probe, and thus the spacer-independent binding behavior of Barhl2 to DNA explains why it was not detected by the Cooperativity Predictor pipeline. Importantly, 6 of the confirmed cooperati v e TFs had not been previously shown to exhibit cooperative behavior to our knowledge. We subsequently used structural models and amino acid conservation to highlight the roles of tw o k ey HD positions that are likely to help discriminate between cooperati v e v ersus non-cooperati v e P air ed-like HDs. Lastly, we demonstra ted tha t the Gsx2 and PHOX2B factors bind to their predicted cooperati v e dimer sites in vivo through the analysis of available genomic data. Below, we highlight two key implications of these findings: First, we discuss the general utility of using HT-SELEX data and the Cooperativity Predictor to identify cooperati v e DNA binding sites. Second, we describe how the differential formation of cooperati v e HD dimers on binding sites with distinct spacing requirements can greatly enhance HD TF DNA specificity.

Streamlined approach to characterize HD cooperativity
Unbiased in vitro DNA binding assays such as HT-SELEX and PBMs followed by motif search algorithms have revolutionized our ability to systematically define TF binding sites. In the original HT-SELEX study, Jolma et al processed the HT-SELEX data to identify the most enriched 6-12 bp subsequences and then applied a multinomial method to generate models that incorporate dinucleotide base biases. Importantly, their studies re v ealed that a subset of TFs not only enriched for monomer sites, but also for dimer sites that suggested formation of homodimer complexes on specific DNA sequences ( 11 ). Since this data and analysis wer e r eleased in 2013, m ultiple other a pproaches have been applied, ranging from machine learning ( 24 , 51-53 ), deep learning ( 54 ) and complex algorithms (55)(56)(57). These approaches have typically focused on better defining monomer binding specificities or known dimer complex specificities such as those from the bZIP or bHLH families. Computa tional analyses tha t focused on coopera tivity hav e inv estiga ted heterodimer coopera tivity using both HT-SELEX ( 11 ) and consecuti v e-af finity-purifica tion SELEX (CAP-SELEX) data ( 58 ). For example, Rube et al. de v eloped a three-layer maximum likelihood model, ProBound, that combines distinct DNA binding assays to de v elop comprehensi v e TF binding models and quantify absolute binding affinities. The authors applied ProBound to discov er cooperati v e binding configurations and cooperativity contributions of individual factors through the analysis of CAP-SELEX and HT-SELEX datasets ( 24 ). Furthermore, Ibarra et al generated models to incorporate DN A sha pe, dinucleotide dependencies, and DNA sequence pr efer ence to predict the heterodimeric cooperativity of FOXO1 and ETS1 ( 59 ). They subsequently integrated the deri v ed cooperati v e kmers with ChIP-seq and ontology data to find a relationship between FOXO1-ETV6 cooperativity and chronic l ymphocytic leukemia, successfull y linking cooperativity to a disease state ( 59 ). It is important to note howe v er, that these analyses also r equir e a CAP-SELEX dataset to have been performed on the same TF with two different tags in the same experiment to assess homodimeric cooperativity. Out of the HDs analyzed here, the required HT-SELEX and CAP-SELEX datasets are only available for ALX4 and HOXB13. Ther efor e, Cooperativity Pr edictor distinguishes itself by assessing cooperativity using only HT-SELEX data.
Other studies have used datasets outside of HT-SELEX to identify dimer binding sites such as chromatin accessibility data ( 60 ), genomic binding assays ( 61 , 62 ), gene expression data ( 63 ), and transcriptional output assays ( 64 ). Howe v er, it is often difficult to dif ferentia te between homodimer binding versus heterodimer binding in genomic assays with low resolution as seen in our analysis of the Phox2a ChIPseq in dif ferentia ted motor neurons (Supplementary Figure  S9I-L). To circumvent this problem, a recent study developed a convolutional neural network on a genomic assay with single bp resolution, ChIP-nexus, to define the cooperativity syntax of four TFs ( 62 ). With the increased resolution, BPNet could dif ferentia te between heterodimeric and homodimeric binding, and the authors were able to identify spacer length dependent pr otein-to-pr otein interactions that were too subtle to be detected by previous methods ( 62 ). Unfortunately, there is limited genomic binding data available, especially at the bp resolution recommended for the described method, to broadly apply this approach to many TFs.
Despite these available tools and r esour ces, the field lacked a pipeline that identifies homodimer cooperati v e interactions in a streamlined manner from an assay with Nucleic Acids Research, 2023, Vol. 51, No. 12 6069 a bundant availa ble datasets. To address this need, we de v eloped a computational pipeline with se v eral methodological features optimized to identify HD TFs that exhibit cooperati v e behavior by tailoring already available bioinformatics tools and a ppl ying statistical tests specific to HT-SELEX data. First, we processed the data from all SELEX cycles as a single experiment, allowing us to use motif enrichment through the SELEX cycles as a measure of cooperativity. Second, as dimer sites tend to range between 12 and 18 bp in length, we specifically searched for longer enriched motifs (16 and 18 bp). Third, and most importantly, we employed two approaches based on independent ideas of cooperativity: (i) We compared the enrichment rates of dimer versus independent 4mers through the SELEX cycles as a predictor of cooperativity based on the idea that cooperati v e comple x es ar e mor e stable on DNA than monomer TF-DNA complexes. (ii) We assessed the specificity of spacer lengths between the top 4mer sequences because cooperati v e interactions typically r equir e dimer sites with a specific spacer length. Through this approach, we successfully identified se v eral ne w cooperati v e HDs and dimer binding motifs from only HT-SELEX data, se v eral of which were not predicted to bind dimer sites by other computational methods. Thus, in contrast to other more complex search algorithms that often focus on heterodimeric interactions, the Cooperativity Predictor algorithm accurately identifies homodimer interactions on constrained DNA binding sites using abundant available datasets from the HT-SELEX assay.
While we specifically designed the Cooperativity Predictor algorithm for HD TFs, future studies could focus on optimizing the algorithm to predict cooperati v e DNA binding for other TF families by modifying the site and spacer length r equir ements based on biochemical and structural data. For HD TFs, we used a 4mer length based on prior structural data that re v ealed HD proteins make key contacts with a 4bp core sequence (i.e. TAAT) ( 8 ). Howe v er, the length of the site could be shortened or extended based upon known DNA binding sequence r equir ements for each TF family. In addition, since 4mer sites are relati v ely short, we r equir ed the two 4mers to be at least 2 bp apart for HD TFs to help distinguish between long monomer sites versus dimer sites. This 2 bp spacer r equir ement means that the Cooperativity Predictor will fail to identify HD TFs that bind dimer sites without a spacer or only a 1 bp spacer. Howe v er, sites separated by 0 and 1 bp spacers have a total length of 8 and 9 bp, which would likely be detected using standard search algorithms such as the IniMotif subsequence search used by Jolma et al. ( 11 , 23 ). For other TF families, the spacer lengths between sites could be adjusted depending upon the expected differences in motif lengths of monomer versus dimer sites. As the Cooperativity Predictor classifies a dimer site using spacer information content and length, dimer sites with spacers that contain high information content or have flexible lengths will not be detected by the pipeline. Hence, the Cooperativity Predictor pipeline may underestimate the number of HD TFs capable of cooperati v e DNA binding. For example, we found that Barhl2, which was predicted to be non-cooperati v e, has weak cooperativity to dimer sites with fle xib le spacer lengths. Lastly, it should be noted that the dimer versus monomer motif enrichment module of the Cooperativity Predictor requires the TF to bind as both a monomer and dimer. We did not find this to be a significant limitation for analyzing HD factors because each was found to enrich for monomer sites.
Howe v er, if the TF is found to only bind DNA as a dimer, the COSMO module could be used to assess for distinct spacer lengths between binding sites.

HD cooperativity impact on DNA binding specificity
The HD family is one of the largest groups of TFs in metazoans, and HD factors regulate di v erse de v elopmental and homeostatic processes. Defining how proteins that bind highly similar DNA sequences in vitr o regula te dif ferent specialized processes has been a long-standing problem in the field of transcriptional r egulation. Incr easingly, studies hav e re v ealed that HD TFs increase their DNA binding specificity by forming homo-and heterodimer complexes on DNA. The r equir ement of two proteins binding distinct sites separated by a specific distance and orientation lengthens the DN A reco gnition sequence, and ther eby incr eases TF-DNA specificity via two mechanisms. (i) The likelihood of finding two sites at a specific spacer length at random is lower than the likelihood of finding a single site. This makes the TF's binding sites more specific and targeted. (ii) The enhanced cooperativity of a TF complex reduces the number of TFs that can effecti v ely compete for sites as only specific TFs cooperati v ely bind such dimer sites.
In our study, we identified se v eral ne w HDs that are capable of cooperatively binding on specific HD dimer motifs in a manner that increases DNA binding specificity. For example, while all 88 members of the HD family analyzed here bind highly similar AT rich monomer sequences, only BSX was able to cooperatively bind AT rich sites 4bp apart and only VENTX was able to cooperati v ely bind AT rich sites 9 bp apart ( Figure 2 ). Hence, BSX and VENTX would be predicted to have a selective advantage over other HD TFs to bind targets containing their respecti v e dimer motifs. Further comparisons between dimer motifs re v eals that in addition to spacer length differences, binding site orienta tion also dif fers between these motifs. For example, Gsx2 binds in a head-to-tail orientation ( 25 ), whereas the Paired crystal structure showed that this complex bound in a headto-head orientation ( 37 ). Lastly, our search algorithm both confirmed and expanded the number of P air ed-like HDs capab le of cooperati v ely forming homodimers on DNA, as we found that a pproximatel y 1 / 3 of P air ed-like HD factors selecti v ely enriched for the TAA T3NA TTA palindromic motif (Figure 2 ; Supplementary Figure S4B). Collecti v ely, these studies highlight how the selecti v e formation of cooperate HD complexes on constrained binding motifs allows for factors to bind di v erse recognition sequences despite containing highly similar HD DNA binding domains ca pable of largel y binding to the same monomer sequences.
A question that emerges from our studies is what key HD amino acid residues and / or additional domains are required for the selecti v e formation of cooperati v e homodimer complexes on DNA? Cooperati v e DNA binding typically results from additional pr otein-pr otein interactions between TFs that supplement the protein to DNA interactions to stabilize the complex on DNA and / or from a DNA conformation change induced by the binding of the first protein that increases stability of binding by the second protein ( 38 , 39 ). For example, the crystal structure of the Paired factor on its pr eferr ed dimer site demonstrated that HDs bend DNA and thereby position the adjacent protein so that they physically interact ( 37 ). These stabilizing interactions decrease the of f-ra te, w hich in turn raises TF-DN A affinity. Using sequence conservation between cooperati v e and non-cooperati v e paired-like factors and the existing P air ed-DNA homodimer structure as a model, we identified and tested the role of se v eral key residues within the HD that are likely to help explain differences in cooperati v e v ersus noncooperati v e DNA binding acti vity (Figure 4 ). In contrast, Gsx2 likely r equir es sequences outside the HD for optimal cooperati v e DNA binding, as prior studies have shown that the Gsx2 HD alone is less cooperati v e than a Gsx2 protein containing an additional 40 amino acids C-terminal of its HD ( 25 ). Less is known about the key amino acid sequences r equir ed f or the f ormation of the other cooperati v e HD comple xes. Howe v er, it is important to note that the proteins used in this study and in the HT-SELEX assay typically contained 20-30 amino acids flanking the HD. Thus, the cooperativity of these HD factors must be mediated via either direct HD-HD interactions or via nearby flanking residues. These limited regions appear to be dri v ers of cooperativity in most cases. For example, by comparing the HT-SELEX results between the TFs' DNA binding domains and full-length proteins, Jolma et al found that the binding specificities were similarly established by the truncated region in 78 of the 79 TFs tested ( 11 ). Moreover, we also found that the Cooperativity Predictor predicted like cooperati v e behavior between the tested shorter HD proteins and the full-length proteins in 19 out of 21 HD TFs (Supplementary Table S10). Thus, homodimeric cooperativity appears to be primarily mediated through the HD and flanking regions. Howe v er, future studies using sequence conservation, m utagenesis, and structural a pproaches will be needed to ascertain the likely di v erse mechanisms used by HD factors to mediate cooperati v e DNA binding.
Lastly, cooperativity has been shown to impact transcriptional output as well as DNA binding specificity. A recent massi v ely parallel reporter assay of the paired-like HD, CRX, re v ealed that enhancers with dimeric sites produce a stronger signal than those with monomeric sites ( 16 ). Moreover, the ALX factors appear to only generate transcriptional changes using dimer sites, in spite these proteins being able to bind monomer sites (Supplementary Figure S7A, B) ( 9 , 65 ). Taken together, these findings suggest that the added DNA binding affinity and / or specificity of CRX and ALX factors for their respecti v e dimer sites results in increased le v els of gene e xpression output. In contrast, prior studies re v ealed Gsx2 mediates opposing transcriptional outputs in a binding site dependent manner as Gsx2 binding to dimer sites was associated with gene stimulation whereas Gsx2 binding to monomer sites was associated with gene r epr ession ( 25 ). Mor eover, the authors tested mor e complex enhancer elements with different ratios of monomer to dimer sites to demonstrate how the higher affinity dimer sites could lead to gene stimulation at low Gsx2 le v els, whereas high le v els of Gsx2 filled both monomer and dimer sites and led to transcriptional r epr ession ( 25 ). These find-ings highlight how the differential use of cooperati v e v ersus non-cooperati v e sites can result in distinct transcriptional outputs in a TF concentration and binding site dependent manner. Even with these defined mechanisms, the extent of cooperativity's role in gene r egulation discover ed thus far is most likely understated, and the manner and scope in which these transcriptional changes relate to disease states have yet to be inv estigated. Here, we e valuated the prevalence of cooperati v e behavior in the HD family, howe v er further investigation will be needed to understand the extent in which cooperativity is altering DNA binding specificity , affinity , and transcriptional outputs to impact gene regulation.