Diverse Eukaryotic CGG-Binding Proteins Produced by Independent Domestications of hAT Transposons

Abstract The human transcription factor (TF) CGGBP1 (CGG-binding protein) is conserved only in amniotes and is believed to derive from the zf-BED and Hermes transposase DNA-binding domains (DBDs) of a hAT DNA transposon. Here, we show that sequence-specific DNA-binding proteins with this bipartite domain structure have resulted from dozens of independent hAT domestications in different eukaryotic lineages. CGGBPs display a wide range of sequence specificity, usually including preferences for CGG or CGC trinucleotides, whereas some bind AT-rich motifs. The CGGBPs are almost entirely nonsyntenic, and their protein sequences, DNA-binding motifs, and patterns of presence or absence in genomes are uncharacteristic of ancestry via speciation. At least eight CGGBPs in the coelacanth Latimeria chalumnae bind distinct motifs, and the expression of the corresponding genes varies considerably across tissues, suggesting tissue-restricted function.

Transposons can contribute to the evolution of regulatory networks by providing novel DNA, and in some cases, such as the hAT family of DNA transposons, being co-opted as sequence-specific DNA-binding proteins (DBPs), such as transcription factors (TFs) (Feschotte and Pritham 2007;Atkinson 2015). The hAT N-terminal region encodes a BED zinc finger (zf-BED) and the Hermes transposon DNA-binding and protein dimerization domain, which are involved in sequencespecific DNA binding to both subterminal repeat sequences and inverted terminal repeats (Hickman et al. 2005;Hickman et al. 2014). The ZBED family of vertebrate TFs is an example of hAT domestication, having arisen from two different exaptations (Hayward et al. 2013). Human CGGBP1 provides another example; it was first identified in a survey of proteins that bind to the Fragile X (CGG) n repeat expansion (Deissler et al. 1996(Deissler et al. , 1997 and it bears remote homology to the Nterminal region of hATs (Singh and Westermark 2015). I Transposons often evolve quickly (Khan et al. 2006;Feschotte and Pritham 2007;Kofler et al. 2015;Arkhipova 2017), and their domain models, which are based on amino acid (AA) sequence alignments, may not generalize across all instances of a transposon family. The same is presumably true of host proteins derived from transposons. Indeed, no significant domains are detected in CGGBP1 by Pfam ) and SMART (Letunic and Bork 2018) database searches, even though CGGBP1 contains an apparent zf-BED and Hermes DNA-binding domain (DBD). The Panther  and Interpro (Mitchell et al. 2019) databases do contain a CGGBP1 superfamily definition, however.
Intriguingly, this domain model identifies high-scoring host sequences (CGGBPs) in several bony fish, the sea lamprey Petromyzon marinus, as well as several insects and fungi. The CGGBP1 domain model also identifies 29/824 hATs in the RepeatMasker database, mainly in the hAT-19 subclass (13/16), suggesting that the CGGBP1 domain model corresponds only to specific hAT subtypes. There is considerable overlap between species containing hAT-19 RepeatMasker hits and those with host genes encoding CGGBPs (supplementary table S1, Supplementary Material online). Strikingly, the genome of the coelacanth Latimeria chalumnae includes 62 CGGBP genes, with 47 containing the full zf-BED and Hermes DBDs (Amemiya et al. 2013;Yates et al. 2020).
These CGGBP proteins are largely unstudied and undocumented, except for CGGBP1, and the presence in various databases of proteins with matches to the CGGBP1 domain model. Here, we present several lines of evidence indicating that the nonamniote CGGBPs represent a previously undocumented class of bona fide DBPs that have arisen many times through independent hAT integrations.

CGGBPs Possess Diverse DNA Sequence Preferences
Determining what sequences DBPs bind is often a starting point in their characterization (Lambert et al. 2018). We queried the sequence specificity of 101 host CGGBPs (from Ensembl; Yates et al. 2020) and five CGGBP-like hAT proteins, using two independent universal protein-binding microarray Letter ß The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Open Access
(PBM) designs (Berger et al. 2006). The two arrays ("ME" and "HK") (Badis et al. 2008) each contain all possible DNA 10mers, but each is embedded within a completely different 35mer relative to the other array. The CGGBPs were selected on the basis of sequence diversity (supplementary fig. S1, Supplementary Material online). As the vast majority contain introns and/or are truncated (<200 AAs), they are likely to be transcribed as mRNA and are not functional hATs (supplementary fig. S3, Supplementary Material online). PBM data were processed to obtain estimates of significance of binding to 8-base sequences (represented by multiple spots on the array), known as 8-mer "E-scores" (Berger et al. 2006) as well as motifs that summarize the top scoring 8-mers (Ray et al. 2013).
Eighteen of the CGGBPs and four hAT proteins (including variants of hAT-19_Crp) displayed clear sequence preferences, that is, a set of high-scoring 8-mers with related sequences, that were shared between the two array designs and yielded a clear motif. These 22 proteins all had relatively long N-terminal extensions (and thus a longer linker between the GST tag and the CGGBP domain), and most contained both Cys and His residues that complex zinc in the zf-BED, a leucine two positions N-terminal to the zf-BED, and two prolines that are widely conserved in the Hermes domain, suggesting that these residues are required for DNA sequence specificity. Supplementary figure S2, Supplementary Material online, contains an alignment of all CGGBPs tested, with key residues annotated. Conditioned on having five of the seven conserved residues and a 14-AA minimum linker, 36% (22/ 61) of the constructs were successful, consistent with previous results from many bona fide DBD classes (Weirauch et al. 2014).

Saltwater crocodile
Crocodylus porosus  figure 1, the scores obtained for individual 8-mers are clustered. The groups formed, therefore, reflect only the DNAbinding data, irrespective of the protein sequences or the genomes in which they are found. In figure 2, the protein phylogram reflects the AA replacement rate, with DNAbinding motifs shown as sequence logos. The groups of closely related CGGBPs in figure 2, therefore, presumably reflect recent shared ancestry.
Several observations support the accuracy of the PBM data. Human CGGBP1 had highest preference to CGGCGGcontaining sequences, reflecting how it was initially identified (Deissler et al. 1997). In addition, the DNA-binding motifs for all four hAT-encoded CGGBPs have high-scoring matches within subterminal regions of the corresponding Repbase model (supplementary fig. S3, Supplementary Material online). Also, in several cases, there are clear similarities between DNA-binding motifs obtained from CGGBPs with related protein sequences, even from distantly related genomes. The coelacanth gene Coel_14934 and the coelacanth hAT-13_LCh transposon encode similar protein sequences and yield very similar DNA-binding motifs, suggesting that Coel_14934 may be a recently co-opted hAT-13_LCh copy. Sea lamprey, which is >515 Ma distant (Kumar et al. 2017), also contains a related CGGBP with similar sequence specificity. A similar relationship is found between the CGGBPs from kanglang fish and climbing perch, which diverged >200 Ma, and the CGGBPs from spiny chromis and the killifish Austrofundulus limnaeus, which diverged >80 Ma.
Most CGGBPs possess unique DNA sequence specificity (figs. 1 and 2): The 8-mer preferences from the ME and HK arrays for each CGGBP correlate more highly with each other   MBE than to any other protein ( fig. 1). The only clear exceptions are the human and chicken CGGBP1 orthologs, and the three single-residue variants of crocodile hAT-19_Crp. Most of the motifs contain a CGG trinucleotide, but several of the coelacanth CGGBPs in fact prefer GCGC-containing sequences, whereas AT-rich motifs were obtained from hAT transposons in the insects Rhodnius prolixus (kissing bug) and Locusta migratoria (migratory locust), and the one CGGBPcontaining protein in the fungus Rhizophagus irregularis ( fig. 2). Thus, the CGGBPs display diverse sequence specificity signatures. Similarity among CGGBP DNA sequence specificity and AA sequence often deviate from what would be expected from the species tree (figs. 1 and 2). To explore this issue more closely, we examined orthology relationships among the CGGBPs.

Lack of Synteny among Nonamniote CGGBP-Containing Proteins Indicates Distinct hAT Domestications
To ask whether the CGGBPs are orthologs (i.e., related by common ancestry in the host genome), we examined the homology of nearby genes at their chromosomal locations, because common descent is often reflected in shared ordering of genes (synteny). Gene order is typically preserved between human and bony fish, for example (Jaillon et al. 2004).
In contrast, few of the CGGBPs are found at syntenic positions (based on the presence of at least two nearby homologs), even within bony fish ( fig. 3). Three coelacanth proteins are adjacent on the chromosome, however (shown in fig. 3; all others are on different scaffolds), and these proteins have similar sequences, consistent with tandem duplication following initial integration. Two of these three yielded motifs ( fig. 2), with no high-scoring 8-mers in common ( fig. 1), suggesting that diversification in DNA sequence specificity may have promoted retention following duplication.
Lack of synteny is consistent with the CGGBPs originating from distinct hAT integration events. Several other observations are consistent with this notion. One is that the intron/ exon structures of the CGGBPs differ from each other (supplementary fig. S4, Supplementary Material online). Another is that CGGBPs show a spotty distribution, absent from many (and perhaps most) fish and insect genomes, and apparently most fungi, in a pattern that does not strictly reflect established clades (supplementary table S1, Supplementary Material online). Likewise, as noted above, protein sequence relationships among CGGBPs often deviate from the species tree, as do CGGBP DNA sequence specificities (figs. 1 and 2). Coelacanth CGGBPs, for example, are distributed throughout trees of both CGGBP AA sequence similarity ( fig. 2) and DNAbinding similarity ( fig. 1), suggesting that they may derive from different hAT subtypes. In another example, CGGBPs of the spiny chromis and the killifish are similar to each other and bind very similar motifs, but their AA sequences are more related to CGGBPs from kissing bug and migratory locust than to those of other fish, suggesting that they may have originated from horizontal transfer of related hATs. Likewise, the CGGBP from yellow fever mosquito is more closely related to crocodile hAT-19_Crp than it is to the hAT-19 proteins we analyzed from other insects, and these two proteins also have similar DNA sequence specificity.

Varied Expression of Coelacanth CGGBPs Suggests Tissue-Restricted Function
Collectively, these observations indicate that many independent domestications of the hAT CGGBP domain have spawned a relatively large family of DBPs with distinct sequence preferences, which are present in a variety of fish and insects and some fungi. In particular, the coelacanth CGGBPs have highly disparate sequences, relative to other CGGBPs, and we speculate that they originate from different hAT-19 families, or other related hATs. It is known that DNA transposons have been recently active in coelacanth (Naville et al. 2014), and indeed, relatively few of the 62 CGGBPs present in L. chalumnae can be found in the transcriptome shotgun assembly of the other extant coelacanth species, L. menadoensis, which diverged $22 Ma from L. chalumnae (Amemiya et al. 2013) (we identified only eight CGGBPs using reciprocal BLAST and scanning with the Panther HMM).
Lack of orthology presents a challenge in examining evolutionary pressures on the CGGBPs. The fact that the nonamniote CGGBPs are found entirely in nonmodel species also complicates functional analysis. Nonetheless, the L. chalumnae CGGBPs are often expressed in different tissues from each other (supplementary fig. S5, Supplementary Material online), suggesting that their functions may be tissue restricted in addition to being differentiated by distinct binding specificities.
Finally, the fact that the CGGBP domain model represents only a small subset of hATs, mainly the hAT-19 subclass, raises two intriguing possibilities. First, the observation that both the CGGBPs and hAT-19 are found mainly among a few disparate classes of eukaryotes suggests horizontal movement of hATs between very different branches, even though these classes (e.g., Rhizophagus and bony fish) inhabit very different environments. Second, the narrow coverage of the CGGBP domain model among hATs (29/824) raises the possibility that domain models corresponding to the same regions of other hAT subclasses may uncover additional classes of host DBPs.

Materials and Methods
We designed and built expression constructs for the 114 CGGBP proteins to include the entire domain plus 50 flanking AAs on both sides (unless the end of the protein was reached), fused to an N-terminal GST tag and driven by a T7 promoter. Three proteins with unknown AA residues (X) were run as all possible variants consistent with corresponding codon sequences. PBM data were generated as described (Weirauch et al. 2014), with motifs derived using Top10AlignZ (Ray et al. 2013 3. Synteny of CGGBP-containing loci. Orthologous loci are displayed as boxes along a bold line, and relationships between homologous genes indicated by thin connecting lines. All species with a sequence-specific CGGBP are included, as well as two fish with non-DBPs for comparison. Gene order was obtained from Ensembl or NCBI gene tracks. Nonhomologous intervening genes have been removed, with two diagonal lines indicating such gaps, and compared with orthologous loci in other species. For species with multiple paralogous CGGBPs, the name/accession of the visualized CGGBP is given. An asterisk is shown next to the name of species for which the CGGBP loci display no synteny, but which do contain conserved genes. Species tree from TimeTree (Kumar et al. 2017