TnpAREP and REP sequences dissemination in bacterial genomes: REP recognition determinants

Abstract REP, diverse palindromic DNA sequences found at high copy number in many bacterial genomes, have been attributed important roles in cell physiology but their dissemination mechanisms are poorly understood. They might represent non-autonomous transposable elements mobilizable by TnpAREP, the first prokaryotic domesticated transposase associated with REP. TnpAREP, fundamentally different from classical transposases, are members of the HuH superfamily and closely related to the transposases of the IS200/IS605 family. We previously showed that Escherichia coli TnpAREP processes cognate single stranded REP in vitro and that this activity requires the integrity of the REP structure, in particular imperfect palindromes interrupted by a bulge and preceded by a conserved DNA motif. A second group of REPs rather carry perfect palindromes, raising questions about how the latter are recognized by their cognate TnpAREP. To get insight into the importance of REP structural and sequence determinants in these two groups, we developed an in vitro activity assay coupled to a mutational analysis for three different TnpAREP/REP duos via a SELEX approach. We also tackled the question of how the cleavage site is selected. This study revealed that two TnpAREP groups have co-evolved with their cognate REPs and use different strategies to recognize their REP substrates.


INTRODUCTION
Although bacterial genomes are small and compact compared to their eukaryotic counterparts, they harbor multiple repeated sequences playing various functions (for review see (1,2)). Among them, REP elements (for Repetitive Extragenic Palindrome) are small palindromic sequences of 20-50 nts preceded by a conserved tetranucleotide, most often GTAG. REPs are present in great numbers, mostly in intergenic regions of bacterial genomes: about six hundred in the Escherichia coli K12 genome or thousands of copies in some Pseudomonas strains. They are often organized in BIMEs (for Bacterial Interspersed Multiple Elements). These structures combine two REPs in inverse orientation, REP and inverted REP (iREP), separated by a variable linker and frequently found as consecutive tandem copies. Various cellular functions have been attributed to REP/BIME in the structuring and plasticity of the genome, or in the regulation of gene expression at transcriptional, post-transcriptional levels, and in the regulation of stress response (3)(4)(5)(6)(7)(8)(9)(10).
A tnpA REP gene was described to be associated with REPs (11) in its immediate proximity in structures called REPtrons (23) (see examples in Figure 1). For simplicity later on in the text, we will refer to REPtron as to a given encoded protein TnpA REP and the ensemble of cognate REPs. It is important to note that the majority of REPs are generally distributed genome-wide but a given tnpA REP exists in most cases as a single copy and there is no evidence of tnpA REP mobility. While the presence of tnpA REP is often found to be correlated with the abundance of REPs in a given genome (11,12), tnpA REP behavior, based on several criteria (copy number per replicon, presence on plasmids, duplication rates) resembles more housekeeping genes than transposase genes (13). TnpA REP has thus been proposed to be a domesticated transposase mobilizing REPs over bacterial genomes. However, the underlying dissemination mechanism remains to be elucidated.
TnpA REP, are members of the HuH recombinase superfamily, which includes Rep proteins (rolling circle replication RCR, not to be confused with REP), relaxases (conjugative transfer) and certain Transposases (Helitrons, IS91/ISCR and IS200/IS605 families). All these proteins cleave, join DNA and carry the characteristic HuH motif (histidine-hydrophobic residue-histidine) crucial for coordinating a metal ion. The metal ion is essential for the nucleophilic attack by the characteristic catalytic Tyr residue, generating a covalent 5 P-tyrosine intermediate and a free 2 Nucleic Acids Research, 2021 In the presented REPtrons, tnpA REP (bold arrow) is represented in blue and red in group 2 and 3, respectively. REP and iREP structures are represented in blue/grey (group 2) and red/orange (group 3), respectively, purple and green bold lines--GTAG motif and complementary sequence CTAC, respectively. In the REP detailed structures, the GTAG is boxed in purple. Blue ovals represent irregularities in the group 2 REPs stem. This colour code is maintained throughout the text. For simplicity, in REPtron Ec, y, z1 and z2 REPs are presented without distinction.
Nucleic Acids Research, 2021 3 3 OH after DNA cleavage. Then, the latter 3 OH extremity can serve as primer for RCR, or act as nucleophile to attack the P-tyrosine bond to resolve it (for more details see (14)). TnpA REP , while constituting a separate family, are closely related to the transposases of the IS200/IS605 family (TnpA IS200/IS605 ) of bacterial insertion sequences (IS), which members are bordered by palindromic ends (for review see (15)). Transposition of IS200/IS605 elements occurs on single strand (ss) DNA and is strand-specific (16)(17)(18). Moreover, IS200/IS605 cleavage sites are chosen via a peculiar DNA-DNA complementarity between the cleavage sites and the respective 'guide' sequences located 5 to each palindrome (19), (example of model element IS608 in Supplementary Figure S1).
tnpA REP has been found in about 25% of all bacterial species (13). They are present largely in ␥ -proteobacteria, but also exist in other distant genera. Based on their protein sequences, TnpA REP can be classified into several groups. In particular, groups 2 and 3 (13) (also called groups 2.2 and 2.5, respectively (20)) are associated with the first and best described REPs (7,11,12,21,22). Group 2 mostly includes TnpA REP from different enterobacteria, while group 3 mainly comprises members from Pseudomonas species.
These two TnpA REP groups are associated with two types of REPs. Group 2 TnpA REP are associated with long REPs interrupted by an irregular zone/bulge in their stems ( TnpA REP from E. coli (TnpA Ec ) is the sole TnpA REP for which experimental studies of interactions with REP substrates have been performed. We have previously shown that TnpA Ec specifically recognizes ss REP (but not iREP) and catalyzes its cleavage and recombination in vitro. Cleavage occurs at the dinucleotide CT situated 5 or 3 to the REP structure (23). The conserved tetranucleotide GTAG is crucial for this activity. Consistent with this functional role, the GTAG motif forms contacts with several TnpAREP residues, as shown in the co-crystal (24) (see Figure 6B bottom). E. coli REPs (REP Ec ) include two conserved mismatches that form a bulge within the REP stem ( Figure 1A). This bulge is required for activity since compensatory mutations restoring regular stem eliminated activity. Although these analyses helped to shed light on the importance of the conserved tetranucleotide GTAG and the bulge in REP Ec recognition by TnpA Ec , the role of other components (loop, stem) was still ambiguous. How group 3 TnpA REP recognize their perfect palindromic REPs as well as how the cleavage site is selected remain to be elucidated.
Here, to go further in deciphering TnpA REP activity, we developed a sensitive in vitro activity assay, CST (for Cleavage and Strand Transfer) to detect and map REP cleavage sites, that we then adapted to a CST-based SELEX. A combination of this robust approach with a mutational analysis permitted to re-examine and to get access to the importance of different structural features in REP recognition by group 2 TnpA REP . In parallel, we extended the analysis to the group 3, for which no data are available, and tackled the question of cleavage site selection in this group. We showed that each group uses different strategies to recognize its REP substrates and demonstrate the role of the GTAG motif in cleavage site selection for a group 3 member. These results represent considerable progress in the comprehension of the distinct mechanism of TnpA REP mediated mobility and specificity of these expanding elements, which led us to discuss REPtrons potential evolutionary routes.

TnpA REP purification
TnpA Ec -His6 was purified as previously described (23). TnpA Mb and TnpA Sm coding sequences were synthetized and cloned in suitable expression vector under control of arabinose promotor. TnpA Mb and TnpA Sm were purified by affinity as N-term STREP tag fusion proteins, corresponding proteins were expressed in the E. coli K12 Strain Rosetta (DE3) (Novagen). A preculture was grown at 37 • C in L broth containing Amp was diluted 50-fold into the same medium at 30 • C. Protein expression was induced at OD 600 = 0.5-0.6 by adding arabinose to 0.8% final. After 3h, bacteria were centrifuged and the pellet was resuspended in buffer NP (phosphate buffer (NaH 2 PO 4 and Na 2 HPO 4 ) pH 8 50 mM, NaCl 400 mM, Triton 0.2%, glycerol 10%, DTT 1 mM) supplemented with 1 mg/ml lysozyme and EDTA-free protease inhibitor cocktail (Roche). Bacteria were sonicated and the lysate was cleared by centrifugation. The supernatant was then mixed with resin Strep-tactin Superflow Plus (Qiagen) during 2h at 4 • C. After washes in buffer NP, the proteins were eluted in buffer NPD (phosphate buffer (NaH 2 PO 4 and Na 2 HPO 4 ) pH 8 50 mM, NaCl 400 mM, Triton 0.2%, glycerol 10%, DTT 1 mM, desthiobiotine 2.5 mM). An additional purification step was performed using a Superdex 200 column (Highload 16/60, GE Healthcare). The samples were then dialysed in 25 mM HEPES pH 7.5, 400 mM NaCl, 1 mM EDTA, 1 mM DTT and 20% glycerol and stored at -80 • C.

CST-test on circular substrates in vitro
Proteins and substrates were incubated together 45 min at 37 • C in the reaction buffer in a final volume of 10 l containing 50 ng of ∼4kb pBluescript SK-derivative ss phagemid circular substrate, 0.5 g of poly-dIdC, 1.5 M TnpA REP . 3 l of 10 M stock of attacking oligonucleotide B457 were added and incubation continued for 30 min. Reaction was stopped and de-proteinized by adding an equal volume of 25 mM EDTA, 0,6 mg/ml Proteinase K and 2% SDS and incubated for 1 h at 37 • C. Products were purified on Promega columns (Wizard SV Gel and PCR) and subsequently served as templates for PCR amplification with GoTaq polymerase using B457 and Cy5 or Cy3 substrate specific fluorescent primer (98 • C, 2 min, 30× (98 • C 30 s, 56 • C 30 s, 72 • C 30 s)). PCR products were separated on a 8% native polyacrylamide gel and revealed by scan on GE Healthcare Typhoon Trio Imager.

CST-based selex
1 l of 1 M of degenerate substrates (Eurofins Genomics) was incubated with the corresponding TnpA REP in the standard reactional mixture for 45 min at 37 • C. The following steps were as described for CST. Amplification was carried out with 457 or other attacking primers and 321, common for all substrates. After sequencing with 321, ss substrates were prepared for next round by asymetric PCR with Phusion polymerase using 0.1 M 321 and 10 M of corresponding attacking primer (98 • C 30 s, 45× (98 • C 10 s, 56 • C 10 s, 72 • C 10 s)). The quantification procedure is described in details in Supplementary Materials and in Supplementary Figure S3C.

Experimental REPtron models
In this study, we focused on E. coli MG1655 REPtron (called Ec) as principal model for the group 2 ( Figure 1A, top). E. coli MG1655 genome harbors 3 types of REP: y (35nts), z1 (29nts) and z2 (37nts) often combined in BIME as mosaics of y-z1 or y-z2 REPs at multiple loci in the genome (7). The three REP Ec types are imperfect palindromes preceded by the characteristic GTAG and can form stem-loop structures interrupted by a conserved AA-GC mismatch forming a bulge, and certain unpaired bases in the loop. In addition, they share several conserved positions in the stem (Supplementary Figure S2B).
To investigate the group 3 TnpA REP /REP, several TnpA REP candidates were tested for their expression and solubility in E. coli. We chose Stenotrophomonas maltophilia K279a and Marinomonas sp. MWYL1 genomes ( Figure  1B). Organisms of this group often host several REPtrons and carry hundreds of REPs in their genomes (12). Furthermore, Stenostrephomonads are omnipresent environmental bacteria often present in the soil, and S. maltophilia is an opportunistic pathogen commonly associated with hospital acquired infections. A phylogenetic analysis of REP distribution in a S. maltophilia collection has pointed out a dynamic character of the REP/BIME distribution in these genomes suggesting an ongoing proliferation process (12). We chose to study Sm, one of REPtrons in the S. maltophilia K279a strains. REPtron Sm carries perfect palindromes REP (REP Sm ) of 16 nts interrupted by 3 nts and directly preceded by the conserved tetranucleotide GTAG ( Figure 1B).
Marinomonas sp. MWYL1 genome carries a group 3 REPtron Mb ( Figure 1B, bottom) and also a group 2 REPtron Ma ( Figure 1A, bottom and see below). REPtron Mb comprises small perfect palindromic REPs (REP Mb ) of 10 nts, interrupted by 4 nts and separated by 2 bases to the GTAG tetranucleotide. Interestingly, in contrast to the general genome-wide distribution, for both REPtron Ma and REPtron Mb, a physical association between tnpA REP genes and REPs is quite pronounced (11). REP Ma and REP Mb are concentrated in proximity to tnpA Ma and tnpA Mb , suggesting that the arrival of these REPtrons was recent and that the corresponding REP copies have been subsequently multiplied in their vicinity.
We concentrated our analyses principally on REPtrons Ec, Sm and Mb. The three purified TnpA REP were then used to examine their activities on their cognate REPs. The study on the group 2 was also supplemented by activity tests performed with TnpA Ec on group 2 REP Ma from Marinomonas sp. MWYL1 genome ( Figure 1A, bottom). REPtron Ma includes two types of long REPs (REP Ma1 and REP Ma2 of 42 and 38 nts) with different irregularities in the stems followed by large loops for which an alignment showed few conserved positions (Supplementary Figure S2C).

Cleavage and strand transfer assay (CST)
We previously showed that TnpA Ec is capable of cleaving and recombining ss REP Ec in vitro (23). Cleavages occur 5 or 3 of REP substrates at a dinucleotide C|T. To go further in the comprehension of REP mobility mechanism, we developed an activity assay called CST (Cleavage-Strand Transfer). The CST assay takes advantage of the general property of HuH enzymes, which form a 5 P-tyrosine link and a 3 OH extremity upon cleavage (14) ( Figure 2A3). The 3 -OH then can be differently used. Upon cleavage by Rep proteins (single-stranded phages and RCR plasmids) and conjugative relaxases, the 3 -OH group can serve to prime replication. The 3 -OH can also act as the nucleophile for strand transfer to resolve the 5 P-tyrosine link in the termination step of RCR replication, conjugative transfer and transposition. Both possibilities might be exploited to disseminate REP/BIME sequences (23).
The CST assay was first developed with the REPtron Ec ( Figure 2). After incubation of ss REP substrates with TnpA Ec in a reaction buffer allowing cleavages to occur ( Figure 2A2-3), an excess of an 'attacking' oligonucleotide is added and incubation is continued. The 3 OH end of the 'attacking' oligonucleotide can then attack the 5 P-tyrosine covalent link to resolve it. This strand transfer reaction leads to the formation of a new molecule where the attacking oligonucleotide is covalently joined to the cleaved ss REP substrate ( Figure 2A4). Pilot experiment with attacking oligonucleotides carrying variable 3 extremities has shown that the 3 base is obligatory a C, whereas upstream sequence is less important (not shown). To characterize joint products, purified DNA was used as template for PCR amplification using the attacking oligonucleotide and a primer specific for the REP substrate ( Figure 2A5). Typical profile obtained with a ss phagemid substrate carry-  (1) were first incubated together with TnpA REP in a reaction buffer leading to their binding and cleavage (2), resulting in the formation of a covalent complex TnpA REP Tyr-5 P and a 3 -OH group (3). Afterwards, an attacking oligonucleotide was added in excess (4), resolving the covalent link and fusing it to the 5 of cleaved substrate (5). Cleavage sites were mapped by PCR amplification with attacking and substrate-specific primers. Purple oval represents TnpA REP , CT/black arrow--cleavage site, purple star -3 -OH and Y circled in yellow -covalent link Tyr-5 P, respectively. Attacking and substrate specific primers are represented as green and red arrows, respectively. Curved red arrow represents attack by 3 -OH group present on the attacking primer. (B) Profile of cleavage sites on ss circular DNA phagemid substrates. The same conditions were used for all the substrates. '−' or '+' indicate no TnpA Ec (lane 2) or with TnpA Ec , reactions performed on substrates carrying wild-type REP Ec on a BIME, only iREP or a BIME carrying mutant GTAG (lanes 1-2, 3 and 4 respectively). Black and red arrows (right) represent mapped cleavage sites 5 and 3 to REP structure and major cleavage sites in wild-type substrate, respectively. ing a wild-type REP/BIME is shown in Figure 2B, lane 1, compared to that obtained in the absence of TnpA Ec (lane 2). No significant amplification products were observed using substrates carrying only an iREP or mutations in the essential GTAG motif ( Figure 2B, lanes 3 and 4 respectively). In all cases, amplification was specific to wild-type REP/BIME substrate and wild-type TnpA Ec , in contrast to catalytic mutant TnpA Ec Y115F (not shown).
The assay was further validated by sequencing the amplification products. As was the case for experiments documented previously, cleavage occurred mainly in proximity, 5 or 3 of the REP structure ( Figure 2B, Supplementary Figure S3A). We also observed discrete distant cleavage sites. In addition, the attacking oligonucleotide was systematically abutted to the T of the C|T cleavage sites confirming that the amplification products were all issued from cleavage and strand joining events (not shown).

CST-based SELEX
To get insight directly into REP structural features potentially important for TnpA REP activity, we took advantage of the CST assay to develop a CST-based SELEX (Systematic Evolution of Ligands by Exponential Enrichment) (25). In contrast to the CST assay described above where phagemid-derived circular ssDNA molecules were used generating multiple cleavage sites 5 and 3 to the REP, SELEX substrates are simple oligonucleotides carrying a unique 5 cleavage site and degenerate zones in the REP defining features (the GTAG motif and the palindrome: bulge, loop). These were incubated with cognate TnpA REP as in the CST assay (Supplementary Figure S3B, R 0 ). After the first PCR amplification, bulk amplified products were sequenced with a common substrate-specific primer (first round, Supplementary Figure S3B, R 1 ). For the next round, ss substrates were prepared by asymmetric PCR using an excess of attacking oligonucleotide as described in Materials & Methods (Supplementary Figure S3B, R2). In each round, different 'attacking' oligonucleotides were used, all carrying a 3 C permitting reconstitution of the cleavage site for the next round. Finally, from sequencing data, enrichment of different bases at a given position were estimated by Enrichment factor E N,0 , calculated as ratio of fractions of a given base at round R N to that at round R 0 : E N,0 = F N /F 0 . Level of selection (S for score) at each position was then estimated as the variance of E N,0 of all bases: S = V(E N,0 ). The calculation method is detailed in Supplementary Materials and an example of this analysis is illustrated in Supplementary Figure S3C.
We first tested the CST-based SELEX to re-examine the importance of the conserved GTAG in the REP Ec . Supplementary Figure S3C shows sequencing profiles obtained with initial substrate (R 0 ) carrying degenerate bases at the GTAG motif and those obtained at the first round (R 1 ). Remarkably, the four positions in GTAG motif were selected with high scores at the first round, as illustrated in Figure 3A and Supplementary Figure S3C. This confirmed the crucial role of the motif previously observed: no mutations were tolerated, any substitution abolished binding and cleavage (24, Supplementary Figure S4C

What has been learned from CST-SELEX on the E. coli REPtron Ec
We have shown previously that TnpA Ec is active on the three REP Ec y, z1 and z2 (23). In this section, except indicated otherwise, we generally used oligonucleotides substrates carrying derivatives of the y REP Ec , the most studied at biochemical and structural levels. REP coordinates were kept as used previously (24).
The REP Ec bulge. The conserved mismatches A 12 A 13 -G 26 C 27 are located in the middle of the y REP stem and the C 27 base is specifically contacted by TnpA Ec ( Figure 6B bottom; (24)). Mutations A 12 A 13 -T 26 T 27 or G 12 C 13 -G 26 C 27 introduced to correct the mismatches severely affected activity (24,23). One could therefore expect a significant or exclusive selection of these bases in the CST-based SELEX assay. Instead, while some selections occurred for the three positions A 12 A 13 and C 27 , the enrichments were far from those observed with the GTAG motif ( Figure 3B). In particular, both C and T were only moderately enriched at the C 27 position which is in contact with the protein. The same was observed with conserved positions A 12 A 13 where T 12 and A 13 were merely enriched with medium scores, respectively ( Figure 3B). Medium and low scores could result from the poor selection of independent bases at each position. Alternatively, multiple specific combinations of nucleotides may have been selected. However, bulk Sanger sequencing cannot capture associations between positions and only provide an average picture of the selection process. Since individual selected molecules were not sequenced the analysis cannot inform us directly about synergism or antagonism between substitutions.
To investigate the impact of this 'low selection', we tested substrates carrying substitutions A 12 T, C 27 T, replacing natural mismatch positions by those suggested by SELEX or by other bases A 13 T, G 26 T, both keeping bases unpaired. Cleavage of these variants was maintained as judged by the presence of cleavage products ( Figure 3C, compare lanes 3-4, 5-6 to 1-2). Thus, the unpaired state (mispairing in this case) instead of the sequence, seems to be crucial for recognition of the REP Ec by TnpA Ec .
The REP Ec stem-loop. Beyond the conserved bulge, hundreds of y, z1 and z2 REP Ec s share several common features (see consensus alignment in Supplementary Figure S2B) including a position in upper stem and several conserved positions in the lower stem, and in particular T 11 and G 32 contacted by TnpA Ec (24) (Figure 6B bottom). Among these 3 types of REPs, stem lengths and loop sequences are variable while relatively conserved in each group. To get access to the role of respective loops, we performed CST-SELEX on the 3 types of REPs. No specific enrichment of degenerate loop nucleotides occurred even after several rounds (with E 3,0 around 1 and low scores for all positions) ( Figure 3D, result shown for y and Supplementary Figure S4A-B for z1 and z2 REP E c, respectively). We further tested the importance of the upper stem sequence and length by mutations. Binding and cleavage of a P 32 -labelled oligonucleotide substrate for which the loop sequences or the upper stem were swapped to their complement, were still observed, as shown by EMSA (Supplementary Figure S4C, compare lanes 1-3, lanes 4-6 and not shown) and sequencing gel ( Figure 3E, lanes 1-2, 3-4 and 5-6), respectively. Similarly, no notable effect was observed upon modification of y REP lower or upper stem to simulate the z1 and z2 structures (not shown). Nevertheless, ablation of the upper stem and loop abolished binding and severely affected cleavage as judged by the absence of retarded complex (Supplementary Figure S4C, lanes 7-9) and reduction of cleavage products ( Figure 3E, lanes 7-8). These results suggest a non-specific structural role of the REP Ec upper stem-loop. This is in contrast to the role of the conserved positions T 11 and G 32 in the lower stem, which mutations T 11 A or G 32 C seriously affected binding (Supplementary Figure S4C, lanes 10-12 and 13-15, respectively) and cleavage ( Figure 3E, lanes 9-10 and 11-12, respectively).
Cross-activity. Taken together, these results suggest a relative flexibility in the substrates of TnpA Ec . This implies that other REP structures, harboring only few conserved features with REP Ec could be recognized and processed by TnpA Ec . Examination of REP structures in two group 2 REPtrons has pointed out some potential common features in REP Ec and REP Ma (Supplementary Figure S5A). Consistently, TnpA Ec exhibits robust activity on REP Ma1 and REP Ma2 substrates (Supplementary Figure S5B, lanes 1-2 and 3-4, respectively). The importance of the bulge for activity could be confirmed by experiment where mutations introduced to form perfect stem affect TnpA Ec cleavage activity on REP Ma1 and REP Ma2 substrates (Supplementary Figure S5C, lanes 3-4 compared to 1-2 and lanes 7-8 compared to 5-6).
On top of the crucial GTAG motif, a handful of REP additional structural features appears sufficient to be recognized and processed by TnpA Ec .

S. maltophilia REPtron Sm: different strategy to recognize cognate REP
The REPtron Sm includes REPs of 23 nts (REP Sm ) composed of an 8-bp perfect palindrome and a 3-nt loop (Figure 1B). Purified Sm TnpA REP (TnpA Sm ) cleaves REP Sm substrate (an oligonucleotide carrying REP structure and a 3 cleavage site) at a CT dinucleotide, as shown in Figure 4A (lanes 1-3). No cleavage product was observed with the catalytic mutant derivative TnpA Sm Y130F (lanes 4-6) nor in the presence of a substrate carrying the mutant cleavage site CT-TT (lanes 7-9). As observed for Ec REPtron, the iREP Sm displayed no binding and cleavage activity (not shown). To examine the importance of the conserved GTAG motif and the palindrome features of ss REP Sm , we assayed different ss REP Sm substrates for binding, cleavage in vitro and SELEX.
The GTAG motif. TnpA Sm formed specific retarded complex with ss REP Sm , as shown in EMSA experiments (Figure 4B, lanes 1-2). Single mutations in the GTAG motif did not affect the binding profile ( Figure 4B, lanes 3-10), showing that, in contrast to TnpAEc (Supplementary Figure S4C, lanes 16-18 and 19-21 and (24)), TnpASm binding to its substrate tolerates mutations in the conserved tetranucleotide. However, these mutations seriously affected cleavage as shown in Figure 4C. Activity was reduced with CTAG mutant and barely detected with GCAG Nucleic Acids Research, 2021 9 substrate ( Figure 4C lanes 3-4 and 5-6, respectively compared to wild-type GTAG, lanes 1-2). Mutations in the third and fourth positions completely abolished cleavage (GTTG, lanes 7-8 and GTAC, lanes 9-10). In agreement with these results, in a SELEX experiment, the GTAG motif was selected was selected mainly with good scores ( Figure  4D).
The REP Sm stem-loop. In a first series of experiments, we used a mutant carrying a reverse complement of the loop sequence (G 13 C 14 T 15 -A 13 G 14 C 15 ). Cleavage was severely affected, as shown in Figure 4A (lanes 10-12). These mutations also largely compromised binding since no retarded complex was observed by EMSA experiment (not shown), suggesting its critical role in REP recognition. We further investigated the importance of the loop by CST-based SE-LEX ( Figure 4E). Among the 3 bases G 13 C 14 T 15 , the G 13 was largely enriched with high score whereas C 14 and T 15 in particular, were not. Accordingly, mutation of a guanine base to a cytosine G 13 C (C 13 C 14 T 15 ) abolished cleavage ( Figure 4F, compare lanes 1-2 and 3-4), confirming the SE-LEX result and highlighting the crucial role of this specific position in the REP Sm loop.
To get access to the importance of the REP Sm stem, we introduced mutations mostly by changing nucleotides to their complements by blocs, and subsequently at individual positions. These experiments showed a certain role of the central and upper parts of the stem on cleavage (Supplementary Figure S6A, compare lanes 1-2 with lanes 3-4, 5-6, 9-10, 11-12 and 13-14) although the effect was not drastic. Interestingly, such mutations in three bottom positions improved the cleavage (lanes 7-8). We also tested importance of being a perfect stem by introduction of a mismatch near the middle of the REP Sm stem. These mutations affected or almost eliminated cleavage (Supplementary Figure S6B, compare lanes 4-6, 7-9 to lanes 1-3).

Marinomonas sp. MWYL1 REPtron Mb: 'flexibility' in cleavage site selection
The Marinomonas group 3 REPtron Mb comprises small 5 bps perfect palindromic REPs, separated by 2 bases from the GTAG tetranucleotide ( Figure 1B). Since TnpA Mb binding to its REP substrate cannot be visualized by EMSA probably due to instability of complexes, here we examined only cleavage activity. Interestingly, the system turned out being more flexible: TnpA Mb (Mb TnpA REP ) cleaved cognate REP Mb at two sites, CT and CA. A ss DNA substrate of 55 nts carrying these cleavage sites both 5 and 3 to the stem-loop exhibited 4 cleavage products ( Figure 5A, lanes  1-3). Cleavage sites were confirmed by CST assay and mutational analysis. As expected, no cleavage product was observed with the catalytic mutant derivative TnpA Mb Y125F (lanes 4-6). A substrate carrying the mutant cleavage site CT-TT gave rise to cleavage products at the CA sites only (lanes 7-9).

The GTAG motif: SELEX and role in cleavage site selection.
In the case of REPtron Ec, the GTAG tetranucleotide is not only involved in TnpA Ec recognition of REP Ec but also supposed to participate in cleavage site selection (24). Hence we examine the importance of the GTAG motif by SELEX in oligonucleotide substrates carrying 5 CT or CA cleavage sites separately. We first observed that the selected profile with a CT-carrying SELEX substrate contrasted with the result obtained with REPtron Ec: only the last two positions were strongly enriched with high scores after a single round of enrichment ( Figure 5B, left). The profile obtained with CA-carrying substrate showed mainly moderate, more homogenous selection with relatively good scores for the motif (Figure 5B, right).
Cleavage site of IS200/IS605 family members is selected by particular DNA-DNA linear and cross complementarity with guide sequences, tetranucleotide 5 to the palindromes at left and right IS ends (19) (Supplementary Figure S1). A simple model of REP cleavage site selection would thus involve the GTAG tetranucleotide as a guide sequence (24). Accordingly, CT and CA can be chosen by cross complementarity with A 3 G 4 and T 2 G 4 respectively ( Figure 5C, top). To test this hypothesis, we designed simple 38 nts substrates carrying mutated GTAG variants and a unique cleavage site located 3 to the stem-loop. The wildtype GTAG substrate was cleaved at CA and CT sites (Figure 5C lanes 1-2 and not shown). Although less efficiently, a substrate carrying a mutation of the third base (GTAG-GTGG) was again cleaved at CA, as expected (lanes 3-4). Changing of GTAG to GTGG resulted in cleavage at CC (lanes 5-6) and to GTAC in cleavage at GA (lanes 7-8) and GT sites (lanes 9-10), respectively. Importantly, no cleavage was detected in absence of the corresponding cleavage site (lanes 11-12).
Thus, different positions of the GTAG motif were selected in substrates carrying CT or CA cleavage sites and although efficacy varied, changing a subset of the motif could modify REP Mb cleavage sites in a predictable way according to two presumed schemas and examples shown in Figure  5C. This confirmed the active role of the motif in cleavage sites selection of this 'flexible' REPtron, in a manner similar to that described for the IS200/IS605 elements (19,26).
The Mb stem-loop. The swap of the entire REP Mb stem to its complement moderately affected cleavage (not shown) indicating a slight role in the REP recognition/activity. Similarly, we further analyzed the importance of different stem portions by the same procedure. We observed a diminution of cleavage activity for mutations of the fourth position in the REP Mb stem, but no effect for mutations of the second and third positions ( Figure 5D, compare lanes 1-2 and 7-8 and not shown). Similarly to the REPtron Sm, introduction of a mismatch in the REP Mb stem seriously diminished cleavage activity ( Figure 5E, compare lanes 3-4 and 5-6 to lanes 1-2). We then examined the importance of the REP Mb loop by SELEX. Experiments were performed separately on CT-or CA-carrying substrates with a degenerate loop. For both substrates, three among 4 positions (T 12 , T 13 and A 15 ) were strongly enriched with good and excellent scores ( Figure 5F and not shown). Accordingly, negative values of log2(E 1,0 ) (for E 1,0 below 1) , clearly illustrated exclusion of the rest in these three positions T 12 , T 13 and A 15 (Supplementary Figure S7). These counterselections were otherwise confirmed by mutational analysis shown in Figure 5G. Mutations in the first and second po-  1-3), TnpA Mb catalytic mutant derivative Y125F on wild-type substrate (lanes 4-6) and substrate carrying mutations CT-TT at two CT sites (lanes 7-9). CA and CT cleavage products are shown by blue and black arrows, respectively. (B) REP Mb GTAG SELEX on CT-carrying substrate (left) and on CA-carrying substrate (right). The same schema as described previously is shown, REP Mb carrying a 5 CT or CA cleavage site and degenerate sequence N 1 N 2 N 3 N 4 (in red) at the GTAG motif where G, C, T, A are in blue, grey, green and red, respectively. Scores are indicated at corresponding positions. Underneath: initial sequence of the motif.  sitions (G 12 T 13 T 14 A 15 and T 12 G 13 T 14 A 15 ) greatly reduced cleavage ( Figure 5G, lanes 3-4 and lanes 5-6, compared to lanes 1-2). Also, the replacement T 13 C (T 12 C 13 T 14 A 15 ) completely abolished activity (lanes 7-8) while substrate carrying the 14th base degenerate (T 12 T 13 N 14 A 15 ) exhibited wild-type behavior (lanes 9-10). Finally, exchange of T 12 and A 15 (A 12 T 13 T 14 T 15) or individual substitutions T 12 A (A 12 T 13 T 14 A 15) or A 15 T (T 12 T 13 T 14 A 15 ) also compromised activity, as shown in Figure 5G, lanes 11-12 and not shown.
These results confirmed the crucial role of three positions in the REP Mb loop in cleavage activity and suggest that two bases T 12 A 15 are complementary in the REP Mb structure and might be considered as part of the stem.

DISCUSSION
Our analysis demonstrated that TnpA REP of the two groups employ diverse strategies to recognize their REP substrates. Clearly both REP components, the GTAG tetranucleotide motif and the palindrome, were involved in TnpA REP activity but their respective impacts varied in each system. In Figure 6A, we summarize the importance of these features. While GTAG is instrumental in REPtron Ec, mutations are largely tolerated in REPtron Sm for binding and in REPtron Mb for cleavage (and by deduction for binding). Although involvement of the GTAG motif in cleavage site selection has been suggested for REPtron Ec, its role was not experimentally supported. Interestingly, the REP Mb tetranucleotide motif was differently selected in CA or CT carrying substrates, probably reflecting their distinct contribution to respective cleavage sites selection. The role of loop sequences is also different for representatives of the two groups. No mutations were tolerated in certain REP Mb or REP Sm loop positions, while only a non-specific structural role was suggested for the REP Ec loop.

TnpA REP of the two groups
Catalytic center and C-term tail. Groups 2 and 3 REPtrons differ by their encoded TnpA REP and corresponding REPs. As shown by an alignment performed on a limited collection of TnpA REP (Figure 6B), the catalytic center composed of the metal coordination module (HuH motif and other additional residues (24)) and the catalytic Tyr is well conserved in both groups. Some differences are found in the N-term and C-term portions: group 3 members include several supplementary residues in N-term whereas group 2 members carry a C-term extension of about 20 residues, comprising a short helix ␣5 and an unstructured region in the case of TnpA Ec ( Figure 6B). The helix ␣5 and downstream adjacent region appeared to be important in TnpA Ec activity since derivatives 131 and 144 (deletions of 34 and 21 C-terminal residues, respectively) exhibit serious defects in binding and cleavage (data not shown). However deletion of 13 extreme C-terminal residues resulted in a mutant, 152, with higher activity than the wild-type (24), suggesting a regulatory function for these residues. In the group 3, the C-term part comprises also a short helix of unknown function.
Contacts with REP. The REP Ec GTAG motif, which is exclusively selected in SELEX and which tolerates no sub-stitution for binding and catalytic activity, is heavily contacted by TnpA Ec protein residues (group 2). These residues are distributed in the regions comprising ␤1 and surrounding ␤4 and also the C-terminal extremity (24) (Figure 6B). While these residues are well conserved in group 2, only some (Q95, D100 and R104) are relatively conserved in the group 3. In particular, G160 and E161, situated in the TnpA Ec C-term tail and absent in the group 3, contact the last two bases of the GTAG motif. These differences may partly explain the discrepancy in GTAG requirement in the two groups.
In REP Ec (group 2), the conserved mismatches forming a bulge in the middle of the stem A 12 A 13 -G 26 C 27 were also important since mutations recreating perfect palindrome affected activity, the C 27 is specifically contacted by the residue K82 situated in the conserved DNA binding ␣3 helix ( Figure 6B, (24)). Nevertheless, these positions were not or only moderately selected by SELEX suggesting that different combinations of nucleotides are possible. And in the case of C 27 , we obtained a mixture C/T suggesting that a pyrimidine might be required at this position. Concerning group 3 REPs, exclusive selection of unique loop positions G 13 (Sm), and T 13 (Mb) and impact of mutations on activity demonstrated their crucial role (Figures 4 and 5). Since no structural data are available, we can only speculate relative to contacts with cognate TnpA REP . In spite of discrepancy, some parallel might be made between loop positions in small REPs of group 3 and unpaired positions in group 2 REP and residues on the equivalent DNA binding helix ␣3 might be responsible for these contacts. The same helix and downstream region might mediate cognate TnpA REP binding to the group 3 REP stems as observed for TnpA Ec.

Binding to folded ssDNA hairpin
TnpA REP , as TnpA IS200/IS605 , recognize their ss DNA REP substrates in a strand-specific manner. Only REP with characteristic features is bound and processed, iREP is not. In the group 3 REP, the conserved motif GTAG is clearly involved in strand discrimination, while its role in group 2 is more limited. Furthermore, the effect of single stranded features (loop or irregular zone as mismatches, bulge) is undeniable.
These properties echo those displayed by some proteins encoded by mobile genetic elements working on ss folded DNA such as Integrases IntI encoded by Integron, plasmid Relaxases and ss DNA Transposases TnpA IS200/IS605 . For conjugative transfer, the Relaxase recognizes oriT as a single stranded folded hairpin (27). While contacts with stem remain non-specific, it establishes specific contacts with ss DNA cleavage region downstream of the hairpin. In the recombination reaction between the integron attC and attI sites, the ds DNA site attC is a ss folded structure from the bottom strand, reconstituting ds recombination site (28). In the crystal structure, Int establishes specific contacts with several flipped out bases in the attC site and these interactions are primordial for recombination (29). Moreover, efficient insertion of integron cassette is also influenced by two other unpaired regions of attC recombination sites (30). Recently, the impact of these structural speci- TnpA REP (boxed in blue and red, respectively) based on TnpA Ec structural data. Catalytic tyrosine and HuH motif are indicated by red and orange-coloured stars, respectively. TnpA Ec residues involved in specific contacts with the GTAG motif, specific interaction with the bulge, specific and non-specific contacts with REP Ec stem are indicated by purple, blue, black and grey points, respectively. Bottom: TnpA Ec residues contacting minimal y REP Ec structure (24) where the same colour code is used: residues contacting specifically GTAG (boxed in purple), bulged C 27 (in light blue) and stem specific positions T 11 , G 32 (in black) and stem non-specific interactions (in grey), respectively.

Nucleic Acids Research, 2021 13
ficity determinants of integron cassette has been refined using synthetic biology combined with large scale mutagenesis, next-generation sequencing and machine learning (31). This powerful approach will be a valuable tool to reexamine and to get a global view of specificity determinants and synthetic evolution pathways in diverse systems including REPtrons.
TnpA REP are so far the closest relatives of TnpA IS200/IS605 , among which transposases of IS608 and ISDra2 are the most studied. To recognize the REP correct structure, TnpA REP proteins contact loop or irregularities in the palindrome stem, as do ss transposases. IS608 unpaired base T 17 is sandwiched between aromatic residues in a hydrophobic pocket, whereas the T 10 in the loop is specifically contacted by two residues (17). Similarly, ISDra2 transposase displayed contacts with T 14 in the loop and a mismatched base within the stem (18). Thus, TnpA REP proteins employ alternatively these binding determinants in combination with the conserved tetranucleotide GTAG. The last feature clearly distinguishes TnpA REP from ss transposases that mostly contact exclusively the palindromes.

Cleavage sites selection
Left and right cleavage sites of the IS200/IS605 family members are selected via a network of peculiar complementary interactions with corresponding 'guide' sequences, which are tetranucleotides 5 to the palindromes (19) (Supplementary Figure S1). Consequently, IS608 cleavage sites could be modified by changing the corresponding 'guide' sequences, resulting also in retargeting of the IS (26). The position of the GTAG tetranucleotide in REPs could be equivalent to the 'guide' sequences. According to the proposed model of cleavage site selection based on examples of IS608 and ISDra2, the common CT and the Mb CA cleavage sites would be chosen via interactions with subsets of the conserved GTAG. TnpA Mb turned out to be more flexible and cleaves REP substrate at both CT and CA sites. Thanks to this flexibility, we could explore this question and manage to vary REP Mb CT and CA cleavage sites by changing certain positions in the GTAG motif. Although in these experiments the cleavage sites could be changed by that simple way, cleavage efficiency varies and it is not excluded that other factors would be involved.
In the cases of REPtrons Ec and Sm, similar attempts to change cleavage sites did not succeed (not shown). The REPtron Ec is known not to tolerate any GTAG mutation. In the case of Sm, while GTAG mutants still form complexes with TnpA Sm , they severely reduced cleavage, in particular when mutations concern the last two positions, consistent with their postulated role in cleavage site selection. We suppose that the CT site is indeed selected by the GTAG motif but that, in the case of the REPtron Ec, the GTAG is 'protected' from mutation by specific contacts with the protein, as shown by the structure. Alternatively, TnpA Ec or TnpA Sm could also accommodate CT dinucleotide into the catalytic site. This information was missing in the available structure.

REPtrons and potential evolutionary route
REPtrons and IS200/IS605 family members share major features. They exhibit an equivalent genetic structure in which coding sequences are bordered by palindromes, and encode proteins with a similar catalytic domain. Large scale phylogenetic analyses have confirmed the evolutionary relationship between TnpA REP and TnpA IS200/IS605 (13,20). TnpA REP have been proposed to originate from ancient TnpA IS200/IS605 ancestors in Enterobacteria and Pseudomonas where tnpA REP are the most widespread. Alternatively, this distribution may reflect their successful establishment following arrival via horizontal transfer in these bacterial groups (ISfinder https://isfinder.biotoul.fr/), (32).
Our results here suggest that these two TnpA REP groups co-evolve with their respective REP sequences. On the other hand, this does not seem to be the case with the IS200/IS605 family, which includes two subgroups, one carries palindromes with irregularities (e.g. IS608 and ISDra2) whereas another one is associated with perfect palindromic ends (e.g. IS200, IS1451). Yet TnpA IS200/IS605 appear very homogenous, no distinction being observable in corresponding transposases sequences (ISfinder). It will be interesting to know whether the two described REPtrons groups here have evolved from a common ancestor, common with IS200/IS605 family members or not.
In spite of the close relationship between TnpA REP and TnpA IS200/IS605 , while tnpA IS200/IS605 exhibit typical behavior of IS transposase genes, tnpA REP are, in many respects, very close to housekeeping genes (13), supporting the previous consideration of TnpA REP as the first described bacterial domesticated transposases. The maintenance of tnpA REP in bacterial genomes also implies that they have been coopted to fulfil functions benefic to the host cell. Diverse documented functions of REP sequences in cell physiology suggest their roles in improving fitness of bacterial host in a given niche or environment. This notion has been reinforced by a recent genome-wide CRISPRi analysis in E. coli using the catalytic null mutant of the Cas9 RNAguided nuclease (CRISPR-dCas9) for silencing genes of interest (33). Interestingly, this study has revealed fitness defect caused by dCas9 binding at different REP sequences. Works are in progress to decipher the dissemination pathway of these important elements.