Novel Two-Component System-Like Elements Reveal Functional Domains Associated with Restriction–Modification Systems and paraMORC ATPases in Bacteria

Abstract Two-component systems (TCS) are important types of machinery allowing for efficient signal recognition and transmission in bacterial cells. The majority of TCSs utilized by bacteria is composed of a sensor histidine kinase (HK) and a cognate response regulator (RR). In the present study, we report two newly predicted protein domains—both to be included in the next release of the Pfam database: Response_reg_2 (PF19192) and HEF_HK (PF19191)—in bacteria which exhibit high structural similarity, respectively, with typical domains of RRs and HKs. Additionally, the genes encoding for the novel predicted domains exhibit a 91.6% linkage observed across 644 genomic regions recovered from 628 different bacterial strains. The remarkable adjacent colocalization between genes carrying Response_reg_2 and HEF_HK in addition to their conserved structural features, which are highly similar to those from well-known HKs and RRs, raises the possibility of Response_reg_2 and HEF_HK constituting a new TCS in bacteria. The genomic regions in which these predicted two-component systems-like are located additionally exhibit an overrepresented presence of restriction–modification (R–M) systems especially the type II R–M. Among these, there is a conspicuous presence of C-5 cytosine-specific DNA methylases which may indicate a functional association with the newly discovered domains. The solid presence of R–M systems and the presence of the GHKL family domain HATPase_c_3 across most of the HEF_HK-containing genes are also indicative that these genes are evolutionarily related to the paraMORC family of ATPases.


Introduction
Monitoring environmental conditions is determinant for the success of microbial organisms. To address this demand, biochemical cascades that enable programmatic responses to diverse stimuli have evolved across the three domains of life (Lodish et al. 2008;Alberts 2017). A dominant signaling archetype in nature is comprised of a receptor/transmitter protein paired with a ligand-binding regulator to be affected downstream, collectively known as two-component systems (TCS) (Hoch and Silhavy 1995;Stock et al. 2000). Specifically, in bacteria, preferable conservation of histidine-kinase (HK) as the transmitter module in receptor proteins has been recurrently documented since the discovery of TCSs roughly three decades ago (Ninfa and Magasanik 1986;Nixon et al. 1986;Koretke et al. 2000;Wuichet et al. 2010). HKs commonly associate with intracellular response regulators (RR), which transduce signals and generate responses through diverse mechanisms. Such mechanisms range from binding DNA or proteins to enzymatic catalysis (Romling et al. 2005;Batchelor et al. 2008;Gao and Stock 2009). As the genomic TCS collection in a single cell may comprise dozens of duplets, interaction specificity among the cognate partners of a TCS is critical to avoid cross-talk with other pathways (Galperin 2005;Skerker et al. 2005).
In this context, the canonical transduction chain in this system depends on five key domains. The first domain is a highly conserved catalytic ATP-binding domain (CA) typified by the HATPase domains (Pfam: PF02518; PF13581; PF13589; PF14501) which integrate the C-terminal regions of all known HKs and harbor a characteristic nucleotidebinding a/b sandwich known as the Bergerat fold (Bergerat et al. 1997;Dutta and Inouye 2000). The proteins harboring this structural fold are classified into the GHKL superfamily (Dutta and Inouye 2000). The second domain is a dimerization domain containing the active His site for phosphorylation (DHp). Together, DHp and CA domains comprise the two well-conserved regions in HK proteins. The prototypic DHp domain harbors the active site His in a double a-helical structure that is part of a four-helix bundle termed H-box (Karniol and Vierstra 2004). Structural variants of DHp have been described over the years, which are commonly called HiskA domains (Pfam: PF00512; PF07568; PF07730; PF07536) (Bilwes et al. 1999;Grebe and Stock 1999;Karniol and Vierstra 2004). The third well-conserved domain in TCS is the receiver domain (REC), which is commonly found in the RRs, and features a highly conserved Asp residue within its active site (Pfam: PF00072) (West and Stock 2001). This active site Asp is typically found within a characteristic (ba) 5 fold conserved in RR proteins (Stock et al. 1989). The remaining two domains among the aforementioned five typical domains in TCS are highly variable and comprise: 1) a sensor/input domain (in HKs) and 2) effector/output (in RRs), respectively responsible for recognizing specific signals, and eliciting responses. The high-sequence variability observed in these domains enables a myriad of signal/target recognition patterns (Gao and Stock 2009).
The functional classification of HKs usually provides necessary insight that allows the initial characterization of the TCS role. This classification is generally inferred employing domain architecture, which encompasses transmembrane regions (TMRs) and sensor domain arrangement within the sequences . Three major groups of HKs can be determined following this criterion. The largest among these groups is represented by the periplasmic-sensing HKs carrying an extracellular sensor typically flanked by two TMRs . Members of this group commonly harbor a so-called "linker" domain (e.g., HAMP, or PAS) flanked by the TMR2 and the DHp domain in the protein sequence (Aravind and Ponting 1999). The second major group includes both soluble and membrane-anchored sensors typified by inward-facing input domains ). Among these, NtrB along with its cognate NtrC comprise a wellcharacterized TCS model responsible for gene regulation under nitrogen deprivation scenarios in Escherichia coli (Ninfa and Magasanik 1986;Jiang et al. 2003). The role of another soluble HK termed HoxJ has also been established in Ralstonia eutropha as being part of a regulation mechanism for hydrogenase genes in response to H 2 (Lenz and Friedrich 1998;Kleihues et al. 2000). Interestingly, hoxJ is invariably found adjacently to hoxBC genes, which encode two proteins that help assemble the sensory/transduction complex HoxJ-BC necessary for proper H 2 perception . This is unsurprising given the typical colocalization of functionally related genes in the same gene-neighborhoods found in prokaryotic genomes (Qi et al. 2010). HKs of the third group are characterized by the conservation of 2-20 TMRs that actively enable stimulus perception often associated with the membrane itself (e.g., mechanical stress) (Mascher et al. 2003. In this report, we predict two novel domains structurally related respectively to DHp and REC in bacteria. In addition, the extensive colocalization observed in genes that encode these domains supports the possibility of these domains comprising a novel TCS. These genes also tend to be neighbored by restriction-modification (R-M) systems, which provided further insights on their putative biological role.

Establishing Homologous Sets and Inspecting Genomic Contexts
On the course of a large-scale gene expression analysis conducted by our group on potato tubers using the highly virulent plant pathogen Pectobacterium brasiliense strain PBR 1692 (Pb1692), three adjacent uncharacterized genes were detected exhibiting significant activation (Bellieny-Rabelo et al. 2019, 2020. The context in which these genes were up-regulated involves different stresses that may indicate their recruitment to either support or participate in bacterial competition, or interaction with the eukaryotic host. One of these three genes (PCBA_RS21900, locus tag in the previous version of the genome from NCBI; AED-0003799 updated locus tag from https://asap.genetics.wisc.edu/, last accessed February 25, 2021) was identified as a member of the GHKL superfamily. The encoded product of PCBA_RS21900 exhibits two HATPase domains (Pfam: PF02518; PF13589) (Dutta and Inouye 2000). These domains are typically found in Hsp90 chaperones, DNA-gyrases, MutL, and HK family members (Dutta and Inouye 2000). The other two neighboring genes (PCBA_RS05725-05720, previous version's locus tags; AED-0003798 updated locus tag that merges those from the previous version), contrarily, did not harbor any known conserved domains. With further inspection of the PCBA_RS05725-05720 homologs in other strains from the same organism (i.e., LMG21371, PcbHPI01) through nucleotide alignment, we found that those two hypothetical entries probably comprise in reality one single gene, which may result from a minor misassembly and/or gene-prediction error in the Pb1692 genome. Curiously, these homologs from LMG21371 and PcbHPI01 strains were also neighbored by GHKL superfamily members. Hence, aiming to improve the accuracy in the subsequent computational analyses involving sequence comparisons, the sequence from Pb LMG21371 (KS44_RS10365) was used instead of those from Pb1692. The goal here was to provide an assessment of the frequency of genomic colocalization between KS44_RS10365-related genes and GHKL superfamily members in other bacterial genomes.
To perform an initial gene-neighborhood screening on KS44_RS10365-related sequences, we collected a wide range of similar protein sequences from prokaryotes in an extensive search on the NCBI database (see Materials and Methods for details). From the initial sequence search, 677 bacterial strains returned positive hits for KS44_RS10365, and their respective genomic data were obtained to conduct subsequent analyses (supplementary table S1, Supplementary Material online). Next, to predict the orthology relationship among the 441,897 sequences collected from the 677 strains, a sequence clustering step was conducted using OrthoMCL (Li et al. 2003). The sequence clustering analysis produced 23,397 distinct orthologous groups (OG) in total, which were populated by ranges of 2-990 sequences. This approach was paralleled by conserved domains inspection (Eddy 2009) in the same protein data set. Conserved domains assessment was used to provide functional annotations, and to carefully check the cohesiveness of domain architectures within the clustered sequences in the OGs.
The integration of domains and OGs annotation with genomic coordinates enabled extensive gene-neighborhood analysis on the genes encoding the OG_4 members, which harbors KS44_RS10365 and its predicted orthologs. The OG_4 sequences spread through 628 different strains in which 30 conserve two to three paralogs, totaling 664 proteins (supplementary table S2, Supplementary Material online). Out of these 664 paralogs, 20 were in regions unsuitable for contextual analysis due to incompletion of the respective genome assemblies, which resulted in the lack of necessary data in the region of interest. In the 644 analyzed regions (after the exclusion of the 20 unsuitable regions), we observed a remarkable linkage of 91.6% between OG_4 and GHKL superfamily members (OG_5), in a distance of up to three genes. Next, by scanning a broader region including ten genes up/downstream to OG_4 homologs, the HATPase-containing genes (GHKL) were among the most prevalently detected domains ( fig. 1). In the OG_4 members' genomic neighborhoods, we also found the dominant presence of DNA_methylase (Pfam: PF00145) containing genes, frequently located upstream of the OG_4 homologs. This strong colocalization raised the possibility of the association of OG_4 members with R-M systems, which will be addressed in detail in a subsequent topic. Other domains associated with two-component signal transduction systems, such as HiskA (Pfam: PF00512) and Response_reg (Pfam: PF00072) were also frequent in the OG_4 members' regions ( fig. 1). The presence of immunity-and polymorphic toxinrelated domains such as CdiI_3 (Pfam: PF18616) and Ntox28 (Pfam: PF15605) in OG_4 members' vicinity indicates that these genes can frequently be found in association with toxin-antitoxin systems. Additionally, several genes located in these regions for which no known domains could be detected were grouped in orthologous groups, such as the members of OG_20, OG_16, OG_24, OG_8, and OG_294 ( fig. 1). The traceable evolutionary relationship among these uncharacterized genes can potentially drive the discovery of novel functionalities in subsequent studies. Together, these observations revealed a strong colocalization pattern among OG_4 and OG_5 (GHKL-containing) members in a wide range of bacteria as well as the most important functional groups to which they frequently associate. Below, we lay emphasis on sequence analysis to examine other conserved features in OG_4/5 groups aiming to determine which biological role they might undertake.

Domain Architectures Characterization
As a preliminary step, we performed a sequence analysis step using the OG_5 sequences in comparison with the major GHKL families (i.e., Hsp90, DNA-gyrases, MutL, and HK). This revealed that OG_5 sequences only produced results when aligned with HKs. This preliminary observation, in combination with the remarkable trend for colocalization of OG_4 and 5 indicated that OG_5 and OG_4 could comprise a TCS. Hence, taking advantage of diverse REC and DHp domain profiles publicly available (Finn et al. 2010), members of OG_4 and OG_5 were analyzed along with those profiles to verify the possible presence of similar features in the multiplesequences alignments (MSA). These analyses revealed two previously unknown conserved regions respectively in OG_4 and OG_5 bearing strong similarities with REC and DHp domains.
The most pronounced feature of REC domains is the active site Asp residue (Stock et al. 1989), which was remarkably conserved with 100% consensus found across representative OG_4 sequences ( fig. 2A). The active site Asp was also consistently conserved within the C-termini ends of b 3 . It has also been reported that two consecutive negatively charged residues located N-terminal to the active site are involved in cofactor coordination (Stock et al. 1993). These residues were strikingly conserved in the C-terminal end of b 1 both in OG_4 and in the canonical REC structure ( fig. 2A and B). The Thr column located at the C-termini of b 4 comprises another residue known to be conserved in REC domains, which was also shown to be highly conserved in the OG_4 representatives exhibiting a high consensus level within this group. Additionally, the C-terminal end of b 5 has a conserved Lys residue in the canonical REC, which was also observed in the OG_4 MSA. Although there was a lowly supported split of b 5 structure into 2 b sheets, here, the C-terminal end of the second sheet seems highly preserved in OG_4 sequences exhibiting strong conservation of the Lys residue similarly to the REC domain ( fig. 2A). This same b 5 split probably reflects the difficulty in secondary structure prediction for some sequences in the MSA, which is not frequently observed when some of the sequences are inspected individually ( fig. 2B). Notably, the tolerance in OG_4 sequences toward loop extensions interspersing the b/a structures, especially between a 1 /b 1 and a 5 /b 5 , have presumably deepened the evolutionary distance to the canonical Response_reg (REC domain) resulting in the solid evolutionary separation between these motifs ( fig. 2A and C). Although the OG_4 conserved motif and Response_reg have diverged, the topology found in these domains exhibits remarkable similarity in the characteristic (ba) 5 regulatory fold. This observed (ba) 5 also exhibits key amino acid residues in identical positions to those reported in the canonical domain structure of REC domains.
On the other hand, domain alignments of OG_5 members suggested a close relatedness between a conserved region within these sequences and the HiskA domain (Pfam: PF00512). Thus, by framing OG_5 and HiskA-containing representative sequences, the remarkable conservation of core His residue followed by a negatively charged residue (Glu or Asp) was observed ( fig. 3A). The two predicted a-helices in the analysis comprised the characteristic DHp found in Hboxes, in which a 1 harbored the core His residue ( fig. 3A and B). Four randomly picked models of OG_5 sequences depicted the presence of the core His residue in a 1 ( fig. 3B). Such strong conservation of H-box helices constituted important functional evidence because this structure is responsible for both dimerization among HKs and phosphorylation of the active site Asp residue in the cognate RR. The phylogenetic reconstruction of H boxes from representative sequences of four canonical DHp domain families (see Materials and Methods for details) suggested that the domain found in OG_5 sequences might have evolved from HiskA ( fig. 3C). These two closely related domains observably conserved a negatively charged residue (Glu or Asp) adjacent to the core His ( fig. 3A). Since HKs attachment to the membrane, or lack thereof, is intrinsically associated with their function, we quantitatively compiled TMR predictions on representative sequences carrying those same four DHp domains (i.e., HiskA, HiskA_2, HiskA_3, and HWE_HK). The results revealed a strong preference by HiskA_3 representatives to conserve six TMRs, whereas most of HWE_HK, HiskA_2, and OG_5 representative members tend to be found in soluble HKs ( fig. 3D). In OG_5 proteins specifically, none of the sequences exhibited positive TMR predictions, which suggests that these group members function as soluble proteins. Taken together, these results strongly suggest that members of OG_4/5 FIG. 1.-Top 1% best-represented protein domains encoded in OG_4 members' genomic regions. In total 628 genomes containing OG_4 genes were analyzed ten genes up-and downstream. The functional domains encoded by each locus in that range were computed and the total amounts are displayed in the bar plot. Domain names were detected by the hmmscan tool and are annotated according to the Pfam database. For each gene product, in the absence of known domains (na-not applicable), the respective orthologous group label (prefix: OG_; suffix: numeric tag) is displayed. groups could comprise a cytoplasmic TCSs. The two respective novel domains found in OG_4 and OG_5 will be hereafter referred to as Response_reg_2 and HEF_HK, and the sequences carrying these domains RR-and HK-like, respectively.
The Association of RR-and HK-Like with R-M Systems Next, to gain further insight into the members of both families through "guilty by association" inference, we conducted indepth genomic context analysis in these RR-like-harboring regions. By using the previously obtained 644 genomic regions from 628 bacterial strains carrying RR-like orthologs as a reference, we first examined the functional domains encoded by their neighboring genes. This step aimed to inspect a genomic range of ten neighboring genes up-and downstream of the RR-like members. Besides, based on the previous assessment showing the solid presence of C-5 cytosine-specific DNA methylase (C5 Mtase) in the RR-like vicinity, we laid special emphasis on identifying R-M elements. To keep this analysis stringent from the functional prediction perspective (see Materials and Methods for details), we used the "gold standard" library of R-M-associated proteins from the REBASE database (Roberts et al. 2015). The REBASE gold standard set comprises only experimentally characterized components that have been associated with either restriction or modification functions. Aiming to determine which known functional domains are associated with R-M systems, we scanned the entire REBASE gold standard protein set using the Pfam database as a reference. Subsequently, by inspecting the domains encoded by the genes neighboring the RRlike encoding genes and comparing these with the previous results, we found a consistent presence of R-M-related domains.
Next, to evaluate the correlation of R-M systems elements in close proximity to RR-like genes, a computational simulation step was implemented. This technique shuffles the entire gene pools from known genomes, generating "pseudogenomes." By analyzing the results from these randomized pseudogenomes, we could infer up to which frequency particular gene associations might occur by chance, as well as overrepresented associations. Here, we utilized 25 randomly picked complete bacterial genomes from our initial set of 628 RR-like-containing organisms to conduct the simulations. From each of the 25 genomes, 2,000 pseudogenomes were generated. Next, the RR-like neighboring genes (ten genes up-and downstream) from each of the 50,000 simulations were inquired for the occurrence of R-M domains. The number of R-M domains in these regions was computed for each strain. For the vast majority of strains analyzed individually, the pseudogenomes analysis showed that up to three R-M domain occurrences should be expected by chance in the RR-like gene neighborhoods ( fig. 4A). Furthermore, we found that between five and seven R-M domains were found within a maximum distance of ten genes from the RR-like encoding gene in 41.9% of the 628 bacterial genomes analyzed, whereas 9.9-15% of the 50,000 pseudogenomes exhibited between five and seven R-M domains in the same genomic range. On the other hand, only 9.7% of the 628 genomes exhibit up to two R-M-related domains in the RR-like neighboring regions, whereas 49.3-56.8% exhibits the same frequency of R-M domains in the pseudogenomes. These observations further underscored the conspicuous presence of R-M systems elements in the genomic contexts of the newly described RR-like orthologs. On top of these comparisons, statistical analyses showed that for all strains, the frequency of R-M domains in RR-like neighborhoods exhibited significant contrast (P value <0.0001) compared with the frequency found in the real data set ( fig. 4B). The nonrandom genomic association of RR-Like and R-M-related genes strongly suggests a mechanistic relationship between these themes in a vast range of bacteria.
By further inspecting the RR-like regions (ten genes up-and downstream) with the REBASE annotation support, we found the prominent presence of type I and especially type II R-M system elements ( fig. 5A). Although functional domains can often be coopted to perform their roles in different proteins from different systems (and can thus be classified into more than one R-M system type), types I and II are still the most prevalent in these regions. Several nuclease domains from type II R-M systems were found in these vicinities, such as MvaI_BcnI (Pfam: PF15515), ParBc (Pfam: PF02195), EcoRII-C (Pfam: PF09019), ResIII (PF04851), HSDR (PF04313), and variants of the HNH domain ( fig. 5B and supplementary table S2, Supplementary Material online). Specifically, the HNH_2 (Pfam: PF13391) is the best-represented endonuclease domain in RR-like neighborhoods. The HNH_2 can be found in seven different genera of Gram-negative bacteria: Three from the Bacteroidetes (Chryseobacterium, Flavobacterium, and Hymenobacter) and four from the Proteobacteria (Klebsiella, Pectobacterium, Salmonella, and Yersinia) phyla ( fig. 5B). Besides the type II-associated domains, the widespread presence of the Vsr endonuclease domain (Pfam: PF03852) was also observed. The Vsr-containing genes encode the very short patch repair (VSR) proteins. VSRs are often classified as V genes, and play a role in correcting mispairs in the DNA that arise following spontaneous deamination of 5methylcytosine (Bhagwat and Lieb 2002). In this analysis, we found Vsr-containing genes in 20 different genera, from four different phyla (Actinobacteria, Proteobacteria, Bacteroidetes, and Verrucomicrobia) encompassing: four cproteobacteria (Acinetobacter, Azotobacter, Pseudomonas, Shewanella), five a-proteobacteria (Agrobacterium, Mesorhizobium, Pleomorphonas, Rhizobium, Sphingomonas), two d-proteobacteria (Desulfovibrio, Geobacter), five bacteroidias (Bacteroides, Hallella, Hymenobacter, Porphyromonas, Prevotella), one chlorobium (Chlorobium), one flavobacterium (Chryseobacterium), one actinobacterium (Diaminobutyricimonas), and one verrucomicrobiae (Akkermansia). The most prevalent genomic organization of VSR-encoding genes across the different taxa was that in which VSR genes were adjacently located downstream to the RR-like genes ( fig. 5B). These analyses underscored the solid association of RR-like with type II R-M systems, which may be preferentially recruited to work along with the newly discovered TCS-like system.

Phyletic Distribution of Protein Architecture Containing the Newly Characterized HK-Like
The next step was to consolidate the newly found domain profiles based on the representative sequences respectively from OG_4 and OG_5. Consolidation of the new profiles enabled us to perform broad domain-oriented searches of Response_reg_2 and HEF_HK conserved motifs against large databases using hmmsearch (Potter et al. 2018). By conducting this sensitive search, we aimed to test the results from the previous analysis and assess the prevalence of the new domains across the bacterial lineages.
Hence, Ensembl, UniProtKB, and Reference proteomes (uniprotrefprot) were inspected for the presence of the two Novel Two-Component System-Like in Bacteria GBE Genome Biol. Evol. 13(3) doi:10.1093/gbe/evab024 Advance Access publication 10 February 2021 domains, which returned respectively from these databases: 529, 163, and 50 hits for Response_reg_2 and 199, 139, and 52 hits for HEF_HK. As expected, based on our first analysis, the Response_reg_2 domains were detected in proteins lacking any other currently known domains. On the other hand, the HEF_HK proteins exhibited diversity in terms of associated known domains. These sequences contained GHKL superfamily motifs (i.e., HATPase_c, and HATPase_c_3), which strikingly confirmed the pattern observed in the first section of this study. Furthermore, the variable presence of functional domains could be exploited to glean additional information on the phyletic distribution of the HEF_HK-containing proteins. To achieve that, all HEF_HK significant hits found by hmmsearch in the three databases mentioned above were acquired and then merged with the OG_5 sequences from the previous analyses. Next, this merged data set was filtered to remove redundant sequences, which typically belong to different strains of the same species. This step prevented the inflated representation of species that were prevalent in public databases. This resulted in a data set including unique sequences to be further analyzed. Next, aiming to ensure that only bona fide HEF_HK-containing proteins were to be included, we performed an additional in-house domain profile recognition on the unique sequences set with a stringent setting (e-value: 1e-05) using hmmscan (supplementary table S3 (sbla_dsm4481), Salmonella enterica subsp. enterica serovar Senftenberg (sent_nctc10384), Serratia odorifera (sod-o_dsm4582), Vibrio anguillarum (vang_vib43), Yersinia pseudotuberculosis (ypse_ep2). (B) The box plots depict the statistical analysis using one-tailed Student's t-test between the 0_pool and five compilations ("mix") of the 25 pseudogenomes previously simulated. Each mix contains five strains. t-Test results are represented as follows: ns (P > 0.05), * (P 0.05), ** (P 0.01), *** (P 0.001), **** (P 0.0001). The mix data sets include the following above mentioned genomes: mix1-abau_d36, acav_fdaargos, aste_pqq42, bsil_dsm23676, chsp_g972; mix2-cwoe_iso977n, dear_dsm27393, dcro_ipl20, dsol_ds04321, ecol_2653396; mix3-ecol_kte191, enum_dsm25634, etar_kcpchb1, fcol_pf1, glov_sz; mix4-kmic_fdaargos, mmed_macl11, nure_nm42, rhsp_p5, sbla_dsm4481; mix5-cpha_dsm266, sent_nctc10384, sodo_dsm4582, vang_vib43, ypse_ep2. Proteobacteria phylum (69.8%). Furthermore, it underscored the remarkable association of this domain with two other domains: 1) the HATPase_c_3, found within N-terminal regions (83.3%), and/or the 2) HATPase_c within C-terminal regions (93.7%) (supplementary table S3 HEF_HKþHATPase_c). Group 3 (11/222), instead, comprised those architectures in which the C-terminal HATPase_c was lost (i.e., HATPase_c_3þHEF_HK). As the most unusual architecture, group 4 (3/222) conserved only the HEF_HK, with no other known domain.
Notably, group 1 was the only architecture found in HEF_HK-containing sequences conserved in Betaproteobacteria, Deltaproteobacteria, or in the Firmicutes phylum ( fig. 6B and supplementary table S3, Supplementary Material online). Numerous orders of the Alphaproteobacteria class were shown to conserve any of the three major groups (i.e., 1, 2, and 3). Among these orders, Rhodobacterales seemed to obligatorily conserve the C-terminal signature (HATPase_c), compared with the N-terminal HATPase_c_3 may not be essential. Contrarily, in the Rhizobiales order, the N-terminal HATPase_c_3 seems to be mandatory. The Gammaproteobacteria class encompasses most of the organisms carrying HEF_HK-containing sequences (51.8%). Within this class, bacteria from Yersiniaceae (including Yersinia and Serratia spp.) and Pectobacteriaceae (including Pectobacterium and Dickeya spp.) families are constricted to group 1. Conversely, Moraxellaceae and Vibrionaceae families may conserve either group 1 or 2, meaning that the C-terminal HATPase_c is essential in these organisms' sequences. These results underscore the wide diversity of organisms in which the HEF_HK domain is conserved and the contrasting conservation for the companion HATPase domains in different phyletic groups.

Discussion
In this study, two novel conserved domains were characterized: HEF_HK and Response_reg_2. The newly described domains conserve characteristic secondary structure organization canonically described DHp and REC domains found respectively in HKs and RRs. In HK proteins, although several DHp domain variants have been described, some key features must remain unaltered such as the structure of the kinase core and the adjacent nucleotide-binding fold . The archetypal organization among the highly conserved domains of HK proteins includes five motifs known as H, N, G1, F, and G2 boxes (Tanaka et al. 1998). The H box harbors the core His residue and remains typically conserved adjacently to the C-terminal nucleotide-binding pocket which encompasses the N, G1, F, and G2 boxes (Tomomori et al. 1999). As a highly conserved structure, the C-terminal CA domain containing the nucleotide-binding pocket could be promptly detected in the sequences from our acquired data set by exhibiting the HATPase signature. Within these sequences, the structure reminiscent of an H box could be detected adjacently to the C-terminal HATPase_c domain ( fig. 6A). Of the sequences carrying a HEF_HK domain, 83.4% also conserved an N-terminal HATPase_c_3 (Pfam: PF13589) domain. The HATPase_c_3 domain is classified into the GHKL superfamily similarly to HATPase_c (Pfam: CL0025). By looking at the complete set of sequences carrying the HATPase_c_3 domains, the strikingly strong preference for N-terminal localization was clear, contrary to the C-terminal preference observed for the HATPase_c domain (Finn et al. 2010). The N-terminal localization of HK-associated HATPase_c_3 occupies the classic site of sensor domains in HK proteins (Szurmant et al. 2007). Intriguingly, none of the canonical DHp domains can be found in HATPase_c_3 containing proteins (Finn et al. 2010). These observations could suggest specific recruitment of HATPase_c_3 domains to play a role as a sensor in the newly predicted HEF_HK containing HKs. Since the HEF_HK containing proteins are predominantly soluble ( fig. 2D), it is possible to speculate that HATPase_c_3þ HEF_HK containing proteins could be involved in interactions with intracellular concentrations of nucleoside triphosphate molecules (NTP).
Characterization of TCSs utilizing sequence analysis and extensive comparative genomics has been largely exploited in the last decades, comprising a powerful strategy for the identification of new systems (Mascher 2006;Wecke et al. 2006;Revilla-Guarinos et al. 2013). Besides revealing the solid linkage between genes encoding HEF_HK-and Response_reg_2-containing products, this approach also sheds light on the striking conservation of upstream genes to this duplet encoding methyltransferases (supplementary table S2, Supplementary Material online). The DNA_methylase domains integrate the DNA methyltransferases (DNMT) protein family which became specialized in cytosine-specific methylation . In this context, DNA methylation in prokaryotes has been largely associated with R-M systems, which typically contain two opposing utilities: endonucleases (performs double-strand DNA cleavage) and DNA methyltransferase (prevents double-strand cleavage by the cognate endonuclease) (Marinus and Casadesus 2009). The linkage between R-M systems and predicted ATPase-encoding genes has been previously observed by (Furuta et al. 2010), however, the presence of a predicted response regulator has not been revealed thus far.
The modification systems are effective barriers against bacterial lineages of varied descent which contain different epigenetic identities as defined by their R-M systems (Kobayashi 2001). Since some R-M systems can occasionally target the host genome if eliminated, they are typically referred to as selfish elements as they promote their survival in the cell lineages (Kobayashi 2001;Casades us and Low 2006;Mruk and Kobayashi 2014). Thus, if the concentration of M proteins is not sufficient to protect the host DNA against cognate R endonucleases in the cell, this may cause cell death (Fukuda et al. 2008). The overrepresented presence of R-M-related domains in the vicinity of HEF_HK-/Response_reg_2-containing genes observed in a wide range of bacterial genomes corroborates that the association between these themes does not occur by chance ( fig. 5 and supplementary table S2, Supplementary Material online). Hence, one might speculate that these newly discovered domains could be involved in signaling cascades controlling, directly or indirectly, R-M systems.
The N-terminal presence of the GHKL domain HATPase_c_3 in HEF_HK-containing proteins is a wellestablished characteristic feature of the Microrchidia (MORC) ATPase family, originally described as a necessary gene for the completion of mammalian spermatogenesis (Watson et al. 1998). Subsequent investigations revealed the presence of MORC family members in prokaryotes , as well as in the major crown group lineages of eukaryotes Koch et al. 2017). Although mechanistic details concerning the majority of MORC's functions remain elusive, they constitute a widespread family primarily involved in diverse processes associated with epigenetic regulation . In Arabidopsis thaliana, for example, AtMORC1 and AtMORC6 have been implicated in gene silencing through chromatin condensation (Moissiard et al. 2012). Another MORC1 (SlMORC1 from Solanum lycopersicum) was further reported to exhibit similarities with type II topoisomerases (Manohar et al. 2017). Furthermore, the ability to form dimers through the N-terminal GHKL domain was elucidated in the murine MORC3 protein (Li et al. 2016). In contrast to the recent advances in the understanding of eukaryotic MORCs, the molecular and biochemical aspects of the bacterial MORCs remain poorly understood. Hence, it is currently unclear whether the focus in chromatin superstructure manipulation is conserved across the eukaryotic and prokaryotic members. In prokaryotes, the MORCs were classified as part of larger radiation that includes several GHKL proteins termed paraMORCs . The paraMORCs were shown to remain preferably in genomic regions coinhabited by R-M systems, endonucleases, and helicases .
The striking similarity of sequence and contextual signatures between the paraMORCs and HEF_HK-encoding genes strongly suggests that this newly found HK-like family may have branched from the paraMORCs through the acquisition of the C-terminal HATPase_c and HEF_HK domains. These findings open an opportunity for further evaluation of the probable regulatory outspread of these genes in a vast number of bacterial taxa, such as via transcriptomics and/or proteomics of mutant strains. Besides, the ability to ascertain which environmental cues or chemicals can trigger a response transduced by this system would also comprise and important step toward its characterization. Altogether, this study revealed two novel protein conserved domains and shed light on the possibility of a TCS-like system undertaking a regulatory role mechanistically linked to R-M systems in a large variety of bacterial lineages.

Materials and Methods
Sequence Search, Orthology, and Gene-Neighborhood Screenings The initial search for RR-similar sequences was conducted by using TBlastN 2.8.0þ (Gertz et al. 2006) using default parameters in five iterations on the RefSeq Genomes database available on NCBI webpage (https://www.ncbi.nlm.nih.gov, last accessed February 25, 2021), allowing up to 1,000 positive hits restricted to Bacteria (taxid: 2). The query used for this search is 600 aa long and was obtained from Pectobacterium carotovorum subsp. brasiliense (KS44_RS10365). A total of 684 positive hits returned, for which 628 came from bacterial strains with support of genome-wide information. These data sets were then obtained, enabling subsequent geneneighborhood screenings in the strains in which the hits were found. Full records in GenBank format of the 628 entries representing their respective genomic constructs (complete genome, scaffold, or contig) containing the KS44_RS10365similar sequences were downloaded from the RefSeq database. Additional ecological and taxonomical information on these 628 strains was then gleaned from different sources, including NCBI taxonomy-, Uniprot-, and PATRIC bacterial databases (Sayers et al. 2009;Wattam et al. 2014;The UniProt Consortium 2017). Taxonomic records of interest include the bacterial class, order, and family.
Following the preliminary sequence search, two distinct methods were carried out aiming to provide relevant information on the obtained sequences for subsequent geneneighborhood screening. First, the OrthoMCL (Li et al. 2003) pipeline was applied to predict homology relationships among the sequences by orthologous clustering based on MCL (inflation ¼ 1.5). Each orthologous group (OG) is numerically labeled for easy identification (e.g., OG_1, OG_2, etc.). Second, all sequences were scanned for known conserved domains using the HMMER3 package (Eddy 2009) using default parameters supported by the PFAM domains database (Finn et al. 2010). To combine both sources of information in the gene-neighborhood screenings, the genomic coordinates were then integrated into these results by custom PERL scripts (https://www.perl.org, last accessed February 25, 2021). The identification of R-M-related domains in this analysis was achieved by retrieving the gold standard sequences from the REBASE database (Roberts et al. 2015) and performing in-house detection of functional domains using HMMER just as described above. This information was integrated into the gene-neighborhood screenings, and the resulting network was then generated by using Cytoscape software (Shannon et al. 2003).

Sequence Alignment, Phylogeny, and Domain Analysis
The orthologous sequences of both OG_4 (RRs) and OG_5 (HKs) were aligned by Clustal Omega (Sievers and Higgins 2018) and PROMALS3D (Pei et al. 2008), and visualized by Jalview (Waterhouse et al. 2009) in comparison with canonical REC and DHp domains obtained from Pfam database (https://pfam.xfam.org/, last accessed February 25, 2021). All alignments performed involving canonical REC and DHp domains use the Pfam "seed" entries available in the online repository for each described domain, which includes only representative sequences carrying the respective domains. This approach aimed to comparatively assess the conservation of key RR and HK residues in the sequences from OG_2 and OG_4. Secondary structures were also inferred in the alignments by using JPred4 (Drozdetskiy et al. 2015), and membrane topology predicted by the TopCons server (Tsirigos et al. 2015). To establish a comparison with OG_2 sequences, we obtained the domain families from Pfam clan CL0304 seed sequences carrying a highly conserved Asp residue (PFAM: PF00072-Response_reg; PF16359-RcsD_ABL; PF09456-RcsC) to perform alignments with OG_2 sequences. Aside from the Response_reg domain, the other combined alignments with OG_2 sequences showed poor quality and were hence discarded. Sequences from OG_4 were aligned with domains families from Pfam Clan CL0025 seed sequences carrying highly conserved His residue (PFAM: PF00512-HiskA; PF07568-HiskA_2; PF07730-HiskA_3; PF07536-HWE_HK), resulting in a good quality combined alignment. Individual alignments including only sequences carrying Response_reg_2 and HEF_HK domains respectively were then performed to determine the limits of each domain. These limits were determined based on secondary structure analyses performed using JPred. The Response_reg_2 and HEF_HK domains were then cut from the alignments, and filtered to avoid sequence redundancy, and then supplied to SMS (Lefort et al. 2017) for evolutionary model selection. Next, the domain sequences were supplied to FastTree (Price et al. 2010) to reconstruct the domains phylogeny. The cut domains from Response_reg_2 and HEF_HK were converted into HMM-profiles, which were consolidated by hmmbuild and hmmpress from the HMMER3 package. After the domain profiles were consolidated (Response_reg_2 and HEF_HK), these were used in a high sensitivity search in public databases through hmmersearch under default parameters (Potter et al. 2018). The positive matches were obtained, merged with the sequences from the respective orthologous groups, and then filtered to avoid sequence redundancies. The nonredundant sets of Response_reg_2 and HEF_HK were then subjected to inhouse domain scanning under stringent criteria (e-value <1e-05) using hmmscan, to avoid low-confidence domain predictions. The 3D model predictions of the conserved motifs in RR-like and HK-like were carried out by using RaptorX (Kallberg et al. 2014) which selects the most suitable PDB structure for each analyzed sequence and uses it as a reference. Four randomly picked representative sequences from both the RR-like (BJP29_RS22595-PDB ID: 2ZWM; H147_RS0115650-PDB ID: 1A2O; B038_RS0119475 PDB ID: 3Q9S; VOL_RS02805 PDB ID: 3Q9S) and the HK-like (CFY84_RS11065-PDB ID: 5OO7; HMPREF0127_RS22520-PDB ID: 5OO7; WKA_RS00445-PDB ID: 5OO7; KORLAC_RS15945-PDB ID: 5JRW) groups were analyzed using this method. Subsequent visualization, image rendering of domain models used Pymol (https:// pymol.org/, last accessed February 25, 2021), in combination with EnvZ (PDB ID: 5B1N) and CheY (PDB ID: 2FKA) obtained from PDB online repository (Berman et al. 2000) for comparison.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.