The third restriction–modification system from Thermus aquaticus YT-1: solving the riddle of two TaqII specificities

Abstract Two restriction–modification systems have been previously discovered in Thermus aquaticus YT-1. TaqI is a 263-amino acid (aa) Type IIP restriction enzyme that recognizes and cleaves within the symmetric sequence 5′-TCGA-3′. TaqII, in contrast, is a 1105-aa Type IIC restriction-and-modification enzyme, one of a family of Thermus homologs. TaqII was originally reported to recognize two different asymmetric sequences: 5′-GACCGA-3′ and 5′-CACCCA-3′. We previously cloned the taqIIRM gene, purified the recombinant protein from Escherichia coli, and showed that TaqII recognizes the 5′-GACCGA-3′ sequence only. Here, we report the discovery, isolation, and characterization of TaqIII, the third R–M system from T. aquaticus YT-1. TaqIII is a 1101-aa Type IIC/IIL enzyme and recognizes the 5′-CACCCA-3′ sequence previously attributed to TaqII. The cleavage site is 11/9 nucleotides downstream of the A residue. The enzyme exhibits striking biochemical similarity to TaqII. The 93% identity between their aa sequences suggests that they have a common evolutionary origin. The genes are located on two separate plasmids, and are probably paralogs or pseudoparalogs. Putative positions and aa that specify DNA recognition were identified and recognition motifs for 6 uncharacterized Thermus-family enzymes were predicted.


INTRODUCTION
The Type IIC/IIG restriction-modification (R-M) systems, in which the restriction endonuclease (REase) and methyl-transferase (MTase) are part of the same polypeptide, are diverse and widely distributed in nature (1,2). They account for more than a third of all Type II R-M systems (2). Type IIC enzymes are ␥ -amino-MTases with N-terminal DNAcleavage domains, able to cleave DNA at fixed distances from their recognition sequences (1). These compact systems are overabundant in mobile genetic elements (MGEs), especially in plasmids (2).
One of the most interesting members of the Thermusfamily enzymes is TaqII (10, 13,16,19,20). The protein was isolated from the thermophilic bacterium T. aquaticus YT-1 (20), a strain that also includes the Type II R-M system TaqI, which recognizes 5 -TCGA-3 . The crude preparation of TaqII described in the original study had two major activities, which recognized and cleaved 5 -GACCGA-3 and 5 -CACCCA-3 DNA sequences, respectively, and an additional minor contaminating REase activity with unknown recognition sequence (20). Our previous studies revealed that recombinant TaqII is prone to classic 'star activity', caused by pH, salt or enzyme concentration changes and exhibits the unique 'affinity star' specificity induced by sinefungin (SIN) (10, 13). Although this could explain the minor contaminating activity (20), the existence of an additional R-M system could not be excluded.
Our preliminary research confirmed that the two major activities were not separated when TaqII was prepared from the native host. However, we discovered that recombinant TaqII isolated from the mesophilic bacterium Escherichia coli (E. coli), recognized only the 5 -GACCGA-3 sequence (10). Such an unusual change of DNA substrate recognition specificity could be caused by number of factors, including: (i) a difference in protein tertiary or quaternary structure, (ii) different polypeptide folding in E. coli, (iii) posttranslational modifications specific for the genus Thermus, (iv) requirement for specific Thermus chaperonins, (v) an additional TaqII subunit missing from the recombinant enzyme complex or (vi) the existence of a third R-M system.
In the present study, we designed and performed experiments necessary to test the last one of the proposed hypotheses, namely that a second Type IIC enzyme was present in the native TaqII preparation. As a result we identified and sequenced a paralogous taqIIIRM gene and discovered novel cleavage specificity: TaqIII, a (nonidentical) twin enzyme of TaqII.

Purification of native TaqII and TaqIII
T. aquaticus YT-1 bacteria were cultivated in a biofermentor Bioflo 310 (New Brunswick Scientific, Edison, NJ, USA) as described in Supplementary data.
The purification scheme for native TaqII/TaqIII from T. aquaticus YT-1 included the stages: cell lysis, DEAE-cellulose chromatography, precipitation of acidic proteins/nucleic acids with polyethyleneimine (PEI) (25), TaqII/TaqIII extraction from nucleic acid-acidic proteins-PEI complexes, ammonium sulfate (AmS) precipitation, Resource Q chromatography, Resource S chromatography and Heparin-Agarose affinity chromatography. The detailed purification procedure is provided in Supplementary data.
After Resource S chromatography two enzyme preparations were obtained: (i) fraction A cleaving only 5 -CACCCA-3 , which did not bind to the cation exchanger and (ii) fraction B, which eluted at ∼550 mM NaCl and exhibited REase activity against both 5 -GACCGA-3 and 5 -CACCCA-3 . Both enzyme preparations were desalted and further purified to homogeneity using Heparin-Agarose (Supplementary data).
For determination of the TaqIII recognition sequence and cleavage site, various DNA substrates were used: the 390 bp PCR products, pUC19, pBR322 and DNA.
All TaqIII cleavage reactions were performed and analyzed as described in Supplementary data. Research, 2017, Vol. 45, No. 15 9007

Determination of TaqIII recognition sequence and cleavage site
Sanger sequencing. For sequencing, a 497 bp substrate was prepared as described previously (10). The purified restriction fragments were sequenced as described in Supplementary data.

Western blotting
Purified proteins were separated by SDS-PAGE and electroblotted onto a PVDF membrane (24). The membrane was probed with rabbit anti-TaqII antibodies (courtesy of Dr D. Nidzworski, University of Gdansk, Poland), washed with TBS-T buffer and incubated with anti-rabbit secondary antibody conjugated with alkaline phosphatase (Santa Cruz Biotechnology, courtesy of Dr D. Nidzworski, University of Gdansk, Poland). A specific protein was visualized by adding BCIP/NBT solution. The detailed procedure is provided in Supplementary data.

LC-MS-MS/MS analysis
LC-MS-MS/MS analysis (liquid chromatography coupled to tandem mass spectrometry) were performed at a Mass Spectrometry Laboratory (IBB PAS, Warsaw, Poland) as described in Supplementary data. The raw data were processed using Mascot Distiller followed by Mascot Search (Matrix Science, UK) against the predicted TaqII derived reference peptide masses. The search parameters are provided in Supplementary data. Peptides with a Mascot Score exceeding the 5% False Positive Rate threshold and with a Mascot Score exceeding 30 were considered to be positively identified.

T. aquaticus YT-1 genome sequencing and assembly
T. aquaticus YT-1 total DNA was purified from a freshly grown liquid culture using the DNeasy Blood and Tissue Kit (Qiagen, Germantown, MD, USA) with minor modifications. DNA was sheared to an average size of 10 kb using G-Tubes (Covarys, Woburn, MA, USA). The sheared DNA (consisting of both chromosome and plasmids) was sequenced using the Single-Molecule Real-Time (SMRT) Sequencing platform (Pacific Biosciences, Menlo Park, CA, USA). Two libraries ('SMRTbells') were constructed as recommended by the manufacturer using the 20 kb library preparation protocol. Libraries were quantified and analysed using a Qubit fluorimeter (Invitrogen, Eugene, OR, USA) and a 2100 Bioanalyzer (Agilent Technology, Santa Clara, CA, USA). Both libraries had mean insert lengths of 10 kb. A portion of one library was size-selected for SM-RTbells larger than 6 kb using a BluePippin device (Sage Science, Beverly, MA, USA).
The libraries were sequenced on an RSII instrument with C4-P6 chemistry and 240 min movie collection times (Pacific Biosciences, Menlo Park, CA, USA). Data from four SMRT cells (1.6 Gb sequence from two non-sizeselected cells and 1.2 Gb from two size-selected cells), was used in the final analysis, resulting in mean chromosomal coverage of ∼680×. Reads were processed, mapped, and assembled de novo using the SMRT analysis pipeline with the HGAP.3 protocol. Closure and final assemblies were performed with the assistance of BLAST (26) and RS BridgeMapper.1 from the SMRT Analysis suite (Pacific Biosciences, Menlo Park, CA, USA). Methylation analysis and determination of methylated motifs (27,28) were performed using RS Modification and Motif analysis.1 from the same suite.
Cloning of the taqIIIRM gene PCR amplification. Due to the high similarity between taqIIRM and taqIIIRM genes, taqIIIRM was amplified from T. aquaticus YT-1 plasmid DNA using a nested PCR approach (Supplementary data).
DNA cloning . The 3326 bp PCR fragment was cleaved with BspHI and SalI, subjected to agarose electrophoresis, gel isolated and cloned into the modified pRZ4737 vector (9), containing the P R promoter under the control of the CI repressor (Supplemenary data).

Selection of positive bacterial clones. Both BsaI and
BspHI/SalI cleavage of plasmid DNA as well as direct PCR from a single bacterial colony were used for the screening of clones. Plasmids from positive clones were subjected to DNA sequencing. The promoter regions and the taqII-IRM gene sequences of the recombinant plasmids were confirmed.
Recombinant gene expression. The taqIIIRM gene expression was performed in E. coli BL21 Star™ (DE3), as described in Supplementary data. Bacterial clones efficiently expressing taqIIIRM gene were selected for a large-scale bacterial culture.

Purification of recombinant TaqII and TaqIII
The purification scheme for recombinant TaqII or TaqIII from E. coli included the following stages: cell lysis and heat treatment, precipitation of acidic proteins with PEI, TaqII/TaqIII extraction from nucleic acid-acidic proteins-PEI complexes, 20-40% AmS fractionation and size exclusion chromatography. The detailed purification procedure is provided in Supplementary data.

Nucleotide and protein sequence accession numbers
The nt sequences of plasmids pTAYT1 11 and pTAYT1 61 determined in this study have been annotated and deposited in GenBank with the accession numbers CP020571 and CP020572 respectively.

Cation exchange chromatography resulted in partial separation of two TaqII specificities
The initial characterization of TaqII by Barker et al. showed that a preparation of the enzyme from T. aquaticus specifically recognized two DNA sequence variants: 5 -GACCGA-3 and 5 -CACCCA-3 (20). In our previous work, we demonstrated that recombinant TaqII was able to recognize only 5 -GACCGA-3 (10). We hypothesized therefore that the protein preparation obtained by Barker and co-workers was a mixture of two REases, TaqII and a hypothetical enzyme TaqIII that recognized only the 5 -CACCCA-3 site.
To try to separate the two activities from a T. aquaticus lysate, we tested various chromatographic resins and binding conditions. Only Resource S cation exchange chromatography performed in 20 mM MES-Na buffer at pH 6.5 (slightly above the theoretically predicted pI of TaqII) resulted in partial separation of the two activities (Supplementary data), but it supported the concept of existence of a distinct TaqIII protein responsible for the 5 -CACCCA-3 activity. The active fractions were further purified using Heparin-Agarose affinity chromatography and analysed by SDS-PAGE. In this manner, two enzyme preparations were obtained (Supplementary data). The first preparation (fraction A) contained proteins that did not bind to the Resource S column (concentrated flow-through and wash fraction). This preparation exhibited REase activity against 5 -CACCCA-3 only (Figure 1, lane 1; Figure 2AB, lanes 2). The second preparation (fraction B) contained proteins that bound to Resource S resin and eluted at ∼ 550 mM NaCl. Fraction B exhibited REase activity against both 5 -GACCGA-3 and 5 -CACCCA-3 ( Figure 1, lane 2; Figure  2AB, lanes 3). The presence of a 120 kDa protein was detected in both enzyme preparations ( Figure 1, lanes 1,2). Interestingly, the second preparation contained two proteins, very similar in size, which could be separated only by SDS-PAGE on 6% gels (Figure 1, lane 2). Intriguingly, there is a difference in gel migration between native and recombinant enzymes. The predicted molecular weight (MW) of TaqII is 125.7 kDa, while the MW of recombinant TaqII, determined by SDS-PAGE and MALDI-TOF mass spectrometry, is 127.2 kDa (not shown). We hypothesize that the observed gel shifting may be caused by various types of post-translational modifications, occurring in E. coli and T. aquaticus. The gel shifting of homologous proteins is often discussed in the literature (35,36).
On the basis of presented results we assumed that fraction A could contain the presumed TaqIII only, while fraction B might be a mixture of TaqII and TaqIII.

Mass spectrometry analysis confirmed the existence of TaqIII
In order to confirm the assumption presented in the previous section, we performed mass spectrometry analysis. The homogenous protein from fraction A was subjected to mass spectrometry LC-MS-MS/MS (37). The recombinant TaqII was used for control analysis (Supplementary data, Figure S2). The experimental peptide mass data were compared with predicted TaqII-derived peptide masses and with the NCBI non-redundant sequence database using the MASCOT search engine (http://www.matrixscience.com) (38). Both analysed proteins were identified as TaqII by database search (Supplementary data, Figure S2). However, investigating data obtained from the MS/MS analysis of the protein from fraction A we found 40 different peptides matching perfectly the TaqII aa sequence (Supplementary data, Figure S2) and 14 unique peptides, showing similarity to the TaqII aa sequence except for 1-3 replaced aa (Supplementary data, Table S1 and Figure S2A).
The MS-MS analysis confirmed that the REase from fraction A (presumed to be TaqIII) is a single protein, distinct from TaqII. The analysis revealed significant aa sequence similarity between the TaqII and TaqIII, especially within the PD-(D/E)XK nuclease domain, helical domain and RMF MTase domain of TaqIII (Supplementary data, Figure S2A). Nine of the fourteen peptides unique to TaqIII were located within a presumed target recognition domain (TRD) in the C-terminal part of the protein. This result indicates that the investigated proteins have variable aa sequence within this region. Such differences could explain the altered substrate specificity of TaqII and TaqIII.

Western blot immunodetection showed significant similarity between TaqII and TaqIII
To address the question concerning the level of structural/sequence similarity between TaqII and the presumed TaqIII, western blot analysis was performed. Rabbit polyclonal anti-TaqII antibodies were used for immunodetection of native and recombinant TaqII and TaqIII protein variants. Interestingly, the anti-TaqII antibody could specifically detect all of the investigated proteins, whether of T. aquaticus ( Figure 3A) or E. coli origin ( Figure 3B). This confirmed significant similarity between the enzymes.

Native TaqIII cleaves 5 -CACCCA-3 only
The determination of the TaqIII DNA recognition sequence was performed using native protein, obtained from the fraction A (see Materials and Methods section). For that purpose three methods were used: (i) assessment of the digestion pattern on , pUC19 and pBR322 DNA, and custom devised substrates (10), containing both variants of the originally reported recognition sequence (20), (ii) direct DNA sequencing of TaqIII cleavage products and (iii) methylome analysis of T. aquaticus YT-1 genomic DNA.
Analysis of the native TaqIII cleavage pattern on several substrate DNA molecules confirmed that the recognition sequence is 5 -CACCCA-3 ( Figure 2; Supplementary data, Figures S3 and S4). However, all of these substrates were incompletely digested (Figure 2; Supplementary data, Figures  S3 and S4), which is a characteristic feature of the Thermusfamily enzymes. In contrast to recombinant TaqII, native TaqIII clearly does not cleave the 5 -GACCGA-3 DNA sequence ( Figure 2B, lane 2). Moreover, similarly to TaqII, TaqIII REase is stimulated by SAM and SIN (not shown) and is able to cleave linear molecules with a single DNA recognition sequence (Figure 2A, lane 2). 1 All parameters were computed using the ExPASy ProtParam tool (https://www.expasy.org). 2 Based on their instability indices, both enzymes are classified as unstable proteins. The stability index is determined based on the protein primary structure alone. However, the stability of a protein in vivo is a net effect of the contributions made by several factors, such as: structure dependent features, the presence of disulphide bridges, ligand binding and protease recognition mechanisms (59).
Direct confirmation of the recognition sequence came from run-off sequencing. A 497 bp PCR product, containing convergently oriented 5 -GACCGA-3 and 5 -CACCCA-3 sequences (10) was cleaved with native TaqIII and subjected to DNA sequencing ( Figure 4). Sequencing from the 3 end of the product clearly shows a cleavage point 9 nt downstream of the 5 -CACCCA-3 site in the opposite strand ( Figure 4).
To identify TaqII and TaqIII DNA modifications and to confirm their specific DNA recognition motifs, genomic DNA from T. aquaticus YT-1 was SMRT sequenced. As expected, both modified 5 -GACCG(m6A)-3 and 5 -CACCC(m6A)-3 were found ( Figure 5), with 100% of the TaqII and TaqIII motifs called as methylated at mean modification QV values of 403 (TaqII) and 438 (TaqIII). The modification of both motifs occurred on just one DNA strand, which is typical for Type IIL R-M systems.
The obtained results clearly demonstrate that native TaqIII recognizes the 5 -CACCCA-3 DNA sequence for both cleavage and modification activities. This sequence had been originally attributed to TaqII as one of two distinct activities (20).

taqIIRM and taqIIIRM genes are inter-plasmid paralogs
Total genomic and plasmid DNA from T. aquaticus YT-1 was SMRT sequenced to generate complete, closed assemblies of the chromosome and all plasmids, in order to determine the taqIIIRM gene sequence. The taqIIRM and taqI-IIRM ORFs were identified, 3318 and 3306 bp in length, respectively, and are highly similar in sequence (Supplementary data, Figure S5). Both ORFs start with an ATG codon but end with different STOP codons, TGA and TAG respectively. The two genes are encoded on two different plasmids: pTAYT1 11 (GenBank CP020571) and pTAYT1 61 (Gen-Bank CP020572) ( Figure 6; Supplementary data, Table S2 and S3). The G+C content of both genes is 66% and corresponds well to the G+C content of the respective plasmids pTAYT1 11 (65%) and pTAYT1 61 (69%) and the T. aquaticus YT-1 genome (68%). The taqIIRM and taqIIIRM genes and their flanking regions are devoid of TaqII and TaqIII recognition sites.
Plasmids pTAYT1 11 and pTAYT1 61 differ in size, at 11 and 61 kb, respectively. Both genes are in close proximity to a parA gene copy ( Figure 6A and B). The taqIIRM gene is immediately preceded by parB and then parA, with the parB ORF overlapping the taqIIRM start codon. Based on putative promoter prediction, we speculate that taqIIRM might be a part of an operon that includes the parAB genes upstream. The taqIIIRM gene is immediately preceded by a copG-like gene and then parA, with the parA ORF overlapping the copG start codon; pTAYT1 11 lacks parB. We speculate that the putative taqIIIRM promoter is probably localized within the copG ORF and short intergenic region between copG ORF and taqIIIRM gene, as follows: 5 -aagttcTTGGCCactcaaagcgcacaaTAACCAGGGAcaggctc-3 , with -35 and -10 consensus sequences marked in capital letters (39,40). ParA and ParB proteins are responsible for predivisional partitioning of plasmid DNA molecules. The CopG protein belongs to the family of homologous plasmid repressors and is involved in regulating the plasmid copy number. The pTAYT1 61 DNA sequence is almost identical (sequence coverage 81%, 98% identity) to plasmid pTA69 (accession number CP010825.1) from T. aquaticus Y51MC23, whose genomic sequence has recently been published (41). There are two regions of dissimilarity in DNA sequence between pTAYT1 61 and pTA69. The taqIIRM gene comprises one of these regions and is replaced in pTA69 by an unrelated Type IIC/IIG R-M system annotated as 'modification methylase BstVI'. The putative gene encodes a 1214 amino acid protein (possessing PD-(D/E)xK, NPPY and DPACGSG motifs and is proceeded by parA/parB genes. Moreover, the T. aquaticus Y51MC23 genome does not contain any gene encoding a Thermus-family R-M system, and no TaqII or TaqIII isoschizomers have been purified from other bacterial species so far (32).

Sequence comparison of TaqII and TaqIII enzymes
Despite the striking similarity of TaqII and TaqIII in terms of physical properties (Table 1), the proteins could be separated by prolonged SDS-PAGE electrophoresis in 6% polyacrylamide gels (Figure 1). The aa sequences of TaqIII and TaqII are 93.4% identical (Figure 7; Supplementary data, Figure S5), which is the highest reported similarity of Type II REases that recognize different DNA sequences. Several putative TaqII/TaqIII homologs can be found in Gen-Bank, providing a suitable target for the engineering of TaqII/TaqIII DNA specificity ( Table 2).
The majority of the 71 aa differences between the two proteins occur within the putative TRD (Figure 7; Sup-     plementary data, Table S4, Figure S5), consistent with the LC-MS-MS/MS analysis. All peptides detected for TaqIII by LC-MS-MS/MS (Supplementary data, Table S1, Figure  S2) fit to the aa sequence derived from the taqIIIRM DNA sequence (not shown).

Prediction of DNA recognition elements for TaqII and TaqIII
Having proteins with highly similar aa sequences that yet recognize differing DNA sequence motifs, like TaqII and TaqIII, provides an opportunity to identify DNA recognition elements (positions and aa) through co-variation analysis (5). For the TaqII branch of the Thermus-family enzymes we currently have just three proteins for which the recognition motif is known. However the TRDs of TaqII and TaqIII have significant similarity to the TRDs of the Type ISP enzymes LlaGI and LlaBIII, for which crystal structures have recently been determined and residues making direct DNA contacts have been identified (42) (Supplementary data, Figure S6). We therefore aligned TaqII, TaqIII and TspGWI with the Type ISP enzymes described by Kulkarni,et al (42), and find that the TaqII and TaqIII proteins match well with the observed and co-variationpredicted DNA recognition elements in the Type ISP enzymes ( Figure 8; Supplementary data, Figure S6). Note that in this discussion of recognition elements, we present the TaqII/TaqIII strand that contains the adenine which the enzymes methylate, and refer to the base positions as 0, 1, 2, 3, 4, 5 (and 6), where the fifth position is the methylated adenine, i.e.: TaqII 5 -GACCGA-3 is 0 = G, 1 = A, 2 = C, 3 = C, 4 = G, 5 = (m6A), 6 = N (Figure 9). At position 0, TaqII recognizes 'G', while TaqIII recognizes 'C', and TspGWI has no specificity, or 'N' (Figure 8  and 9). We predict that TaqII likely specifies 'G' through the Arg residue (R777) that aligns with the Arg residue in LlaBIII (R1327) that contacts the G in the top strand of the G:C base pair (bp) to specify 'G' at this position. TaqII also has a Ser residue (S737) at the aligned position where LlaGI uses an Arg residue (R1286) to specify 'C' recognition at position 0 (through contacts to the G in the bottom strand of this C:G bp). The pairing of Ser (737) and Arg (777) at these alignment positions is common in Type ISP TRDs and correlates to recognition of 'G' at position 0. TaqIII likely recognizes 'C' through these same alignment positions, with Tyr (Y777) likely contacting the 'C' base in the top strand, and Pro (P737) contacting the 'G' in the bottom strand, although this combination is uncommon. TspGWI, which does not specify a particular base at position 0, has small residues at these positions: Ser at 737 (S747) and Ala at 777 (A788), which likely allow any base to fit.
At position 1, TaqII, TaqIII and TspGWI all recognize 'A' (Figure 8 and 9), and all have the same residues, Ser and Arg, at the positions where LlaGI (R1329) and LlaBIII (D1326) make major-groove contacts to this base pair. LlaGI and LlaBIII make minor groove contacts to position 1 from residues within the MTase domain (S1056 in LlaGI; N1056 and N1058 in LlaBIII). In TaqII, TaqIII and TspGWI, the residues at these positions are larger (His at 1056 and Lys at 1058), which could potentially contribute to specificity for 'A' by excluding a G:C or C:G bp through steric clash with the guanine 2-amino group. Interestingly, LlaGI also contacts position 1 through the residue (N1327) at the alignment position where LlaBIII (R1327) specifies position 0 recognition. TaqII, TaqIII and TspGWI have different residues at this alignment position, yet all recognize 'A' at position 1, so it seems unlikely the Thermus-family enzyme residues corresponding to LlaGI position R1327 are making contacts to position 1, as is the case for LlaGI. Such subtle differences highlight the complexity and as yet poorly understood nuances of DNA recognition.
At position 2, the Thermus-family enzymes all specify 'C' (Figure 8 and 9), likely through contacts to the bottom strand 'G" by their conserved Lys (K688) at the alignment position where LlaGI specifies 'T' through contacts by Asn (N1228) to the bottom strand 'A'. Two nearby alignment positions, (S683 and E688) co-vary with the base recognized at position 2 in both the Type ISP and in the MmeI-family and thus may contribute to recognition as well.
For position 3 (Figure 8 and 9), the residue at the alignment position corresponding to LlaGI H1368 plays a critical role. His at this position co-varies with recognition of 'C', as in TaqII and TaqIII (H817), while a negative residue (Asp or Glu) co-varies with recognition of 'G', as in TspGWI (D828). The residues corresponding to LlaGI G1372 and Q1373 also contact this bp in the LlaGI cocrystal structure and thus may play a role in recognition for some enzymes, though for TaqII, TaqIII and TspGWI, these two alignment positions contain the same residues, Gly (G823) and Gly (G824), and so do not appear to be contributing directly to the difference in recognition.
For position 4 ( Figure 8 and 9), LlaGI and LlaBIII are non-specific, and from their structures it was not obvious which aa positions might specify recognition in homologous enzymes. The Thermus-family enzymes have a small insertion of 2 (TaqII, TspGWI) or 3 (TaqIII) aa between Figure 8. Amino acid sequence alignment of the TRD region for 10 LlaGI family Type ISP enzymes and 3 Thermus-family enzymes (TaqII, TaqIII and TspGWI) for which the recognition specificity is known. Highlight colors indicate aa positions that contact the highlighted base position within the aligned recognition motifs. Aa residues of LlaGI and LlaBIII that were observed to make direct base contacts in their crystal structures are underlined. Numbers indicate position within the protein; however note that in the text numbers are based on the LlaGI position for the Type ISP enzymes, and on TaqII for the Thermus-family enzymes. Consensus aa and Consensus ss (secondary structure) are as predicted by the promals alignment algorithm. the LlaGI position 3 contact positions H1368 and G1372, Q1373, suggesting this additional protein may contact the adjacent position 4 bp. Given the high degree of identity between the Thermus-family enzymes in this region, we predict that SXR or RXS at TaqII position S820 -R822 may specify recognition for a G:C (SXR) or a C:G (RxS) bp at position 4, wherein position 820 contacts the bottom strand base (S820 contacting 'C' in TaqII and TspGWI; R820 contacting 'G' in TaqIII), and position 822 contacts the top strand base (R822 contacting 'G' in TaqII and TspGWI, S822 contacting 'C' in TaqIII). Whether the presence of an additional aa in TaqIII within this small putative loop may affect the positioning of the proposed RxS contacts is unknown, and further emphasizes the subtle nature of recognition within these enzymes.
Specificity at position 5 ( Figure 8 and 9) is provided by the MTase domain, which flips the adenine into the methyltransferase catalytic pocket for recognition and eventual modification.
The Thermus-family enzymes do not have specificity at position 6 ( Figures 8 and 9), which is consistent with the Type ISP findings, where non-specific enzymes have a small aa at the position corresponding to LlaGI Q1118 (G586 in TaqII) and an Asn at the position corresponding to LlaGI K1131 (N594 in TaqII). The Thermus-family enzymes also have a deletion of 7 aa within the LlaGI loop that contacts position 6, which likely contributes to their lack of specific DNA contacts.
The Thermus-family enzyme TRD domains are thus remarkably similar to those of the Type ISP enzymes, and the TaqII and TaqIII aa residues at the positions that make base-specific contacts in the LlaGI and LlaBIII structures are consistent with their observed recognition motifs. We plan to test this model of TaqII and TaqIII recognition elements through subsequent site-specific mutagenesis studies.

Recombinant TaqIII protein
In order to finally establish that the identified taqIIIRM gene encodes the fully functional TaqIII protein, the gene was cloned into modified pRZ4737 vector, with gene expression driven by a P R promoter that is inducible by a temperature shift to 42 • C, as done for TaqII (16). A novel purification procedure was developed for purification of recombinant TaqII and TaqIII that uses PEI for selective acidic protein precipitation (25). The purified homogenous TaqIII protein was used for DNA cleavage assays ( Figure 2C). Similarly to native protein, the recombinant TaqIII recognizes and cleaves the 5 -CACCCA-3 DNA sequence only ( Figure  2C).

DISCUSSION
T. aquaticus YT-1 has two previously characterized R-M systems, TaqI (a Type II system recognizing 5 -TCGA-3 ) and TaqII (a Type IIC system recognizing 5 -GACCGA-3 ) (10,20). We report here the discovery of the third R-M system in T. aquaticus YT-1, TaqIII, a close homologue of TaqII recognizing the novel sequence 5 -CACCCA-3 . TaqII and TaqIII are members of the Thermus enzyme family (8), comprising closely related Type IIC R-M systems. TaqII and TaqIII enzymes represent an interesting example of naturally occurring sequence specificity evolution.
Because TaqII and TaqIII have highly similar protein sequences but different recognition motifs, they are excellent candidates for covariation analyses to identify specific recognition elements (5); however having just three characterized members form the TaqII branch of the Thermusfamily severely limits covariation analysis. Interestingly, these enzymes share a conserved DNA recognition domain with the Type ISP enzymes, such as LlaGI and LlaBIII, even though these enzymes employ a completely different mechanism for DNA cutting. The excellent characterization of Type ISP DNA recognition carried out by the Szczelkun and Saikrishnan labs (42) can be applied to the TaqII family enzymes to help identify putative positions and aa that specify recognition, particularly those responsible for the differing specific recognition at two positions within the TaqII and TaqIII recognition motifs (Figure 8). We have used the analysis of recognition elements to predict recognition motifs for 6 uncharacterized Thermus-family enzymes (Supplementary data, Figures S6 and S7), which predictions can be readily tested. Based on our analysis of TaqII and TaqIII recognition elements, we anticipate it will prove possible to rationally alter their specificity at one or more positions.
The evolution of bacterial genomes is driven by HGT, mutation and genome rearrangement (43). From the other side, bacteria species are maintained by genetic isolation. R-M systems may facilitate bacterial genetic isolation by regulating the uptake of DNA from the environment (44). They also may establish favored patterns of genetic exchange between bacterial subpopulations (45).
It has been demonstrated that Thermus species acquire DNA through HGT (43). One of the major survival techniques of Thermus species is genome plasticity, which enables them to inhabit extreme temperature environments. Such unusual genome plasticity is achieved through natural transformation (46)(47)(48). Another characteristic feature of Thermus species is a remarkably high level of genome rearrangement (43,49). Frequently occurring (GC) n repeats located upstream and downstream of the breakpoints of global genome rearrangements have been proposed to play a crucial role in facilitating homologous recombination (43).
Thermus genomes consist of chromosomes, megaplasmids and small plasmids (43). However, the number of plasmids per genome varies between bacterial strains (50). Our recent sequencing data revealed that T. aquaticus YT-1 genome contains a single chromosome of 2.08 Mb and six plasmids of sizes 8, 11, 12, 13, 61 and 71 kb (manuscript in preparation). The taqIIRM and taqIIIRM genes are localized on two separate plasmids: pTAYT1 61 and pTAYT1 11, respectively. The genes are probably paralogs, descended from an ancestral gene following a gene duplication event. Another scenario is that the genes are pseudoparalogs, acquired in two separate HGT events.
TaqII/TaqIII recognition sequences are absent in the taqIIRM and taqIIIRM genes but present in the T. aquaticus YT-1 genome. This is in agreement with the observation of Rusinov and co-workers that the sites Type I, IIC/IIG, IIM, III and IV R-M are usually not avoided (51).
Several independent studies have shown that R-M systems stabilize plasmids in cells by their selfish behaviour and have an impact on the host genome (52)(53)(54)(55). Intriguingly, Oliveira and co-workers demonstrated that R-M systems are relatively rare in plasmids and other MGEs. In contrast to 69% of the investigated chromosomes, only 10.4% of the investigated plasmids encode R-M systems (56). Oliveira et al. linked the occurrence of R-M systems in plasmids with plasmid HGT via conjugation or via mobilization in trans using the conjugation machinery of another cell (56). However, one recent study revealed that the presence of R-M systems may limit the spread of naturally occurring plasmids between related bacterial species by mobilization (57). Moreover, it was shown that bacteria that possess plasmidencoded R-M systems could efficiently release those plasmids into the environment, enabling their spread by natural transformation (57). It remains to be clarified whether and to what extent such mechanisms exist in naturally competent Thermus species.
There are several evolutionary advantages to the movement of genes from a bacterial chromosome to a plasmid. It has been suggested that Thermus strains move evolutionary modifying genes onto plasmids to boost their level of genetic plasticity (43). Plasmids are smaller and more quickly replicating DNA molecules, supporting microorganism propagation. They also exhibit higher rates of mutation, leading to the enrichment of gene variants within a population (58).
The most common plasmid-encoded R-M systems are compact Type IIC enzymes (1,56). Such systems account for more than a third of all Type II R-M systems and show a lower degree of purifying selection than orthodox Type II or Type I systems. Their compactness, weaker structural constraints, and simpler co-evolution of recognition sites can promote the diversification of the protein sequence and faster evolution of new specificities (56). This can help them to be more efficient in establishing in a new host (56). This is in agreement with our experience with Thermus-family enzymes. Most of the positive bacterial clones we obtained contained mutated R-M genes, resulting in protein variants with several aa substitutions. We speculate that IIL Thermus-family enzymes may have the ability to change specificity within the same bacterial host.