Large-Scale Molecular Evolutionary Analysis Uncovers a Variety of Polynucleotide Kinase Clp1 Family Proteins in the Three Domains of Life

Abstract Clp1, a polyribonucleotide 5′-hydroxyl kinase in eukaryotes, is involved in pretRNA splicing and mRNA 3′-end formation. Enzymes similar in amino acid sequence to Clp1, Nol9, and Grc3, are present in some eukaryotes and are involved in prerRNA processing. However, our knowledge of how these Clp1 family proteins evolved and diversified is limited. We conducted a large-scale molecular evolutionary analysis of the Clp1 family proteins in all living organisms for which protein sequences are available in public databases. The phylogenetic distribution and frequencies of the Clp1 family proteins were investigated in complete genomes of Bacteria, Archaea and Eukarya. In total, 3,557 Clp1 family proteins were detected in the three domains of life, Bacteria, Archaea, and Eukarya. Many were from Archaea and Eukarya, but a few were found in restricted, phylogenetically diverse bacterial species. The domain structures of the Clp1 family proteins also differed among the three domains of life. Although the proteins were, on average, 555 amino acids long (range, 196–2,728), 122 large proteins with >1,000 amino acids were detected in eukaryotes. These novel proteins contain the conserved Clp1 polynucleotide kinase domain and various other functional domains. Of these proteins, >80% were from Fungi or Protostomia. The polyribonucleotide kinase activity of Thermus scotoductus Clp1 (Ts-Clp1) was characterized experimentally. Ts-Clp1 preferentially phosphorylates single-stranded RNA oligonucleotides (Km value for ATP, 2.5 µM), or single-stranded DNA at higher enzyme concentrations. We propose a comprehensive assessment of the diversification of the Clp1 family proteins and the molecular evolution of their functional domains.


Introduction
Polynucleotide kinases (PNKs) catalyze the transfer of a monophosphate from a nucleoside triphosphate (NTP; usually ATP) to the 5 0 end of either RNA or DNA, and the kinase module belongs to the P-loop phosphotransferase superfamily (Wang et al. 2002;Weitzer and Martinez 2007). PNKs are involved in important cellular events, including DNA repair (Weinfeld et al. 2011;Tahbaz et al. 2012;Chalasani et al. 2018), RNA processing (Weitzer and Martinez 2007;Ramirez et al. 2008;Holbein et al. 2011), and RNA repair (Wang et al. 2002;Zhang et al. 2012;Das et al. 2013). They often contain other functional domains, such as phosphatase, RNA ligase, or cyclic phosphodiesterase (CPDase) domains. For example, bacteriophage T4 polynucleotide kinase/phosphatase (T4 PNKP) is a bifunctional enzyme with 5 0 -OH kinase and 3 0 -phosphatase activities, and phosphorylates either RNA or DNA in the corresponding repair pathways (Wang et al. 2002;Zhu et al. 2007). T4 PNK is used to end-label RNA or DNA in molecular biology applications (Chaconas and van de Sande 1980). Another example is a mammalian polynucleotide kinase/phosphatase (mPNKP) that creates 5 0 -phosphate and 3 0 -hydroxyl termini for ligation in the DNA repair pathway (Bernstein et al. 2005;Bernstein 2009). Bacterial Clostridium thermocellum PNKP contains three catalytic domains, a PNK domain in the N-terminal region, a phosphatase domain in the central region, and an RNA ligase domain in the C-terminal region, and is involved in the RNA repair pathway (Martins and Shuman 2005;Wang et al. 2012;Zhang et al. 2012). The plant and fungal tRNA ligase, Trl1, contains three catalytic domains (Wang and Shuman 2005;Wang et al. 2006;Englert et al. 2010): An RNA ligase domain in the N-terminal region, a PNK domain in the central region, and a CPDase domain in the C-terminal region. Yeast Trl1 is reported to be an essential enzyme in RNA repair, noncanonical premRNA splicing, and pretRNA splicing (Phizicky et al. 1986;Ruegsegger et al. 2001;Schwer et al. 2004).
An enzymatic activity that specifically phosphorylates RNA molecules was detected in HeLa nuclear extracts approximately 40 years ago (Shuman and Hurwitz 1979). The corresponding enzyme was recently identified as Clp1, which was initially extensively characterized in yeast as a component of the mRNA 3 0 -end cleavage and polyadenylation factor complex (Minvielle-Sebastia et al. 1997;Gross and Moore 2001;Weitzer and Martinez 2007;Holbein et al. 2011). In 2000, the factors required for endonucleolytic cleavage and polyadenylation during 3 0 -end formation in mammalian premRNAs were purified from HeLa nuclear extracts. Homo sapiens Clp1 (Hs-Clp1) was one of these factors and was shown to be essential for the 3 0 -end cleavage of mRNA but not for its polyadenylation (de Vries et al. 2000;Paushkin et al. 2004). Purified Hs-Clp1 also has RNA kinase activity and the enzyme is involved in the pretRNA splicing reaction (Weitzer and Martinez 2007). Hs-Clp1 also forms a complex with the tRNA splicing endonuclease (TSEN), a multisubunit enzyme involved in the removal of tRNA introns from pretRNAs (Hanada et al. 2013;Weitzer et al. 2015). It should be noted that most tRNA introns in eukaryotes are located within the anticodon loop (canonical position) of the pretRNA. However, in Archaea, many tRNA introns are also located at other sites (noncanonical positions) (Sugahara et al. 2008;Fujishima et al. 2010). Moreover, the precursor sequences of split tRNAs, which have currently only been found in archaeal species, show high sequence similarity to the tRNA intron sequences in related archaeal species (Fujishima et al. 2009). Therefore, all these pretRNAs in Archaea are essentially spliced by the same mechanism as those in eukaryotes (Yoshihisa 2014). Many protein factors involved in pretRNA splicing have common characteristics. However, the archaeal system is much simpler than the eukaryotic system. For example, human RtcB tRNA ligase binds to a set of proteins as subunits (Popow et al. 2014), whereas archaeal RtcB requires no proteins except Archaease (Desai et al. 2014).
Clp1 is basically conserved in many eukaryotic species, including Homo sapiens (Hs), Mus musculus (Mm), Caenorhabditis elegans (Ce), Drosophila melanogaster (Dm), Arabidopsis thaliana (At), Schizosaccharomyces pombe (Sp), and Saccharomyces cerevisiae (Sc) (Xing et al. 2008). It has been shown experimentally that both purified Hs-Clp1 and Ce-Clp1 phosphorylate RNA and weakly phosphorylate DNA (Weitzer and Martinez 2007;Dikfidan et al. 2014). It has also been reported that kinase-dead Clp1 mice accumulated a set of small RNA fragments derived from the aberrant processing of tyrosine pretRNA and that these fragments induced TP53dependent cell death, resulting in the progressive loss of spinal motor neurons (Hanada et al. 2013). However, Sc-Clp1 has no kinase activity because it has accumulated mutations in its PNK domain (Ramirez et al. 2008;Dikfidan et al. 2014). Structurally, the eukaryotic Clp1 enzymes consist of three functional domains: The N-terminal domain, PNK domain, and C-terminal domain (Noble et al. 2006). Because the PNK activity is lost if either the N-terminal or C-terminal domain is deleted, both domains are important for the maintenance of its PNK activity (Dikfidan et al. 2014). The PNK domain contains four conserved motifs (Leipe et al. 2003;Orelle et al. 2003;Pillon et al. 2018): 1) the Walker A motif or phosphate-binding loop (P-loop), GxxxxGK[S/T], which is an NTP-binding motif; 2) the Walker B motif, [D/E]hhQ (h is a hydrophobic residue), in which the conserved aspartic acid residue is required for the coordination of divalent cations and its catalytic activity; 3) the Clasp motif, [T/S/L]xGW, which is important for RNA binding; and 4) the Lid motif, RxxxxR, which is required for ATP binding and to stabilize the transition state of the phosphotransferase reaction. There is a Clp1related enzyme in Archaea, and purified Pyrococcus horikoshii Clp1 (Ph-Clp1) has thermostable 5 0 -OH PNK activity for the 5 0 -OH ends of both RNA and DNA, although it prefers RNA in a competitive situation (Jain and Shuman 2009). The eukaryotic PNKs, Nol9, and Grc3, also share similarities with Clp1. These enzymes phosphorylate both RNA and DNA, and are involved in prerRNA processing (Braglia et al. 2010;Heindl and Martinez 2010). It has also been reported that some proteins in Bacteria share similarities with the Clp1 PNK domain (Xing et al. 2008). In this paper, we regard all these proteins as members of the Clp1 family of enzymes. Until now, Clp1 and its family of enzymes have been characterized and reported separately for all three domains of life, and there is no comprehensive evolutionary analysis of the Clp1 family enzymes (Leipe et al. 2003). Nor is the detailed evolutionary scenario that accounts for their overall diversity fully understood.
Therefore, in this study, we conducted a large-scale molecular evolutionary analysis of the Clp1 family proteins in the three domains of life and propose a possible evolutionary scenario for them. During this research, we also found a group of large proteins, each containing the conserved Clp1 PNK domain. Finally, we provide the first experimental evidence that a bacterial Clp1 protein from Thermus scotoductus (Ts-Clp1) shows PNK activity.

Data Sources
To undertake a large-scale search for Clp1 family proteins in the three domains of life, a total of 137,772,056 coding sequences (CDSs) were obtained from the UniProtKB database (December 2018 data set) at ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/; last accessed September 17, 2019. (table 1A; The UniProt 2017). In a detailed evolutionary analysis of the Clp1 family proteins in the complete genomes of prokaryotes (5,468,108 CDSs) and eukaryotes (8,534,278 CDSs), together with their species information, were obtained from the Reference Sequence (RefSeq) database (Prokaryotes, August 2018 data set, and Eukaryotes, August 2019 data set) at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/; last accessed September 17, 2019. (table 1B; O'Leary 2016). Seventy-two representative species (36 bacterial, 18 archaeal, and 18 eukaryotic species), each with a complete genomic sequence, were randomly selected according to a previous report (Ciccarelli et al. 2006). For the domain analysis, 16,712 domains, together with their reliable annotations, were obtained from the Pfam-A database (version 32) at ftp://ftp. ebi.ac.uk/pub/databases; last accessed September 17, 2019. (Finn et al. 2014). The taxonomic classification was performed with the NCBI Taxonomy Database (https://www.ncbi.nlm.nih. gov/taxonomy; last accessed September 17, 2019).

Sequence Similarity Search for Clp1 Family Proteins
To comprehensively identify Clp1 family proteins in both the UniProt KB and RefSeq databases, a protein-protein BLAST (BLASTP, ver. 2.2.29þ) search (Boratyn et al. 2013) was performed with an E-value of 1e À 4 and query coverage of !70%. We used several query sequences (supplementary table S1, Supplementary Material online) to cover the diverse Clp1 family proteins, including Clp1-related proteins such as Nol9 and Grc3 in eukaryotes and a bacterial Clp1 protein that was originally annotated as a GTPase in the database. "Amino acid identity" was defined as the percentage of amino acid [aa] residues in two different sequences that were identical. "Amino acid similarity" was defined as the percentage of identical or similar aa residues, based on similar physicochemical properties. We used both identity and similarity scores calculated with the BLAST program using the BLOSUM62 matrix (Henikoff and Henikoff 1992).
Amino Acid Sequence Alignment, Domain Search, and Phylogenetic Tree The aa sequences of the Clp1 family proteins were aligned with MAFFT version 7.394, with the default parameters (Katoh and Standley 2013). The multiple-sequence alignment was used to construct a phylogenetic tree using the GTR model with the RAxML software version 8.2.11 (Stamatakis 2014). The results were visualized with either Jalview (version 2.10.3) (Waterhouse et al. 2009) or SeaView (version 4.5.4) (Gouy et al. 2010). To extract the protein domains, an HMMER (ver. 3.2) search (Potter et al. 2018) of the Pfam-A protein domain database was performed with an E-value of 1e À 4. The domain structures and sequence alignment of the Clp1 family proteins were visualized with DoMosaics (version rv0.95) (Moore et al. 2014). Material online). The synthetic Ts-clp1 gene was designed to contain NdeI and XhoI sites at its 5 0 and 3 0 termini, respectively, and was subcloned into these restriction sites in the pET-23b expression vector (Novagen, Madison, WI, USA). The resulting pET-Ts-Clp1 vector encoded a protein with a six-histidine (His) tag at its C-terminal end.

Expression and Purification of His-Tagged Recombinant Ts-Clp1 Protein
To express the recombinant Ts-Clp1 protein, E. coli strain BL21(DE3) was transformed with the expression vector pET-Ts-Clp1. The transformants growing logarithmically at 37 C in Luria-Bertani (LB) medium containing 50 mg/ml ampicillin were treated with 0.4 mM isopropyl-b-D-thiogalactoside (IPTG). After further growth for 16 h at 30 C, the cells were harvested by centrifugation (9,000 Â g for 15 min at 4 C), and the protein was extracted with sonication (3-4 min) in His-tag-binding buffer containing 20 mM Tris-HCl (pH 8.0), 500 mM NaCl, 5 mM imidazole, and 0.1% (v/v) NP-40. The extract was heat-treated at 60 C, the growth temperature of T. scotoductus, for 15 min to destroy any endogenous E. coli proteins and then centrifuged at 18,000 Â g for 10 min at 4 C to remove any debris. The recombinant protein was purified with a HisTrap HP column (GE Healthcare, Piscataway, NJ, USA) and eluted with a linear gradient of imidazole (5-1,000 mM) in His-tag-binding buffer using the AKTA, fast protein liquid chromatography (FPLC) System (GE Healthcare). The eluted protein peak was collected and dialyzed against buffer D containing 50 mM Tris-HCl (pH 8.0), 1 mM ethylenediaminetetraacetic acid (EDTA), 0.02% (v/v) Tween 20, 7 mM 2-mercaptoethanol, and 10% (v/v) glycerol.

Extraction and Distribution of Clp1 Family Proteins in the Three Domains of Life
To detect the Clp1 family proteins (Clp1 or Clp1-related proteins) on a large scale, several PNK domain regions of representative Clp1 proteins were used as query sequences (supplementary table S1 and supplementary fig. S1, Supplementary Material online). For bacterial Clp1 proteins, in the first round of extraction, in which we used the PNK domain of archaeal Clp1 (UniProt AC: O57936) as the query sequence, a bacterial Clp1 protein was identified that had been annotated a GTPase or GTP-binding protein (UniProt AC: A5G778 in supplementary table S1, Supplementary Material online). We selected this as the query sequence for a comprehensive second round of extraction for bacterial Clp1 proteins. In summary, a BLASTP search (E-value 1e À 4) was performed against the UniProtKB database (release: December 2018), which contains 137,772,065 proteins (table 1A). As a result, 3,557 Clp1 family protein sequences were obtained (the list contains those proteins detected with a metagenomic analysis and proteins with partial Clp1 aa sequences). It should be noted that T4 PNK, which is known to phosphorylate the 5 0 ends of DNA and RNA (Wang et al. 2002), was not extracted under our search conditions.
To classify the types of species, 3,540 of the 3,557 sequences were used because the remaining 17 sequences were annotated as "ecological metagenomes." The species containing Clp1 family proteins are listed in supplementary table S2, Supplementary Material online. These proteins are distributed in 1,426 species of eukaryotes, 211 species of archaea, and 144 species of bacteria. In the eukaryotes, the largest number of species containing these proteins was in the Opisthokonta (1,180 species including metazoans, fungi, and protists), followed by Viridiplantae (138 species consisting of green plants). In the archaea, the largest number of these species was in the TACK superphylum (106 species, including members of the Crenarchaeota and Thaumarchaeota), followed by the Euryarchaeota (94 species). In the bacteria, the largest number of these species was in the phylum Proteobacteria (39 species), followed by the Terrabacteria group (19 species that were Gram-positive bacteria or photosynthetic bacteria). It was not possible to calculate the numbers of CDSs of Clp1 family proteins in the genomes registered in the UniProtKB database because many of these genomes were incomplete or "draft" genomes. Therefore, we counted the number of CDSs of Clp1 family proteins in the genomes of 72 representative species (36 bacterial, 18 archaeal, and 18 eukaryotic species) for which complete genomes were available (table 2), according to a previous report (Ciccarelli et al. 2006). We found Clp1 family proteins in all the representative eukaryotes (18/18; 100%) and half the archaea (9/18; 50%) (table 2A), but in only a limited number of bacteria (3/36; 8.3%) (table 2B). In the eukaryotic genomes, CDSs for Nol9 proteins were present in Metazoa (e.g., mammals, fishes, and birds) and CDSs for Grc3 were present in fungi. Both proteins are known to be involved in prerRNA processing. Bacterial proteins that were annotated as translation factor GUF1, GTPase, GTP-binding protein, or uncharacterized protein were extracted (shown as "Others" in table 2B), although all these proteins are essentially considered to be Clp1 family proteins (discussed below). This analysis revealed that there are two or more CDSs for Clp1 family proteins per genome in all representative eukaryotes and in some Crenarchaeota (Archaea). In plants, both Arabidopsis thaliana and Oryza sativa subsp. japonica have six CDSs for Clp1 family proteins. However, there is usually only one CDS for a Clp1 family protein per genome in Euryarchaeota (Archaea) and Bacteria ( fig. 1 and  Material online). Because there were fewer Clp1 family proteins in Archaea, we analyzed the archaeal strains without a clp1 gene (i.e., 26 strains of Halobacteria and 12 strains of Methanococci). We found that 23/26 (88%) Halobacteria and 9/12 (75%) Methanococci had RtcB genes in their genomes. Similarly, 22/26 (84%) Halobacteria and 12/12 (100%) Methanococci had archaease genes in their genomes. Therefore, almost all these archaeal species seem to splice their pretRNA via the RtcB-dependent ligation pathway and not via the Clp1-dependent ligation pathway. In contrast, CDSs for Clp1 family proteins were present in only 14 highly restricted species of the 1,543 species (0.9%) of Bacteria examined (supplementary table S5, Supplementary Material online). These results show that genes for bacterial Clp1 family proteins are extremely rare in bacterial genomes. To investigate the phylogenetic positions of the bacterial Clp1 family proteins, the presence or absence of these enzymes was mapped on a previously reported bacterial phylogenetic tree Chlorobaculum tepidum ATCC 49652 ----Bacteroidetes Bacteroides thetaiotaomicron ATCC 29148 ----Porphyromonas gingivalis ATCC 33277 ----(continued) (Wu and Eisen 2008). Figure 2 clearly shows that the evolution of bacterial Clp1 is not lineage specific, but that it is distributed almost independently among bacteria. We also conducted a comprehensive molecular evolutionary analysis of the 235 archaeal and 149 bacterial Clp1 family proteins obtained from the UniProt Knowledgebase (UniProtKB), including both complete and incomplete genomic sequences. We found that the size distribution of the N-termini had two peaks corresponding to short and long forms (supplementary figs. 2 and 3, Supplementary Material online). The short peak mainly consisted of bacterial Clp1 proteins and the longer peak mainly consisted of archaeal Clp1 proteins. However, both peaks contained both archaeal and bacterial Clp1 proteins.

Clp1 Family Proteins and Novel Large Proteins With Clp1 PNK Domains
As shown in figure 1, the Clp1 family proteins were divided into two major clades, the prokaryotic (archaeal and bacterial) type and the eukaryotic type, on the phylogenetic tree. The eukaryotic clade was further subdivided into three clades: Clp1, Nol9, and Grc3. The last two are involved in prerRNA processing (Braglia et  To determine the size distribution of the Clp1 family proteins, 3,332 of the original 3,557 sequences were used because the remaining 225 sequences were annotated as "ecological metagenomes" and/or "fragmented sequences." supplementary fig. S7, Supplementary Material online summarizes the size distribution of the Clp1 family proteins. The main peak is located between lengths of $300 and 800 aa (mean 6 standard deviation, 555 6 225 aa) and this peak contains all known enzymes, including Clp1, Nol9, and Grc3 (Braglia et al. 2010;Heindl and Martinez 2010). The smallest Clp1 (UniProt AC: M5Q339) is a 199-aa protein -The numbers of Clp1 or Clp1-related proteins in representative complete genomes of (A) Eukarya (18 species), Archaea (18 species), and (B) Bacteria (36 species) are shown. a "Others" contains proteins annotated as "GTPase," "translation factor GUF1," or "uncharacterized protein" based on their domain similarities.  . 3). Among the 122 large proteins, 69 were from Protostomia, 30 were from Fungi, and 13 were from the Trypanosomatidae. Figure 3 shows representative examples of these large proteins. Several of these large proteins, such as a 1,007-aa protein (UniProt AC: C3Z8N7), a 1,009-aa protein (UniProt AC: A0A146F5J7), and a 1,048-aa protein (UniProt AC: A0A0V1IFH5), contain all the eukaryotic Clp1 domains (Clp1_eN, Clp1_P, and Clp1_eC), whereas the others usually contain the Clp1_P domain, together with other functional domains. Therefore, we conclude that the large proteins are novel proteins, possessing the whole Clp1 or partial Clp1 structure (mainly the Clp1_P domain) and other functional domains. Because some of the large proteins have similar domain architectures, such as the 1,471-aa protein (UniProt AC: A0A178U9P2) and the 1,631aa protein (UniProt AC: A0A0B2WWM9), or the 2,385-aa protein (UniProt AC: A0A0V1HGY2) and the 2,567-aa protein (UniProt AC: A0AVIHGN9), and because some of the large proteins appear at high frequencies (e.g., 40 times for the 1,015-aa protein [UniProt AC: A0A084W2E3] and 23 times for the 1,048-aa protein [UniProt AC: A0A0V1IFH5]), many of them are not sequencing artifacts but are actually encoded in these genomes. The Clp1_P domains in the large proteins have the well-conserved motifs required for phosphorylation activity (Walker A, Walker B, and Clasp), as well as partially conserved Lid motifs (supplementary fig. S8, Supplementary Material online), suggesting that the large proteins may all have phosphorylation activity. However, the exact functions of these large proteins remain unknown. The large proteins are also annotated based on their similarities with known functional proteins. Although many of them are annotated as Clp1 homologs, Clp1-like proteins, or even uncharacterized proteins, some are annotated as specific proteins. For examples, the 1,447-aa protein (UniProt AC: A0A178U9P2) is annotated as a translation factor GUF1 homolog, the 1,552-aa protein (UniProt AC: A0A0K6FVA0) as a fanconi-associated nuclease, and the 2,385-aa protein (UniProt AC: A0A0V1HGY2) as a voltage-dependent calcium channel, unc-36. Further functional analyses are required to support these annotations (supplementary table S6 experimental study has reported that a bacterial Clp1 protein actually has PNK activity. Therefore, our aim in this study was to demonstrate the biochemical activities of the enzyme. We investigated a bacterial Clp1 family protein (UniProt AC: E8PQM6) from T. scotoductus, which was isolated from a cleft in the Witwatersrand Supergroup rocks in South Africa (Gounder et al. 2011). Because the in vivo expression of the Ts-clp1 gene was examined by RNA-Seq analysis (Cusick et al. 2016), we considered that the corresponding protein Ts-Clp1 must be functional. We designated the protein "Ts-Clp1" (based on biochemical evidence; see below), although the protein is annotated as an uncharacterized protein in the UniProt database. The Ts-Clp1 protein shows only 28% aa identity and 43% similarity to human Clp1 (Hs-Clp1), and only 30% identity and 50% similarity to Pyrococcus furiosus Clp1 (Pf-Clp1) ( fig. 4A). At least three of the four motifs in the PNK domain (Walker A, Walker B, and Clasp) are well conserved across the three domains of life ( fig. 4B). These three motifs are reportedly involved in phosphorylation activity (Pillon et al. 2018). Although the last motif (Lid) is less well conserved in Ts-Clp1, similar aa residues occur in the motifs. For example, Arg299 in human Clp1 is replaced with Lys146 in Ts-Clp1.

Evolution of Clp1 Family Proteins
For the biochemical analysis of Ts-Clp1 and to efficiently produce the recombinant protein in E. coli, we first constructed an expression vector for the Ts-clp1 gene with codon optimization (supplementary fig. S9, Supplementary Material online). The expressed and purified His-tagged Ts-Clp1 had a molecular mass of approximately 27 kDa, as determined with sodium dodecyl sulfate (SDS)-polyacrylamide gel electrophoresis (PAGE) (fig. 5A). This finding is consistent with the size predicted from the aa sequence deduced from the corresponding gene. Because T. scotoductus was isolated from a South African gold mine in which the ambient temperature of the rock was $60 C (Kieft et al. 1999), the PNK activity of Ts-Clp1 (10 ng/reaction) was examined at 60 C for 15 min in the presence of NTPs and MgCl 2 . Under these conditions, Ts-Clp1 phosphorylated a single-stranded RNA (ssRNA) oligonucleotide in the presence of 2-10 mM NTPs (Km for ATP: 2.5 mM) ( fig. 5B and supplementary fig. S10, Supplementary Material online). Ts-Clp1 also uses ATP in preference to other NTPs: The relative activity for each NTP is: ATP (1.00), CTP (0.80), GTP (0.67), and UTP (0.66). In contrast, Ts-Clp1 did not phosphorylate a single-stranded DNA (ssDNA) oligonucleotide, even with larger amounts of NTP (0.2-1.0 mM) ( fig. 5C). However, Ts-Clp1 phosphorylated the ssDNA oligonucleotide when larger amount of the recombinant enzymes was used (>50 ng/reaction) (fig. 5D). These characteristics are quite similar to those of archaeal Clp1 (Jain and Shuman 2009). We also found that Ts-Clp1 phosphorylated both a doublestranded RNA (dsRNA) oligonucleotide and a dsRNA  figure 5G, the PNK activity of Ts-Clp1 was very heat stable. The enzyme even showed activity at 90 C, and the preincubation of the reaction mixture at 90 C before the enzyme was added did not affect the specific activity of Ts-Clp1.

Discussion
We conducted a large-scale molecular evolutionary analysis of the Clp1 family proteins and systematically confirmed that this family of proteins is distributed throughout all three domains of life (table 2 and supplementary table S2, Supplementary Material online). Because our experimental data showed that bacterial Ts-Clp1 has PNK activity ( fig. 5), we have also demonstrated that the Clp1 family proteins in all domains of life have associated PNK activity. In contrast, the number of species with a Clp1 family member differs across the three domains of life. All the representative eukaryote species examined had at least one Clp1 family protein, usually a Clp1 protein. Other Clp1 family proteins involved in prerRNA processing have evolved in a species-specific manner: Nol9 appears in the Metazoa, Plantae, and Amoebozoa, and Grc3 appears in the Fungi. These observations suggest that the ancestral clp1 gene was duplicated in the common ancestor of the eukaryotes, and diversified functionally during their evolution ( fig. 1). We also speculated that proteins in the "Others" category in table 2 may have similar functions to those of Nol9 or Grc3. However, further experimental analyses are required to determine whether the Clp1 proteins regulate both tRNA splicing and rRNA processing in the eukaryotic species that express only the Clp1 protein. In contrast, there are almost no duplicated clp1 genes in either the Archaea or Bacteria. According to our analysis of complete genomes, the proportion of species with Clp1 family proteins was 85.8% in Eukarya, 41.4% in Archaea, and 0.9% in Bacteria (supplementary tables S3-S5, Supplementary Amino acid sequence alignments were visualized with Jalview. Identical amino acid residues are indicated in blue and partly conserved amino acid residues are indicated in light blue. The amino acid numbers, from the first methionine (Met) residue, are shown on the left of each line. In the consensus sequence line, "x" and "h" mean any amino acid residue and a hydrophobic amino acid residue, respectively. See figure 1 legend for the five organisms used here.
Material online). Many archaeal tRNA genes contain a tRNA intron and exact pretRNA splicing is required to produce mature and functional tRNAs. However, approximately half the Archaea have no Clp1 enzyme. Therefore, it is suggested that the RNA ligase RtcB-dependent ligation pathway (the 3 0 -phosphate pathway) for tRNA exons is predominant in archaeal species without a clp1 gene, which is required for the 5 0phosphate pathway (Englert et al. 2011;Yoshihisa 2014). However, we found a limited number of bacterial clp1 genes in restricted and phylogenetically diverse bacterial species ( fig. 2, table 2, supplementary tables S2 and S5, Supplementary Material online). Our research also strongly suggests that the bacterial proteins shown in table 2B are bacterial PNKs or bacterial Clp1 proteins, although they were initially annotated as GTPases or GTP-binding proteins based on the similarities of the Walker A and B motifs. Experimentally, Ts-Clp1 showed kinase activity at 90 C, although the bacterium was isolated from an environment at 60 C (Kieft et al. 1999;Gounder et al. 2011). Ts-Clp1 preferentially phosphorylates ssRNA oligonucleotides over ssDNA oligonucleotides, and its biochemical characteristics are very similar to those of the hyperthermophilic archaeal enzyme, Ph-Clp1 (Jain and Shuman 2009). Evolutionarily, the bacterial Clp1 proteins are distributed almost independently on the bacterial phylogenetic tree ( fig. 2). On the basis of all these observations, we speculate that the bacterial clp1 genes may have been acquired by horizontal gene transfer from one of the hyperthermophilic archaea. However, several features of bacterial Clp1 do not support this possibility, including its lack of the conserved C-terminal domain of archaeal Clp1. An alternative explanation is that a massive loss of clp1 genes occurred in the early evolution of Bacteria, although the common ancestor of Bacteria had a clp1 gene.
In terms of the Clp1 family protein structures, the PNK domain is highly conserved among all family members. Interestingly, there are also subfamily-specific domains. For example, eukaryotic Clp1 has conserved N-terminal (Clp1_eN) and C-terminal domains (Clp1_eC). Similarly, the Nol9 proteins, which are involved in prerRNA processing, have conserved N-terminal (Nol9_eN) and C-terminal domains (Nol9_eC) (supplementary figs. S5 and S6, Supplementary Material online). These findings suggest that the ancestral Clp1 and Nol9 proteins acquired their corresponding protein-specific N-terminal and C-terminal domains in the common ancestor of the eukaryotes in accordance with their specific substrates (pretRNA or prerRNA, respectively), resulting in their functional diversification during evolution ( fig. 1). In the archaeal Clp1 proteins, only the C-terminal domain (Clp1_aC) is conserved (supplementary fig. S4, Supplementary Material online). There are no such conserved domains in Grc3. From an experimental perspective, both Hs-Clp1 and Ce-Clp1, which contain the eukaryotic Clp1_eN and Clp1_eC domains, mainly phosphorylate ssRNA under typical reaction conditions. It has also been reported that Ce-Clp1 very weakly phosphorylates ssDNA when incubated for a longer period (Weitzer and Martinez 2007;Dikfidan et al. 2014). In contrast, archaeal Ph-Clp1 does not contain the N-terminal domain (Jain and Shuman 2009) and bacterial Ts-Clp1 has neither the N-nor C-terminal domain. Using ssRNA as the substrate, we determined that the Ts-Clp1 Km value for ATP was 2.5 mM (supplementary fig. S10, Supplementary Material online). Prokaryotic Ts-Clp1 (Km, 2.5 mM) and Ph-Clp1 (Km, 16 mM) (Jain and Shuman 2009) phosphorylated the ssRNA at lower ATP concentrations than eukaryotic Ce-Clp1 (Km, 99 mM) (Dikfidan et al. 2014). However, these prokaryotic enzymes showed lower substrate specificity, also phosphorylating ssDNA. Therefore, we speculate that the Nterminal domain in eukaryotes may contribute the substrate specificity for RNA, but that the eukaryotic enzyme requires a relatively larger amount of ATP. On the basis of all these findings, the Clp1 enzymes can be broadly divided into two classes: Prokaryotic and eukaryotic. After the duplication of the primitive clp1 gene, only the eukaryotic Clp1 protein acquired both the N-terminal and C-terminal domains, defining its RNA specificity. The N-terminal domain of Ce-Clp1 is required for its ATP-binding activity (Dikfidan et al. 2014).
Although prokaryotic Clp1 lacks this N-terminal domain, its phosphorylation activity is intact. Therefore, it will be necessary to analyze the reaction mechanism at the structural level. There are two types of Clp1 orthologs: One with polynucleotide activity, such as human Hs-Clp1 (Weitzer and Martinez 2007), and the other without polynucleotide activity, such as yeast Sc-Clp1 (Ramirez et al. 2008). Although ATPase activity is not essential for mRNA 3 0 -processing, ATP binding may have a critical function in this event. Because Sc-Clp1 has no PNK activity, yeast tRNA ligase (Trl1) is instead responsible for the phosphorylation of the 3 0 tRNA exon during pretRNA splicing. It has also been reported that Hs-Clp1 complemented a lethal kinase-defective Trl1 mutation in yeast (Ramirez et al. 2008). This result suggests that Clp1 and Trl1 share a functionally common regulatory mechanism in pretRNA splicing. We believe that the same is true of plant Rlg1 (Nagashima et al. 2016) and Clp1. However, the number of experimentally characterized enzymes is limited, and further research is required to properly understand the functions of these enzymes during evolution.
We also identified a set of large proteins containing the Clp1_P domain and other functional domains ( fig. 3). As reported previously, Trl1 tRNA ligase contains three functional domains: Ligase, PNK, and CPDase domains (Wang and Shuman 2005;Englert et al. 2010). Although the PNK domain in Trl1 shows no significant similarity to that in Clp1, enzymes containing the PNK domain tend to have other functional domains. We speculate that the fundamental architecture of the PNK family proteins includes multiple functional domains. As far as we know, none of these large proteins has been characterized experimentally. Therefore, any analysis of large proteins containing the Clp1_P domain must also determine the functions of these other domains experimentally.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.