Evolution of DNA packaging in gene transfer agents

Abstract Gene transfer agents (GTAs) are virus-like particles encoded and produced by many bacteria and archaea. Unlike viruses, GTAs package fragments of the host genome instead of the genes that encode the components of the GTA itself. As a result of this non-specific DNA packaging, GTAs can transfer genes within bacterial and archaeal communities. GTAs clearly evolved from viruses and are thought to have been maintained in prokaryotic genomes due to the advantages associated with their DNA transfer capacity. The most-studied GTA is produced by the alphaproteobacterium Rhodobacter capsulatus (RcGTA), which packages random portions of the host genome at a lower DNA density than usually observed in tailed bacterial viruses. How the DNA packaging properties of RcGTA evolved from those of the ancestral virus remains unknown. To address this question, we reconstructed the evolutionary history of the large subunit of the terminase (TerL), a highly conserved enzyme used by viruses and GTAs to package DNA. We found that RcGTA-like TerLs grouped within viruses that employ the headful packaging strategy. Because distinct mechanisms of viral DNA packaging correspond to differences in the TerL amino acid sequence, our finding suggests that RcGTA evolved from a headful packaging virus. Headful packaging is the least sequence-specific mode of DNA packaging, which would facilitate the switch from packaging of the viral genome to packaging random pieces of the host genome during GTA evolution.


Introduction
Gene transfer agents (GTAs) are virus-like particles encoded and produced by certain bacteria and archaea (reviewed most recently by Lang, Westbye, and Beatty (2017) and Grull, Mulligan, and Lang (2018)). Unlike viruses, GTAs package fragments of the host genome instead of the genes that encode the GTA itself (Hynes et al. 2012). When GTA particles infect another cell, they can transfer the encapsidated genetic material to the recipient (Solioz, Yen, and Marris 1975;McDaniel et al. 2010;Hynes et al. 2012). The genomic loci that encode GTAs resemble prophages, indicating that GTAs evolved from viral ancestors. Although the function of GTAs is not firmly established, the prevailing hypothesis is that GTAs are not defective prophages, but instead are agents of horizontal gene transfer that are maintained in prokaryotic genomes due to the advantages associated with gene exchange, particularly, in stressful conditions (Lang, Westbye, and Beatty 2017;Kogay et al. 2020).
The best-studied GTA is produced by the alphaproteobacterium Rhodobacter capsulatus, and will be referred to as RcGTA. Production of the RcGTA particles is triggered by environmental factors (Westbye et al. 2017a), occurs in a small fraction of the population (Fogg, Westbye, and Beatty 2012;Hynes et al. 2012), is regulated by host proteins (Westbye, Beatty, and Lang 2017b;Ding et al. 2019; and involves expression of genes that are found in at least five loci in the R. capsulatus genome (Hynes et al. 2016;Lang, Westbye, and Beatty 2017). The largest of these loci is a 17-gene 'head-tail cluster' that encodes proteins involved in head-tail morphogenesis and DNA packaging (Lang, Westbye, and Beatty 2017). There are homologs of the RcGTA head-tail cluster genes in other alphaproteobacteria (Shakya, Soucy, and Zhaxybayeva 2017), including several Rhodobacterales for which GTA production has been observed (Fu et al. 2010;Nagao et al. 2015). Additionally, homologs of headtail cluster genes are present in numerous viruses and proviruses (Shakya, Soucy, and Zhaxybayeva 2017).
The small size of RcGTA and the low density of its packaged DNA precludes the particle from accommodating all of the genes required for its production (Bá rdy et al. 2020). Instead, RcGTA packages seemingly random portions of the host genome (Hynes et al. 2012). In double-stranded DNA viruses of the realm Duplodnaviria, genome packaging is mediated by the terminase and portal proteins (Fokine and Rossmann 2014;Rao and Feiss 2015). Viral terminases typically consist of large and small subunits (Rao and Feiss 2008). The small subunit (TerS) binds to the DNA to be packaged and then recruits the large subunit (Casjens 2011;Rao and Feiss 2015). The large subunit (TerL), which consists of ATPase and nuclease domains, cuts the concatemeric viral DNA, translocates the DNA into the viral capsid with concomitant ATPase hydrolysis and, finally, cuts the DNA again to terminate packaging (Casjens 2011;Rao and Feiss 2015). Viruses evolved different strategies for packaging DNA into their capsids, and these strategies involve different classes of TerL (Casjens and Gilcrease 2009). Some viruses employ a headful packaging strategy whereby TerL initially cuts the viral concatemeric DNA at a specific sequence (pac site) and terminates packaging when the capsid is full, rather than at a particular sequence (Rao and Feiss 2008;Casjens and Gilcrease 2009). The virions produced by headful phages usually package more than 100 per cent of the phage genome length, and as a result, have terminally redundant, circularly permuted chromosomes (Casjens and Gilcrease 2009). Viral TerLs with the same packaging mechanism tend to form clades in phylogenetic trees (Casjens et al. 2005). However, due to the diversity of TerL sequences, the relationships among the different functional classes of TerLs are not well-resolved (Casjens et al. 2005;Casjens and Gilcrease 2009;Merrill et al. 2016). Given its lack of sequence specificity, RcGTA is presumed to package fragments of the host DNA via the headful strategy (Casjens et al. 2005;Hynes et al. 2012). The RcGTA TerL and its alphaproteobacterial homologs formed a distinct group in previous phylogenies, but they did not cluster with or within the viral TerLs that are involved in headful packaging (Casjens et al. 2005;Casjens and Gilcrease 2009). Therefore, these phylogenies did not provide evidence of a headful packaging strategy in alphaproteobacterial GTAs, in part, due to limited sequence data available at the time.
In this study, we conducted a comprehensive evolutionary analysis of TerL sequences to better resolve the phylogenetic relationship of alphaproteobacterial RcGTA-like TerLs and viral TerLs with known packaging strategies. Our analyses suggest that RcGTA, and, by inference, the rest of the putative alphaproteobacterial GTAs, evolved from a virus that employed the headful packaging strategy. We also identified two amino acid substitutions that are conserved in the TerLs of the putative alphaproteobacterial GTAs and might be important for the DNA packaging properties of GTAs.

Retrieval and sequence-based clustering of large terminase homologs
Eighteen profiles covering one or both (ATPase and nuclease) domains of TerL were retrieved from the NCBI Conserved Domains Database (CDD) (Lu et al. 2020) (accessed on 11 December 2018). Two TerL profiles for distinct families of bacterial and archaeal viruses were added from Philosof et al. (2017) and Yutin et al. (2018). These 20 profiles (Supplementary Table  S1) were used as queries for PSI-BLAST searches (E-value threshold of 0.01, effective database size of 2 Â 10 7 sequences) (Altschul et al. 1997) against the NCBI non-redundant protein database (accessed in December 2018). Only the subject sequences that were taxonomically assigned to archaea, bacteria, and viruses were retained.
Partial TerL sequences were removed by ensuring the presence of both an ATPase (N-terminal) and a nuclease (C-terminal) domain using the following criteria: sequences either had to align to !75 per cent of a 'full' TerL profile that includes both TerL domains or align to different TerL profiles over their Nand C-terminal domains. A sequence was considered to align over a specific domain if it met one of two conditions: 1, if the sequence matched !75 per cent of an N-or C-terminal domainspecific profile and had at least 35 per cent of the protein length outside of the matched domain to contain the unmatched domain or 2, if a sequence aligned to just the N-or C-terminal portion of a 'full' TerL profile and had at least 35 per cent of the protein length outside of the matched domain to contain the unmatched domain.
The resulting 254,382 sequences were clustered using MMseqs2 (Steinegger and Sö ding 2017) with a similarity threshold of 0.75. From each of the obtained 11,298 clusters, a representative sequence of median length was selected for subsequent analyses.

Alignment of representative homologs and filtering out partial sequences
The representative TerL sequences were iteratively aligned and clustered using the approach described by Wolf et al. (2018). Briefly, the sequences were clustered with a similarity threshold of 0.5 using UCLUST (Edgar 2010). The clustered sequences were aligned using MUSCLE (Edgar 2004) and alignment sites that contained more than 67 per cent gaps were temporarily removed. Pairwise similarity scores between cluster alignments were calculated using HHSEARCH (Sö ding 2005), converted to a distance matrix and used to build a UPGMA tree (Sokal and Michener 1958). All of the branches of the UPGMA tree above a depth threshold of 2.3 were used to guide progressive alignment of the clusters using HHALIGN (Sö ding 2005). The removed sites were reinserted back into their original sequences after the profile-profile alignment. These alignment and clustering steps were repeated for 20 iterations, when 11,230 of the sequences formed one alignment.
Of the remaining 68 sequences that failed to align, two were clearly TerLs but contained inteins, which were manually removed. Five other sequences were also likely TerLs because they were longer than 300 amino acids, exhibited significant similarity to a TerL profile via CDD searches (E-value <0.001), and contained recognizable Walker A motif and nuclease catalytic residues. The seven sequences were profile-aligned to the alignment of 11,230 sequences using more relaxed criteria (similarity threshold of 0.01 and UPGMA depth threshold of 6). The remaining 61 sequences did not meet these criteria and were discarded.
The new alignment of 11,237 sequences contained partial sequences that lacked a Walker A motif. To remove these, each sequence's similarity was scored to the alignment's consensus sequence using a BLOSUM62 substitution matrix and the score was compared to the score of sequences with 100 per cent identity to the consensus sequence. Sequences with a score less than 10 per cent of the perfect match score were removed. Then, the alignment was used as a PSSM in a PSI-BLAST search against all of the sequences within the alignment. Only the sequences that matched to !75 per cent of the PSSM were retained. The sections of the 11,060 sequences that passed this criterion were extracted and re-aligned using the abovedescribed iterative alignment procedure with a clustering similarity threshold of 0.5 and UPGMA depth threshold of 2.3. After 23 iterations of alignment and clustering, 11,057 of the sequences aligned. The three sequences that did not align were longer than 300 amino acids, exhibited significant similarity to a TerL profile via CDD searches (E-value <0.001), and contained a recognizable Walker A motif and nuclease catalytic residues. Therefore, they were retained and aligned to the main alignment using more relaxed parameters of a clustering similarity threshold of 0.01 and UPGMA depth threshold of 6.

Alignment trimming
The alignment of 11,060 sequences was trimmed to remove all columns with more than 50 per cent gaps and less than 10 per cent amino acid homogeneity. The homogeneity value of an alignment column was defined and calculated using the following procedure. For each of the N¼11,060 sequences, columnbased sequence weights w i P N i¼1 w i ¼ 1 ! were assigned according to Henikoff and Henikoff (1994). The score of an alignment column against an amino acid x was calculated as where a i is an amino acid in the i-th sequence and S a i ;x is the BLOSUM62 substitution matrix score for a pair of amino acids a i and x (Henikoff and Henikoff 1993). As the consensus amino acid of the column c, the amino acid with the highest score Q c , that is, c ¼ argmax x Q x , was selected. An expectation of the score of the given alignment column against a randomly selected amino acid R was calculated as . The homogeneity of an alignment column was defined as H ¼ max QcÀQR Sc;cÀQR ; 0Þ . The homogeneity ranges from 0 (the alignment column score does not exceed the random expectation score Q R ) to 1 (the alignment column score is equal to the maximum possible score S c;c ).

Reconstruction of large terminase phylogeny
The trimmed alignment of 11,060 sequences was used to reconstruct an initial phylogenetic tree in FastTree v. 2.1.4 (Price, Dehal, and Arkin 2010) using the Whelan and Goldman (WAG) substitution model (Whelan and Goldman 2001) and 20 gammadistributed rate categories (Yang 1994). The initial tree was used as a guide tree to refine our alignment, which in turn was used to reconstruct an improved tree. To this end, the sequences were divided into two sets, those that formed distinct groups on the tree and the remaining ones. The sequences that formed distinct groups were re-aligned using a clustering similarity threshold of 0.01 and UPGMA depth threshold of 2.3, whereas the other sequences were aligned using more stringent parameters (similarity threshold of 0.66 and UPGMA depth threshold of 1.3). These alignments were profile-aligned using a similarity threshold of 0.5 and UPGMA depth threshold of 2.3. Nine sequences did not join the main alignment and were discarded due to low scores against the consensus, calculated with a BLOSUM62 matrix as described above. The resulting alignment of 11,051 sequences was trimmed to remove all columns with more than 50 per cent gaps and less than 10 per cent amino acid homogeneity, as defined above. The final phylogenetic tree of 11,051 TerL homologs ( Fig. 1) was reconstructed using FastTree v. 2.1.4 (Price, Dehal, and Arkin 2010) with the WAG substitution model (Whelan and Goldman 2001) and 20 gamma-distributed rate categories (Yang 1994).

Identification of RcGTA-like large terminases
In the phylogenetic tree of TerLs ( Fig. 1), a group of 616 TerLs was labeled as the 'RcGTA-like containing group'. This group includes 507 TerLs of the 526 RcGTA-like TerLs that were identified and curated by Kogay et al. (2019). Nineteen RcGTA-like TerLs from the dataset of Kogay et al. (2019) are absent in our dataset because they were not in GenBank at the time of our data collection (December 2018). The 616 TerLs were classified as either 'RcGTAlike' or 'virus-like' using a machine learning approach implemented in the GTA-Hunter program (Kogay et al. 2019).

Examination of the genomic neighborhoods of large terminases
For the 604 TerLs within the 'RcGTA-like containing' group ( Fig. 1) that originated from bacterial and archaeal genomes, the presence of 11 other RcGTA-like genes near the terL gene was examined. To this end, the RcGTA-like genes from the training set of Kogay et al. (2019) were used as queries in a BlastP search against the assemblies of the bacterial and archaeal genomes (E-value <0.001; query and subject overlap by at least 60 per cent of their length) (Altschul et al. 1997). The detected RcGTA gene homologs were classified as 'RcGTA-like' or 'virus-like' using GTA-Hunter (Kogay et al. 2019). The detected RcGTA gene homologs were also clustered into regions using DBSCAN, with a maximum distance cutoff of 8,000 bp between adjacent genes (Ester et al. 1996;Kogay et al. 2019). If a terL gene was embedded in a region containing at least 6 of the 11 RcGTA-like genes, it was classified as being in a 'large RcGTA-like element'. If a terL gene was located in a region containing 1 to 5 RcGTA-like genes, it was classified as being in a 'small RcGTA-like element'. The classification of the sequence represented in the phylogeny was assumed to be the same for the rest of the cluster members although this was not directly verified.

Assignment of packaging strategy to viral large terminases
A list of viruses with experimentally determined packaging mechanisms was compiled from the phylogenetic tree of Casjens and Gilcrease (2009). Viruses from the phylogenetic tree of Merrill et al. (2016) were also added, for most of which experimental evidence of the packaging mechanism is available. The dataset of the 252,614 TerL homologs was searched for these 87 viruses using TerL accession numbers provided by Merrill et al. (2016) and NCBI taxonomy IDs for the viruses from Casjens and Gilcrease (2009). Of the 87 viruses, 73 were present in our dataset (Supplementary Table S2). Due to the close sequence similarity, some of the 73 TerLs belong to the same MMSEQ clusters, and therefore are represented by 58 TerLs on our phylogeny of 11,051 TerLs (Fig. 1). The 58 representative viruses for which the packaging mechanism was not known were assigned the mechanism of a virus from the same cluster with a known packaging strategy, under the assumption that the similarity of their TerL amino acid sequences is sufficient to imply the same packaging mechanism.

Validation of the reconstructed phylogenetic patterns with more accurate maximum likelihood analyses
To confirm that the phylogenetic relationships obtained from the FastTree program (Price, Dehal, and Arkin 2010) were not  Table S2). The viruses with experimentally determined packaging strategies are labeled by the packaging strategy (Headful, Cohesive ends [COS], Direct Terminal Repeats [DTR], and Host Ends) followed by a prototype phage from that group (e.g., P22). Support values (aLRT) of >75 per cent denoted by red dots are only shown for a selection of branches relevant to grouping RcGTA-like TerLs within phages that employ a headful DNA packaging strategy. Scale bar, amino acid substitutions per site. Tree topology with all support values is available in NEWICK format in Supplementary Data. The patterns of this phylogeny are consistent with those in the phylogenies reconstructed using a more accurate maximum likelihood inference carried out using the IQ-TREE program (see Supplementary Dataset); in particular, the RcGTA-like TerLs continue to form a well-supported group (100 per cent of bootstrap samples), which forms a sister group to TerLs of phages that utilize a headful packaging strategy (75 per cent of bootstrap samples). impacted by its limited tree search and optimization capabilities, additional phylogenetic trees were reconstructed using IQ-TREE v 1.6.7 (Nguyen et al. 2015) from two datasets subsampled from the 11,051 TerLs. The first dataset of 342 TerLs was constructed to broadly represent the TerL diversity (Fig. 1). The dataset contains the 58 representative viral TerLs (described in Section 2.7), 50 TerLs randomly sampled from group 1 of the 'RcGTA-like containing group' (Fig. 2), all TerLs from group 2, 70 TerLs from group 3, 50 TerLs randomly sampled from the rest of the 'RcGTA-like containing group', and 100 randomly sampled TerLs from the rest of the whole TerL tree (Fig. 1). The second dataset of 346 TerLs was constructed to represent well the TerLs from the 'RcGTA-like containing group' (Figs 1 and 2). The dataset contains 12 representative viral TerLs with either P22 or Sf6-like headful packaging strategies, 70 TerLs randomly sampled from group 1, all TerLs from group 2, and 250 TerLs randomly sampled from group 3. For both datasets, the aligned sequences were retrieved from the trimmed alignment of 11,051 TerLs (see Section 2.4) and gap-only sites were removed. The optimal evolutionary model was selected using ModelFinder (Kalyaanamoorthy et al. 2017), as implemented in IQ-TREE. Support values for branches were calculated using ultrafast bootstrap approximation with 1,000 replicates (Hoang et al. 2018), as implemented in IQ-TREE. The tree reconstructed from the second dataset was rooted with the headful P22 viral TerL that, out of the headful P22 viral TerLs in Fig. 1, is the most distantly related to the 'RcGTA-like containing group'.

Annotation of prophages in regions encoding viruslike TerLs
To examine whether the bacterial TerLs classified as virus-like reside in prophages, prophages were predicted in the corresponding nucleotide genome sequences using PHASTER (Arndt et al. 2016, accessed in May 2020). The regions labeled as intact (score >90) or questionable (score 70-90) prophages were retained, while the regions labeled as incomplete prophages (score <70) were discarded. One region surrounding the viruslike TerL (accession WP_020474221) was predicted by PHASTER to be an intact prophage and was also classified as a small RcGTA-like element by GTA-Hunter, due to the presence of one RcGTA-like gene. The PHASTER prediction of this region as a prophage was considered to supersede the small RcGTA-like element classification.

Detection of conserved sites that differentiate RcGTA-like and virus-like TerLs
The amino acid sequences of the 616 TerLs in the subtree shown on Fig. 2 were re-aligned using MAFFT-linsi v. 7.305 (Katoh and Standley 2013). The alignment was scanned for sites conserved in more than 80 per cent of the TerL homologs from group 1 (virus-like) or group 3 (RcGTA-like) (Fig. 2). Of these detected sites, only the sites with at least 70 per cent betweengroup difference in the relative abundance of the most conserved amino acid were retained.
The positions of the two identified sites relative to the known TerL structural domains were determined by searching CDD (Lu et al. 2020) with RcGTA TerL RefSeq record WP_031321187 as a query (database accessed on 26 July 2020). The conservation of the two detected sites within the 'RcGTAlike containing group' was visualized using a subset of the 616 TerLs (Fig. 2) TerLs from group 2, 14 randomly sampled TerLs from group 3, and a representative TerL from the cluster that contains the RcGTA TerL. The aligned TerLs were retrieved from the alignment of 616 TerLs, and the gap-only alignment positions were removed. Secondary structure information was obtained via HHPred using RcGTA TerL as a query (Zimmermann et al. 2018).
The locations of the two sites were also visualized on a 3D structure of TerL from the Shigella phage Sf6 (PDB ID 4IDH) (Zhao et al. 2013) which, based on our phylogenetic inference, is the TerL most closely related to RcGTA-like TerLs for which a structure is available. The homologous positions of the substitutions in the Shigella phage Sf6 TerL were identified by aligning it to the RcGTA TerL using HHPred (Zimmermann et al. 2018). The visualization was carried out in PyMOL v 2.4 (Schrö dinger, LLC 2020).

RcGTA-like TerLs belong within the group of TerLs of headful packaging phages
Of the 11,051 representative TerL homologs from bacteria, archaea, and viruses, 616 are closely related to the RcGTA TerL and form a well-supported group in the phylogenetic tree (Fig. 1). Of these 616 TerLs, twelve are encoded in viral genomes, whereas the remaining 604 are found in 601 bacterial and 3 archaeal genomes. Using a machine learning approach that relies on amino acid composition, we classified the 604 bacterial and archaeal TerL homologs as either 'RcGTA-like' (527) or 'viruslike' (77) (Supplementary Table S3). By mapping 73 TerLs with experimentally determined packaging strategies onto the phylogeny, we found that RcGTA-like TerLs fall, with strong support, within a group of headful packaging phages (Fig. 1). Therefore, our phylogeny implies that the RcGTA-like TerLs evolved from a viral TerL that employed a headful packaging strategy and is thus consistent with the earlier proposed hypothesis that RcGTA-like TerLs use a headful mechanism to package host DNA (Casjens et al. 2005;Hynes et al. 2012). Of the TerLs from the viruses with experimentally determined packaging strategies, Enterobacteria phage P22-like TerLs are the closest relatives of the RcGTA-like TerLs. This affinity contrasts the previous results, from analyses of much smaller data sets, according to which T4-like (Lang and Beatty 2000;Hynes et al. 2012) or T7-like (Casjens et al. 2005) TerLs were found to be most closely related to the RcGTA-like TerLs.

Phylogenetic evidence of a single origin of GTA TerLs
Whereas the TerLs of viruses that employ headful DNA packaging specifically package the viral genome into the capsid, RcGTA TerL lacks sequence specificity and packages random segments of the bacterial genome (Hynes et al. 2012). To evaluate if non-specific DNA packaging evolved once or multiple times, we first sought to determine more accurately which of the RcGTA-like TerLs likely belong to bona fide GTAs. Shakya, Soucy, and Zhaxybayeva (2017) hypothesized that genomic regions with a smaller number of recognizable RcGTA gene homologs are more likely to be prophages than GTAs because these regions tend to have a more virus-like GC content relative to their host, evolve faster and are more often associated with viral genes. Therefore, some of the TerLs classified as 'RcGTAlike' might not belong to RcGTA-like elements in cases when the alphaproteobacterial genomes that contain these genes lack homologs of other RcGTA genes. Among the 527 RcGTA-like TerLs, we classified 391 as 'large' (containing at least six RcGTAlike genes near the terL gene) and 136 as 'small' (1-5 RcGTA-like genes) elements ( Fig. 2 and Supplementary Table S3).
Within the subtree that contains RcGTA-like TerLs (Fig. 2), all but one of the TerLs found in large elements form a wellsupported clade (group 3 in Fig. 2). The one TerL from a large element that falls outside this clade is a representative of a cluster of three TerLs that are found in the genomes of alphaproteobacteria Zavarzinia compransoris DSM 1231, Zavarzinia sp. HR-AS and Oleomonas sp. K1W22B-8. This TerL belongs to a narrow 'transition zone' (group 2 in Fig. 2) between the group 3 TerLs and the deepest branches of the subtree that include exclusively viral and 'virus-like' sequences (group 1 in Fig. 2). The transition zone also contains a mix of RcGTA-like TerLs from small elements and virus-like TerLs, including the TerL from the intact prophage predicted in the genome of a planctomycete Zavarzinella formosa DSM 19928. None of the TerLs within this transition zone come from functionally characterized viruses or GTAs. Thus, the phylogeny indicates that RcGTA-like TerLs likely evolved only once from a viral TerL, in an ancestor of group 3, by acquiring the capability to package DNA non-specifically. The positions of the TerLs from Zavarzinia's and Oleomonas' putative GTAs and the Zavarzinella prophage could be explained by horizontal gene transfer, as previously documented in some instances for other RcGTA-like genes (Yang et al. 2017) and discussed in detail below.

Viruses might mediate horizontal gene exchange of RcGTA-like genes
In addition to the above-discussed predicted prophage from Zavarzinella formosa DSM 19928, we identified two intact prophages in the genomes of the firmicute Thermoactinomyces sp. DSM 45892 and the alphaproteobacterium Methylobacterium terrae 17Sr1-28, and 16 viruses that encode TerLs that are phylogenetically most closely related to the RcGTA-like TerLs (Tables 1  and 2). Notably, the TerLs of these three prophages are even more closely related phylogenetically to the TerLs from large elements than the 16 viruses are (Fig. 2), but whether they produce functional virions is unknown.
Some of the 16 viruses encoding TerLs related to GTA TerLs infect GTA-containing alphaproteobacteria and possess genes that are more closely related to GTA genes than to their homologs in other viruses. For example, Dinoroseobacter phage vB_DshS-R5C contains homologs of four putative tail genes of the RcGTA (genes g12-g15; Yang et al. 2017). The predicted prophage in Zavarzinella formosa DSM 19928 encompasses a homolog of the adaptor gene (g6) that is RcGTA-like in amino acid composition. These observations suggest an ongoing exchange and recombination of RcGTA-like genes between viruses and alphaproteobacteria, which likely explains the presence of virus-like TerLs within groups 2 and 3 (Fig. 2). The TerL phylogeny also indicates that such gene exchange might extend beyond alphaproteobacteria because at least four bacterial TerLs within groups 2 and 3 come from non-alphaproteobacterial genomes (OYV96073, WP_020474221, WP_110156686, and OQX66442). Because the viruses with known hosts infect a wide range of bacteria that live in environments similar to those occupied by GTA-containing alphaproteobacteria (Table 1), they either might have an opportunity for gene exchange with viruses that infect GTA-containing bacteria or might be capable of infecting GTA-containing bacteria, in addition to their currently known hosts.

Two amino acid changes distinguish GTA and viral TerLs
Although viral TerLs that are most closely related to RcGTA-like TerLs have not been experimentally characterized, examination of amino acids that are conserved in RcGTA-like TerLs from large elements (group 3 on Fig. 2) but not in closely related viral TerLs (group 1 on Fig. 2), or vice versa, might help pinpoint the changes that are important for the unique packaging strategy of GTAs. We did not identify any amino acids that are conserved in group 1 TerLs but not in group 3 TerLs, but found two amino acids (located at positions 282 and 292 in the RcGTA TerL; RefSeq record WP_031321187) that are conserved in the group 3 TerLs but not in the group 1 TerLs (Supplementary Table S4 and Fig. S1). In position 292, 99 per cent of the group 3 TerLs, but only 4 per cent of the group 1 TerLs, contain cysteine, whereas 59 per cent of the group 1 TerLs contain threonine. However, given that the threonine to cysteine substitution results in a reduction of the number of carbons per side chain, selection for the reduction in the energetic cost of GTA production (Kogay et al. 2020) cannot be excluded as a driver for this substitution. In position 282, 90 per cent of the group 3 TerLs but no group 1 TerLs contain proline, whereas 64 per cent of the group 1 TerLs but only 6 per cent of the group 3 TerLs contain alanine. Proline contains two more carbons in its side chain than alanine, and therefore, this substitution cannot be selected for energetic cost savings. In TerLs from bona fide GTAs of R. capsulatus and Dinoroseobacter shibae, cysteine is found at position 292 in both proteins, whereas in position 282 the Dinoroseobacter shibae TerL has proline and RcGTA TerL has alanine.
Within the TerL protein structure, the two amino acids are located in the nuclease domain (Rao and Feiss 2015) and lie in close proximity at the opposite ends of a loop that extends toward the translocating DNA ( Supplementary Fig. S2 and Figure  3B in Zhao et al. (2013)). The importance of these residues with respect to the functionality of the GTA TerLs remains to be elucidated.

Discussion
RcGTA is hypothesized to package DNA via a headful mechanism because RcGTA particles encapsidate random pieces of R. capsulatus' DNA, which would likely be facilitated by a nonsequence-specific TerL (Casjens et al. 2005;Hynes et al. 2012). Previous experiments support the hypothesis that RcGTA utilizes headful packaging because the packaged DNA fragments have different sequences at the ends (Hynes et al. 2012). The large dataset of available TerL sequences allowed us to obtain phylogenetic evidence that the RcGTA-like TerLs evolved from the large terminases of headful-packaging phages (Fig. 1). Our findings suggest that RcGTA either continues to employ the headful packaging strategy of its ancestor or modified it into a unique DNA packaging strategy. Previous studies have reported that the RcGTA TerL was most closely related either to the TerLs of T7-like viruses, which use a sequence-specific packaging mechanism (Casjens et al. 2005), or to the TerLs of T4-like viruses, which employ the headful packaging mechanism (Lang and Beatty 2000;Hynes et al. 2012). However, we found that the RcGTA-like TerLs are more similar to the TerLs of headful-packaging P22-like viruses. This discrepancy is likely due to the vastly expanded set of viral sequences now available in GenBank and the more sensitive search method that we used to identify viral TerL homologs.
In further support of the origin of RcGTA from a virus that employed a headful packaging mechanism, several structural proteins of RcGTA have the highest sequence and secondary structure similarity to the corresponding proteins in viruses that also utilize a headful packaging strategy (Bá rdy et al. 2020). Specifically, the tail tape measure protein of RcGTA is homologous to the tail-needle protein of bacteriophage P22, domains of the RcGTA hub and megatron proteins are homologous to their counterparts in bacteriophage T4, and the RcGTA stopper and tail terminator proteins are homologous to those from bacteriophage SPP1 (Bá rdy et al. 2020).
We identified TerLs from several viruses and predicted prophages that are phylogenetically closer relatives of the RcGTA TerL than P22-like TerLs. These viruses and predicted prophages either infect alphaproteobacteria or at least are found in the same environments as GTA-containing alphaproteobacteria. Because the specific mechanisms of headful packaging differ among phages (Casjens et al. 1992(Casjens et al. , 2004Bhattacharyya and Rao 1993), experimental characterization of packaging in viruses that are closely related to GTAs could offer further insight into the origin of the GTA TerLs and their packaging mechanism.
The TerL phylogeny presented here supports the single origin of the RcGTA head-tail cluster in alphaproteobacteria because large RcGTA-like elements grouped together. With the newfound support for a single origin of RcGTA-like TerLs from a TerL of a headful-packaging phage, we propose that a headfulpackaging TerL in the RcGTA ancestor underwent key changes that resulted in the switch from packaging the GTA genome to packaging random, small pieces of the host genome (Hynes et al. 2012) with a substantially lower density of DNA in the capsid (Bá rdy et al. 2020). The selection for reduced energy cost of GTA protein production that apparently occurred after the origin of RcGTA-like elements in alphaproteobacteria (Kogay et al. 2020) makes it challenging to pinpoint amino acid changes in GTA TerLs that contribute to this transition.
The loss of specificity for the GTA genome and the reduction in the DNA packaging density rule out self-propagation of the GTA genome mediated by virions. Strikingly, the RcGTA genes appear to be actively precluded from packaging, being the least frequently packaged region in the alphaproteobacterial genome that is incorporated into the GTA particles (Hynes et al. 2012). The mechanism behind this exclusion is unknown, one possibility being that intensive expression of the RcGTA genes interferes with their packaging (Hynes et al. 2012).
In addition to TerL, other GTA proteins that are involved in DNA packaging, including TerS and portal, might contribute to the unique DNA packaging features of RcGTA. In particular, TerS proteins determine where packaging initiates and control the specificity of packaging through the recognition of pac sites (Leavitt et al. 2013). However, no discrete packaging start sites were found in the RcGTA genome (Hynes et al. 2012). The RcGTA gene g1, which is adjacent to the terL gene (g2) in the RcGTA genome, has been recently shown to encode TerS (Sherlock, Leong, and Fogg 2019). Sherlock, Leong, and Fogg (2019) demonstrated that the RcGTA TerS binds non-specifically to DNA with low affinity due to the absence of a specific DNAbinding domain and the retention of non-specific DNA binding activity. A TerS protein with altered DNA-binding characteristics and a modified headful TerL might together underlie the random packaging that is characteristic of RcGTA.

Data Availability
The following data is provided in our Supplementary Data available via FigShare (https://doi.org/10.6084/m9.figshare. 12191691): GenBank accession numbers of the 254,382 RcGTA TerL protein homologs that are taxonomically assigned to bacteria, archaea, or viruses and likely include both ATPase and nuclease domains, GenBank accession numbers of the 252,614 RcGTA TerL protein homologs that are represented by 11,051 TerLs in the tree in Fig. 1, GenBank accession numbers of the 11,051 amino acid sequences used for reconstruction of the tree in Fig. 1, alignment of 11,051 TerL amino acid sequences (untrimmed and trimmed), phylogenetic tree of 11,051 TerLs shown in Fig. 1, alignment of amino acid sequences of 616 TerLs from the subtree shown in Fig. 2, alignments and phylogenetic trees of 342 and 346 TerLs used in IQ-TREE phylogenetic reconstructions. All alignments are in FASTA format and all phylogenetic trees are in NEWICK format.