Machine-Learning Classification Suggests That Many Alphaproteobacterial Prophages May Instead Be Gene Transfer Agents

Abstract Many of the sequenced bacterial and archaeal genomes encode regions of viral provenance. Yet, not all of these regions encode bona fide viruses. Gene transfer agents (GTAs) are thought to be former viruses that are now maintained in genomes of some bacteria and archaea and are hypothesized to enable exchange of DNA within bacterial populations. In Alphaproteobacteria, genes homologous to the “head–tail” gene cluster that encodes structural components of the Rhodobacter capsulatus GTA (RcGTA) are found in many taxa, even if they are only distantly related to Rhodobacter capsulatus. Yet, in most genomes available in GenBank RcGTA-like genes have annotations of typical viral proteins, and therefore are not easily distinguished from their viral homologs without additional analyses. Here, we report a “support vector machine” classifier that quickly and accurately distinguishes RcGTA-like genes from their viral homologs by capturing the differences in the amino acid composition of the encoded proteins. Our open-source classifier is implemented in Python and can be used to scan homologs of the RcGTA genes in newly sequenced genomes. The classifier can also be trained to identify other types of GTAs, or even to detect other elements of viral ancestry. Using the classifier trained on a manually curated set of homologous viruses and GTAs, we detected RcGTA-like “head–tail” gene clusters in 57.5% of the 1,423 examined alphaproteobacterial genomes. We also demonstrated that more than half of the in silico prophage predictions are instead likely to be GTAs, suggesting that in many alphaproteobacterial genomes the RcGTA-like elements remain unrecognized.


Introduction
Viruses that infect bacteria (phages) are extremely abundant in biosphere (Keen 2015). Some of the phages integrate their genomes into bacterial chromosomes as part of their infection cycle and survival strategy. Such integrated regions, known as prophages, are very commonly observed in sequenced bacterial genomes. For example, Touchon et al. (2016) report that 46% of the examined bacterial genomes contain at least one prophage. Yet, not all of the prophage-like regions represent bona fide viral genomes (Koonin and Krupovic 2018). One such exception is a gene transfer agent, or GTA for short (reviewed most recently by Lang et al. [2017] and Grull et al. [2018]). Many of genes that encode GTAs have significant sequence similarity to phage genes, but the produced tailed phage-like particles generally package pieces of the host genome unrelated to the "GTA genome" (Hynes et al. 2012;Tomasch et al. 2018). Moreover, the particles are too small to package complete GTA genome (Lang et al. 2017). Hence, GTAs are different from lysogenic viruses, as they do not use the produced phage-like particles for the purpose of their propagation.
Currently, five genetically unrelated GTAs are known to exist in bacteria and archaea (Lang et al. 2017). The best studied GTA is produced by the alphaproteobacterium Rhodobacter capsulatus and is referred hereafter as the RcGTA. Since RcGTA's discovery 45 years ago (Marrs 1974), the genes for the related, or RcGTA-like, elements have been found in many of the alphaproteobacterial genomes (Shakya et al. 2017). For a number of Rhodobacterales isolates that carry RcGTA-like genes, there is an experimental evidence of GTA particle production (Fu et al. 2010;Nagao et al. 2015;Tomasch et al. 2018). Seventeen of the genes of the RcGTA "genome" are found clustered in one locus and encode proteins that are involved in DNA packaging and head-tail morphogenesis ( fig. 1 and supplementary table S1, Supplementary Material online). This locus is referred to as a "head-tail cluster." The remaining seven genes of the RcGTA genome are distributed across four loci and are involved in maturation, release, and regulation of RcGTA production (Hynes et al. 2016). Because the head-tail cluster resembles a typical phage genome with genes organized in modules similar to those of a k phage genome (Lang et al. 2017), and because many of its genes have homologs in bona fide viruses and conserved phage gene families (Shakya et al. 2017), the cluster is usually designated as a prophage by algorithms designed to detect prophage regions in a genome (Shakya et al. 2017). The RcGTA's classification as a prophage raises a possibility that some of the "in silico"-predicted prophages may instead represent genomic regions encoding RcGTA-like elements.
Presently, to distinguish RcGTA-like genes from the truly viral homologs, one needs to examine evolutionary histories of the RcGTA-like and viral homologs and to compare gene content of a putative RcGTA-like element to the RcGTA "genome." These analyses can be laborious and often require subjective decision making in interpretations of phylogenetic trees. An automated method that could quickly scan thousands of genomes is needed. Notably, the RcGTA-like genes and their viral homologs have different amino acid composition ( fig. 1 and supplementary fig. S1, Supplementary Material online). Due to the purifying selection acting on the RcGTA-like genes at least in the Rhodobacterales order ) and of their overall significantly lower substitution rates when compared with viruses (Shakya et al. 2017), we hypothesize that the distinct amino acid composition of the RcGTA-like genes is preserved across large   2 kb   g1  g2  g3  g4  g5  g6  g7 g8  g9 g10  g11  g12  g13  g14  g15   0   5   10   15 Amino acid abundance (%) The "head-tail" cluster of the Rhodobacter capsulatus GTA "genome" and the amino acid composition of viral and alphaproteobacterial homologs for some of its genes. Genes that are used in the machine-learning classification are highlighted in gray. For those genes, the heatmap below a gene shows the relative abundance of each amino acid (rows) averaged across the RcGTA-like and viral homologs that were used in the classifier training (columns). The amino acids are sorted by the absolute difference in the average relative abundance between RcGTA-like and viral homologs, which was additionally averaged across 11 genes. The heatmaps of the amino acid composition in the individual homologs are shown in supplementary figure S1, Supplementary Material online. evolutionary distances, and therefore the RcGTA-like genes can be distinguished from their bona fide viral homologs by their amino acid composition. Support vector machine (SVM) is a machine-learning algorithm that can quickly and accurately separate data into two classes from the differences in specific features within each class (Cortes and Vapnik 1995). The SVM-based classifications have been successfully used to delineate protein families (e.g., DNA binding proteins [Bhardwaj et al. 2005], G-protein coupled receptors [Karchin et al. 2002], and herbicide resistance proteins [Meher et al. 2019]), to distinguish plastid and eukaryotic host genes (Kaundal et al. 2013), and to predict influenza host from DNA and amino acid oligomers found in the sequences of the flu virus (Xu et al. 2017). During the training step, the SVM constructs a hyperplane that best separates the two classes. During the classification step, data points that fall on one side of the hyperplane are assigned to one class, whereas those on the other side are assigned to the other class. In our case, the two classes of elements in need of separation are phages and GTAs, whereas their distinguishing features are several metrics that capture the amino acid composition of the encoding genes.
In this study, we developed, implemented, and crossvalidated an SVM classifier that distinguishes RcGTA-like head-tail cluster genes from their phage homologs with high accuracy. We then applied the classifier to 1,423 alphaproteobacterial genomes to examine prevalence of putative RcGTA-like elements in this diverse taxonomic group and to assess how many of the RcGTA-like elements are mistaken for prophages in the in silico predictions.

Materials and Methods
The SVM Classifier and Its Implementation Let us denote as u a homolog of an RcGTA-like gene g that needs to be assigned to a class y, "GTA" ðy ¼ À 1Þ or "virus" ðy ¼ 1Þ. The assignment is carried out using a weighted soft-margin SVM classifier, which is trained on a data set of m sequences T g ¼ fT g 1 ; . . . ; T g m g that are homologous to u (see "SVM Training Data" section). The basis of the classification is the n-dimensional vector of features x associated with sequences u and T g i (see "Generation of Sequence Features" section). Each sequence T g i is known to belong to a class y i .
Using the training data set T g , we identify hyperplane that separates two classes as an optimal solution to the objective function: subject to: where n i ! 0; i ¼ 1; . . . ; m; (2) where w and b define the hyperplane f x ð Þ ¼ wx i þ b that divides the two classes, n i is the slack variable that allows some training data points not to meet the separation requirement, and C is a regularization parameter, which is represented as an m Â m diagonal matrix. The C matrix determines how lenient the soft-margin SVM is in allowing for genes to be misclassified: Larger values "harden" the margin, whereas smaller values "soften" the margin by allowing more classification errors. The product Cn represents the cost of misclassification. The most suitable values for the C matrix were determined empirically during crossvalidation, as described in "Model Training, Cross-Validation, and Assessment" section.
To solve equation (1), we represented this minimization problem in the Lagrangian dual form LðaÞ: subject to: where K represents a kernel function. The minimization problem was solved using the convex optimization (CVXOPT) quadratic programming solver (Andersen et al. 2012). The pseudocode of the algorithm for the weighted soft-margin SVM classifier training and prediction is shown in figure 2.

SVM Training Data
To train the classifier, sets of "true viruses" (class y ¼ 1) and "true GTAs" (class y ¼ À 1) were constructed separately for each RcGTA-like gene g. To identify the representatives of "true viruses," amino acid sequences of 17 genes from the RcGTA head-tail cluster were used as queries in BlastP (Evalue <0.001; query and subject overlap by at least 60% of their length) and PSI-BLAST searches (E-value < 0.001; query and subject overlap by at least 40% of their length; maximum of six iterations) of the viral RefSeq database release 90 (last accessed in November 2018; accession numbers of the viral entries are provided in supplementary table S2, Supplementary Material online). BlastP and PSI-BLAST executables were from the BLAST v. 2.6.0þ package (Altschul et al. 1997). The obtained homologs are listed in supplementary table S3, Supplementary Material online. Due to few or no viral homologs for some of the queries, the final training sets T g were limited to 11 out of 17 RcGTA-like head-tail cluster genes (g2, g3, g4, g5, g6, g8, g9, g12, g13, g14, and g15; see supplementary table S1, Supplementary Material online, for functional annotations of these genes). As the representatives of the "true GTAs," we used the RcGTA-like regions that were designated as such via phylogenetic and genome neighborhood analyses by Shakya et al. (2017). To make sure that our "true GTAs" do not contain any other regions, we created a database of the 235 complete alphaproteobacterial genomes that were available in the RefSeq database prior to January 2014 (supplementary table S4, Supplementary Material online). To identify the representatives of "true GTAs" in this database, amino acid sequences of 17 genes from the RcGTA head-tail cluster (Lang et al. 2017) were used as queries in BlastP (E-value < 0.001; query and subject overlap by at least 60% of their length) and PSI-BLAST searches (E-value < 0.001; query and subject overlap by at least 40% of their length; maximum of six iterations) of the database. For each genome, the retrieved homologs were designated as an RcGTA-like head-tail cluster if at least 9 of the homologs had no more than 5,000 base pairs between any two adjacent genes. The maximum distance cutoff was based on the observed distances between the neighboring RcGTA head-tail cluster genes. This assignment was determined by clustering of the obtained homologs with the DBSCAN algorithm (Ester et al. 1996) using an in-house Python script (available in a GitHub repository; see "Software Implementation" section). The resulting set of 88 "true GTAs" is provided in supplementary table S5, Supplementary Material online, and was verified to represent a subset of RcGTA-like elements that were identified by Shakya et al. (2017).
Because GTA functionality has been extensively studied only in R. capsulatus SB1003 (Lang et al. 2017) and horizontal gene transfer likely occurred multiple times between the putative GTAs and bacterial viruses (Hynes et al. 2016;Zhan et al. 2016), the bacterial homologs that were both too divergent from other bacterial RcGTA-like homologs and more closely related to the viral homologs were eliminated from the training sets to reduce possible noise in classification. To do so, for each of the 11 trainings sets T g , all detected viral and bacterial homologs were aligned using MUSCLE v3.8.31 (Edgar 2004) and then pairwise phylogenetic distances were estimated under PROTGAMMAJTT substitution model using RAxML version 8.2.11 (Stamatakis 2014). For each bacterial homolog in a set T g , the pairwise phylogenetic distances between it and all other bacterial homologs were averaged. This average distance was defined as an outlier distance (o) if it satisfied the inequality: where Q 1 and Q 3 are the first and third quartiles, respectively, of the distribution of the average distances for all bacterial homologs in the training set T g . If an outlier distance was greater than the shortest distance from it to a viral homolog in the set T g , the bacterial homolog was removed from the data set. The alignments, list of removed sequences, and the associated calculations are available in the FigShare repository. Additionally, for each gene g, the sequences that had the same RefSeq ID (and therefore 100% amino acid identity) were removed from the training data sets. The final number of sequences in each training data set is listed in table 1.

Assignment of Weights to the Training Set Sequences
Highly similar training sequences can have an increased influence on the position of the hyperplane, as misclassification of two or more similar sequences can be considered less optimal than misclassification of only one sequence. This could be a problem for our classifier, because there is generally a highly unequal representation of taxonomic groups in the RefSeq database. To correct for this taxonomic bias, a weighting scheme was introduced into the soft-margin of the SVM classifier during training. To do so, sequences in each training set FIG. 2.-The pseudocode of the SVM classifier algorithm that distinguishes RcGTA-like genes from the "true" viruses. The algorithm is implemented in the GTA-Hunter software package (see "Software Implementation" section).
T g ¼ fT 1 ; . . . ; T m g were aligned in MUSCLE v3.8.31 (Edgar 2004) (the alignments are available in the FigShare repository). For each pair of sequences in a training set T g , phylogenetic distances were calculated in RAxML version 8.2.11 (Stamatakis 2014) under the best substitution model (PROTGAMMAAUTO; the selected substitution matrices are listed in supplementary table S6, Supplementary Material online). The farthest-neighbor hierarchical clustering method was used to group sequences with distances below a specified threshold t. Weight d i of each sequence in a group was defined as a reciprocal of the number of genes in the group. These weights are used to adjust the cost of misclassification by multiplying C ii for each sequence T i by d i . The most suitable value of t was determined empirically during crossvalidation, as described in "Model Training, Cross-Validation, and Assessment" section.

Generation of Sequence Features
To use amino acid sequences in the SVM classifier, each sequence was transformed to an n-dimensional vector of compositional features. Three metrics that capture different aspects of sequence composition were implemented: Frequencies of "words" of size k (k-mers), pseudo-amino acid composition (PseAAC), and physicochemical properties of amino acids.
In the first feature type, amino acid sequence of a gene is broken into a set of overlapping subsequences of size k, and frequencies of these n unique k-mers form a feature vector x. Values of k equal to 1-6 were evaluated for prediction accuracy (see "Model Training, Cross-Validation, and Assessment" section).
The second feature type, PseAAC, has n ¼ (20 þ k) dimensions and takes into account frequencies of 20 amino acids, as well as correlations of hydrophobicity, hydrophilicity, and sidechain mass of amino acids that are k positions apart in the sequence of the gene (after Chou [2001]), More precisely, PseAAC feature set x of a sequence of length L consisting of amino acids R 1 R 2 . . .R L is defined as follows: where r i is the frequency of the ith amino acid (out of 20 possible), x is a weight constant for the order effect that was set to 0.05, and s k (k ¼ 1, . . ., k) are sequence ordercorrelation factors. These factors are defined as and H 1 ðR i Þ, H 2 ðR i Þ, and MðR i Þ denote the hydrophobicity, hydrophilicity, and side-chain mass of amino acid R i , respectively. The H 1 ðR i Þ, H 2 ðR i Þ, and MðR i Þ scores were subjected to a conversion as described in the following equation: where H 0 1 i ð Þ is the original hydrophobicity value of the ith amino acid, H 0 2 i ð Þ is hydrophilicity value, and M 0 i ð Þ is the  Kaundal et al. [2013]). For a given sequence, each of its encoded amino acids was counted toward one of the 19 classes, and the overall scores for each class were normalized by the length of the sequence to form n ¼ 19-dimensional feature vector x.

Model Training, Cross Validation, and Assessment
For each GTA gene, parameter, and feature type, the accuracy of the classifier was evaluated using a 5-fold crossvalidation scheme, in which a data set was randomly divided into five different subsamples. Four parts were combined to form the training set, whereas the fifth part was used as the validation set and its SVM-assigned classifications compared with the known classes. This step was repeated five times, so that every set was tested as a known class at least once.
For each class y ("GTA" and "Virus"), the results were evaluated by their accuracy scores, defined as the number of correctly classified homologs divided by the total number of homologs that were tested. The cross-validation procedure was repeated ten times to reduce the partitioning bias, and the generated results were averaged, resulting in an Average Accuracy Score (AAS) for each gene and each class. To ensure that "GTA" and "Virus" classes had equal impact on the accuracy assessment, each class was assigned a weight of 0.5. The final, Weighted Accuracy Score (WAS) was calculated as WAS g ¼ 100 Â ðAAS g GTA Â 0:5 þ AAS g Virus Â 0:5Þ: (9) The most suitable "softness" of the SVM margin was determined by trying all possible combinations of several raw diagonal values of the matrix C (0.01, 0.1, 1, 100, 10,000) and the threshold t (0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1). The set of parameters and features that resulted in the highest WAS was defined as the optimal set for a gene g. If multiple parameter and feature sets resulted in the equally highest WAS, we applied the following parameter selection criteria, in the priority order listed, until only one parameter set was left: First, we selected parameter set(s) with k-mer size that on average performed better than other k-mer sizes; second, we avoided parameter set(s) that included PseAAC and physicochemical composition features; third, we selected parameter set(s) with the value of C that gives the highest average accuracy across the remaining parameter sets; and finally, we opted for the parameter set with the value of t that also gives the highest WAS across the remaining parameter sets. Additionally, we evaluated classifier accuracy using the Matthews correlation coefficient (MCC) (Matthews 1975).

Selection of Alphaproteobacterial Genomes for Testing the Presence of RcGTA-Like Genes
From the alphaproteobacterial genomes deposited to the RefSeq database between January 2014 and January 2019, we selected 636 complete and 789 high-quality draft genomes, with the latter defined as genome assemblies with N50 length >400 kb. The taxonomy of each genome was assigned using the GTDB-Tk toolkit (Parks et al. 2018).
The GTDB assignment is based on the combination of Average Nucleotide Identity (ANI) (Jain et al. 2018) and phylogenetic placement on the reference tree (as implemented in the pplacer program [Matsen et al. 2010]

Detection of RcGTA-Like Genes and Head-Tail Clusters in Alphaproteobacteria
The compiled training data sets of the RcGTA-like genes (see the "SVM Training Data") were used as queries in BlastP (Evalue < 0.001; query and subject overlap by at least 60% of their length) searches of amino acid sequences of all annotated genes from the 1,423 alphaproteobacterial genomes. Acquired homologs of unknown affiliation (sequences u) were then assigned to either "GTA" or "Virus" category by running the SVM classifier with the identified optimal parameters for each gene g (table 2). The proximity of the individually predicted RcGTA-like genes in each genome was evaluated by running the DBSCAN algorithm (Ester et al. 1996) implemented in an inhouse Python script (available in a GitHub repository; see "Software Implementation" section). The retrieved homologs were designated as an RcGTA-like head-tail cluster only if at least 6 of the RcGTA-like genes had no more than 8,000 base pairs between any two adjacent genes. The maximum distance cutoff was increased from the 5,000 base pairs used for the clustering of homologs in the training data sets (see "SVM Training Data" section) because the SVM classifier evaluates only 11 of the 17 RcGTA-like head-tail cluster homologs and therefore the distances between some of the identified RcGTA-like genes can be larger.
To reduce the bias arising from the overrepresentation of particular taxa in the estimation of the RcGTA-like cluster abundance in Alphaproteobacteria, the 1,423 genomes were grouped into Operational Taxonomic Units (OTUs) by computing pairwise ANI using the FastANI v1.1 program (Jain et al. 2018) and defining boundaries between OTUs at the 95% threshold. Because not all OTUs consist uniformly of genomes that were either all with or all without the RcGTAlike clusters, each RcGTA-like cluster in an OTU was assigned a weight of "1/(number of genomes in an OTU)." The abundance of the RcGTA-like clusters in different alphaproteobacterial orders was corrected by summing up the weighted numbers of RcGTA-like clusters.

Software Implementation
The above-described SVM classifier, generation of sequence features, and preparation and weighting of training data are implemented in a Python program called "GTA-Hunter." The source code of the program is available via GitHub at https:// github.com/ecg-lab/GTA-Hunter-v1. The repository also contains training data for the detection of the RcGTA-like headtail cluster genes, examples of how to run the program, and the script for clustering of the detected RcGTA-like genes using the DBSCAN algorithm.

Assessment of Prevalence of the RcGTA-Like Clusters among Putative Prophages
Putative prophages in the 1,423 alphaproteobacterial genomes were predicted using the PHASTER web server (Arndt et al. [2016]; accessed in January 2019). The PHASTER program was chosen due to its solid performance in benchmarking studies (de Sousa et al. 2018) and its useful scoring system that ranks predictions based on a prophage region completeness (Song et al. 2019). To restrict our evaluation to likely functional prophages, only predicted prophages with the PHASTER score >90 (i.e., classified as "intact" prophages) were retained for further analyses. The proportion of these predicted intact prophages classified by the GTA-Hunter as "GTA"s was calculated by comparing the overlap between the genomic locations of the predicted intact prophages and the putative RcGTA-like regions.

Construction of the Alphaproteobacterial Reference Phylogeny
From the set of 120 phylogenetically informative proteins (Parks et al. 2017), 83 protein families that are present in a single copy in >95% of 1,423 alphaproteobacterial genomes were extracted using hmmsearch (E-value < 10 À7 ) via modified AMPHORA2 scripts (Wu and Scott 2012) (supplementary table S9, Supplementary Material online). For each protein family, homologs from Escherichia coli str. K12 substr. DH10B and Pseudomonas aeruginosa PAO1 genomes (also retrieved using hmmsearch, as described above) were added to be used as an outgroup in the reconstructed phylogeny. The amino acid sequences of each protein family were aligned using MUSCLE v3.8.31 (Edgar 2004). Individual alignments were concatenated, keeping each alignment as a separate partition in further phylogenetic analyses (Chernomor et al. 2016). The most suitable substitution model for each partition was selected using ProteinModelSelection.pl script downloaded from https://github.com/stamatak/standard-RAxML/ tree/master/usefulScripts; last accessed April 2019. Gamma distribution with four categories was used to account for rate heterogeneity among sites (Yang 1994). The maximum likelihood phylogenetic tree was reconstructed with IQ-TREE v 1.6.7 (Nguyen et al. 2015). One thousand ultrafast bootstrap replicates were used to get support values for each branch (Minh et al. 2013;Hoang et al. 2018). The concatenated sequence alignment in PHYLIP format and the reconstructed phylogenetic tree in Newick format are available in the FigShare repository.

Examination of Conditions Associated with the Decreased Fitness of the Knock-Out Mutants of the RcGTA-Like Head-Tail Cluster Genes
From the three genomes that are known to contain RcGTAlike clusters (Caulobacter crescentus NA100, Dinoroseobacter shibae DFL-12, and Phaeobacter inhibens BS107), fitness experiments data associated with the knock-out mutants of the RcGTA-like head-tail cluster genes were retrieved from the Fitness Browser (Price et al. [2018]; accessed in May 2019 via http://fit.genomics.lbl.gov/cgi-bin/myFrontPage.cgi). Price et al. (2018) defined gene fitness as the log 2 change in abundance of knock-out mutants in that gene during the experiment. For our analyses, the significantly decreased fitness of each mutant was defined as a deviation from the fitness of 0 with a t À score j j! 4. The conditions associated with the significantly decreased fitness were compared across the RcGTA-like head-tail cluster genes in all three genomes.

GTA-Hunter Is an Effective Way to Distinguish RcGTA-Like Genes from Their Viral Homologs
The performance of the developed SVM classifier depends on values of parameters that determine type and composition of sequence features, specify acceptable levels of misclassification, and account for biases in taxonomic representation of the sequences in the training sets. To find the most effective set of parameters, for each of the 11 RcGTA-like head-tail genes with the sufficient number of homologs available ( fig. 1; also, see Materials and Methods for details), we evaluated the performance of 1,435 different combinations of the parameters using a cross-validation technique (supplementary table S10, Supplementary Material online).
Generally, the classifiers that only use k-mers as the feature have higher median WAS values than the classifiers that solely rely either on physicochemical properties of amino acids or on PseAAC (supplementary fig. S2 and table S10, Supplementary Material online), indicating that the conservation of specific amino acids blocks is important in delineation of RcGTA-like genes from their viral counterparts. However, the WAS values are lower for the large k-mer sizes (supplementary fig. S2, Supplementary Material online), likely due to the feature vectors becoming too sparse. Consequently, parameter combinations with values of k above 6 were not tested. The WAS values are also lower for k ¼ 1, likely due to the low informativeness of the feature. The lowest observed WAS values involve usage of physicochemical properties of proteins as a feature (supplementary fig. S2 and table S10, Supplementary Material online), suggesting the conservation of physicochemical properties of amino acids among proteins of similar function in viruses and RcGTA-like regions despite their differences in the amino acid composition. The more sophisticated recoding of physicochemical properties of amino acids as the PseAAC feature performs better, but for all genes its performance is worse than the best-performing kmer (supplementary fig. S2 and table S10, Supplementary Material online).
For several genes, the highest value of WAS was obtained with multiple combinations of features and parameter values (supplementary table S10, Supplementary Material online). Based on the above-described observations of the performance of individual features, we preferred parameter sets that did not include PseAAC and physicochemical composition features, and selected k-mer size that on average performed better than other k-mer sizes (see Materials and Methods for the full description of the parameter selection procedure).
For individual genes, the WAS of the selected parameter set ranges from 95.6% to 100% (table 2), with 5 out of 11 genes reaching the WAS of 100%. The two genes with the highest WAS below 99% (g6 and g12) have the smallest number of viral homologs available for training (table 2). Additionally, several viral homologs in the training data sets for g6 and g12 genes have smaller phylogenetic distances to "true GTA" homologs than to other "true virus" homologs (supplementary table S11, Supplementary Material online). As a result, the SVM classifier erroneously categorizes some of the RcGTA-like g6 and g12 genes as "viral," resulting in the reduced classifier efficacy (supplementary table S10, Supplementary Material online).
Assessment of accuracy using the MCC generally agrees with the results based on WAS (table 2 and supplementary  table S10, Supplementary Material online). For 10 out of 11 genes, the set of parameters with the highest WAS also has the highest MCC. For gene g6, there are sets of parameters with higher MCC than the MCC for set of parameters with the highest WAS, but the differences among the MCC values are small (supplementary table S10, Supplementary Material online). Therefore, the combinations of features and parameters chosen using the WAS scheme (table 2)

GTA-Hunter Predicts Abundance of RcGTA-Like Head-Tail Clusters in Alphaproteobacteria
The 1,423 examined alphaproteobacterial genomes contain 7,717 homologs of the 11 RcGTA genes. The GTA-Hunter classified 6,045 of these homologs as "GTA" genes (supplementary table S12, Supplementary Material online). However, many genomes are known to contain regions of decaying viruses that may be too divergent to be recognizably "viral" and there is at least one known case of horizontal gene transfer of several GTA genes into a viral genome (Zhan et al. 2016), raising a possibility that some of the predicted "GTA" genes may not be part of "true GTA" genomic regions. To minimize such false positives, we imposed an extra requirement of multiple predicted RcGTA-like genes to be in proximity on a chromosome. Specifically, we called a genomic region the putative RcGTA-like cluster only if it consisted of at least six genes classified as "GTA." We found that the RcGTAlike clusters defined that way are present in one (and only one) copy in 818 of the 1,423 ($57.5%) examined alphaproteobacterial genomes (supplementary table S13,  Supplementary Material online, and table 3). Uneven taxonomic representation of Alphaproteobacteria among the analyzed genomes may inflate this estimation of the abundance of the GTA-harboring genomes within the class. To correct for this potential bias, 1,423 genomes were grouped into 797 OTUs based on the ANI of their genomes (supplementary table S14, Supplementary Material online). Although indeed some taxonomic groups are overrepresented in the set of 1,423 genomes, in 450 of the 797 OTUs (56.4%) all OTU members contain the putative RcGTA-like clusters (supplementary table S14, Supplementary Material online).

RcGTA-Like Clusters Are Widely Distributed within a Large Subclade of Alphaproteobacteria
The 818 genomes with the RcGTA-like gene clusters detected in this study are not evenly distributed across the class (table 3) but are found only in a clade that includes seven orders (clade 4 in fig. 3). Overall, 66% of the examined OTUs within the clade 4 are predicted to have an RcGTA-like cluster (table 3).
RcGTA-like clusters are most abundant in clade 6 ( fig. 3), a group that consists of the orders Rhodobacterales and Caulobacterales (table 3).
Although the two unclassified orders that contain RcGTAlike clusters are represented by only two genomes (clades 2 and 3 in fig. 3), their position on the phylogenetic tree of Alphaproteobacteria suggests that the RcGTA-like element may have originated earlier than was proposed by Shakya et al. (2017) (clade 5 in fig. 3). Given that RcGTA-like headtail cluster genes are readily detectable in viral genomes, it is unlikely that the RcGTA-like clusters remained completely undetectable in the examined genomes outside the clade 4 due to the sequence divergence. Therefore, an RcGTA-like element was unlikely to be present in the last common ancestor of all Alphaproteobacteria (clade 7 in fig. 3), which was suggested when only a limited number of genomic data were available (Lang and Beatty 2007).

Most of the Detected RcGTA-Like Clusters Can Be Mistaken for Prophages
Among the 818 detected RcGTA-like clusters, the functional annotations of the 11 examined genes were similar to the prophages and none of them refer to a "gene transfer agent" (data not shown). Because at least 11 of the 17 RcGTA headtail cluster genes have detectable sequence similarity to viral genes (supplementary table S3, Supplementary Material online), it is likely that, if not recognized as GTAs, many of the putative RcGTA-like clusters will be designated as "prophages" in genome-wide searches of prophage-like regions. To evaluate this hypothesis, we predicted prophages in the set of 1,423 alphaproteobacterial genomes, and limited our analyses to the predicted prophage regions that are more likely to be functional integrated viruses ("intact" prophages; see Materials and Methods for the criteria). Indeed, of the 1,235 "intact" prophage regions predicted in the clade 4 genomes, 664 (54%) coincide with the RcGTA-like clusters ( fig. 4). Conversely, 664 out of 818 of the predicted RcGTAlike clusters (81%) are classified as intact prophages. Of the 351 RcGTA-like clusters that contain all 11 examined genes, 323 (92%) are classified as intact prophages. Interestingly, within 818 genomes that contain RcGTA-like clusters, the average number of predicted intact prophages is 1.23 per genome ( fig. 5), which is significantly higher than 0.51 prophages per genome in genomes not predicted to contain RcGTA-like clusters (P value < 0.22 Â 10 À17 ; Mann-Whitney U test). If the 664 RcGTA-like regions classified as intact prophages are removed from the genomes that contain them, the average number of predicted "intact" prophages per genome drops to 0.42 ( fig. 5) and the difference becomes insignificant (P value ¼ 0.1492; Mann-Whitney U test). This analysis suggests that an elevated number of the observed predicted prophage-like regions in some . The branches of the reference tree are collapsed at the taxonomic rank of "order," and the number of OTUs within the collapsed clade is shown in parentheses next to the order name. Orange and brown bars depict the proportion of OTUs with and without the predicted RcGTA-like clusters, respectively. The orders that contain at least one OTU with an RcGTA-like cluster are colored in green. Nodes 1-3 mark the last common ancestors of the unclassified orders. Node 4 marks the lineage where, based on this study, the RcGTA-like element should have already been present. Nodes 5 and 7 mark the lineages that were previously inferred to represent last common ancestor of the RcGTA-like element by Shakya et al. (2017) and Lang and Beatty (2007), respectively. Node 6 marks the clade where RcGTA-like elements are the most abundant. The tree is rooted using homologs from Escherichia coli str. K12 substr. DH10B and Pseudomonas aeruginosa PAO1 genomes. Branches with ultrafast bootstrap values >¼95% are marked with black circles. The scale bar shows the number of substitutions per site. The full reference tree is provided in the FigShare repository. alphaproteobacterial genomes may be due to the presence of unrecognized RcGTA-like elements.

Discussion
Our study demonstrates that RcGTA-like and bona fide viral homologs can be clearly separated from each other using a machine-learning approach. The highest accuracy of the classifier is achieved when it primarily relies on short amino acid kmers present in the examined genes. This suggests that the distinct primary amino acid composition of the RcGTA-like and truly viral proteins is what allows the separation of the two classes of elements ( fig. 1). However, the cause of the amino acid preferences of the RcGTA-like genes, and especially enrichment of the encoded proteins in alanine and glycine amino acids ( fig. 1), remains unknown. Given the structure of the genetic code, the skewed amino acid composition may be the driving force behind the earlier described significantly higher %G þ C of the genomic region encoding the RcGTA-like head-tail cluster than the average %G þ C in the host genome (Shakya et al. 2017). Regardless of the cause of the skewed amino acid composition, the successful identification of the putative RcGTA-like elements in alphaproteobacterial taxa only distantly related to R. capsulatus (clade 4 in fig. 3) suggests that the selection to maintain these elements likely extends beyond the Rhodobacterales order. Nevertheless, whether these putative elements indeed encode GTAs, as we currently understand them, remains to be experimentally validated.
The benefits associated with the GTA production that would underlie the selection to maintain them remain unknown. In a recently published high-throughput screen for phenotypes associated with specific genes (Price et al. 2018), knockout of the RcGTA-like genes in the three genomes that encode the RcGTA-like elements resulted in decreased fitness of the mutants (in comparison to the wild type) under some of the tested conditions (supplementary table S15, Supplementary Material online). Interestingly, the conditions associated with the most statistically significant decreases in fitness correspond to the growth on nonglucose sugars, such as D-raffinose, b-lactose, D-xylose, and m-inositol. Overall, carbon source utilization is the most common condition that elicits statistically significant fitness decreases in the mutants. The RcGTA production was also experimentally demonstrated to be stimulated by carbon depletion . Further experimental work is needed to identify the link between the RcGTA-like genes expression and carbon utilization. Conversely, absence of the RcGTA-like elements in some of the clade 4 genomes ( fig. 3) indicates that in some ecological settings RcGTA-like elements are either deleterious or "useless" and thus their genes were either purged from the host genomes (if RcGTA-like element evolution is dominated by vertical inheritance) or not acquired (if horizontal gene transfer plays a role in the RcGTA-like element dissemination).
Previous analyses inferred that RcGTA-like elements had evolved primarily vertically, with few horizontal gene exchanges between closely related taxa (Lang and Beatty 2007;Hynes et al. 2016;Shakya et al. 2017). Under this hypothesis, the distribution of the RcGTA-like head-tail clusters in alphaproteobacterial genomes suggests that RcGTA-like element originated prior to the last common ancestor of the FIG. 5.-The number of predicted "intact" prophages in alphaproteobacterial genomes. The 1,423 genomes were divided into two groups: those without GTA-Hunter-predicted RcGTA-like clusters (in brown) and those with these RcGTA-like clusters (in dark orange). For the latter group, the number of prophages was recalculated after the RcGTA-like clusters that were designated as prophages were removed (in light orange). The distribution of the number of predicted intact prophages within each data set is shown as a violin plot with the black point denoting the average value. The data sets with significantly different average values are denoted by asterisks (P < 0.001; Mann-Whitney U test).