OrthoPhyl—streamlining large-scale, orthology-based phylogenomic studies of bacteria at broad evolutionary scales

Abstract There are a staggering number of publicly available bacterial genome sequences (at writing, 2.0 million assemblies in NCBI's GenBank alone), and the deposition rate continues to increase. This wealth of data begs for phylogenetic analyses to place these sequences within an evolutionary context. A phylogenetic placement not only aids in taxonomic classification but informs the evolution of novel phenotypes, targets of selection, and horizontal gene transfer. Building trees from multi-gene codon alignments is a laborious task that requires bioinformatic expertise, rigorous curation of orthologs, and heavy computation. Compounding the problem is the lack of tools that can streamline these processes for building trees from large-scale genomic data. Here we present OrthoPhyl, which takes bacterial genome assemblies and reconstructs trees from whole genome codon alignments. The analysis pipeline can analyze an arbitrarily large number of input genomes (>1200 tested here) by identifying a diversity-spanning subset of assemblies and using these genomes to build gene models to infer orthologs in the full dataset. To illustrate the versatility of OrthoPhyl, we show three use cases: E. coli/Shigella, Brucella/Ochrobactrum and the order Rickettsiales. We compare trees generated with OrthoPhyl to trees generated with kSNP3 and GToTree along with published trees using alternative methods. We show that OrthoPhyl trees are consistent with other methods while incorporating more data, allowing for greater numbers of input genomes, and more flexibility of analysis.


Introduction
Underpinning every aspect of an organism's biology is its evolutionary history.Phylogenetic methods strive to reconstruct these histories using genetic data.With the advent of "Next-" and "Third-gen" sequencing methodologies, the amount of genetic data in public databases has increased dramatically.At writing, there are currently 2,018,201 bacterial genome assemblies available through NCBI alone (ncbi.nlm.nih.gov/assembly/?term= bacteria).The wealth of available genetic data has transformed the fields of bacterial phylogenetics and taxonomy (Konstantinidis and Tiedje 2007;Varghese et al. 2015;Jain et al. 2018).Unfortunately, most whole genome phylogenetic analysis methods require specialized bioinformatic data processing and expertise (Smith 2013;Lozano-Fernandez 2022).There are few very large-scale trees to place these sequences into an evolutionary framework.The available large-scale trees are under-resolved and built by reconciling many disparate phylogenetic studies, such as NCBI taxonomy (Schoch et al. 2020) or are generated with methods unsuited for narrow and broad evolutionary distances (Hördt et al. 2020) reducing their utility at multiple evolutionary scales.This flood of genomic data necessitates an easy-to-use phylogenetic analysis pipeline to help reveal the evolutionary context for the myriad of sequenced bacteria.

Single and multi-locus
A classic way to generate bacterial trees is to compare single homologous loci (e.g.16S ribosomal, recA, gyrA, rpoB, or dnaK genes).While largely replaced by other methods, single gene trees have some advantages, which include the ability to use conserved primer sites to amplify sequences, very low computational intensity, and the availability of curated databases (SILVA, RDP, NCBI GenBank, etc.).Using genes other than rDNA allows for greater resolution at certain evolutionary distances but still lacks broad resolution (i.e.Near resolution for swiftly evolving genes, far for conserved) (Yang 1998).However, these trees are prone to misleading topologies due to horizontal gene transfers, incomplete lineage sorting, paralogous gene conversions, and limited phylogenetic information (Huerta-Cepas et al. 2007).Using multi-locus alignments to build phylogenetic trees increases the total amount of phylogenetic information (more sequence used), reduces the effects of incongruent gene tree topologies due to recombination or horizontal gene transfer, and widens the breadth of evolutionary resolution (Gontcharov et al. 2004).However, in multi-locus analyses of taxa without well-vetted multi-locus methods, locus selection is non-trivial and the effort and cost of primer design and amplification optimization scales with the number of sequences used, which can rapidly become prohibitive for many labs.Additionally, strains of many taxa, like Brucella, are not differentiated well by available multi-locus methods, leading to unknown evolutionary histories (Sankarasubramanian et al. 2019).

Whole genome
Classically, whole genome alignments were required to identify Single Nucleotide Polymorphisms (SNPs) of assemblies (see Shakya et al. 2020;Treangen et al. 2014 for method details).These methods add considerable phylogenetic data for tree building and can produce different topologies than multi-locus alignments, illustrating their potential utility (Baltrus et al. 2014).A major drawback of whole genome alignments is that they tend to produce short alignments or fail outright, for more distantly related genomes (Darling et al. 2004;Angiuoli and Salzberg 2011;Chung et al. 2018;Shakya et al. 2020).Thus, phylogenetic information drops quickly as a function of evolutionary distance.The practical result is a lack of resolution at even moderate evolutionary distances.For alignments to a reference, having the "missing" data enriched for the more divergent sites will possibly lead to an underestimation of the sequence to reference evolutionary distances for more divergent sequences (Spencer et al. 2007;Shavit Grievink et al. 2013;Bertels et al. 2014).Alternatively, one can use all-vs-all whole genome alignments, however, they scale poorly and are limited in the number of input taxa that can be aligned (Angiuoli and Salzberg 2011).

K-mer overlap
An emergent method to quickly estimate evolutionary histories from genome assemblies is to use k-mer content.This involves identifying subsequences of length k within each assembly and calculating the shared k-mer content (for details see Bussi et al. 2021).The advantage of this class of methods is that it is extremely fast relative to alignment-based methods.The comparison of k-mer overlap between sequences naturally leads to pairwise distance matrices, thus neighbor-joining trees can be generated directly from the results (Saitou and Nei 1987).Recent work has integrated more complicated distance statistics with success (Tang et al. 2023).Crucially, k-mer distances become non-linearly associated with true percent identity, leading to erroneous estimated branch lengths (Jain et al. 2018).
An alternative k-mer-based approach is kSNP (Gardner and Hall 2013;Gardner et al. 2015;Hall and Nisbet 2023), which identifies SNPs by flanking conserved k-mers.This approach allows the use of tree-building methods including parsimony, maximum likelihood, and Bayesian.Like whole genome alignments and k-mer overlap methods, kSNP suffers from a rapid decline in phylogenetic information as sequences diverge.Additionally, the signal/noise ratio degrades quickly because non-homologous k-mers are increasingly common as sequences diverge (Gardner and Hall 2013).

Predicted gene alignments
To use the wealth of data generated by NGS methods and simultaneously simplify the problem of whole genome alignments, CDS or protein sequences derived from annotations or transcriptome sequencing can be compared.This method breaks the problem of identifying homologous loci into two tractable parts: gene identification, then identification of homologous sequences.This approach identifies many phylogenetically informative sites (like whole genome), but alignments are more tractable (like single loci).However, gene-based alignments are negatively affected by inferred paralogs, i.e. duplicated genes or contamination.Thus, most methods filter out gene families with paralogs.However, see ASTRAL-Pro for a method that is "paralog aware" (Zhang et al. 2020).A widely used pipeline that accomplishes this task is GToTree, which annotates assemblies, uses precomputed hidden Markov models to identify single copy orthologs, aligns predicted amino acid sequences, then generates trees with FastTree2 or IQTree2 from concatenated alignments (Lee 2019).While focused on protein alignments, GToTree can also be run in nucleotide mode to use CDS alignments for phylogenetic inference instead.
A trade-off arises while using predicted coding or protein sequences to build species trees.Coding sequences (CDS) have the most phylogenetic information due to codon degeneracy but are hard to align at high divergencies (States et al. 1991).Proteins remain alignable at high divergences but lack the information content of CDS alignments.This can be addressed by converting protein alignments to corresponding codon alignments, leveraging nucleotide phylogenetic information and protein alignment accuracy (Wernersson and Pedersen 2003;Bininda-Emonds 2005).However, at increasing evolutionary distances, even with accurate alignments, mutational saturation in nucleotide data can negatively affect phylogenetic estimates.Evolutionary model selection, avoiding compositional bias, and dense taxon sampling can largely alleviate these issues, see (Kapli et al. 2023) for details.

Benefits of an automated workflow
The skills required to generate bacterial phylogenetic trees from whole genomes are extensive.Some of the many steps include gathering and annotating assemblies, identifying and filtering orthologs, sequence alignment, trimming and concatenation, and finally, tree inference (Ashford et al. 2020;Lozano-Fernandez 2022).Each one of these steps, especially for large data sets, requires expertise in picking parameters, file management, and data format manipulation and filtering.Many of these steps also require familiarity with UNIX command line.
Beyond taxonomic studies, the phylogenetic context of organisms is becoming increasingly important for standard molecular and evolutionary studies.Even with the available data and a clear need for the phylogenetic placement of focal study species, technical barriers preclude many researchers from inferring their evolutionary histories.An easy-to-use, accurate phylogenomic pipeline will encourage wide adoption of phylogenomic methods to complement ongoing environmental, evolutionary, and clinical microbiological research, and perhaps help standardize the estimation of phylogenetic trees within and between research labs.
To meet this demand, we present OrthoPhyl, a phylogenomic pipeline that takes bacterial genomes as input, annotates them, identifies orthologs, converts protein to nucleotide alignments, and builds species trees with both concatenated alignments and gene tree to species tree reconciliation.The workflow accepts an arbitrarily large number of input genomes (tested up to 1,200 here).To accelerate analysis, a subset of samples representing the whole dataset's diversity are identified, their proteomes are used to identify orthologs and build hidden Markov models, which are then expanded to the full input dataset by iterative searches.This strategy allows the generation of trees for 689 Brucella assemblies in ∼48 hrs using 30 cpus and 58.3GB memory with no hands-on time required.This pipeline is designed to be an easy-to-install, scalable, turn-key solution for generating high-resolution bacterial trees from diverse clades.

OrthoPhyl's workflow
The structure of OrthoPhyl can be broken down into four main steps: Genome assemblies are annotated (Fig. 1a), resulting proteins are assigned to orthogroups (Fig. 1b), orthogroup proteins are aligned and converted to codon alignments (Fig. 1c), and finally, concatenated codon alignments are used to infer phylogenetic trees (Fig. 1d).All processes below use default parameters unless specified otherwise.

Structural annotation
OrthoPhyl starts by generating gene calls from input assemblies (Fig. 1a) with Prodigal (Hyatt et al. 2010).Redundant CDSs (> 99.9% nucleotide identity) found in a genome are removed using bbmap's dedupe (BBMap (2022) Oct 6) with the logic they are likely the result of recent gene duplications and provide no or little phylogenetic information.This removes unnecessary paralogs from the dataset to preserve the usefulness of the containing orthogroup for downstream analysis.

Orthogroup assignment
To infer orthogroups to use for tree generation, OrthoFinder is used with default parameters except for using multiple sequence alignments based on gene tree inference (Emms and Kelly 2015;Emms and Kelly 2019).When species trees are being generated for large numbers (default: > 30) of genome assemblies, finding orthogroups directly becomes intractable due to the exponential increase in computational time of the all-vs-all homology search.To overcome this, OrthoPhyl identifies a diversity-spanning subset of assemblies to identify orthogroups.Briefly, average nucleotide identity (ANI) is estimated for all assembly pairs with FastANI (Jain et al. 2018).A custom algorithm is used to cluster assemblies based on pairwise ANI values.Briefly, through successive rounds, samples with the highest ANI are merged into clusters.The merged samples' ANI to other samples and clusters are averaged.Rounds of merging are performed until N number (default: 30) of clusters are formed.Single representative samples for each of the N groups are chosen as input for OrthoFinder to infer a reduced set of orthogroups.FastANI fails to give an ANI percent for genome comparisons with ANI less than ∼75%.For these instances, an arbitrary value of 50% is used.The representative genome picking will be unaffected if there are less than 30 clusters at this or greater distances to all other clusters.
To expand orthogroups to the full assembly dataset, proteins from each orthogroup are realigned with Mafft (Katoh and Standley 2013), then hmmer (Eddy 2008(Eddy , 2009(Eddy , 2011) ) generates hidden Markov models and searches against all predicted proteins using default parameters.OrthoPhyl then finds the minimum HMM hit score cutoff which removes all paralogs, the remaining hits above this score (if any) are designated as the final orthogroup, ensuring all potential paralogs are removed and the greatest number of taxa are represented in the orthogroup.Each orthogroup is then realigned and HMMs are again generated and searched against all proteins to capture additional, more divergent orthologues.The workflow is divided into four main tasks a) annotate assemblies and remove identical CDSs.If more than "N" assemblies are being analyzed, b1) identify a subset of diversity-spanning assemblies, b2) pass them through OrthoFinder to generate orthogroups, and b3) expand the OrthoFinder-identified orthogroups to the full dataset of assemblies through iterative HMM searches.c) Align full orthogroup protein sets, generate and trim matching codon alignments, then filter orthogroups by taxon representation.Finally, d) estimate species tree topologies with concatenated codon alignment supermatrices along with a gene tree to species tree consensus method.

Codon alignment generation
Once orthogroups are identified for the full set of assemblies, protein sequences are realigned with Mafft (Katoh and Standley 2013).These alignments are then used as the basis for codon alignment using PAL2NAL with codon table 11 (Suyama et al. 2006).Codon alignments are then trimmed with trimAL using arguments "-resoverlap .5 -seqoverlap 50 -gt .80-cons 60 -w 3″ (Capella-Gutiérrez et al. 2009).Users can change trimming options with the "control_files.user"if required.Alignment_Assessment (Portik et al. 2016), is used to visualize the quality of codon alignments, including their phylogenetic signals, taxa per alignment, and alignments per taxa (Fig. 2).This is critical for ensuring highquality data are used for generating trees and can aid in troubleshooting phylogenies with low branch support.
Our tool then classifies orthogroups as relaxed or strict singlecopy orthologs (SCOs).Strict SCOs are genes found in every taxon with no paralogs, while relaxed SCOs are found in a subset of assemblies still with no paralogs.The percent of assemblies with the SCO to count in this set is tunable with the default set as 30% (shown in Fig. 2a: red dashed line).See (Wiens and Morrill 2011) for analysis on the effects of missing data.These two sets of orthogroups are used in parallel moving forward to generate species trees.

Species tree estimation
Two general methods are used to infer species trees for the input assemblies: gene-tree to species-tree estimation with ASTRAL-III (Zhang et al. 2018) and maximum likelihood methods based on codon alignment supermatrices generated with catfasta2phyml (https://github.com/nylander/catfasta2phyml).First, to use ASTRAL, ML gene trees are generated with either "FastTree2 -gtr -gamma" (Price et al. 2010) or IQtree2 with default settings (Nguyen et al. 2015;Kalyaanamoorthy et al. 2017;Hoang et al. 2018) per user input.Then ASTRAL will generate species trees with the strict and relaxed SCO gene tree datasets.For the maximum likelihood species tree methods, the user can specify any combination of IQtree, FastTree2 and/or RAxML (Stamatakis 2014) to infer tree structure(s).Default parameters are used for each tree method except for RAxML and FastTree2 being set to use the GTR + gamma model and 100 and 1,000 bootstraps (with Shimodaira-Hasegawa test), respectively, and IQtree2 set to run ModelFinder to identify a best fit evolutionary model and run 1,000 ultrafast bootstraps.
Finally, the trees generated during the OrthoPhyl run are compared with generalized Robinson-Foulds metrics provided by ETEtoolkit (Huerta-Cepas et al. 2007) indicating the stability of the resultant tree structure when different inference methods and matrix completenesses are used.All default parameters used with these tools can be found in the OrthoPhyl repository at OrthoPhyl/control_file.defaults.

Datasets
As a proof of concept, OrthoPhyl was used to build trees for three bacterial clades.We analyzed well-characterized E. coli/Shigella strains.These included a list of 34 complete genomes analyzed in (Shakya et al. 2020), obtained from NCBI.
The genus Brucella was used to show OrthoPhyl's ability to resolve very closely related sequences while dealing with very long relative branch lengths in the same tree (Ochrobactrum group).Assemblies were acquired using NCBI Taxon number 234 (Schoch et al. 2020) with the utility script "gather_genomes.sh"(Supplementary Methods -Gathering and Filtering Assemblies and Supplementary Fig. 1) which is packaged with OrthoPhyl (github.com/eamiddlebrook/OrthoPhyl/utils/).This resulted in 689 NCBI Genbank or RefSeq assemblies after filtering for >98% completion, < 0.1% duplication, and < 1% contamination with CheckM (Benson et al. 2013;O'Leary et al. 2016;Parks et al. 2015).Here, duplication is the total number of loci identified as marker genes divided by the number of identified marker genes (see Supplementary methods "Gathering and Filtering Assemblies" and "utils/checkm_assemblies.slurm" within the github repository).It is important to note that Ochrobactrum species were recently moved to the genus Brucella (Hördt et al. 2020).However, for clarity and due to controversy within the field, we chose to keep the Ochrobactrum labeling for these species.An assembly of Mycoplana dimorpha (GCA_003046475.1)was added to the dataset as an outgroup.
Finally, a tree for all NCBI Rickettsiales assemblies was generated to illustrate OrthoPhyl's utility in dealing with bacterial order level divergences.For Rickettsiales assemblies, we again used gather_genomes.sh,this time with NCBI taxon number 766.To illustrate a more challenging scenario, filtering of Rickettsiales assemblies was less stringent with cut-offs at > 95% completion, < 0.2% duplication, and < 1.5% contamination, again using CheckM (Parks et al. 2015) and our independent calculation of duplication (Supplementary methods "Gathering and Filtering Assemblies".This resulted in 1,201 assemblies.Seven Pelagibacter assemblies were added to this dataset to serve as an outgroup.Accessions and stats for all genome assemblies used in this manuscript are available in Supplementary Tables 2-4.

Small Tree of Closely Related Assemblies: E. coli and Shigella
To illustrate Orthophyl's ability to build trees from moderate numbers of closely related samples on a local Linux computer, we ran our E. coli analysis (34 genomes) on a desktop with a 12-core Intel processor with 16 GB RAM running RHEL8 (centOS8).Because the assemblies are of high quality and the stains are closely related, OrthoPhyl was used to generate trees from only the strict SCO dataset.Run statistics are provided in Table 1.The full pipeline took 3.25 hours to build trees with FastTree2 and ASTRAL.OrthoPhyl identified 2,142 strict SCOs and alignment of these sequences constitutes 2.105 MB with 6% phylogenetically informative sites for the SCO (Table 1).
The resulting tree built by FastTree2 from the SCO-strict dataset shows successful differentiation of the E. coli phylotypes (Fig. 3).This tree shows B1 and A as sister phylotypes, with S. sonnei, S. boydii, and S. flexneri between them.That clade is in turn next to phylotype E and S. dysenteriae, then D1 and D2.Finally, B2 is the sister to all of them.This topology is identical to the reported tree from (Shakya et al. 2020) using whole genome alignments, except for the placement of the root (E.fergusonii), where they have the root placed between D2/B2 (red arrow) and the rest while OrthoPhyl's tree indicates it should be placed between B2 and the other phylotypes.

Genus level Species trees of 689 assemblies: Brucella/Ochrobactrum
For the 689 Brucella/Ochrobactrum assemblies analyzed (plus Mycoplana dimorpha outgroup), OrthoPhyl was run on a RedHat8 OrthoPhyl -Bacterial Phylogenomics | 5 compute node with 30 CPUs and 500 GB of ram.The full analysis (generating strict and relaxed SCO trees with FastTree2 and ASTRAL) took just over 2 days (48 hrs 31 min).Total ram usage was moderate at 58GB.CPU usage efficiency was 24%, with a total CPU time of approx.373 hours (Table 1).
OrthoPhyl identified 785 strict and 1,635 relaxed SCOs in the full dataset of assemblies.The distribution of taxon number represented in each alignment shows that most orthogroup alignments have all taxa represented, with approximately 1,400 orthogroups having 680-689 taxa (Fig. 2a).The orthogroup alignments have every little missing data (gaps or ambiguous bases) with the vast majority showing less than 10% (Fig. 2b). Figure 2c shows the number of informative sites vs each alignment's length.A slope of 0.42, indicates the alignments are far from the value that would constitute noisy, erroneous alignments.
The maximum likelihood (ML) Brucella phylogeny generated by FastTree2 based on the 1,635 relaxed SCOs (Fig. 4 4a and Supplementary Fig. 2, orange arrows).A major distinction between our tree topology and the one presented in Ashford et al. is that our tree shows the monophyletic Brucella clade splitting the Ochrobactrum clade between O. endophytica and all other Ochrobactrum with >90% support (Supplementary Fig. 2, black arrow).In contrast, the Brucellae in Ashford et al. split Ochrobactrum between Group A and B, as in (Hördt et al. 2020).This alternative placement of Brucellae is shown in Fig. 4a and Fig. 3. Strict single copy ortholog based maximum likelihood phylogeny of 34 Escherichia and Shigella assemblies inferred by FastTree2 using the GTR gamma model with the default 1,000 bootstraps.The tree was constructed with 2,217 genes totaling 2.04MB of sequence.Tips are labeled with the given species and strain from NCBI's BioSample database (details in Supplementary Table 2).All bootstrap support values are 100% except for the split leading to E. coli strains APEC O1 and S88 (black arrow), which is 98.7%.Bars next to tree indicate E. coli phylotypes [Shakya 2017].The tree was rooted at E. fergusonii then root node was removed.The blue and red arrows indicate the alternative placement of E. coli IAI1 and the root from Shakya 2017, respectively.multiple alignments for about 80 input assemblies (Darling et al. 2004;Angiuoli and Salzberg 2011).OrthoPhyl can generate alignments for >1,000 assemblies, allowing large trees to be inferred.It does this by identifying orthogroups in a diversity-spanning subset of assemblies (see Methods), then assigning proteins to these orthogroups for the full set of predicted proteins by iterative HMM searches.This reduces the computationally demanding all-vs-all protein searching that would make these large analyses intractable for many researchers.

User friendly
OrthoPhyl manages the formatting and organization of intermediate files, which is a non-trivial task in a phylogenomic workflow such as this.As with any phylogenetic method, gene sets, alignment, and tree estimation parameters will likely need to be tuned to obtain robust results, thus we exposed these variables for altering at runtime.If OrthoPhyl fails due to input error or lack of memory recourses, it can restart at the last completed step.Being robust to restarting also allows users to tune parameters without rerunning the entire pipeline.

Discussion
OrthoPhyl successfully reconstructs phylogenetic trees for clades with broad evolutionary divergences.This is achieved with little user input; only requiring one command line input pointing the pipeline to the assembly and output directories.OrthoPhyl can generate trees for >30 bacterial assemblies on a laptop or >1,000 assemblies on a workstation or single compute node with moderate resources (30 cpus and 100GB RAM).Many useful options are provided by OrthoPhyl, including alignment trimming parameters, number of assemblies from which to identify initial SCOs, tree-building software to use, and evolutionary models for tree estimation.
With few exceptions, the trees estimated for the E. coli/Shigella, Brucella/Ochrobactrum, and Rickettsiales datasets are consistent Fig. 5. a) Maximum likelihood tree for Rickettsiales generated by FastTree2 within OrthoPhyl.Branch colors show reported genera for each sample according to NCBI metadata.To aid in identification, labels are also provided.Samples with no genus-level identification are labeled in grey.Internal branch colors are given based on majority rule of offspring.Note the red Wolbachia clade has been compressed vertically (1/5 ratio) thus clade widths are not proportional to number of samples across the tree.Branch lengths are relative probability of a mutation per site.The tree was rooted with a Pelagibacter outgroup that was subsequently removed.Candidate genera (Candidatus) are labeled Can.The single, genus-level polytomy (Canidatus-Jidibacter) is labeled with red arrows.b) The Rickettsia subtree is shown with monophyletic species identifiers collapsed.Following species tip labels, in parentheses is the number of collapsed tips.The two most basal branches are omitted, as they are identified only to genus.All assemblies with species-level classification show monophyly apart from R. conorii (red arrows) and R. massiliae/R.rhipicephali (orange arrow).
OrthoPhyl -Bacterial Phylogenomics | 9 with published topologies.This illustrates OrthoPhyl's ability to resolve broad evolutionary relationships: from highly similar genomes with ANIs of >99% for E. coli/Shigella and Brucella/ Ochrobactrum to highly dissimilar assemblies with ANIs of 57% for the Rickettsiales.With OrthoPhyl being able to generate whole genome-based phylogenies for more than 1,000 assemblies, it fills a large gap in phylogenomic analysis software, providing an easy-to-use, assembly-to-tree tool that will help many researchers incorporate evolutionary analysis into their ongoing bacterial studies.
The Brucella/Ochrobactrum and Rickettsiales trees are, to our knowledge, the densest phylogenies published for these clades, let alone using whole genome methods.While more work is necessary to validate the topologies presented here, they are a starting point for assessing metadata-based species labels and expected clade monophyly.Future work will test the robustness of topologies to different evolutionary models, gene sets, tree inference methods, and sample incorporation.

Workflow considerations
There are many parameters to tune during the various steps of phylogenomic inference.Since we focus on usability there is an inevitable tradeoff with optimization, mainly through automated or hard-coded parameter choice.OrthoPhyl does not produce publication-ready trees without user inspection of analysis metrics such as alignment quality, single copy ortholog numbers, percent phylogenetically informative sites, mutational saturation, and of course branching support values.However, caution should be exercised when interpreting bootstrap support from phylogenomic methods as support values can coalesce on erroneous topologies because of mutational saturation and compositional bias and/or model misspecification.Ultimately, trees should be compared to existing data to ensure major clades are consistent and evidence for novel branching should be closely scrutinized.
Users should take different approaches to inferring trees depending on their final goals.If one wishes to infer an initial phylogenetic tree to inform experimental design or taxon sampling, running OrthoPhyl with default parameters to generate a tree with FastTree2 will likely be sufficient.This will produce a tree using the widely applicable GTR + gamma model.To infer a robust tree for publication, researchers should consider using IQtree2, which runs ModelFinder to choose the evolutionary model that best fits their data.They should also ensure even taxon sampling, with a focus on breaking up long branches.For trees aimed at revising the taxonomy of a studied group, users should consider screening out genes with high saturation and inspecting gene trees for signals of horizontal gene transfer.Additionally, building trees using multiple "best" models, tree software, and gene sets and then comparing results will add robustness to the inferred topologies.We refer readers to (Lozano-Fernandez 2022) for an indepth discussion on the subject.

Future
Future versions of this software package will include several additional evolutionary analyses.Bayesian tree inference methods will be incorporated into the pipeline to allow more flexibility in tree estimation.Since codon alignments of single-copy orthologs are generated while running this pipeline, we also plan on integrating a software package, e.g.ETE3 (Huerta-Cepas et al. 2016), to test evolutionary models such as positive/negative selection and neutrality.A "forest of life" based horizontal gene transfer analysis will be added using gene tree comparisons.This analysis will look at k-means clustering of gene trees to build multiple consensus trees to identify if blocks of genes do not agree with a single consensus species tree hypothesis (Puigbò et al. 2019).
In the current iteration of OrthoPhyl, very ancient paralogs (originating before the most recent common ancestor of the dataset species) might be clustered into the same orthogroup, leading to the orthogroup's removal during filtering even though 2 sub-trees of the orthogroup could be consistent with the species tree.Future iterations of this software will filter the orthogroups in a deep paralog-aware manner, using a rapidly generated neighborjoining species tree to identify and rescue such orthogroups.
Although this pipeline is geared towards prokaryotes, only two changes are needed to adapt it to eukaryotic phylogenetics: 1) allowing users to input precomputed annotations directly, avoiding the complex task of Eukaryotic annotation within the workflow itself and 2) taking into consideration paralog subtrees which are consistent with a consensus species tree.The next version of OrthoPhyl will incorporate these changes to expand the use cases to Eukaryotic phylogenetics.
OrthoPhyl identifies gene family groups for the dataset provided.While this allows the workflow to adapt to any set of genome assemblies, it requires compute resources that could be saved if predefined gene family models could be used as a starting point.Un-clustered protein sequences would then be fed into the standard OrthoPhyl ortholog inference pipeline to capture additional gene families.

Conclusions
Phylogenetic reconstruction is critical to understanding the evolution of pathogenicity, novel traits, horizontal gene transfer, and metabolic potentials of bacteria.Unfortunately, large-scale phylogenomic analyses of diverse bacterial groups require a deep understanding of bioinformatic methods and great care in dealing with the myriad intermediate files and formats used during the analysis.To our knowledge, OrthoPhyl is the only software to take >1,000 bacterial assemblies with order level divergence, generate ortholog codon alignments, then reconstruct accurate phylogenetic trees without user input or management of intermediate files being necessary.Thus, OrthoPhyl will allow many research groups, including those with modest computing resources and knowledge, to leverage the wealth of publicly available genomic data to enrich their ongoing analyses with robust phylogenomic inferences across a broad swath of bacterial diversity.

Web resources
Code used in this manuscript is freely available at https://github.com/eamiddlebrook/OrthoPhyl/ under branch OrthoPhyl_1.0while the default branch holds the current OrthoPhyl version.Installation and execution instructions are provided in the associated github README.mdfiles.OrthoPhyl requires a Linux OS.Third-party software versions and OrthoPhyl execution files will remain static in the OrthoPhyl_1.0branch.For versions of software dependencies see Supplementary Table 1.To aid in usability, a Singularity container Is available at https://cloud.sylabs.io/library/earlyevol/default/orthophyl or with the command, singularity pull library://earlyevol/default/orthophyl:1.0_ms.Also, see the GitHub page for Singularity usage guide.

Fig. 1 .
Fig. 1.Workflow diagram of OrthoPhyl.Grey boxes indicate processes.Programs used in each step are listed unless a custom script was used.Orange, tan, and purple boxes represent user input, intermediate files, and species tree outputs, respectively.Purple arrows show iterative approaches.The workflow is divided into four main tasks a) annotate assemblies and remove identical CDSs.If more than "N" assemblies are being analyzed, b1) identify a subset of diversity-spanning assemblies, b2) pass them through OrthoFinder to generate orthogroups, and b3) expand the OrthoFinder-identified orthogroups to the full dataset of assemblies through iterative HMM searches.c) Align full orthogroup protein sets, generate and trim matching codon alignments, then filter orthogroups by taxon representation.Finally, d) estimate species tree topologies with concatenated codon alignment supermatrices along with a gene tree to species tree consensus method.

Fig. 2 .
Fig. 2. Alignment metric visualization generated by OrthoPhyl.a) Histograms of the number of taxa in each alignment is shown.Dotted line indicates the >30% taxa cutoff for alignments used to generate the "SCO relaxed" trees.b) Amount of missing data (including gaps and ambiguous bases) per alignment is shown with a histogram.This is exclusive of genomes without detected orthologs.c) The number of phylogenetically informative sites (sites with at least 2 different states in at least 2 taxa each) compared to alignment length for each orthogroup are shown.A linear regression is shown with a solid line with associated slopes.A one-to-one relationship is shown as a black dotted line.
a ANI percentage is calculated with CEANIA (github.com/eamiddlebrook/CEANIA) on the final concatenated codon alignments excluding gapped positions.See Supplementary Methods for details.