MultiPhATE2: code for functional annotation and comparison of phage genomes

Abstract To address a need for improved tools for annotation and comparative genomics of bacteriophage genomes, we developed multiPhATE2. As an extension of multiPhATE, a functional annotation code released previously, multiPhATE2 performs gene finding using multiple algorithms, compares the results of the algorithms, performs functional annotation of coding sequences, and incorporates additional search algorithms and databases to extend the search space of the original code. MultiPhATE2 performs gene matching among sets of closely related bacteriophage genomes, and uses multiprocessing to speed computations. MultiPhATE2 can be re-started at multiple points within the workflow to allow the user to examine intermediate results and adjust the subsequent computations accordingly. In addition, multiPhATE2 accommodates custom gene calls and sequence databases, again adding flexibility. MultiPhATE2 was implemented in Python 3.7 and runs as a command-line code under Linux or MAC operating systems. Full documentation is provided as a README file and a Wiki website.


Introduction
As the era of reliable antibacterial treatment draws to a close, bacteriophage (phage) therapy is gaining ground in Western countries as an alternative treatment for antibiotic resistant and chronic recalcitrant bacterial infections, with several clinical trials having recently been initiated (Furfaro et al. 2018;Altamirano and Barr 2019;Gó rski et al. 2020;Voelker 2019;Duplessis and Biswas 2020;Pires et al. 2020). These efforts depend on reliable, actionable biological information, much of which is generated today using bioinformatics tools for analyzing the increasingly large quantities of genomic sequencing data. Moreover, it is essential that the genomes are sequenced and accurately annotated to avoid introducing unwanted genes (such as toxins or antibiotic resistance genes) into the patient (Luong et al. 2020).
To support these medical research efforts, open-source tools for phage genome annotation are prerequisite. Although nextgeneration sequencing efforts have generated increasingly large quantities of phage genomic data (Russell and Hatfull 2017;Carrol et al. 2018), there still remains an urgent need for tools that enable annotation and evaluation of phage genomic data and that inform research efforts toward developing novel phage therapies and therapeutics based on phage products (Yang et al. 2014;Gó rski et al. 2020). A number of advanced bioinformatics tools and computational systems have been developed for evaluation of phage sequence, including detection of prophage sequences within bacterial genomes (Akhter et al. 2012;Reis-Cunha et al. 2019;and others), evaluation of phage sequence for therapeutic goals (Philipson et al. 2018), and working with phage metagenomics data (Kieft et al. 2020). Several microbial genome annotation tools have been used for annotation of phage genomes, including EDGE (Li et al. 2017), PROKKA (Seemann 2014), DFAST (Tanizawa et al. 2019), RAST (Aziz et al. 2008), and PATRIC (Davis et al. 2020). Collectively, these tools offer a broad array of capabilities ranging from genome assembly to gene prediction to functional annotation and standardized formatting of data for submission to public databases. For example, EDGE is a bioinformatics platform that enables comprehensive processing of metagenomic data, including assembly and annotation, taxonomic classification, phylogenetic analysis, and primer analysis. DFAST and PROKKA perform protein and noncoding gene predictions, functional annotation targeted at prokaryotes, and are largely intended for rapid analysis and submission to databases. PATRIC offers comprehensive genome analysis using vast data sets comprising curated data for pathogen species and RAST annotation tools "under the hood." It is important to note that these tools are primarily aimed at evaluation of bacterial and/or archaeal genome sequence.
As of this writing, we are not aware of any comprehensive, open-source annotation system that is tailored for phage annotation. We developed multiPhATE (Ecale  to address the need for a downloadable, command-line annotation tool suited for use by phage research laboratories with few or no bioinformatics specialists on staff, and limited resources dedicated to constructing annotation and comparative genomics pipelines from individual tools and databases. Here, we describe updates that we have made to the original multiPhATE code, resulting in an even more versatile and flexible tool. Updates to be found in multiPhATE2 include additional database search algorithms and supported databases, parallel processing to speed computations, controls that add flexibility to the workflow, and new code for comparing across related genomes.

MultiPhATE2 system overview
MultiPhATE2 comprises a comprehensive, high-throughput functional annotation and genome comparison system for analysis of newly sequenced phage genomes. An accounting of multiPhATE2 features, and a comparison to those of the original multiPhATE code can be found in the multiPhATE2 GitHub wiki (see "What are the differences between the original multiPhATE code and multiPhATE2?"; URL is provided below). MultiPhATE2 performs gene finding followed by computational functional annotation of user-specified phage genomes, then performs gene-by-gene comparisons among the genomes. A system driver script takes a single argument consisting of a configuration file, then invokes up to four computational subsystems: the Gene Calling and PhATE annotation subsystems are run for each genome, and, if two or more genomes are specified by the user, multiPhATE will identify corresponding genes among the genomes using the Compare Gene Profiles (Tkavc et al. 2017) and Genomics subsystems (Figure 1).
Recognizing a compelling need for flexible and rapid computations, we have included process control features within the multiPhATE2 workflow ( Figure 2). We believe that multiPhATE2 scales well in performing multi-genome annotation and comparison, while offering a flexible toolset for tailoring analyses according to the user's needs.

Process control
Parallel processing is optionally applied at several stages of computation ( Figure 2). Furthermore, the user may stop or re-start computations at several points within the workflow. These options may be selected in the multiPhATE2 system configuration file. Specifically,

1) Parallel processing is applied in the PhATE and Compare
Gene Profiles subsystems. Each genome input to PhATE may be processed as a separate, parallel process, and each binary genome comparison in the Compare Gene Profiles subsystem may be processed likewise in parallel.
2) The user may specify the desired number of threads with which to invoke Blastp (Camacho et al. 2009) in the PhATE subsystem.
3) The user may opt to process the multiPhATE2 code through the Gene Finding subsystem or the PhATE subsystem and stop at either point. 4) The user may opt to re-start processing at three points in the multiPhATE2 system. Checkpoints (re-starting points) may be selected at the beginning of the PhATE, Compare Gene Profiles, or Genomics subsystem processing.

Input
MultiPhATE2 is invoked using a single input parameter consisting of a configuration file, as follows: $python multiPhate.py multiphate.config. The configuration file allows the user to specify (a) genomes to be processed, (b) gene finder(s) to be used, (c) PhATE annotation search algorithms and databases, (d) blast cutoffs, (e) locations of databases, (f) Compare Gene Profiles matching cutoff, (g) PhATE, Compare Gene Profiles, and Blastþ multiprocessing, (h) stopping points, (i) checkpoints, and (j) console messaging verbosity. Concise instructions for creating a multiphate configuration file are provided in the project README, the project wiki, and in the sample.multiphate.config file itself, provided with the multiPhATE2 distribution. Code execution can be tailored to run specific gene finders and to search for homologous sequences in specific phage-and virus-centric data sets, in addition to more generic protein data sets.

Gene calling subsystem
The Gene Calling subsystem of multiPhATE was updated to include user-provided GFF-formatted custom gene calls, in addition to the already-supported Glimmer (Delcher et al. 2007 Overview of multiPhATE2 system and workflow. User-specified configurations (configuration file) are input to the multiPhATE2 system, which invokes four subsystems: Gene Calling, the PhATE annotation pipeline, Compare Gene Profiles, which performs binary genome-to-genome comparisons of genes and proteins, and Genomics, which consolidates binary comparisons into gene-gene and protein-protein correspondences among all input genomes.
from web-only based services, Genbank gene calls, or hand-curated gene-call data sets (see README in the project repository for format specifications). The side-by-side comparison among gene callers (Compare Gene Calls module, Figure 3) was expanded to include output data sets that either merge the results of multiple gene callers (i.e., a nonredundant superset), or that recognize agreement among callers: a consensus gene-call set, comprising calls that were in agreement among two or more callers, and a common-core gene-call set representing calls that were made by all the gene callers. Any one of the gene-call outputs, custom calls, or multiPhATE-generated gene-call super/ subsets may be forwarded on for PhATE functional annotation.

PhATE annotation subsystem
PhATE is a fully automated computational pipeline for functional annotation of phage genes within a genome sequence, and was originally written as part of multiPhATE (

Comparative genomics subsystems
MultiPhATE2 accomplishes comparisons among input genomes in a two-step process whereby each genome is compared to each other, and then gene-gene and protein-protein correspondences are identified from among all of the input genomes. The Compare Gene Profiles subsystem performs NxN reciprocal blast of the genes from each genome against the genes from every other genome provided by the user (Figure 1). The code then identifies for each gene its mutual and nonmutual (singular) best hits against corresponding genes from each of the other genomes, or reports if no corresponding gene is found. For each binary genome-to-genome comparison, hits are ordered with respect to the query genome. The Genomics subsystem inputs the binary blast result files from Compare Gene Profiles and computes genes and proteins that correspond across all the input genomes with respect to the reference genome (i.e., the first genome listed). Ultimately, homology groups are generated, comprising each reference gene and its corresponding genes, plus its paralog's corresponding genes. This analysis is also performed for protein sequences. PhATE annotations are carried through the Compare Gene Profiles and Genomics computations so that the user can readily identify gene/protein function among the identified homolog groups.

Figure 2
System overview and configurable process control features of multiPhATE2. Large blue arrows: multiPhATE2 subsystems; CGP ¼ Compare Gene Profiles; curved grey arrows: process controls (stop ¼ stopping point; checkpoint ¼ point at which processing may be restarted); "parallel processing" indicates multiprocessing applied to functional annotation of input genomes and binary genome-to-genome comparisons; "parallel blast" indicates multithreading option provided by BLASTþ. Figure 3 Gene callers and gene-call comparison in the Gene Calling subsystem of multiPhATE2. The user may select any or all of the supported gene callers and/or provide their own gene calls (custom). The Compare Gene Calls module computes a set of calls that are common among all selected callers (common calls), a consensus set comprising gene calls produced by at least two callers (consensus calls), and a nonredundant superset of gene calls (superset). The user may select the results of one gene caller or a super/subset for input to the PhATE subsystem.

MultiPhATE system output
Directories and files that are produced by multiPhATE2 are detailed in the README. In brief, an output subdirectory is created for each input genome to hold results of the Gene Finding and PhATE subsystems, and subdirectories are created to hold results of the Compare Gene Profiles and Genomics subsystems. A sample multiPhATE2 system main output directory with contents for the Bacteriophage P2 genome (Christie and Calendar 2016) can be found in the multiPhATE2 supplementary data repository. In summary, the following subdirectories are created:

Data availability
MultiPhATE2 is freely available under an open-source GPL3 license at https://github.com/carolzhou/multiPhATE2. Instructions for downloading, installing, and using multiPhATE2, as well as instructions for acquiring databases and third party codes used by multiPhATE2, are found in the README file included with the distribution. Supplementary materials, which demonstrate the outputs of multiPhATE2, are available in a GitHub repository at https://github.com/carolzhou/multi PhATE2_supplementaryData/. Additionally, use cases illustrating many of the capabilities of the software, as well as a chart depicting additional features of multiPhATE2 compared to is predecessor, can be found on the project wiki pages, at https://github. com/carolzhou/multiPhATE2/wiki (or select the "Wiki" tab in the project repository). Users may report software issues on the project's GitHub repository webpage (select the "Issues" tab) or by sending an email to multiphate@gmail.com.

Results and discussion
MultiPhATE2 represents a significant advance in bacteriophage genome annotation in that it streamlines gene calling, functional sequence annotation, and comparative genomics for sets of newly sequenced draft or finished genomes. MultiPhATE is straight forward to install, with full instructions in the README file in the project's github repository. Running multiPhATE2 as a command-line program taking a single argument (i.e., the multiphate.config file) facilitates launching jobs comprising annotation and comparison of potentially large sets of genomes. Furthermore, built-in flexibility ( Figure 2) allows the user not only to install components in a step-wise manner, but also to run and re-run each subsystem with different parameters, so that the user may determine, for example, (a) which gene caller or callers are preferred and which one may be best to carry through to functional annotation; (b) which search algorithms and databases (including custom) are most appropriate for the genomes under study; (c) which sequence identity cutoffs produce the desired stringency in terms of homolog identification or gene-gene correspondence. MultiPhATE2 offers a wide range of gene calling and sequence search algorithms and databases, including the options of providing user-defined custom gene calls and blast databases (Figures 3 and 4). Parallel processing within the PhATE and CGP subsystems (Figure 2) enables computations for large input data sets, which might not otherwise be feasible when processing in serial. Furthermore, hits to entries in each of the pVOG and VOG databases are combined with the query protein sequence to produce fasta grouping to facilitate potential follow-on analyses (beyond the scope of multiPhATE2), such as multiple sequence alignment and common motif identification. We know of no other Figure 4 Functional annotation options supported within the PhATE subsystem of multiPhATE2. The user may select any or all of the algorithms to search any or all of the databases over genome, gene, and/or protein sequences. dbxrefs ¼ database external references, which comprise additional information about a given database entry.
phage-tailored genome annotation system that provides breadth and flexibility that are comparable to multiPhATE2.