FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation

Abstract Summary Analysis of conservation of gene neighbourhoods over different evolutionary levels is important for understanding operon and gene cluster evolution, and predicting functional associations. Our tool FlaGs (standing for Flanking Genes) takes a list of NCBI protein accessions as input, clusters neighbourhood-encoded proteins into homologous groups using sensitive sequence searching, and outputs a graphical visualization of the gene neighbourhood and its conservation, along with a phylogenetic tree annotated with flanking gene conservation. FlaGs has demonstrated utility for molecular evolutionary analysis, having uncovered a new toxin–antitoxin system in prokaryotes and bacteriophages. The web tool version of FlaGs (webFlaGs) can optionally include a BLASTP search against a reduced RefSeq database to generate an input accession list and analyse neighbourhood conservation within the same run. Availability and implementation FlaGs can be downloaded from https://github.com/GCA-VH-lab/FlaGs or run online at http://www.webflags.se/. Supplementary information Supplementary data are available at Bioinformatics online.


Description
FlaGs (for Flanking Genes) analyses genomic context around genes encoding a protein of interest using a Python3 script combined with Jackhmmer (Eddy 2011) and, optionally, ETE-toolkit (Huerta-Cepas, et al. 2016). FlaGs finds the flanking genes upstream and downstream of the genes of interest, clusters them based on homology and represents them visually with distinct identifiers (colour and number).
At initiation the tool verifies the input file which is either a list of NCBI protein accessions (-p option), or a tab delimited list of both protein accessions and corresponding genome assembly identifers (-a). The latter option is to inform FlaGs which specific genome it should be searching with the protein accession; since NCBI protein accessions are non-redundant, identical protein sequences can be present in multiple genomes with the same accession number. If FlaGs is given protein accessions alone using the -p option, it takes each accession and finds the corresponding list of assembly identifiers (eg. GCF_000001635.23 or GCA_000001635.5). Then it can either report all results for all identifiers (-r option), or for one representative identifier (default). The -r option should be used with caution as identical proteins can sometimes be found in thousands of genomes from closely related species, strains or isolates. See Arguments below for guidance. Figure 1. The FlaGs workflow. The user inputs a list of protein accession numbers -optionally with GCF assembly identifiers -and can specify the number of adjacent flanking genes to consider and the sensitivity of the Jackhmmer search though changing the E value cut-off and number of iterations. The web version of FlaGs (webFlaGs) can optionally use a single protein sequence or NCBI accession and begin by executing a BLASTP search against the RefSeq database (excluding eukaryotes) or a representative genome database to generate the input list of accessions. The output always includes a to-scale figure of flanking genes, a description of the flanking gene identities as a legend, and optionally, a phylogenetic tree annotated with colour-and number-coded pennant flags. Unconserved proteins are uncoloured.
After verification of the input query list, FlaGs uses the assembly identifier to retrieve the Genomic Feature Format (gff3) file for each protein query and uses this to identify annotated flanking genes from upstream and downstream regions. Then it retrieves the sequences for all flanking genes and searches for homologues within the set using Jackhmmer from the Hmmer package (hmmer.org, (Eddy 2011)). Flanking genes are then clustered and each cluster is assigned a specific number. Sets of homologues found with the Jackhmmer searches are joined together to make one cluster (the union) if there is one or more common protein among the Jackhmmer search results ( Figure  1). The numbering of clusters begins with 1. The lower the cluster number the more conserved the protein is, i.e the more frequently it is found among all the flanking genes. Finally, FlaGs generates a visual representation of the output using the tkinter Python module, and this is saved as a postscript file (Figure 2). The black block arrow represents the query proteins and the rest represent the flanking genes. The figure is proportional to actual gene lengths and intergenic space. If the query protein is encoded on the negative strand, the entire context is then flipped to make it easily comparable. The number and colour represent the cluster, and the "…_outdesc.txt" output file provides the legend for interpretation. Unconserved proteins are uncoloured and unnumbered: , RNA coding genes are outlined in green: , and pseudogenes are grey, outlined in dark blue: Figure 2. The to-scale postscript output. The black gene is the query, and otherwise colours and numbers represent the clusters. The "…_outdesc.txt" output file provides the legend for interpretation. Genes and intergenic spaces are to scale.
An optional feature of FlaGs is the generation of a phylogenetic tree of the query sequences, on which the flanking genes are annotated ( Figure 3). This is achieved using the ETE-toolkit (etetoolkit.org, (Huerta-Cepas, et al. 2016)). The ETE "mafft_default-trimal01-none-fasttree_full" workflow is used to generate the tree (see ETE user manual, http://etetoolkit.org/cookbook/). The flanking genes and the queries are represented as triangular pennant flag-like shapes, with colours and numbers representing clusters. The "…_outdesc.txt" output file again provides the legend for interpretation. Figure 3. The annotated phylogenetic tree output. The tree is made with ETE3 using the mafft_default-trimal01none-fasttree_full method and is saved as an SVG vector image file.

Prerequisites and use of the easy install script
FlaGs.py is available for download here: https://github.com/GCA-VH-lab/FlaGs/ FlaGs has been primarily tested in Mac and Linux environments. Users have reported that running on Windows is possible with some workarounds (updates in this direction are planned). In the meantime, Windows users are directed to use the server version at webflags.se instead.
For Mac and Linux users, a bash script install.sh is bundled with the download, which installs the programs required for running FlaGs. If you are using a Mac and do not yet have Xcode command line tools installed, you will be prompted to do so, following the on-screen instructions. Navigate to the FlaGs folder and use the following command to run the quick install script (answering y or yes when required):

bash ./install.sh
If this does not work due to insufficient permissions, and you have an admin/root password you can use sudo (But! Always safety first when using sudo: by all means check the install.sh file first to make sure you're comfortable with the commands that will be run (the list of tools that will be installed follow below)): sudo bash ./install.sh Tools that will be installed by the script (if they are not already installed): Homebrew (to install Wget) Wget (to install Anaconda) Anaconda (to install the tools below) Hmmer Biopython ETE (which itself installs a number of phylogenetic tools) Be aware that this assumes Bash is the default shell. In newer Macs, this may instead be Zsh. To get round this, just copy the required Conda and ETE PATHs (check .profile, .bashrc, and .bash_profile) over to the ~/.zshrc file. The copied text should look something like this: export PATH="/Users/username/anaconda3/bin:$PATH" export PATH="/Users/username/anaconda3/bin/ete3_apps/bin:$PATH" Alternatively, users can install the prerequisite programs manually following the instructions below:

Python 3
We recommend the use of Anaconda Python to simplify installation of other required and optional dependencies. It can be downloaded from https://www.anaconda.com/download/ . If this works and you don't want to install a specific Anaconda version, you can proceed to step 2.
Recommendation for specific Anaconda version: Anaconda3 (version 5.2.0) is well tested with FlaGs. It is available for both Linux and Mac Users in the anaconda archive https://repo.anaconda.com/archive. The following command can be used to download anaconda 5.2.0.

Biopython
Using Anaconda, Biopython can be installed using the following command: conda install -c conda-forge/label/gcc7 biopython

Jackhmmer
Using Anaconda, hmmer can be installed using the following command:

Running the example files:
N.B. a valid email address is required by NCBI to monitor the number of requests to their server per second per user. You will not receive emails from the local version of FlaGs.
Input File Example: Protein Accession WP_047256880.1 in a text Input file separated by newline. If these are not RefSeq accessions (WP_), the program will attempt to convert to a RefSeq accession.
Input File Example: 3. "-l", "--localGenomeList" FlaGs v1.0.5 can use local genomes too. For that, the user needs to provide the following files: i. Compressed (gzip) FASTA format of the predicted protein products annotated on the genome assembly (eg. Assembly_1.faa.gz, file should contain ".faa.gz" suffix) With the "-l" flag set, the user specifies the input text file with assembly ID and the query protein accession.
Input file example: In this case, both of the Fasta and gff file should contain the Assembly Identifier (the GCF number) as the prefix. For example, for assembly identifier GCF_000001765.3, the compressed Fasta file should be named GCF_000001765.3.faa.gz and the GFF file should be named GCF_000001765.3.gff.gz Or input file example with user-defined assembly names: Assembly_1 WP_047256880.1 #tab separated Assembly_2 WP_012725678.1 For assembly identifier Assembly_1, the compressed Fasta file should be named as Assembly_1.faa.gz and the GFF file should be named as Assembly_1.gff.gz 4. "-ld", "--localGenomeDirectory" This flag allows the user to specify the path or directory where .faa.gz and .gff.gz are stored. By default, FlaGs script will look for the .faa.gz and .gff.gz files in the same directory where the script is located or running from.
The user can only use this flag when the '-l' flag is being used.
For example: If the files are stored in ~/FlaGs/local directory user can specify by following: -ld ~/FlaGs/local 5. "-r", "--redundant" This will initiate a search for flanking genes in all available GCFs that encode this identical protein sequence (using -r A or -r a) for each query, or a specific number of GCFs (for example to search a maximum of 5 genomes encoding identical protein sequences, the user would write -r 5).
Warning: the -r A option should be used with caution as identical proteins can sometimes be found in thousands of genomes in RefSeq. Using -r with a number to limit is highly recommended. 6. "-e", "--ethreshold" This e-value is used by Jackhmmer as a cutoff parameter to detect homology among all flanking genes. By default it is 1e-10. 7. "-n", "--number" This number represents number of Jackhmmer iterations that allows to find more distant homolog, by default it is set as 3.
8. "-g", "--gene" Using this parameter, the user can define the number of upstream and downstream genes to look for each query in the input list. By default, it is 4, which means for each query it will try to find 4 upstream and 4 downstream flanking genes and then process further. 9. "-t", "--tree" This requires an ETE3 installation. The option enables showing flanking genes along with a phylogenetic tree.
10. "-ts", "--tshape" This requires an ETE3 installation and thus this option only works when -t is used. This parameter can increase or decrease the size of triangle shapes that represent flanking genes, by default it is 12.
11. "-tf", "--tfontsize" This also requires an ETE3 installation and thus this option only works when -t is used. This parameter can increase or decrease the size of font inside triangles that represent flanking genes, by default it is 4. 12. "-to", "--tree_order" In combination with -t, it will first generate the tree output, and then use the tree order to generate the other (without tree) output file. This is very useful option for stitching the two output files together in (for example) Illustrator to make a figure with both the tree, and the to-scale neighbourhood output.

"-u", "--user_email"
A valid email address is required by NCBI to monitor the number of requests to their server per second per user. (there is a limit of 3 queries per second, which may be exceeded if you run multiple instances of FlaGs in parallel; but see the API-key advice below). You will not receive emails from the local version of FlaGs.
14. "-api", "--api_key" Valid API-key allows 10 queries per seconds, which makes the tool run faster. This may sometimes allow the user to run multiple instances of FlaGs in parallel, but in general this is not well tested or advised.
16. "-k", "--keep" If the user wants to keep the intermediate files eg. gff3, this option is used (warning: these are large files). By default, intermediate files will be removed.

"-v", "--version"
Retrieve the version number of the program.
18. "-vb", "--verbose" This option will show the work progress for each query as STDOUT. This is a very useful option, which we recommend using.

Output files
In addition to the figure output files described in 1. Description above, FlaGs generates the following output files: 1. Information about the flanking genes for each of the queries retrieved from databases is stored in a tab delimited file named with the "_operon.tsv" suffix. For example, information about two example queries WP_028660106.1 and WP_091126147.1 is shown in Figure 4. . This tab delimited text file contains the information retrieved from NCBI for each query and their flanking genes. In this example, the query accession WP_028660106.1 is renamed as WP_028660106.1#1|Nocardioides_insulae_DSM_17944, where #1 indicates the input order (can be different if tree order option is selected). Nocardioides_insulae_DSM_17944 is the species name and strain information that is retrievable from the NCBI RefSeq Database.

2.
A general summary report is also generated for each query, for example if the query accession is valid or invalid, if the query accession was converted or updated to another accession associated with an identical RefSeq sequence, if the query failed, or if flanking gene information is found or not. This is a tab delimited file with the "_QueryStatus.txt" suffix 3. A general flanking gene summary report is also generated for each valid query if flanking gene information is found or not in a tab delimited file with "_flankgene_Report.log" suffix 4. A file with the suffix "_jackhits.tsv" contains the Jackhmmer output 5. After the clustering steps a file named with the suffix "_clusters.tsv" is generated with detailed information of the cluster. Here you can see in column 1, it starts with cluster number 1 (this is the number shown within genes in the figure files) and "(2)" means the protein WP_001229260.1 has been found twice in the cluster and the last column is the protein description retrieved from the database. The protein description gives information about the proteins in a cluster (accessions and title from NCBI).
If wanted, this file can be used as an input for the additional bundled python script descriptionCloud.py to save a wordcloud visualization. descriptionCloud.py has additional dependencies: pandas, PIL, wordcloud and matplotlib. If these are not already installed, they are available through anaconda, eg: conda install -c conda-forge wordcloud descriptionCloud.py is run as follows: python descriptionCloud.py -i example_outdesc.txt A png format output image will be created, for example see Figure 5. 7. Discarded protein ids with improper accession are listed in a text file with "_NameError.txt" suffix. 8. Discarded protein ids lacking proper information in RefSeq DB are listed in a text file with "_Insufficient_Info_In_DB.txt" suffix 9. If the -t option used to make a phylogenetic tree, ETE3 saves additional results files to a folder ending with "[output name]_tree_". This includes a Newick format version of the tree.
FlaGs also generates some intermediate files to proceed the overall process but only the files mentioned above are useful for interpretation. Intermediate files can be kept with the -k option (warning: these are large files).

Running online with webFlaGs
FlaGs can be run through the web server at www.webflags.se. WebFlaGs is a more user friendly version of FlaGs (as it does not require installation) with a web interface where user can paste the input in the specific text area or upload as text file. webFlaGs accepts input datasets that contains up to 200 accessions. If more than 200 queries are required, it is recommended to use the local version of FlaGs. FlaGs and webFlaGs both use the same input file format which is autodetected by webFlaGs.
An additional input allowed by webFlaGs is a single accession or sequence (raw or FASTA format). In this case, a BLASTP search is carried out to retrieve homologues that are then used as input for webFlaGs. The user decides how many hits to return (up to 200) and the E-value cut-off, as well as the query database to be searched. The query database can be either the RefSeq database (excluding eukaryotes), or a variation of the eukaryote-excluded RefSeq database that is reduced to one representative genome per bacterial and archaeal species, plus an additional 711 genomes from microbes represented in the COG database and 63 non-eukaryotic NCBI reference genomes.
All viral genomes from RefSeq are included. The list of 23403 included taxa can be found following the link in the interface or directly at http: //130.239.193.227/html/List.txt. In webFlaGs, the user can set the E-value cut-off that is used by Jackhmmer to detect homology among all flanking genes. By default, it is 1e-10. The user can also choose the number of Jackhmmer iterations that allows it to find more distant homologs. By default this is set to 3. Like FlaGs, in webFlaGs the user can define the number of upstream and downstream genes to look for flanking each query in the input list. By default, it is 4.
When running webFlaGs, a valid email address is required, which is only used for sending results (we use our own email address for accessing the NCBI API). The user will receive a download link for a zipped file which contains: 1. The one or two figure files described above based on the output type selected in the webFlaGs webpage by the user. There are three output type options in webFlaGs. If the option 'With phylogenetic tree and query showed as tree order' is selected, the user will get the annotated phylogenetic tree output and the to-scale postscript output where queries will be presented as the same order that of the annotated phylogenetic tree output. For option 'With phylogenetic tree but query showed as input order', queries in the to-scale postscript output will be presented as the same order as submitted by the user, also the user will get the annotated phylogenetic tree output in svg format. If 'No phylogenetic tree and query showed as input order' is selected, the user will get only the to-scale postscript output along with the converted PDF file (see below).

2.
A converted PDF version of the to-scale postscript output. This is provided for Windows users who can not easily open postscript files.
3. A description file, which is generated after clustering, with the suffix "_outdesc.txt" (described above). This is the main legend file for the figures.
4. The query input file with the suffix "_query.flagsIn" 5. A text file with the suffix "_QueryStatus.txt" suffix that records the success (or not) of each query, and importantly informs if the input accession failed but was converted or updated to an identical RefSeq sequence that worked.
6. Information about the flanking genes for each of the queries retrieved from databases is stored in a tab delimited file named with "_operon.tsv" suffix. (described above)

(Optional)
If the user selected 'With phylogenetic tree' as the Output type option, they will receive the multiple sequence alignment FASTA file with the suffix "_tree.fasta" which was used for making tree file, and the resulting Newick format version of the tree, having a suffix ".nw".

Recommendations and tips
When the results are received, it is recommended that users check the tab delimited results file with the "_QueryStatus.txt" suffix. This records the success (or not) of each query, and importantly informs whether the input accession failed but was converted or updated to an identical RefSeq sequence that worked. When using the -p option, the query status file is also useful for identifying which genome (identified with a GCF number) was used to retrieve the flanking genes.
When using the -to option, the two output figures show input taxa listed in the same order (by the tree). Therefore, in Adobe Illustrator (or similar) it is very easy to collate these two vector images together, in order to make a figure of the to-scale flanking gene output mapped on to the tree.
FlaGs can handle relatively large input datasets, although of course the smaller, the faster. Since FlaGs retrieves information from NCBI (multiple times for each input accession), it is dependent on both a stable local internet connection and the stability of the NCBI server. Our example input file "GCF_accession_input.txt" with 30 accessions takes around 7 minutes to run with the -a, -t,to options on a 2015 MacBook Pro with an internet connection of around ~75 mbps. Without the -t option it takes around 6 minutes. We have also tested input files containing around 1000 protein accessions (four flanking genes each side) and they worked successfully over the course of a few hours.
If you have >1000 entries, you may want to reduce the number of input accessions. One way to do this is by reducing redundancy with a tool such as Usearch (Edgar 2010).
The -vb (verbose) option is very useful for following the progress of the run, and we recommend its use unless you want to limit what is displayed via STDOUT.
FlaGs is designed for use with prokaryotic and bacteriophage genomes where gene clusters can be conserved over vast evolutionary distances, and the space between genes is relatively small. While it will run with eukaryotic nuclear genomes, the running time is much longer and the resulting images can be very wide. For this kind of data, we recommend initial testing with small datasets and few flanking genes (the -g option). FlaGs does however work very well with eukaryotic organellar genomes that are more bacteria-like in their organisation and conservation.
As mentioned above, the '_outdesc.txt' file can be used as an input for the additional bundled python script descriptionCloud.py to save a wordcloud visualisation. The list of accessions can be pasted into a text file to use as FlaGs input. You're done and ready to run FlaGs! You can stop reading here.

References
3b: If you have >100 sequences selected (or just want to try a different way), you can click "Download", then "Hit Table (

CSV)"
This is a table that can be opened in a spreadsheet program such as Excel, Numbers or Google Sheets

Open your table, and split the text into columns
In Excel and Google Sheets this is done through the Data > Text to columns menu option Follow the wizard to split by commas Click finish and you will see your list of accessions in the second column The list of accessions can be copied and pasted into a text file to use as FlaGs input. You're done and ready to run FlaGs!