Roary: rapid large-scale prokaryote pan genome analysis

Summary: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. Availability and implementation: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary Contact: roary@sanger.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.


Input
Roary accepts as input annotated de novo assemblies in GFF3 (Stein, 2013) format, as produced by Prokka (Seemann, 2014), where there is one file per sample. The typical input is normally fragmented draft de novo assemblies from short reads, or near complete draft assemblies and all of the data should be from the same species. If annotation is not available, FASTA files of amino acids can also be used as input but not all functionality is available. These are the only mandatory input files to the software.

Method description
Sup. Fig. 13: A flowchart of the steps in the application. The pipeline takes as input GFF3 files created by Prokka (Seemann, 2014) and clusters the predicted proteins to allow for the full genomic variation of the input set to be explored. The basic method is to filter and precluster the proteins, perform an all against all comparison using BLASTP, and cluster with MCL.
Predicted coding regions are extracted and converted to protein sequences. Sequences where more than 5% of nucleotides are unknown, or that are less than 120 nucleotides, are excluded from further analysis ( Common Gene Annotation Process, Broad Institute, WUGC, JCVI and Baylor , 2011). Sequences must have a start or stop codon, and any without them are filtered out. Since the input sequences are all from the same species, there will be large numbers of near identical genes, so we use a quick clustering method to reduce the computational cost of the later all against all BLAST. This substantially reduces the running time. CDhit (Fu, Niu, Zhu, Wu, & Li, 2012) is used to iteratively perform a first pass clustering. Beginning with a sequence identity of 100% and a matching length of 100%, the protein sequences are clustered. If a sequence is found in every isolate, it is said to be a core gene and the cluster is added to the final results. All of these sequences are then removed and not considered for analysis with BLAST. CDhit is repeated with a lower threshold, reducing by 0.5% down to the user defined threshold (defaults to 98%), with core genes removed at each stage. One final clustering step is performed with CDhit, with a sequence identity of 98% leaving one representative sequence for each cluster in a protein FASTA file.
A BLAST database is created from this FASTA file. Low complexity regions are first masked out with SegMasker (Camacho et al., 2008), and a protein blast database is created with makeblastdb (Camacho et al., 2008). The FASTA file is chunked up (without splitting individual sequences) and compared to the blast database to perform an all against all blast. The combined blast results are then provided as input to MCL (Enright, Van Dongen, & Ouzounis, 2002), which clusters the input sequences. It uses a normalised bit store (bit scores normalized by length of the HSP). The clusters are then reinflated with the final CDhit clustering results, and with the iterative CDhit core genes. The clusters are labelled with the most commonly occurring gene names assigned to the sequences in the cluster. If there is no annotated gene name, a unique identifier is generated. The functional annotation is also recorded for each cluster.
All against all comparison with BLAST in combination with MCL clustering will produce homologous groups of genes. While useful, these groupings can often contain paralogous genes (i.e. homologs from the same genome), which could skew the results by producing falsely large and welldistributed gene groups. In theory, as pan genomes are only truly informative in closely related species, orthologous genes (homologs from different genomes) will be highly likely to also share their surrounding genes. In groups where paralogs are detected, Roary will try to split these into orthologous groups by using the conserved gene neighbourhood (CGN) of each gene.
A form of guided kmeans clustering is performed using a metric for CGN sharing, assigning each of the paralogs as an initial means. Each gene in the group is assigned to the paralog with the greatest proportion of shared surrounding genes. This clustering is performed iteratively until there are no more paralogous groups left. In cases where multiple sets of paralogs are present in the group, Roary assigns the smallest set as the means for each iteration. All results presented here uses a neighbourhood radius of 5 (5 genes before and 5 genes after).
The presence and absence of genes in the accessory genome is utilised to create a binary tree using FastTree (Price et al., 2010). This gives a rough picture of the underlying data, taking less than a second on a dataset with over 100 isolates. Whilst nowhere near as accurate as a core SNP tree, which requires substantially more computation, it does nevertheless provide useful insights and very often groups clonal isolates together. CDhit is also used in this manner to cluster the isolates (90% identity).
The ordering of each protein, on each contig, in each de novo assembly, is noted. It is then used to create a graph of the ordering of the clusters, which provides a relative ordering of the clusters. Each edge is weighted by 1 divided by the number of isolates that the ordering is found in and a node corresponds to a cluster. The contribution of an isolate to the weight of an edge is based on the number of isolates in the CDhit cluster so that low frequency genotypes are not overwhelmed by oversampled clonal isolates. Multiple graphs are created for each of the connected groups (e.g. chromosome and plasmids would be on separate graphs). To reduce the impact of spurious edges, which can be caused by misassemblies and misclustering, the graph is filtered to remove weak edges. For a given node, all edges whose weight is not within 90% of the strongest edge are removed. This greatly improves the contiguation of mobile elements, but it has the downside of disconnecting low frequency variation found within those elements. The graphs are then simplified to minimum spanning trees and traversed using depth first search. The graphs are ordered by the mean edge weight for the resulting path. The same process is repeated, but with the conserved clusters removed from the graph, to generate an overview of the accessory regions. This groups globally syntenous regions together, giving context to the genes.
Quality information can be derived about the clusters from the graphs. For example, if there are a large number of disconnected graphs, each with a small number of genes and found in low frequency in the isolates, it can indicate low copy number contamination of the isolate. A small number of these can be biologically very interesting, for example a drug resistance gene being inserted at an IS element. Also,if there are very large numbers of genes on contiguous blocks it can also indicate contamination by another organism. A further quality control step is performed where the quality of the predicted proteins is accessed based on the predicted annotation. If two genes are overlapping by more than 10% (minimum 4 nucleotides) in different reading frames it could indicate a misprediction. If one of the proteins is marked as a hypothetical protein, and the other has a predicted function, the hypothetical protein is flagged as potentially erroneous.
Each cluster is outputted to a separate multiFASTA file as nucleotide sequences. They codon aligned using PRANK (Löytynoja, 2014). Genes that occur exactly once in every isolate are combined into a multiple FASTA alignment to allow for a phylogenetic tree to be constructed using the core genes.

Output
The pan genome application Roary creates multiple output files. This includes a spreadsheet detailing the presence and absence of each gene in each isolate, number of isolates a gene is found in, frequency of genes per isolate, functional annotation, QC information and sorting information. A multiple sequence file of all the nucleotide sequences for each cluster is also created and aligned using PRANK to create a multiple sequence alignment file.
Tab delimited files are created for visualizing with R (Sup. Fig. 1013). We randomly sort the isolates files and plot what happens to the pan genome when they are added iteratively. Multiple iterations are performed because the order in which genomes are added can change the results, with the number of iterations set to the number of input files (minimum 100). This allows for bounds to be placed on the values. The number of conserved genes shows the size of the core genome (where a gene occurs at least once in every isolate) and it generally stabilises quite quickly. The total number of clusters includes the core and the accessory genome and the slope of the curve varies depending on if the pan genome is open or closed (Medini et al., 2005). The number of new, previously unseen, genes found as each isolate is added to the plot allows you to see how likely you are to find new genes as you sequence more data. Finally there is a plot for the number of unique genes overall that have been observed exactly once.
The application includes the ability to query the clusters, so that you can find out what is different about two sets of isolates and what they have in common (union, intersection, complement

Core gene alignment
If you pass in the flag 'e' a multiFASTA file containing the aligned core genes is created. From this you can use an application like RAxML or FastTree to construct a phylogenetic tree based on SNPs in the core genes. The individual genes are codon aligned with Prank[prank]. The file is always called core_gene_alignment.aln.

Multifasta files of each gene
If you pass in the flags 'e dont_delete_files' multifasta files containing nucleotide sequences will be generated for each gene. They will all reside in a sub directory called 'pan_genome_sequences'. You can use seaview to open the sequence and take a look at it. The sequences are codon aligned with Prank [prank]. Thousands of files get created so its a feature you should use with caution.

Pan Genome Reference FASTA
If you pass in the flag 'e' multifasta a FASTA file will be created called pan_genome_reference.fa. It contains a single representative nucleotide sequence for each gene/cluster in the pan genome. This can be used for aligning reads to, for passing into a reference aware assembler and for rapidly adding a new isolate to the pan genome.

Querying the pan genome
You can query the pan genome in two ways, 1.) open up the gene_presence_absence.csv spreadsheet in Excel and filter the rows and columns yourself, or 2.) use the query_pan_genome script. It takes in the groups file (clustered_proteins) and a list of the GFF files that were used to create the pan genome in the first place.
To get help run: query_pan_genome h

Create multi-FASTA files for a list of genes
This action will create a multiFASTA file of protein sequences for the list of genes passed in. The data is extracted directly from the GFF files and you can choose nucleotide sequences or amino acid sequences. The sequences are not aligned.

Union (pan genome)
Get the union of the genes for the GFF files passed in. This will give you all of the genes that are found in any of the isolates used to create the pan genome. This is particularly useful if you have created a pan genome, and you spot a clade of interest. It allows you to then zoom in on the clade and find out what genes they have in common.

query_pan_genome a union g clustered_proteins *.gff
To get the list of genes found in a subset of GFF files: query_pan_genome a union g clustered_proteins file1.gff file2.gff file3.gff

Intersection (core genes)
Get the intersection of the genes for the GFF files passed in. This will give you all the genes that are found in all isolates (e.g. the core genes) for the given list of GFF files. This allows you to find out what the core genes are for a subset of isolates.

query_pan_genome a intersection g clustered_proteins *.gff
To get the core genes for a subset of GFF files: query_pan_genome a intersection g clustered_proteins file1.gff file2.gff file3.gff

Complement (accessory genes)
Get the complement, otherwise known as the accessory genes, of a subset of isolates. It is the Union minus the Intersection.

query_pan_genome a complement g clustered_proteins *.gff
To get the accessory genes for a subset of GFF files: query_pan_genome a intersection g clustered_proteins file1.gff file2.gff file3.gff