Microbial genomic island discovery, visualization and analysis

Abstract Horizontal gene transfer (also called lateral gene transfer) is a major mechanism for microbial genome evolution, enabling rapid adaptation and survival in specific niches. Genomic islands (GIs), commonly defined as clusters of bacterial or archaeal genes of probable horizontal origin, are of particular medical, environmental and/or industrial interest, as they disproportionately encode virulence factors and some antimicrobial resistance genes and may harbor entire metabolic pathways that confer a specific adaptation (solvent resistance, symbiosis properties, etc). As large-scale analyses of microbial genomes increases, such as for genomic epidemiology investigations of infectious disease outbreaks in public health, there is increased appreciation of the need to accurately predict and track GIs. Over the past decade, numerous computational tools have been developed to tackle the challenges inherent in accurate GI prediction. We review here the main types of GI prediction methods and discuss their advantages and limitations for a routine analysis of microbial genomes in this era of rapid whole-genome sequencing. An assessment is provided of 20 GI prediction software methods that use sequence-composition bias to identify the GIs, using a reference GI data set from 104 genomes obtained using an independent comparative genomics approach. Finally, we present guidelines to assist researchers in effectively identifying these key genomic regions.


A tale of genomic islands in bacterial evolution
Bacteria and archaea reproduce clonally, typically by binary fission/bipartition, but are also able to acquire foreign genetic material from other living organisms via horizontal gene transfer (HGT; also called lateral gene transfer). HGT has played an extensive role in microbial genome evolution, leading researchers to tentatively represent their evolution as a 'Web of Life' [1] or a 'Rhizome of life' [2] rather than a tree reflecting vertical descent. HGT is a major source of novel microbial genes, providing and maintaining diversity at the population level [3,4]. Clusters of consecutive genes likely acquired by HGT are commonly described as genomic islands (GIs).
Several types of mobile genetic elements fall within this broad definition of GIs, including integrons, transposons, integrative and conjugative elements (ICEs) and prophages (integrated phages) [5]. These can be distinguished based on the mechanism of GI acquisition, mainly conjugation, transformation or transduction, and associated mobile selfish elements (integrases, transposases and insertion sequences) that promote GI mobilization and transfer [1,6]. Once integrated into the genome, GIs evolve through mutations, genome rearrangements, gene loss or further acquisition of mobile genetic elements. The ability for a GI to transfer further to other microbial hosts strongly depends on the type of mobile genetic element, the host background and tightly regulated stochastic processes [6,7].
Two groups of GIs can be distinguished with different roles: replacement and additive GIs [3]. Replacement GIs acquired by homologous recombination maintain the genomic diversity in the population and can be found conserved in distant relatives as well as in closely related strains [8]. Additive GIs are acquired by non-homologous recombination in preferential insertion sites such as transfer RNAs (tRNAs) or transfer-messenger RNAs (tmRNAs) or in the vicinity of highly conserved core genes, commonly leaving behind 16-20-bp direct repeats [9]. Additive GIs enable the bacterium to integrate multiple cassettes of different origin in a single genomic region, leading to their mosaic nature. The resulting genome flexibility enables rapid transfer of useful phenotypes in bacterial populations sharing a niche [10].
Genes that provide an advantage in a selective environment have been found to be associated with GIs, including those of significant medical interest and 'novel' genes (the latter reflecting a large, undersampled gene pool likely associated with these genomic regions) [11]. The spread of antimicrobial resistance (AMR) in some cases is the result of such HGT via bacterial conjugation [12] or phage transduction [13]. Virulence factors were also shown to be significantly disproportionately associated with GIs [14]. In particular, prophages can carry virulence factors that are often associated with increased bacterial virulence [15]. A recent study modeled how prophages, through recombination with actively infecting phages, play a key role in maintaining the phage population and expanding phage types [16]. They promote a diverse phage-host ecosystem with its associated large pool of genes driving microbial evolution. Indeed, lysogenic phages represent a major source of GIs [17,18], as about 50% of characterized bacteria harbor one to over a dozen prophages [19].
The gene content of GIs was traditionally used to classify GIs into several subtypes, for example (i) pathogenicity islands (PAIs) that encode genes important for bacterial pathogenicity/ virulence [20]; (ii) resistance islands that encode AMR genes [21]; (iii) symbiosis islands, as first coined for strains of Mesorhizobium meliloti, that are able to form nodules on plants after the acquisition of a 500-kb element [22];or (iv) metabolic islands that encode for adaptive metabolic abilities [6], for example the degradation of aromatic compound in Pseudomonas knackmussii [23]. However, the modular evolution of GIs and their composite nature enable single GIs to encode proteins with multiple functions. Hence, this simplistic classification reflects functional types driven by human interests for key gene functions rather than defined categories of evolutionary and mechanistic relevance.
As a direct consequence of their origin and evolution, features of GIs can include (i) a local nucleotide composition bias (guanine cytosine (GC) content, GC skew, codon usage, or k-mer signature), differing from the chromosome average; (ii) the presence of mobility genes and insertion sequences that can rapidly decay after GI integration; (iii) a high prevalence of phagerelated genes; (iv) a high prevalence of hypothetical proteins; and (v) the presence of direct repeats [24].

Computational prediction of GIs
GI detection, capitalizing on the features mentioned earlier, can be roughly classified into (1) sequence composition-based approaches and (2) comparative genomics approaches, based on the two most distinctive features associated with the horizontal origin of GIs: their sequence composition bias and sporadic phylogenetic distribution [5]. Despite the presence of well-known features, GI prediction is challenged by their mosaic nature and propensity to evolve rapidly through further gene acquisition that mixes nucleotide bias signatures, gene loss that can remove mobility genes or genome rearrangements. GI prediction has become an increasingly important component of bacterial genome investigation, and the development of new, more accurate tools has attracted major attention in the community, given the release of several new computational methods for GI prediction in the past decade (Table 1). Previous reviews of GI prediction methods have provided a good overall classification of tools available at the time of publication [5,25,26]. Hence, we review here new computational methods and new releases of existing software, in comparison with older methods.

Sequence composition-based approaches
Assuming that mutational and selection pressures acting relatively homogeneously on microbial genomes create genome-wide signatures of nucleotide composition specific to each microbial species, genomic regions acquired by HGT can be distinguished from the rest of the genome in some cases by their atypical nucleotide composition. Methods adapted to the analysis of single genome sequences are generally based on the detection of such biases in sequence composition, eventually coupled with the analysis of gene content and further characteristics that are detailed in Table 2. Available tools usually examine k-mer frequencies or GC content at the genome or the gene level either using sliding windows or windowless methods. The wide variety of implementations, scoring algorithms and refinement methods (Tables 1  and 2) complicates the division of existing methods in intricate categories. Because of the large number of available methods, we will summarize here their similarities, advantages and drawbacks rather than precisely describing algorithms, as offered in previous reviews [5,25].
Window-based and windowless methods analyzing genome-level nucleotide sequence composition Many methods use sliding windows at the genome level (rather than analyzing individual genes) to calculate nucleotide composition biases, including MTGIpick [27], GI-SVM [28], SigHunt [29], Centroid [30], INDeGenIUS [31], AlienHunter [32] and Design-Island [33]. As reviewed in [25], these methods differ in the size of the sliding window, the scoring scheme as well as rules for the determination of regions with atypical composition. AlienHunter and GI-SVM use overlapping windows of fixed size, whereas MTGIpick, INDeGenIUS and Centroid use non-overlapping windows. Recently published, MTGIpick implements a t-test with selection of features to quantify compositional differences of tetranucleotides using kurtosis. It was suggested that it provides much higher recall, but lower precision, versus other window-based methods [27].
MSGIP [34], MJSD [35], GC-profile [36] and ZislandExplorer [37] are windowless methods using top-down approaches to determine break points in the genome sequence composition bias. GC-profile calculates GC content distribution, leaving the determination of GIs to the user. Its further development, named ZislandExplorer, refines regions identified using GC-profile by analyzing the codon and amino acid usage to define potential GIs. A recursive segmentation method is used by MJSD to identify changes in genome sequence atypicality, based on measures of Markov Jensen-Shannon divergence.
Most of these methods are sensitive but have low precision because of the limited information used for prediction (also see the benchmarking of GI predictors later). It is interesting to note that by further taking into account codon and amino acid usage, ZislandExplorer achieves a much improved precision compared with GC-profile. Overall, windowless methods supposedly better identify GI boundaries, as they do not rely on any predefined window size but can in principle determine boundaries as precisely as a single base. For window-based methods, the boundaries will depend on the size of the window used for computations. However, most methods do not provide specific implementations to improve the determination of GI boundaries, except MTGIpick that includes a derived version of MJSD to refine GI boundaries after an initial prediction step.
Window-based and windowless methods at the gene level Genes represent an important functional unit of microbial genomes, which differ in sequence composition from noncoding regions of the genome, and were, therefore, used to identify composition biases or to refine predictions in many other methods (Table 2), including IslandPath-DIMOB, SIGI-HMM and SIGI-CRF, Wn-SVM, SVM-AGP, PredictBias, GIHunter, GIDetector Comparative genomics method and so excluded from the GI predictor assessment analysis.
() are used when the GUI, webserver or database has been published as a separate too.  Could not be successfully used for the GI predictor assessment analysis (Supplementary Table S1). b Comparative genomics method, and so excluded from the GI predictor assessment analysis. and RVM. IslandPath-DIMOB and PredictBias both use a sliding window of six genes to calculate sequence composition bias, which essentially smooths the signal and, for example, avoids the detection of abnormal compositions because of high gene expression levels of small sets of genes. The use of criteria such as the presence of mobility genes in addition to sequence composition bias was shown to significantly improve the prediction accuracy of IslandPath-DIMOB by eliminating many false positives (FPs) [38].
A recently released version of IslandPath-DIMOB, with improved sensitivity in the score for nucleotide composition bias as well as novel HMM profiles to detect mobility genes, showed a highly increased recall with little loss of precision [39]. SIGI-HMM and SIGI-CRF, available as part of the Colombo package [40], measure codon usage bias in each gene as a signature for HGT. Individual genes that are predicted as having unusual codon usage using a Hidden Markov Model form potential GIs when closely positioned on the genome.
Following an assessment of structural features associated with GIs by Vernikos et al [41], machine learning approaches were developed, making use of the following GI features to train the models: insertion point, size of the region, gene density, presence of repeats, phage proteins, integrases and tRNAs ( Table 2). GIDetector [42], a stand-alone software for Windows, and GIHunter [43], a Linux-compatible command line equivalent, have shown promising results in their ability to accurately predict GIs. Machine learning methods are able to integrate multiple GI features, improving the accuracy of GI predictions overall, as is observed for GIHunter [43].
Other sequence composition-based approaches Finally, Islander [44] does not fall within any of the aforementioned categories, as it uses the presence of tRNA sequences and a 'displacement fragment' to predict the position of GIs. Further information, including a measure of GC content, and absence of a tyrosine integrase gene are used to filter out false positives (FPs). This method therefore only predicts a subset of GIs, though it predicts GI boundaries more effectively for those it does predict [44].

Comparative genomics approaches
Comparative genomics approaches predict GIs based on the sporadic distribution of genomic regions acquired by HGT in closely related bacterial and archaeal genomes. Conversely, genomic regions conserved in large groups of monophyletic organisms are less likely to have horizontal origins. Comparative genomic methods have the advantage of determining precise GI boundaries based on multiple-sequence alignments or local alignments. However, they are dependent on the availability of appropriate closely related reference genomes used for comparison, and their prediction may vary according to the set of comparative reference genomes selected. Furthermore, they may be sensitive to prediction of genomic regions because of gene loss as well as HGT.
IslandPick [38] is the only method that performs an automatic selection of suitable genomes for comparison. Genomes are aligned using Mauve [45], and a secondary filter using BLAST [46] ensures that the predicted GI region is not simply a recent duplication or genomic rearrangement not aligned by Mauve. IslandPick predictions were shown to have a higher similarity to published GIs from the literature than other sequence composition-based methods [38], thus demonstrating the high accuracy of this comparative genomics approach. IslandPick is now available within the IslandViewer 4 webserver, which integrates multiple methods and provides a database of reference genomes for comparison [47]. IslandPick can also be customized, allowing a user to manually select the genomes to use for a comparative genomics-based prediction.
tRNAcc [48] and its webserver MobilomeFinder [49] use Mauve [45] multiple-sequence alignment to identify strainspecific regions downstream of tRNA genes that are predicted as GIs. The method is robust because of the combined requirement of a site-specific integration in a tRNA gene and the comparative genomic approach. However, it only identifies a subset of rather canonical GIs and is not able to predict GIs integrated in other genomic locations.
DarkHorse [50] has been designed to identify proteins likely acquired by HGT rather than large GIs. Protein sequences are subjected to BLAST analysis, comparing the sequences against the NCBI non-redundant database, and assigning a score reflecting atypical phylogenetic distance between the BLAST query and subject organisms. The combined approach using comparative genomics and phylogenetics identifies potential HGT candidates at different taxonomic levels. The use of amino acid sequences also enables analysis of more distantly related organisms, compared with IslandPick or tRNAcc.

Hybrid and composite approaches
The various comparative genomics and sequence compositionbased approaches, with their inherent advantages and drawbacks, have the potential to identify different regions acquired by HGT. Therefore, a few methods have implemented composite or hybrid approaches to improve GI prediction in bacterial and archaeal genomes ( Table 2). The performance of such methods is tightly linked to the choice of individual tools combined as well as further decision rules for the integration of predictions.
EGID [51] and its graphical interface GIST [52] combine five sequence composition-based approaches: AlienHunter, COLOMBO SIGI-HMM, INDeGenIUS, IslandPath and PAI-IDA. To ponder the weight of each method, individual results are filtered to form non-overlapping regions in which gene scores exceed a threshold. These are further used to generate a consensus GI prediction by merging closely positioned regions. This composite approach was shown to achieve a balanced recall and precision of 0.63 when assessed using a reference data set identified by comparative genomics [51].
IslandViewer 4 integrates three composition-based methods (IslandPath-DIMOB, SIGI-HMM and Islander) as well as the comparative genomics method IslandPick. Related strains for genome comparison are automatically selected among the reference organisms available in the database, but users can also rerun the analysis using a different set of manually picked genomes. A database of precomputed genome analyses is also provided. However, there are some limitations for usersubmitted genome analysis, versus precomputed genome analysis: IslandPick predictions are at the time of this publication restricted to the analysis of complete genomes, as the impact of poorly reordered contigs from draft genomes on its performance has not yet been assessed. Also, Islander predictions are so far only available for a subset of precomputed genomes, and not for custom user-supplied genome analyses. Overlapping predictions of the integrated tools in IslandViewer are merged, yielding a significant increase in recall (0.73) while maintaining good precision (0.9), thanks to limiting the methods chosen to be within IslandViewer as only those with relatively high precision [5, see also the analysis below].
PIPS [53] and its graphical user interface (GUI) GIPSy [54] determine the genomic signature deviation using custom scripts and SIGI-HMM, the presence of transposase genes and flanking tRNA genes and the absence in other organisms of the same genus or closely related species. A score assigned to each region, combining the presence of these characteristics, results in predictions at four levels of confidence. GIs are further classified into four subtypes based on the presence of factors for virulence, metabolism, antibiotic resistance or symbiosis. This tool is unique in that it combines sequence composition-based GI predictions with information on the conservation of genomic regions in user-provided custom genomes. However, the manual process in the graphical interface is tedious and reference genomes for comparison cannot be determined automatically, hindering its use for larger-scale analyses, which are becoming increasingly common.
VRprofile [55] performs sequence homology searches to a database of mobile genetic elements MobilomeDB, including notably prophages, ICEs, integrons and insertion sequences, taking into account gene order and clusters. It also uses and merges SIGI-HMM and IslandPath-DIMOB v0.2 predictions. Homology-based gene cluster and composition-based predictions are integrated, giving priority to gene clusters if they are larger than predicted GIs, and removing regions smaller than 5 kb. tRNAs and direct repeats are also predicted to better identify insertion sites and GI boundaries.

Databases of GIs
Several websites hosting webservers and source code of GI predictors offer the possibility to download precomputed predictions for microbial genomes (indicated by a D in Table 1). As detailed later, some resources have further developed databases of predicted or curated GIs in publicly available genome sequences. Updated in 2015, the Islander [44] website provides 3919 GIs identified in 2168 prokaryotic genomes. Detailed information is available for each GI, including the position and sequence of tRNA integration site, the displaced tRNA fragment associated with the integration, biases in nucleotide composition and the gene content highlighting the integrase. Because it focuses on integrases of the tyrosine recombinase family, the Islander database also serves as a resource for the study of integrase site specificity and its evolution.
IslandViewer 4 [47] offers a database of precomputed GI predictions using three methods (IslandPath-DIMOB, SIGI-HMM and IslandPick) for 6193 bacterial isolates as well as Islander predictions for a subset of these isolates. These can be browsed in its interactive visualization interface, further highlighting the presence of AMR genes identified using the Comprehensive Antibiotic Resistance Database (CARD) [56], virulence factors (curated data set) as well as pathogen-associated genes [57]. Results can be downloaded in various text and graphical formats. The database is updated roughly semiannually, to include new genome sequences publicly available in RefSeq.
VRprofile [55] offers precomputed predictions for 2428 complete genomes. However, we were only able to download the list of 'candidate genes', including mobility genes, AMR genes and virulence factors, as well as insertion sequences.
For each GI, a graphical and tabular representation of the gene content, tRNA and insertion sequences/direct repeats provides direct links to the sequence. The database also lists publications associated with the discovery of a GI, as well as other genomes exhibiting sequence similarity to the GI identified.
ICEberg [59] is a database of ICEs in 363 bacterial isolates with experimental data extracted from the literature (n ¼ 211), predicted by bioinformatics methods (n ¼ 219) or directly extracted from GenBank (n ¼ 30). An organism or ICE familybased browser displays detailed information on 460 ICEs, including a genome context view, sequence information as well as related publications.
Finally, the DarkHorse database [60] identifies phylogenetically atypical proteins probably acquired by HGT with various degrees of phylogenetic granularity (strain, species and genus) in 1456 bacterial and archaeal genomes (last updated in 2009). Results are returned in a tabular format with links to the protein sequence.

Benchmark of GI predictors
To enable researchers to select the optimal tool for their needs, it is essential to have large-scale assessments and comparisons of available methods. In 2008, a first assessment of GI predictors by Langille et al. [38] using a data set of GIs identified by the comparative method IslandPick represented a landmark in the field, as it enabled researchers to compare the performance of several popular methods for GI prediction. Since then, numerous methods to predict GIs have been published, claiming significant improvements in recall and accuracy compared with existing methods. However, different positive reference data sets of GIs are used in each publication to demonstrate the predictor accuracy, precluding a direct comparison of the methods. A major limitation of most articles is that the reference data set is often limited to one or only a few genomes to provide a proof of concept for the new method, which does not represent the breadth/variety of microbial genomes. To provide guidance for researchers who need to analyze one to hundreds of bacterial genomes, we have performed an assessment of GI prediction tools, using one comparative data set of islands from 104 genomes, as well as a literature data set of six genomes, plus a negative data set [39].
Among the 37 methods identified for this review (Table 1), including command line tools, GUI-based tools and webservers, we could not find the web resources or source code for only two tools. Eight tools could not be run because of various reasons (Supplementary Table S1), mostly software errors, some of which might be linked to our operating systems but given the number of tools considered here and time constraints could not all be individually solved. The six tools that include comparative genomics approaches (Supplementary Table S1) have been excluded from this analysis because their results largely depend on the genomes chosen as a reference and most tools do not provide an automatic selection of phylogenetically closely related genomes for the analysis. Altogether, 20 compositionbased prediction tools were successfully used to predict GIs on a data set of 104 bacterial genomes described in [39].

Methods
The reference data sets described in [39] comprise a positive data set of GIs predicted by comparative genomics using IslandPick combined from Langille et al [38], IslandViewer 3 [61] and IslandViewer 4 [47] as well as a negative data set of core conserved genomic regions [38]. It includes 104 genomes, representing 22 different bacterial genera and 53 species that cover several common pathogens and environmental bacteria. Each genome encodes between 1 and 77 predicted GIs (mean ¼ 17.7, SD ¼ 16.6), for a total of 1845 GIs. The full data set is available for download in tabular and fasta format at the following URL: http://www.pathogenomics.sfu.ca/islandviewer/download/. The literature positive data set includes curated GIs recovered in the literature for six bacterial genomes: Escherichia coli O157: H7 str. Sakai (NC_002695.1), Escherichia coli CFT073 (NC_004431.1), Salmonella enterica subsp. enterica serovar Typhi str. CT18 (NC_003198.1), Staphylococcus aureus str. MW2 (NC_003923.1), Streptococcus pyogenes str. MGAS315 (NC_004070.1) and Vibrio parahaemolyticus RIMD 2210633 (NC_004603.1). Recall, precision, accuracy, F-score and Matthews correlation coefficient (MCC) were calculated as in [39], based on the number of true positives (TPs) and (FPs) measured as the number of nucleotides in predicted GIs overlapping with the positive and negative data sets, respectively, and true negatives (TNs) and false negatives (FNs) measured as the number of nucleotides outside predicted GIs overlapping with the negative and positive data sets, respectively: When TP and FP were equal to 0, the precision was counted as equal to 1 because the software is being conservative making no prediction. The MCC varies between -1 (complete disagreement between the prediction and the reference data set) and 1 (perfect prediction). An MCC of 0 indicates a prediction no better than random. MCC was considered as 0 when the denominator was equal to 0. The code used to assess GI predictors is available at https://github.com/brinkmanlab/gi_predictor_assessment. All stand-alone software was installed on Debian 3.2.41 x86_64 machines, CentOS 7 machines for GI-SVM and GCProfile and Windows 10 for GIDetector and Design-Island. Webservers were accessed using the URLs in Table 1. Default parameter values were used for all software, as built into the program, as described in the readme document or in the corresponding publication. In the case of MJSD, a significance threshold of 0.99, segmentation model order of 1 and atypicality assessment model order of 2 were used, as in the article. In the case of Wn-SVM [62], parameters -nu of 0.2 and -n of 6 to run the analysis were used, as suggested by Dr Tsirigos. MJSD [35], PAI-IDA [63], SigHunt [29], GC-Profile [36] and INDeGenIUS [31] returned a numerical score as a result and thus required a cutoff to distinguish GIs. For MJSD, Salmonella enterica strain NC_003198.1 was analyzed and compared with the results described in the article [35]. Classifying segments with scores equal or greater than 0.99999 as GIs was found to have the strongest resemblance. The cutoffs 3.9 and 5 were used for PAI-IDA and SigHunt, as suggested in their respected publications. No recommended cutoff was available for INDeGenIUS and GC-Profile, as these softwares produce scores dependent on the composition of the genome being analyzed. For INDeGenIUS, regions with scores exceeding 2 SDs from the mean were considered GIs. For GC-Profile, regions with scores beyond 2 weighted SDs from the weighted mean, in either direction, were classified as GIs. The weights were based on the length of the genomic regions considered.

Results
GI predictors differ greatly in their ability to accurately predict GIs, with consistent trends among the 104 genomes used here in the reference comparative data set ( Figure 1). All tools also show significant variation in their recall, precision and accuracy among the 104 genomes, stressing the need to use a diverse and large data set of genomes to assess GI predictors to avoid biases because of the selection of specific bacterial taxa on which some tools might perform better. Notably, average values for some tools (Table 3)  However, when choosing a predictor, researchers also need to consider the precision, which reflects the software tendency to make FP predictions. Some tools with very low recall to intermediate recall (IslandPath-DIMOB, SIGI-CRF, SIGI-HMM, ZislandExplorer, VRprofile and Islander) have high precision, with Islander at the extreme exhibiting a very low recall but a constantly high precision thanks to its strict criteria to predict a subset of GIs integrated in tRNAs or tmRNAs [44]. Several predictors (PredictBias, AlienHunter, MTGIpick, MJSD and SigHunt) with high recall exhibit a poor precision (0.46-0.59). Overall, sequence composition-based approaches working at the gene level generally present more robust predictions than those analyzing genome-level sequences (greater accuracy; P < 0.01), with lower recall but much higher precision than the aforementioned genome-based methods.
Only IslandViewer 4, a composite method that includes IslandPick comparative genomic approach, SIGI-HMM and IslandPath-DIMOB [47], as well as GIHunter, a single method that incorporates multiple features to build a decision tree for GI prediction [43], show both a high recall and precision. This is reflected in the high overall accuracy, F-score and especially MCC values obtained by those two tools compared with the others. MCC is widely used for software benchmarking and is considered a balanced measure of the correlation between the reference data sets and the observed predictions, robust to different sizes in the confusion table. However, it must be emphasized that for some users, high precision is critical, whereas for others, high recall is essential, and so, a collection of measures is shown.
The positive data set identified by comparative genomics comprises many genomes from different bacterial species and includes some well-known GIs identified in the literature [38,39]. However, GIs in this comparative genomics data set have not been manually curated. GI predictors were thus also evaluated using a positive data set of GIs retrieved from the literature in six bacterial genomes (Table 4). This data set also allows an unbiased assessment of IslandViewer and avoids the circularity linked to the use of IslandPick to build the reference data set and its integration in IslandViewer 4 webserver predictions. The small number of genomes considered here and their biased representation reduces the statistical significance of the analysis, and results should be taken with caution. Nevertheless, note the general congruence of the results (Pearson's correlation coefficient of 0.86). Several methods show highly improved MCC, such as PredictBias, VRprofile, MTGIpick, Figure 1. Accuracy assessment of genomic island prediction methods. Accuracy of genomic island (GI) predictors was assessed using a data set of GIs identified by comparative genomics analysis of 104 genomes [39]. Each genome is represented by a value, with the median, and the first and third quartiles represented in the boxplot as the lower and upper hinges, respectively. Outliers are shown as black dots, if they exceed 1.5 times the interquartile range. Islander and MSGIP, notably thanks to the much higher recall observed. The better performance of tools that require the presence of features of GIs, such as Islander, VRprofile or GIHunter, which were trained on such a literature data set, could partially reflect this training, and the tendency to describe large canonical GIs in the literature rather than smaller regions that are equally well identified by a comparative genomics approach such as IslandPick. As the negative data set and thus FPs remain the same as with the comparative data set, changes in precision are only because of the increase of TPs (also visible from the increased recall) and the choice of a small number of genomes in this literature data set. The creation of a larger data set of literature-based curated GIs, and corresponding negative data sets, would be key to better benchmark GI predictors and assess their ability to predict large multi-modular GIs. GIHunter and IslandViewer 4 remain among the methods with the highest MCC together with PredictBias, although more closely followed by other GI predictors using this literature data set for evaluation.

Ease of use, common issues and solutions
Researchers interested in using some of these GI predictors may face difficulties accessing the software code and installing it, obtaining the correct input files and interpreting the results-in particular when requiring predictions for large numbers of isolates with minimal manual handling. Early GI predictors were developed as command line software in various coding languages, with a tendency to publish GUIs, webservers or databases providing an easier access for non-bioinformaticians in the following years (Table 1). GI predictors developed recently all include a GUI and/or a webserver as do some other older successful predictors and databases with good long-term maintenance such as Colombo [40], PAIDB [58] and IslandViewer [47].
To help researchers access existing and adapted tools for their needs, we have summarized key features of the software investigated in Table 1. More detailed information is available in Supplementary Table S1. An important criterion for choosing a suitable GI predictor can be the type of input file required. Whereas, some tools require simple nucleotide fasta files (.fna and .ffn for genes), or standard flat file formats from nucleotide repositories such as GenBank (.gbk, .gbff and .gb) or Embl (.embl), others require tabular formats that are no longer provided by default in nucleotide repositories (.ptt and .rnt; Table 1). These latter tables can be inferred from standard flat file formats using custom scripts, but they therefore require some knowledge of a coding language.
Also, with the democratization and the popularity of shortread sequencing, most sequences currently submitted to public databases are draft/incomplete genomes (i.e. they include several contigs). Most tools only accept complete genomes as input, except (theoretically) five webservers. IslandViewer 4 is one such tool, which provides the possibility to select a reference genome for contig reordering before GI prediction [47]. According to its publication, MTGIpick accepts fasta files with multiple contigs [27], but the corresponding website indicates that only files with one nucleotide sequence are accepted. PAI-Finder accepts nucleotide fasta files of genes and can, therefore, handle unfinished genomes but with a limit of 1000 sequences per submission [58]. GI-POP, a webserver previously accepting draft genomes, has not been accessible since publication [64]. Finally, VRprofile website provides a tool called CDSeasy to annotate and reorder contigs according to a complete reference genome that provides a gbk file with a pseudochromosome that can be used as input to VRprofile [55]. To overcome limitations and use other GI predictors on draft genomes, a common strategy is to reorder contigs according to a closely related complete reference genome using tools such as Mauve [45] or ABACAS [65] and concatenate them in a single pseudochromosome. Contigs that could not be reordered are generally placed at the end of the pseudochromosome. Predictions on such pseudochromosomes should be interpreted with caution, considering the position of gaps between contigs and the possibility that some smaller contigs belonging to the GI could not be placed correctly. Indeed, transposases and integrases that are features of GIs and used by many prediction software are often found in single small contigs because such mobile elements exist in several identical copies in the genome, leading to difficulties in genome assembly. Finally, the quality of the draft genome (fewer contigs is better) and the genetic distance with the reference genome (the closest reference is generally better) may affect the reordering and the quality of GI predictions.
Several predictors have flexible input parameters to perform the analysis. Although this is desirable, to customize predictions for particular genomes of interest, this can become problematic when defaults are lacking or recommended parameters are not clearly stated. Similarly, some GI predictors such as INDeGenIUS, MJSD, PAI-IDA, SigHunt and GC-Profile return a numerical score or bias value that needs to be interpreted to infer the position of GIs along the genome. This flexibility often comes at the expense of software ease of use-particularly for larger-scale analyses. Indeed, in the absence of closely related reference genomes with highly curated GIs from literature that can be used as benchmark, optimal parameters and cutoffs may be difficult to choose, given the wide variation in prediction accuracy among different bacterial species. All parameters and cutoffs used in this analysis (defaults, suggested with the software or stated in the publication) are mentioned in Supplementary Table S1.
Finally, most current microbial genome projects now involve the sequencing of many isolates. These large-scale analyses require adapted tools to minimize manual handling time that is notably usually required for submissions to webservers. Command line tools, which comprise most methods currently available, are well adapted to batch submissions for researchers with some basic knowledge and coding ability. Other tools available as webservers only, such as PredictBias [66], can be automatized using packages such as Selenium Python, but this requires more advanced coding knowledge. The recently released IslandViewer 4 [47] offers an HTTP API for submission of many genomes and retrieval of results using various coding languages or simple cURL commands.

Choosing a GI predictor
Given the variety of GI predictors and their diverse features, the choice of a tool will depend on characteristics of the genome to be analyzed as well as on requirements of the planned research. The availability of closely related genomes is necessary to use methods based on comparative genomics that generally enable a more precise definition of GI boundaries. Adapted to single genomes, the numerous methods based on sequence composition are commonly used. Depending on the research focus, methods with lower recall but very high precision may be of interest, to ensure robust predictions, or conversely, high recall may be desired to reduce the chance of missing predicting a key GI region. For example, focusing on canonical GIs integrated in tRNAs would require software such as Islander [44] or tRNAcc [49] that use the presence of a tRNA integration site as a requirement to issue a GI prediction (Table 2); however, this method will miss the majority of islands, which are non-canonical. Depending on the tool, composition-based prediction methods may have less accurate determination of GI boundaries and will require careful inspection. In addition, the availability of genome annotation as well as the status of the genome assembly, complete or incomplete/draft including several contigs, largely reduce the choice of possible tools. To facilitate the selection of an appropriate tool, we have summarized the characteristics of all tools and provide a multi-entry decision table in Figure 2.
Because of the varying sensitivity and criteria used by GI predictors, predictions obtained using different tools generally only partially overlap. As previously observed [5] and used for the development of composite methods such as IslandViewer [67], VRprofile [55] and EGID [51], combining several methods results in a significant increase in the recall of GIs. In general, the selection of highly precise methods is advisable to preserve a good prediction accuracy, as shown earlier for IslandViewer 4 and VRprofile. Finally, the combination of both composition-based and comparative genomics methods has the potential to identify GIs with different characteristics.

GI visualization and future needs
Although there is a notable trend toward providing GUIs and webservers to enable non-bioinformaticians to run GI predictions on user-provided genomes, most interfaces only have limited ability to visualize GIs. GIPSy [54], GC-profile [36] and PredictBias [66] only provide GIs as tabular results. PredictBias further displays a tabular list of genes encoded in the predicted GIs, as well as potential VFs. ZislandExplorer [37] provides static images with GIs highlighted, whereas Islander [44] and VRprofile [55] display static graphs of genomic features as well as tabular results. Che et al. have developed a Cþþ tool named GIV [68] to plot AlienHunter or GIHunter predictions and GI-associated features in a circular representation using Circos [69].
Only a few resources provide more advanced visualization techniques. According to the publication, MTGIpick [27] presents a dynamic GI visualization in the form of a circular representation of genomes with predicted GIs that allow the user to zoom in and out. However, we were unable to successfully run MTGIpick stand-alone version. PAIDB [58] displays linear views of precomputed GI predictions featuring genes as arrows on the two strands. Clicking on any arrow displays additional information, including gene function, sequences and related publications. IslandViewer provides by far the most extensive, integrated and interactive visualization of GIs and their genetic content, highlighting virulence factors (VFs), AMR determinants and pathogen-associated genes [47]. Gene content can be browsed in a pop-up table with direct links to public databases such as NCBI for publicly available genomes, as well as CARD [56] for AMR, and VFDB [70], Victor's virulence (http:// www.phidias.us/victors/) and PATRIC [71] for VFs. Since Figure 2. Summary of GI predictor characteristics with a multi-entry decision table. GI predictors were classified according to their interface and the requirements. Sensitivity (recall) and precision are color coded using a gradient from low-to-high as assessed using the comparative genomics data set. Methods that were not assessed (comparative genomics) or that we were unable to assess are shown in white. IslandViewer 3 release [61], synchronized circular and linear views with intuitive zoom and navigation features represent GIs broken down by a prediction method using GenomeD3Plot [72]. Moreover, IslandViewer allows the independent visualization of two genomes side by side, as a first step to enable GI comparison between bacterial isolates.
Further development of advanced visualization techniques is required to facilitate the analysis of the key role of GIs in bacterial adaptability. Indeed, integrated views with AMR genes and VFs are a critical step to investigate bacterial evolution, particularly in the context of infectious disease outbreaks, which are now becoming routinely investigated using whole-genome sequencing (termed genomic epidemiology) [73,74]. For environmental strains, automatic identification of metabolic pathways or transporters that could enable the microbe to thrive on different food sources would be valuable. Further identification and highlighting of interesting features such as phage genes, mobility genes (integrases, transposases), preferential insertion sites such as tRNAs as well as bordering direct repeats (when existing) would facilitate the analysis of the genome dynamic and the origin of these horizontally acquired genomic regions. Finally, and perhaps most pressingly, there is a high need to visualize comparative genomic analyses at a larger scale. Specific interfaces must be developed to accommodate the need to smoothly and interactively visualize up to hundreds of microbial genomes and their GI predictions, identifying clusters of related islands and rearrangements of potential interest.

Moving toward integrated analyses beyond the single genome
There has been a resurgence of interest in GIs in recent years, as large-scale microbial genome analyses, including genomic epidemiological analyses, further reveal the importance of GIs in microbial adaptation to environmental conditions-including medically relevant adaptations by pathogens. The prediction of GIs and the analysis of their gene content have become an essential component of microbial investigations, from genome reports to detailed comparative genomics studies. In particular, in clinical microbiology and epidemiology, it is increasingly needed to rapidly identify recently acquired genomic regions that may be unique to a disease outbreak, or associated with a pathogen strain that has a particular phenotype of interest, such as increased persistence, transmissibility, AMR or resistance to solvents used for their control, as investigated for Listeria monocytogenes [75], Pseudomonas aeruginosa [76] and E. coli [77]. Many genomes can now be sequenced daily and made available rapidly using a single short-read sequencer or certain longer-read technologies, such that sequence analysis represents the major bottleneck. Among the many tools currently available to predict GIs presented in this review, each method has its strengths and weaknesses depending on the type of genomic data available as well as research requirements, but a combination of methods is often most suitable and desirable. Predicted GIs should be manually checked and their gene content searched for genes of medical or environmental interest, boundaries eventually refined, and GIs can be further compared using homology searches. Further analysis of integration sites and GI conservation with further modular gene acquisition or loss across larger numbers of samples are desirable to obtain a better picture of bacterial genome evolution. Tools to predict, analyze and visualize GIs need to accommodate the need for better-integrated evaluation of microbial genomes. The development of more comprehensive methods to identify GIs, their origin and their encoded features will enable researchers to implement real-time tracking of microbial genome evolution in answer to changing environmental conditions and selection pressures such as antibiotic use.

Key Points:
• GIs are clusters of microbial genes that were likely horizontally acquired. They are key players in genome evolution, facilitating microbial adaptability of wide environmental, industrial and medical interest.
• GI predictors show highly varying levels of recall and precision with consistent trends across diverse microbial species.
• Benchmarking of new GI predictors should preferably be performed on large and varied data sets of bacterial/ archaeal isolates.
• More integrated interactive analysis and visualization of GIs in multiple genomes is now required, as the demand for large-scale analyses increases.

Supplementary Data
Supplementary data are available online at https://academ ic.oup.com/bib.