Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis.


Parallel linked track or trapezoid
Lines are drawn to join homologous regions between two or more stacked-up genomes. The user visualizes the genomic context together with the rearrangements for each comparison ( Supplementary Fig 1-E).
 Well-adapted to visualise simple genomic reshaping occurring at a few loci  Scattered and highly segmented rearrangements result in a tangle of lines that is very difficult to comprehend and interact with  Circos (65) : genomes are laid out in a circular arrangement. This minimizes the cross-over of lines connecting multiple genomic regions.  SyMAP (4) : uses a 3D approach where a reference stands in the middle and multiple twoby-two comparisons revolve around it. This layout also minimizes the cross-over of lines.

Symbolic representa tion
Symbols of uniform size are used instead of a representation proportional to the genomic sizes. In other words, the scale becomes the annotation events and the metric does no longer  The display scale needs not to be adjusted to the size of the features of interest  Legibility by human eyes  Not possible to achieve a genome-wide overview of the conserved regions.
Two approaches have been explored to implement this graphical paradigm:  A parallel linked-track representation where the size of genes is standardized: VisGenome with CartoonPlus (96), AutoGRAPH (63). Although the genes are more easily identified, it does not address the risk of confusion due to the jumble of depend on the genomic base pair coordinate system but on the legibility by human eyes.
lines.  Table chart where columns are delimited by genes and rows by compared genomes (Supplementary Figure 1-F). This strategy is used by Genomicus (78) Supporting Table 3. Distribution of the dispensable genes set of V583 in relation to 20 loci of interest regarding horizontal gene transfer. These loci were reported in various studies (42,45,46) (column 3 to 5). The dispensable genes set of V583 is retrieved using Insyght by comparing homologies between strains V583, 62, OG1RF, and Symbioflor1 (column 6 to 12). V583 is a strain from clinical origin, 62 is a commensal isolate from a baby, and OG1RF is a derivative of human Isolate which harbours known virulence traits such as gelatinase (GelE), the adhesin for collagen (Ace) and exhibits virulence in mice.
The other two strains are Symbioflor1 which is a probiotic from human origin, and D32 which is isolated from pig faeces. Strain D32 was left aside because its pathogenicity phenotype is not well characterised yet. A "+" in the header refers to the presence of homolog within the designated strains; a "-" refers to the absence of homolog. For example the column V583+ / 62+ / OG1RF+ / Symbioflor1-refers to the gene set from V583 that have homologs with 62 and OG1RF but not in Symbioflor1. For most of the 20 loci (presented as rows), a significant number of genes is retrieved when analysing the overall dispensable gene set. The p-value (Binomial law) is presented in parenthesis to show that the distribution of the dispensable genome is significantly biased toward those loci and is not random. Pp2 was the only prophage that did not appear in our results, which correlates with it being part of the core genome (42). A previous study (47) has found that OG1RF appears to not contain the homologous region to the putative pathogenic island EF_0479-EF_0628; this is consistent with our data as this loci appears to share 81 homologs with strain 62 and only 4 with OG1RF.
Name of loci in V583