Defining syntenic relationships among orthologous gene clusters is a frequent undertaking of biologists studying organismal evolution through comparative genomic approaches. With the increasing availability of genome data made possible through next-generation sequencing technology, there is a growing need for user-friendly tools capable of assessing synteny. Here we present SimpleSynteny, a new web-based platform capable of directly interrogating collinearity of local genomic neighbors across multiple species in a targeted manner. SimpleSynteny provides a pipeline for evaluating the synteny of a preselected set of gene targets across multiple organismal genomes. An emphasis has been placed on ease-of-use, and users are only required to submit FASTA files for their genomes and genes of interest. SimpleSynteny then guides the user through an iterative process of exploring and customizing genomes individually before combining them into a final high-resolution figure. Because the process is iterative, it allows the user to customize the organization of multiple contigs and incorporate knowledge from additional sources, rather than forcing complete dependence on the computational predictions. Additional tools are provided to help the user identify which contigs in a genome assembly contain gene targets and to optimize analyses of circular genomes. SimpleSynteny is freely available at: http://www.SimpleSynteny.com.
Understanding patterns of conserved synteny from the genomes of different organisms is a central undertaking in the field of molecular biology. Originally synteny was defined through cytogenetics, and referred to the presence of two or more loci located on a single chromosome (1). With the widespread application of next-generation sequencing and the ability to routinely assemble whole genome datasets, it is now also common to describe synteny interchangeably with collinearity, or the conservation of gene order and orientation. Qualitatively, synteny can take different forms, depending on the scale involved. Macrosynteny refers to collinearity of gene order at the whole-chromosome scale, microsynteny describes a small number of genes exhibiting collinearity across a given sub-chromosomal region and mesosynteny is characterized by the conservation of gene content within a chromosome in the absence of collinearity (2). Because the degree of synteny breaks down over time through various processes, including chromosomal rearrangements, gene losses and gains, and chromosomal duplications and losses, assessments of synteny allow biologists to address questions related to the evolutionary divergence of organisms and gene families. The focus of syntenic exploration may be limited to genes on a single chromosome, expanded to consider the organization of an entire genome and may be performed between multiple species depending on the question at hand (3).
Computational tools for assessing synteny are becoming increasingly important components of comparative genomic studies, particularly with the proliferation of draft genome assemblies containing large numbers of contigs. A few examples of programs for detecting novel syntenic regions include: Proteny (4), i-ADHoRe (5) or DRIMM-Synteny (6). A recent summary of additional predictive software tools is provided in (7) while a review on visualizing genomic comparisons is available in (8). Generally speaking, each program for estimating synteny offers a trade-off with respect to the number of genomes accommodated, method of scoring, scale of analysis (micro- versus macro-syntenic) and user interface (command-line versus graphical/web-based). When selecting a program for a particular project, it has been noted that some perform better when dealing with closely-related taxa, while others do better with more divergent ones (4,7). Accordingly, results on the same data set can vary widely between programs (9).
Programs typically provide visual assessments of macrosynteny in the form of dot or circle plots, which often lack the fine scale level of detail necessary to easily observe individual genes. For example, the web-based CoGe Comparative Genomics Platform (http://genomevolution.org) provides the SynMap tool (10) to generate a dot plot, which allows the user to click on a region of a chromosome to zoom in. It is even possible to click on a chromosome segment and be taken to CoGE's Genome Evolution Analysis (GEvo) viewer to display individual gene information if an additional GFF file has been provided. However, when a user hovers their mouse over a region of the dot plot in SynMap, no gene names are displayed. A user looking for an individual gene is required to estimate its general location and repeatedly jump back and forth between SynMap and GEvo until it is found. Another web-based visualization tool with greater focus on displaying syntenic information for individual genes across one or more genomes is the Multi-Genome Synteny Viewer (mGSV) (11). The program expects the user to identify their genes of interest prior to use and requires them to provide the exact location of each gene for all genomes. To help with this task, the authors provide Perl scripts to export results from BLAST (12). In turn, this requires the user to be familiar with the command line and could entail writing a custom script if there is a need to convert results from an unsupported format. The program also does not currently allow the user to save a figure of their final analysis. Since gene names are only displayed when the mouse hovers over it, taking a screenshot is not an ideal solution.
Keeping the above issues in mind, we present here a new web-based tool called SimpleSynteny to provide informative visuals of microsynteny. In contrast to the higher-level view provided by dot and circle plots, our pipeline provides a more detailed perspective for researchers exploring a preselected set of gene targets. Unlike dedicated browsers that display synteny for a subset of curated genomes, such as the Yeast Gene Order Browser (13), we allow the user to upload a limited number of contigs from any genome of their choosing. An emphasis has been placed on accessibility so that the tool is readily usable to those without advanced computer or scripting skills. In the following section we detail how the server works and highlight some features and additional tools before showing two examples of SimpleSynteny analyses. First, we show a side-by-side comparison of SimpleSynteny and mGSV by recreating an analysis of a secondary metabolite cluster in two fungi. The second example details a more complex analysis, using mating genes found across eight different fungi.
MATERIALS AND METHODS
The standard pipeline for SimpleSynteny consists of three primary steps: (1) data input, (2) contig editing and (3) customization of graphical output. Users are also provided with two optional tools that identify contigs of interest from FASTA files, and an advanced mode that allows for custom image manipulation.
The SimpleSynteny pipeline begins on the main ‘Step 1’ page where individual genome and gene target files in FASTA format are uploaded. At least one genome file and one gene file are required, but a single gene target file can be assigned to multiple genomes as discussed below. FASTA definition lines are used to label all contigs and genes, however, the user can manually edit the names of genomes during the upload process. Each genome file can contain up to ten contigs (or supercontigs, scaffolds, chromosomes, etc.). For cases where the user does not know which contig(s) in a complete genome assembly contains their genes of interest, the genome assembly can be preprocessed using the optional ‘Contig Finder’ tool (detailed below) to quickly identify and export only the sequences containing the target region(s) into a single merged FASTA file. Each gene target file can contain up to 60 nucleotide or protein sequences. When comparing multiple genomes, SimpleSynteny draws connections between genes with identical names. Accordingly, consistent gene names and spelling of gene definition lines is important if multiple gene target files are used. After completing file uploads, the user can adjust additional settings using the ‘Advanced Settings for Gene Matches,’ allowing for customization of basic BLAST parameters, a threshold to exclude target sequences which do not have a minimum percentage of positions contained within BLAST hits and an optimization setting for aligning circular genomes.
Discovery versus visualization
The SimpleSynteny visualization pipeline does not explicitly score or evaluate syntenic relationships between targets. Accordingly, users should be aware of the difference between assigning a single gene target file to multiple genomes, versus the use of individual gene target files that uniquely correspond to each genome. In the former case, when BLAST is searching using sequences from a different genome, novel discovery is taking place. Such evaluations can be a fast and convenient first step for researchers to visually explore syntenic relationships between genomes. However, confirmation of results using additional tools may be required, particularly, when comparing distantly related species. In contrast, when specific target files are supplied for all genomes, SimpleSynteny functions strictly as a visualization tool, as the orthologous relationship between targets have already been confirmed by the user.
In ‘Step 2’ of the SimpleSynteny pipeline, BLASTN or TBLASTN are used to align nucleotide or protein targets onto the contigs of the first genome, respectively. The target-mapping process typically takes less than 30 s and the hit with the best E-value for a given target on a particular contig is used to set the strand direction when drawing annotations. A warning will appear if a given target gene does not map to the genome, along with the reason that mapping was not completed, allowing the user the opportunity to go back to Step 1 and adjust threshold settings as appropriate. A preview figure will then load on screen, showing all contigs for the first genome in a horizontal layout, including gene locations and orientations. Genes are uniquely color-coded for ease of visualization. The user can adjust the position and orientation of each contig by moving it left or right, or by removing individual genes at their own discretion before repeating this process for any remaining genomes. Clicking the ‘Show Other Genomes’ button allows the user to see previously-edited genomes alongside the current selection to aid in decision making. When editing multiple genomes, the ‘Try to Optimize Contig Order’ button will attempt to automatically arrange contigs in the same order as the first edited genome. The editing of contig positions and orientation continues until all genomes are processed.
When all genomes have been edited, ‘Step 3’ of SimpleSynteny allows the user to adjust an array of image settings, save the final completed figure, and generate summary data from the analysis. The ‘Basic Image Settings’ section allows the user to select from several standard image formats (PNG, JPG, GIF, EPS, TIFF or PDF), adjust the width and height of the image, and customize image resolution. Images can be drawn at low resolution for rapid visualization, or the user can increase the resolution up to 1200 dots per inch for publication-quality graphics. Under ‘Genome Adjustments,’ the user may alter the size and placement of genome names. The program also provides an option to automatically attempt to declutter the syntenic diagram by reordering genomes to minimize either the Euclidean distance of lines connecting genes or the number of arrows indicating changes in gene direction. The section ‘Drawing Style’ provides options for converting a figure to gray scale, adjusting the manner in which gene labels are displayed, or toggling a gene shading option to highlight regions along the full-length sequence covered or excluded by significant BLAST hits. This later feature can be useful as a quick visual indicator of sequence homology, for example when mapping proteins from one taxa onto another. The user can repeatedly adjust any of the above settings and generate a new preview image before deciding to download a final image file. They can also bookmark the URL of the ‘Step 3’ page and revisit their project for up to 72 h before it is deleted from the server. Generating a preview image typically takes less than a minute but requires more time as image dimensions and resolution settings are increased. Final full-resolution figures are provided for download in a ZIP file. The archive also contains other useful documentation, including server settings, lists of any unmapped genes, results provided from BLAST and a set of human-readable ‘contig mapping’ (CMAP) files for each genome for use with Advanced Mode as described below.
Contig Finder is an accessory tool included with SimpleSynteny to allow easy identification and extraction of contigs containing gene targets within a genome. The user first uploads a single genome in FASTA format, up to 250 MB in size. Larger genome files will need to be split into parts and processed separately. After the file is uploaded, a text area appears where the user can paste nucleotide or protein target sequences to search the genome using BLAST. If any hits are found, results are sorted in descending order to show contigs containing the most number of hits first. The user can then add up to 10 contigs to an export list to save the sequences in FASTA format.
Advanced Mode allows the user to utilize CMAP files to directly interface with the SimpleSynteny figure generation engine. This allows the user to fully customize gene shading, or make additional edits to figures generated in Regular Mode using the CMAP files provided with user output as a starting template. Additional genes or knowledge obtained from other programs, for example those to search for tandem repeats, can also be incorporated. In brief, each CMAP file describes a single genome, with each line detailing an individual contig. Gene entries each delineate the name, direction and start/stop coordinates for both gene and shading boxes. Complete details on the CMAP format are provided in the site documentation. The Advanced Mode interface is designed to auto-correct the spacing of elements and to provide basic hints when invalid CMAP lines are submitted. Once 1 − 10 valid CMAP files have been submitted through the Advanced Mode interface, the user can immediately advance to ‘Step 3’ (described above) to generate their figure.
Demo mode example: recreating a syntenic analysis of a secondary metabolite protein cluster
O'Connell et al. recently highlighted the organization of a polyketide synthase secondary metabolite gene cluster (Colletotrichum graminicola Cluster 18) in two fungal plant pathogens, as shown in Figure 1B (15). Cluster 18 contains 15 genes, most of which are upregulated during host infection by the Arabidopsis pathogen Colletotrichum higginsianum, but not by the maize pathogen C. graminicola. To recreate this syntenic analysis using SimpleSynteny, genomes and relevant proteins were obtained from the Broad Institute Colletotrichum Database (http://www.broadinstitute.org/annotation/genome/colletotrichum_group) for C. graminicola and C. higginsianum. Results generated using SimpleSynteny are shown in Figure 1A. For purpose of comparison, the same analysis was performed using mGSV, after genes were mapped using BLAST to help generate the required mGSV synteny and annotation files per the site's documentation (Figure 1C). Both programs reproduced the gene cluster organization, however, user time varied greatly, with SimpleSynteny taking $$\scriptstyle \sim$$10 min of user time versus $$\scriptstyle \sim$$45 min to prepare files using the mGSV pipeline. In addition, the quality of the output varied considerably, with SimpleSynteny yielding a customizable, publication quality graphic in either color or gray scale versus the mGSV screenshot.
A second example comparing the MAT1 locus between eight fungal taxa shows a conserved core group of genes in the region
In this example we show how SimpleSynteny is able to generate a more complicated syntenic comparison between the genomes of multiple, divergent organisms (Figure 2). Shown is a syntenic comparison of a genome region containing the fungal mating type gene MAT1 and 16 surrounding genes, encompassing $$\scriptstyle \sim$$100-kb. These 17 genes were mapped to the genomes of eight filamentous fungi in the ascomycete sub-phylum Pezizomycotina, incorporating members of groups that last shared a common ancestor approximately 302 MYA (16). Depending on the organism, the 17 target genes were contained within 2 − 7 contigs. Detailed information describing this analysis and the datasets used to generate it are provided in supplemental materials.
We have presented here SimpleSynteny, what we hope to be a useful tool for biologists in a range of disciplines, including those without expertise working with command line software. The evaluation of structural changes among species is a fundamental step in many comparative genomics studies, and the visualization of intact, disrupted or duplicated gene regions is integral to many analyses. The SimpleSynteny pipeline is designed to provide a fast new method to quickly visualize syntenic gene regions mapped across one or more genomes. We envision the tool being used either independently, or in conjunction with other programs which specialize in broadly comparing entire genome assemblies. Researchers dealing with circular assemblies such as organelles or bacterial genomes may find SimpleSynteny particularly helpful, as our circular genome option automatically aligns genomes to start at the same gene without the need for editing files. In the future, we hope to incorporate additional features into SimpleSynteny, such as more ways to customize gene shading. We also hope to eventually add an option to highlight introns and exons through the server's Regular Mode. Additional details on how to use the SimpleSynteny tool are available in the website documentation and the server can be freely accessed without any login requirement at: http://www.SimpleSynteny.com.
Supplementary Data are available at NAR Online.
We thank Y. Rivera, C. Salgado Salazar, L. Beirn, J. Demers and the reviewers for their valuable feedback and suggestions to improve server functionality. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. USDA is an equal opportunity provider and employer.
2013-2015 United States Department of Agriculture (USDA)—Animal and Plant Health Inspection Service Farm Bill 10201 and 10007 Program funds (to J.A.C.); USDA—Agricultural Research Service (ARS) [project 8042-22000-279-00D]; USDA-ARS Floriculture and Nursery Research Initiative [project 0500-00059-001 to J.A.C.]. Inter-agency fellowship agreement between the United States Department of Energy (DOE) and the USDA through the Oak Ridge Institute for Science and Education ARS Research Participation Program Fellowship [DOE contract DE-AC05-06OR23100]; Class of 2013 USDA-ARS Headquarters Research Associate Award (to J.A.C.). Funding for open access charge: U.S. Department of Agriculture, Agricultural Research Service.
Conflict of interest statement. None declared.