With an increasing number of vertebrate genomes being sequenced in draft or finished form, unique opportunities for decoding the language of DNA sequence through comparative genome alignments have arisen. However, novel tools and strategies are required to accommodate this large volume of genomic information and to facilitate the transfer of predictions generated by comparative sequence alignment to researchers focused on experimental annotation of genome function. Here, we present the ECR Browser, a tool that provides easy and dynamic access to whole genome alignments of human, mouse, rat and fish sequences. This web-based tool (http://ecrbrowser.dcode.org) provides the starting point for discovery of novel genes, identification of distant gene regulatory elements and prediction of transcription factor binding sites. The genome alignment portal of the ECR Browser also permits fast and automated alignments of any user-submitted sequence to the genome of choice. The interconnection of the ECR Browser with other DNA sequence analysis tools creates a unique portal for studying and exploring vertebrate genomes.
Received February 11, 2004; Accepted February 25, 2004
Several vertebrate genomes, including those of human, mouse, rat and several fish, have recently been sequenced and assembled, and with the exponential increase of sequencing performance and capabilities, DNA sequences of several other vertebrate genomes are expected to emerge in the near future. A number of studies have underscored the value of comparative genome alignments in the functional annotation of such complex genomes, demonstrating clearly that DNA sequence conservation can serve as a faithful guide to the identification of sequence elements with critical biological functions. This strategy has been validated with the identification of both novel genes and functional noncoding elements (1–3). While comparisons between human and rodent sequences yield informative results in many cases and have been exploited extensively (4–6), many genomic segments are not usefully annotated in such comparisons due to the non-uniform structure and evolutionary rate across vertebrate genomes (7). Moreover, several cardinal features in the human genome are likely to have been acquired or shaped more recently than the human–mouse evolutionary separation (8,9). These examples underscore the need for alternative comparative strategies that can accommodate the evolutionary asymmetries and architectural uniqueness of the human genome.
Several strategies have been devised recently to overcome these difficulties. In particular, the use of multiple species sequence comparisons has been proposed as an alternative to standard pairwise comparisons, aiming at the identification of a subset of sequences that are conserved in multiple species. Using this premise, a new method of multiple comparative sequence analysis was developed (10) based on the identification of an optimal dataset of species to compare that results in the best correlation between multiple conserved sequences (MCSs) and biologically relevant regions. A similar prioritization strategy, using comparisons between human sequence and that of a single distantly related species, the puffer fish, was recently used to identify evolutionarily conserved regions (ECRs) corresponding to critical regulatory elements in large (>1 Mb), highly conserved gene desert regions flanking the human DACH gene (11). This and other similar studies have emphasized the identification of ECRs that arose prior to the divergence of fish and primate lineages and have been conserved since that time (11–13). Another strategy, dubbed phylogenetic shadowing, has been developed to detect and identify more recently or rapidly evolving ECRs including primate-specific functional elements (14), which would not be detected in sequence comparisons of humans and rodents or more distant vertebrates. Taken together, these studies illustrate the fact that no single comparative genomic strategy suffices for genome-wide comparative studies. Rather, there is a pressing need for the development of tools that can integrate sequences of multiple genomes in a custom-made fashion, allowing for a dynamic overlay of orthologous sequences from a selected number of species, as deemed necessary on a case-by-case basis.
To fulfill this need, we have created a genome browser displaying multiple alignments of genomic sequence of various sequenced species including human, rodents and fish. This tool, called the ECR Browser, presents a dynamic representation of sequence comparisons, allowing the user to specify optimal parameters for alignment and display in analysis of genomic regions with different divergence rates. Two main goals have driven the creation of the ECR Browser: (i) to permit genome alignments to be generated, retrieved and displayed quickly, (ii) to provide maximum flexibility in genome alignments by allowing the user to dynamically adjust alignment and display parameters. These parameters include the number and types of species to be included in the comparison, the sequence to be used as the ‘base’ against which other genomes are compared, types of annotation to be displayed, thresholds to define significantly conserved sequence elements (e.g. sequence lengths and percentage identity with comparable sequences in the base genome) and other features that permit the user to tailor comparisons specifically to regions characterized by different evolutionary rates. Furthermore, the ECR Browser is designed to permit the incorporation of novel genomes immediately as their sequences become available in public databases.
Several strategies have recently been developed to analyze large segments of genome sequence, from whole microbial genomes to homology regions in the chromosomes of higher vertebrates (15–18). For the creation of the ECR Browser, we employed a strategy of genome alignment that is based on four consecutive sequence management steps. Briefly, after masking of repetitive elements, all the genomes were mapped pairwise, to establish large-scale syntenic relationships. Subsequently, each syntenic orthologous pair of sequences was aligned. Finally, data were collected and stored in a central database that is then utilized by the ECR Browser to construct conservation profile graphs at the user's specification.
The sequences and annotation data that are utilized by the ECR Browser are taken from the UCSC Genome Browser (17). In addition to the human, mouse and rat sequences obtained from this source, we have augmented the genome dataset with sequences of three fish genomes, namely Fugu rubripes (http://www.jgi.doe.gov/fugu/), Tetraodon nigroviridis (http://www.genoscope.cns.fr/externe/tetraodon/Ressource.html) and Danio rerio (http://www.sanger.ac.uk/Projects/D_rerio/). Repetitive elements in each genome were identified and masked, using precomputed data downloaded from UCSC where available, or by a local run of the RepeatMasker program (http://www.repeatmasker.org).
Over millions of years of evolution, multiple large-scale rearrangements have altered gene order dramatically in the genomes of fish, rodent, human and other vertebrate lineages. To identify related syntenic blocks in these divergent species, each genome was mapped to all others in pairwise fashion. The dramatically different evolutionary history that separates primates and rodents from fish, compared with the evolutionary separation between different fish or within the primate and rodent lineages, required the application of different approaches to genome alignments in each type of pairwise mapping. For mapping syntenic homologies between more closely related species, such as humans and rodents, we used a locally installed version of the BLAT tool (19). For comparative mapping of more distantly related species, such as humans and fish (or rodents and fish), the more sensitive but slower blast tool was employed (20). At the final step of synteny mapping, neighboring short hits of similarity were joined into large blocks of synteny (see Supplementary materials for details). Finally, pairs of orthologous sequences from each syntenic block were aligned with the use of the blastz alignment tool, with long alignments being cleaned from non-diagonal spurious hits (Supplementary materials).
From a technical viewpoint, alignments of the human and mouse genomes (as an example) utilize 50 Mb of disk space (i.e. significantly less than the size of the original genome FASTA files) and require less than a week on a P4-processor machine to be created. This is significantly faster than any other genome alignment strategy previously reported (16,18). This scale-up in performance and significant savings of disk space allow us to have multiple genome alignments on hand with a relatively short response time to update the ECR Browser as new assemblies of genomes are released.
VISUALIZATION AND DATA BROWSING SCHEME
The conservation-profile visualization scheme of the ECR Browser tool is based on an idea originally implemented in the PipMaker tool (21) and later adopted by both Vista (22) and zPicture (23). In this model, the base genome sequence is schematically displayed as the horizontal axis of a 2D graph, while the vertical axis represents the percentage identity between the base sequence and the sequence being compared (Figure 1). ECRs are differentiated from the neutrally evolving background and are colored according to their classification as protein coding exons, UTRs, introns, repetitive elements or conserved intergenic regions.
The ECR Browser dynamically constructs graphical conservation profiles for any region in the genome, which can be specified either by a gene name or by absolute genomic coordinates of the region of interest. Depending on user preferences the browser augments the conservation profile with an annotation of different genomic features, such as known genes, gene predictions, repetitive elements and single nucleotide polymorphisms, with annotations downloaded directly from the UCSC Genome Browser. Other browser features such as zooming, shifting and re-centering allow for the rapid conversion of the genomic size and coordinates being analyzed (Figure 1).
To accommodate the non-uniform evolutionary structure of vertebrate genomes, a flexible definition of ECR parameters was implemented in the browser. Display variability allows the user to require high stringency parameters in detecting ECRs in slowly evolving genomic regions or less stringent parameters to identify barely distinguishable, short ECRs in other alignments, e.g. in rapidly diverging regions or in comparative analysis of distantly related species. Users can customize the display so that a subset of the available genomes is selected for comparative analysis. Thus, for example, alignments involving sequences from only closely related species might be chosen to analyze rapidly evolving genomic loci. Other custom features include the format of the displayed conservation plot (either a pip-plot or a smooth graph), a selection of different types of gene annotation and selection of picture display parameters (Figure 2).
The ECR Browser readily provides user access to DNA sequences corresponding to the genomic region being displayed. The browser also provides access to the sequences of ECRs detected under a specified set of alignment conditions, and a list of their positions in the displayed region; ECR sequences and positions are readily updated as the user alters the parameters of the alignment and display. To provide ready access to individual ECRs, we have introduced the ‘Grab ECR’ option, which allows ‘one-click’ access to any selected ECR in the conservation plot. This option connects to a detailed ECR description page containing ECR sequences from both species in any pairwise comparison, and a display of the underlying DNA sequence alignment. In addition, sequence characteristics such as length, percentage identity, G+C content and genomic coordinates are listed and accompanied by links to the analysis of potential transcription factor binding sites (TFBS) inside the ECR, through the rVista program (24) (Figure 3). Primer selection and design can be performed for any ECR with the ‘primers/oligos’ tool (http://www.primers.dcode.org) that is integrated into the ECR Browser. This tool selects primer sequences of specified length and GC content for the user to choose, and verifies uniqueness for the designed primers by counting the number of times the primer sequence is encountered in the human and mouse genomes. These tools are designed to facilitate the transfer of data and predictions generated in comparative sequence alignments to the laboratory, where the sequences can be tested experimentally for biological function.
As a by-product of the synteny mapping that is required for accurate comparative alignments (described above), the ECR Browser is able to locate and reconstruct syntenic linkage maps, establishing relationships that can be later utilized to navigate between different genomes (Figure 4A). Using the ‘Synteny/Alignments’ link, the user can jump from the display of a specific locus in one genome directly to a visualization of the syntenically homologous locus in another species (Figure 4B). This option permits users to compare the size, organization and conserved features at the same locus in divergent genomes. It also permits users to compare ECRs arising in comparisons between human and mouse, for example to those detected in the same genomic locus in rat and mouse genome comparisons.
The identity of the base genome in the display of a particular region can also be readily changed with the use of the ‘Base Genome’ feature. Selecting this option will result in the generation of a new alignment of the same sequences previously displayed, but with a different base genome used as reference.
While pre-computed alignments of the genomes available in the ECR Browser are sufficient for many tasks, the ability to add additional sequences to a comparison—e.g. user-generated targeted sequence from additional genomes—can be valuable in a variety of applications. For this purpose, we created custom-defined alignment options within the browser that allow the instantaneous alignment of any user-defined sequence. Such queries may be either submitted directly, in FASTA format, or automatically downloaded from GenBank using the accession number of the sequence to be aligned, which is then forwarded to the ‘Genome Alignment’ portal in the browser. Upon receiving the user-submitted sequence, the ECR Browser will rapidly map this sequence to the selected genome (either human, mouse, rat or Fugu genome) using the BLAT tool. When the orthologous region is identified in the selected base genome, it is extracted along with the corresponding RefSeq gene annotation. Blastz alignments of the two sequences are made and a dynamic graphical visualization of the alignments is generated (Figure 5). A dynamic Picture (23) conservation plot, a portal to the rVista tool (24), an alignment dot-plot and a tool for dynamic annotation of ECRs in the alignment are also automatically provided.
INTEGRATION WITH OTHER TOOLS
We intend to maintain the ECR Browser as a constantly updated tool that not only incorporates newly deposited and annotated sequences, but also provides direct connections to the growing set of publicly available external sequence analysis tools. Presently, an extensive annotation of known genes, gene predictions, experimental RNA evidence and many other features is available through the direct interface between the ECR Browser and the Genome Browser at UCSC (17) and the Ensembl Genome Browser (25,26). This portal permits the user to examine any non-genic conservation pattern against the UCSC evidence database on putative novel genes and noncoding RNAs. Also, the ‘Synteny/Alignments’ link of the ECR Browser directs the user to the zPicture analysis web page, described above, offering an easy and fast way to distill a chosen pairwise alignment out of the multiple genome alignments. Using the zPicture tool, various modifications can be applied to the alignment. For example, the zPicture annotation feature permits manual curation of genes and other features that are not annotated in public databases (e.g. incorporating user-generated data), or editing of the public annotation to add features retrieved from other experimental or computational sources. In addition to these external tools, the ECR Browser is dynamically connected to the GALA annotation database (27), the Rat Genome Browser (http://www.hpc.mcw.edu/mod_perl_gbrowse; when rat is selected as a base genome) and the JGI Fugu Genome Browser (28) (when Fugu is selected as a base genome). We are also planning to incorporate new methods of analysis and simultaneous scoring of multiple sequence alignments (29,30) into the ECR Browser tool in order to provide a high sensitivity interface toward identification of functional domains in differentially evolving genomic loci.
As mentioned previously, the ECR Browser is also interconnected with the rVista tool (24). rVista is capable of filtering out up to 95% false positive TRANSFAC (31) predictions of TFBS while preserving high sensitivity of the search. The rVista portal provides a unique opportunity to predict the function of a noncoding element. By identifying evolutionarily conserved TFBs in an ECR, the rVista portal provides a basis for experimental testing and application of the known function of the conserved transcription factors toward understanding the function of a neighboring gene. Any pairwise alignment from the ECR Browser can be automatically submitted for rVista analysis via the ‘Synteny/Alignment’ link. Also, any ECR retrieved with the use of the ‘Grab ECR’ function can be submitted directly to rVista for binding-site analysis.
The ECR Browser tool is designed to highlight candidate functional coding and noncoding elements and to visualize their genomic positions relative to the known gene features in the genome. By permitting comparisons of genomic sequence from species representing different, user-selected evolutionary clades, the ECR Browser provides flexibility in assessing evolutionary fates of noncoding sequences, allowing for comparisons that reflect sequence conservation over a wide range of timescales and in species with both shared and lineage-specific biological features. Comparisons between distant organisms, such as primates and fish, will likely uncover the fundamental building blocks shared by all vertebrates, while the comparative sequence analysis with closer comparisons, such as those between mice and rats, can highlight the functional structure of rapidly diverging genomic regions, including those that are specific to certain lineages and dictate lineage-specific traits.
Because it provides links to orthologous regions in other publicly available sequence analysis tools, the ECR Browser offers the user easy, automated access to resources, permitting a thorough annotation of functional elements in the genome (through the portal to the UCSC Genome Browser) and the annotation of TFBS (through the portal to the rVista tool). Because the underlying algorithms and tools that power the ECR Browser are designed to permit rapid updates, the tool will be constantly updated with new sequence and new links to other relevant sequence analysis sites. These features, the ease with which conservation parameters and included datasets can be changed by the user, and the immediate dynamic display of alignment results make the ECR Browser a powerful new addition to the computational toolkit for annotating functional features in the human sequence and in other genomes sequenced now or in future years.
Supplementary Material is available at NAR Online.
The work was performed under the auspices of the US Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Livermore National Laboratory Contract No. W-7405-Eng-48.
1Genome Biology Division and 2Energy, Environment, Biology and Institutional Computing, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA and 3Department of Genome Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA