Summary: The large number of genomes that will be sequenced will need to be annotated with genes and other functional features. Aligning gene sequences from a related species to the target genome is an economical and highly reliable method to identify genes; unfortunately, existing tools have been lacking in sensitivity and speed. A program we reported,
Supplementary information:Supplementary data are available at Bioinformatics Online.
The number and variety of sequenced genomes will continue to grow spectacularly. The 10 000 Genomes Project (G10KCOS, 2009) alone will catalog and sequence more than 10 000 species from across the spectrum of verterbate evolution, and there are numerous ongoing projects to sequence plants and animals of agricultural importance, or of more specialized scientific interest (Plant and Animal Genome Conference; http://www.intl-pag.org/). As sequencing becomes increasingly accessible, one can expect many more groups and even individual investigators to sequence the genome of their studied organism. These genomes will need to be annotated with genes and other functional features. A key resource of gene information are the cDNA (mRNA, EST) sequences already in the databases, which can be readily aligned to a target genome to produce gene models. To further facilitate this comparative annotation approach, an increasing number of projects are producing mixed collections of resources from several related species, which are then used to analyze each of those genomes (The Fagaceae Genomics Project, http://www.fagaceae.org; The Genome Database for Rosaceae, http://www.rosaceae.org).
Most spliced alignment tools were designed for comparing highly similar sequences and perform poorly on cross-species comparisons, where sequence similarity drops. Few programs have been adapted for aligning sequences cross-species, most notably BLAT (Kent, 2002) and GMAP (Wu and Watanabe, 2005). These, however, produce output that is often less accurate than required, more so as the distance between species increases. Other tools, reviewed in Zhou et al. (2009), employ probabilistic or exact dynamic programming methods and are capable of aligning sequences cross-species, but lack the speed required for whole-genome annotation and are limited to comparisons between close species. The main difficulty in aligning cross-species is detecting weakly similar regions, which leads to incomplete gene models and incorrect exon boundaries. Differences in the gene models of orthologs caused by evolutionary block insertion and deletion events are a further challenge. We recently developed a program,
We developed an optimized framework for running batch
A key feature of
We also compared the run time of
We described a utility,
Funding: National Institutes of Health (R01-LM006845 to Steven L. Salzberg).
Conflict of Interest: none declared.