Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify.
Accurate automated annotation of biological sequences is an increasingly important problem in the biological sciences. Recent releases of high-quality multiple sequence alignment data, such as by the Drosophila 12 Genomes Consortium ( 1 , 2 ) have only underscored this fact. Phylo-grammars have had great success in this arena, with diverse applications in areas such as the prediction of exons in DNA ( 3 , 4 ), prediction of secondary structure in proteins ( 5 , 6 ) and detection of noncoding RNA ( 7 ). However, despite their broad range of application, implementations of phylo-grammars have often been limited to a single model and have lacked fast and accurate training algorithms, limiting more widespread adoption.
The XRATE PROGRAM
THE xREI WEBSERVER
State transition diagrams
State transition diagrams are generated from the transformation rules within the
Diagrams can be rendered in one of the two ways. The
Rate matrix visualization
The rate matrices of individual substitution chains are displayed as ‘bubble plots.’ A grid is drawn with the grammar alphabet as the axes. At each vertex, a circle is drawn proportional to the substitution rate between the respective residues. These circles are initially scaled by an arbitrary function which generally produces good results. This scale factor is available to the user to modify manually.
The XRate menu item provides access to
While there is no hard upper limit to the size of alignments submitted to
As part of the initial
We have also included several alignment databases over which both these and user-supplied grammars may be trained and run. RFAM ( 25 ) is a collection of non-coding RNA multiple alignments. PANDIT ( 26 ) contains codon multiple sequence alignments covering many common protein-coding domains. TreeFam ( 27 ) is a database of protein alignments along with curated and semi-curated trees.
Code reusability and flexibility were both major goals in implementation. As such any functionality currently in
Phylo-grammars have a broad range of applications in the biological sciences.
The authors were funded by NIH/NHGRI grant 1R01GM076705-01 and by the 2007 Google Summer of Code [National Evolutionary Synthesis Center (NESCent) group, NSF #EF-0423641].
Conflict of interest statement . None declared.