The analysis of taxonomic distribution and lineage-specific variation of domains and domain combinations is an important step in the assessment of their functional roles and potential interoperability. In the study of eukaryote sequence sets with many multi-domain proteins, it can become laborious to evaluate the phylogenetic context of the many occurring domains and their mutual relationships. PhyloDome is an answer to that problem. It provides a fast overview on the taxonomic spreading and potential interrelation of domains that are either given as a list of names and PFAM/SMART accessions or derived from a user-defined set of sequences. This taxonomic distribution analysis can be helpful in protein function and interaction assignment as the comparative study of potential Hedgehog pathway members in C.elegans shows. An implementation of PhyloDome is accessible for public use as a WWW-Service at http://mendel.imp.univie.ac.at/phylodome/ . Software components are available on request.
Globular domains of proteins (in addition to non-globular segments) have been recognized early on as the fundamental building blocks of protein structure and function ( 1 ). As a consequence, the evolution of protein complexity has often been rationalized in the context of domain evolution ( 2 ). In this respect, it was observed that two phenomena accompany the raise of complexity during organism evolution—an increase of combinatorial diversity through rearrangements of domain architectures ( 3 , 4 ) and the expansion of already existing domain families ( 5 ). Domain expansion and the following functional diversification have been recognized as major factors in the extensions of protein functions and the implementation of adaptive tasks ( 5 ). Therefore, taxonomic analysis and assessment of lineage-specific variances are important aspects in evaluating the potential functional role of single domains as well as their combinations.
Although the taxonomic distribution of domains occurring in eukaryote multi-domain proteins can provide important indications on function and domain interrelation, this part of the sequence-analytic procedure is not well supported by available tools. Therefore, we developed PhyloDome, which can provide a fast overview on the lineage distribution of domains that are either found in a user-defined set of eukaryote proteins or supplied as a list of names and PFAM ( 6 ) or SMART ( 7 ) accessions.
DESCRIPTION OF PhyloDome
In principle, taxonomic distributions can be described in two ways. First, by a phylogenetic distribution: A (Δ) = ( a1 , a2 , …, ai , …, aN ), where domain Δ occurs ai -times ( ai ≥ 0) in the proteome derived from genome i ( i = 1, …, N ; N is the number of genomes). Or second, by a phylogenetic profile of domain Δ as sign ( A (Δ)) = (α 1 , α 2 , …, α i , α N ) where
Two modes of visualization (graphical and tabular) are used to present the taxonomic spreading of one or multiple domains across the 25 studied species ( Figure 1 ). In addition to just visualizing taxonomic distribution, PhyloDome also provides some help in the evaluation of results, when viewing both single as well as multiple domains.
Focusing on single domains, PhyloDome aims at classifying these according to evolutionary scenarios that are known to have functional significance. Distinction is made between uniformly distributed (χ 2 ), lineage-specific expanding (Dixon's outlier test) and several types of lineage-specific domains (e.g. fungi, arthropoda, worm, chordata, etc.). The assignment into these categories allows further interpretations in line with the observation that uniformly distributed domains are mostly involved in basic biological mechanisms, while taxon- and lineage-specific expanding domains are probably serving adaptive functions ( 5 ).
Besides single domain classifications, Phylodome also supports the taxonomic comparison between domains. The simplest possible assessment is based on phylogenetic profiles and allows very limited reasoning on domain interrelation. The following interpretable scenarios can be distinguished: (i) if domains co-occur in different proteomes, they fulfill the minimal requirement for their functional interaction; (ii) if two domains never occur together in the same taxon, they are, as a rule, not functionally linked. In exceptional cases, two domains with exclusive profiles might represent functional equivalents either corresponding to sequentially very divergent, taxon-specific instances of a domain superfamily or to non-orthologous replacement, a concept that has been previously developed for whole genes of prokaryotes ( 10 , 11 ). In PhyloDome, the user is alerted for the occurrence of overlapping as well as exclusive patterns within a given query set.
As an alternative, phylogenetic distributions of domains (Δ 1 and Δ 2 ) can also be compared based on the correlation coefficient r between the respective vectors, A (Δ 1 ) and A (Δ 2 ). In order to confirm that a functional relationship between domains is associated with high correlation of their respective taxonomic distributions, we used two measures for domain interrelatedness: first, physical association of domains in one protein; second, the functional distance between domains based on their Gene Ontology ( 12 ) classification.
We observed that physically linked domains also tend to have a high taxonomic correlation coefficient (up to 52% of physical links between domains are found with an r ≥ 0.8 when counted on a sequence basis, Figure 2a ). The reverse conclusion is true as a tendency: a high correlation coefficient is indicative for a functional link. The fraction of physically associated domains, among the taxonomically correlating domains (with at least one domain from a multi-domain protein) increases with the correlation coefficient and comprises ∼11% for 0.8 ≤ r ≤ 1 regardless of the mathematical form of r ( Figure 2b ). Among domains with an available Gene Ontology assignment, the average functional distance between domains (counted by the minimal number of separating vertices in the GO tree) tends to drop with increasing taxonomic correlations ( Figure 2c ).
The input to PhyloDome can be (i) one or a set of sequences entered in a fasta or raw sequence format or (ii) one or a set of domain names and PFAM or SMART accessions. Although it can be used for the analysis of a single domain or protein, PhyloDome unfolds its real potency in the interpretation of sequence sets where functional relationships are the target of interest. Such sets can, for example, contain sequences or domains that are genetically shown to be members of the same pathway or seem to form a complex in a model organism, possibly, together with their homologs in other taxa.
If a sequence set is given as an input, PhyloDome explores the query proteins with RPS-BLAST against the PFAM-A library ( 13 ), and derives a list of significantly hitting domains. These domains or the user-supplied domain list are subjected to further analysis. In the results page, a graphical representation of all relevant domains (if appropriate, as domain architecture diagrams of queries) is returned ( Figure 1 ). Color-coded bars reflect the domains' distribution in the (almost complete) proteomes of 25 fully sequenced eukaryotes. For the ease of interpretation, a phylogenetic tree of eukaryote taxa is supplied with the same color-coding. For each domain, a mouse-over function supplies domain name, PFAM accession number, a link to the PFAM domain annotations and the numerical data for the phylogenetic profile.
In addition to the described graphical display, the evolutionary distribution is tabulated numerically for all domains ( Figure 1 ). Gray background shading varying proportionally in its intensity shows the domain occurrences visually. For pairs of domains, overlaps/exclusions of taxonomic profiles and significant correlation coefficients based on their phylogenetic distributions are reported. With this data, the user can rapidly identify the domains in a protein set evolving in a distinct and correlated fashion, which might be functionally linked.
The following sources of possible errors should be taken into account when interpreting PhyloDome outputs. Of course, the computation results depend on the accuracy of the domain models and the completeness of the domain library. With regard to phylogenetic distributions, the accuracy of measured domain occurrences ai in proteomes of model organisms is critical. Incomplete genome sequencing, assembly errors and inaccurate gene structure determination will, most often, lead to lower domain occurrences since protein sequences might become (i) absent from the derived proteome, (ii) shortened or (iii) partially substituted by or appended to non-sense sequences. On the other hand, false positive domain assignments (especially for small domains and domains with compositional bias) might artificially increase domain occurrences.
Application example: hedgehog signaling in Caenorhabditis elegans ?
The occurrence of about a dozen of proteins with Hint (PFAM accession PF01079) ( 14 ) and Patched (PF02460) ( 15 ) domains encoded in the worm genome has led to the conclusion that a Hedgehog-related pathway might exist in Caenorhabditis elegans . Many Hint domain-containing worm sequences (e.g. NP_500347 and NP_501673) are even annotated as Hedgehog-like in GenBank protein database. As studied mainly in fly, Hedgehog is known to be expressed as a precursor protein, and auto-processed via its catalytic C-terminal domain. A thioester intermediate attacked by cholesterol releases the N-terminal signaling peptide modified by cholesterol and the C-terminal cysteine peptidases domain ( 16 ). The Patched receptor (Ptc) is thought to sequester Hedgehog by a cholesterol-dependent process. Ptc signals on through Smoothened and a complex including Fused, Costal, Suppressor-of-Fused and Cubitus interruptus ( 17 ).
Considering the so-called Hedgehog-related proteins in C.elegans ( Figure 1 ) as an example, we show that analysis of the phylogenetic distribution of domains constituting homologous multi-domain proteins in different species helps to correctly transfer functional annotations. For this purpose, the known Hedgehog pathway proteins Hedgehog (Q02936), Suppressor-of-Fused (NP_536750.2), and Patched (P18502) from fly as well as the Hedgehog-like (NP_507923.1) and Patched homolog 1 (Q09614) from worm were submitted to PhyloDome ( Figure 1 ). Focusing on single domain evolutionary scenarios, it becomes clear that cholesterol-based signaling is of enhanced importance in worm. The two domains of the pathway, known to be involved in cholesterol-dependent processes ( 18 ) Hint (cholesterol modification) and Patched (sterol sensing) are lineage-specifically expanded in worm. Other domains found in Hedgehog signaling (Hh_signaling/PF01085 and Sufu/PF05076) are not detectable in worm and are, apparently, coelomata specific. These observations indicate that it is not the Hedgehog signaling itself, but only the cholesterol modification and sensing mechanism that is expanded in worm.
Further backing comes from the pairwise co-evolution analysis. Whereas the C-terminal cysteine peptidase domain Hint is shared, the coelomata-specific domains involved in Hedgehog signaling (Hh signaling and Sufu) and the worm-specific Ground-like domain (PF04155) are occurring in taxonomically exclusive groups. Therefore, and because of the absence of significant sequence similarity, the two groups of domains are most likely functionally unrelated. The phylogenetic patterns of the cholesterol-signaling domains Hint and Patched show a high correlation coefficient supporting their potential functional linkage. These data support the following model of protein function evolution in the different taxonomic ranges: it seems that divergent cholesterol-dependent signaling processes have evolved in coelomata and pseudo-coelomata lineages (with a vast expansion in pseudo-coelomata).
The visualization tool PhyloDome is thought to facilitate studies of eukaryote protein sets in the context of taxonomic distribution of their domains. This program supports the easy identification of typical evolutionary scenarios of single domains and domain pairs and enhances their indicative value for functional annotation.
The authors are grateful for the generous support from the Austrian Academy of Science and Boehringer Ingelheim. This project has been partly funded by the Austrian Science Fund (FWF), Austrian Gen-AU bioinformatics integration network sponsored by the Austrian Ministry of Education, Science and Culture (BM-BWK) and by a grant from the Austrian Economy Ministry (BM-WA). Funding to pay the Open Access publication charges for this article was provided by the Research Institute of Molecular Pathology.
Conflict of interest statement . None declared.