- Split View
-
Views
-
Cite
Cite
Ngoc-Vinh Tran, Bastian Greshake Tzovaras, Ingo Ebersberger, PhyloProfile: dynamic visualization and exploration of multi-layered phylogenetic profiles, Bioinformatics, Volume 34, Issue 17, September 2018, Pages 3041–3043, https://doi.org/10.1093/bioinformatics/bty225
- Share Icon Share
Abstract
Phylogenetic profiles form the basis for tracing proteins and their functions across species and through time. Novel genome sequences nowadays often represent species from the remotest corner of the tree of life. Thus, phylogenetic profiling becomes increasingly important for functionally annotating this data and to integrate it into a comprehensive view on organismal evolution. To strengthen the link between the sharing of a gene across species and of the corresponding function, it is meanwhile common to complement phylogenetic profiles with additional information, such as domain architecture similarities between orthologs, or pairwise similarities of other protein features. However, there are few visualization tools that facilitate an intuitive integration of these various information layers. Here, we present PhyloProfile, an R-based tool to visualize, explore and analyze multi-layered phylogenetic profiles.
PhyloProfile is available as open source code under the MIT license at https://github.com/BIONF/phyloprofile. An online version for testing PhyloProfile and for small to medium-scale analyses is available at http://applbio.biologie.uni-frankfurt.de/phyloprofile.
1 Introduction
Phylogenetic profiles capture the presence–absence pattern of genes across species (Pellegrini et al., 1999). The presence of an ortholog in a given species is often taken as evidence that also the corresponding function is represented (Lee et al., 2007). Moreover, if two genes agree in their phylogenetic profile, it can suggest that they functionally interact (Pellegrini et al., 1999). Phylogenetic profiles are therefore commonly used for tracing functional protein clusters or metabolic networks across species and through time. However, orthology inference is not error-free (Altenhoff et al., 2016), and orthology does not guarantee functional equivalence for two genes (Studer and Robinson-Rechavi, 2009). Therefore, phylogenetic profiles are often integrated with accessory information layers, such as sequence similarity, domain architecture similarity or semantic similarity of Gene Ontology-term descriptions. Various approaches exist to visualize such enriched phylogenetic profiles. For example, public ortholog databases often provide the domain architectures of the identified orthologs (e.g. Altenhoff et al., 2015), DoMosaics (Moore et al., 2014) or the ETE3 tool kit (Huerta-Cepas et al., 2016) facilitate a display of domain architectures at the leafs of a gene tree and recently Aquerium was developed to display domain-based protein occurrences on taxonomically clustered genome trees (Adebali and Zhulin, 2017). However, there is still a shortage of tools that provide a comprehensive set of functions for the display, filtering and analysis of multi-layered phylogenetic profiles comprising hundreds of genes and taxa. PhyloProfile serves to close this methodological gap.
2 Features and capabilities
2.1 Input
PhyloProfile expects as a main input the phylogenetic distribution of orthologs or more generally of homologs. This information can be complemented with domain architecture annotation and data for up to two additional annotation layers. The tool accepts tab delimited text and sequences in FASTA format as input. The stand-alone version additionally supports orthoXML (Schmitt et al., 2011). To ease the generation of custom input, we provide several example datasets and a number of helper scripts, e.g. to extract phylogenetic profiles directly from the OMA database (Altenhoff et al., 2015). The WIKI accompanying PhyloProfile gives a comprehensive guide of how to format input data and additionally informs about performance and scaling of run time and memory usage.
2.2 Interactive visualization and dynamic exploration of phylogenetic profiles
PhyloProfile is implemented with an interactive visualization using the Shiny package for R (https://CRAN.R-project.org/package=shiny). Species are automatically linked to the NCBI taxonomy and are ordered in increasing taxonomic distance from a user-specified reference taxon. Alternatively, a custom phylogeny can be uploaded for this purpose. Input taxa can be collapsed at higher order systematic ranks to rapidly change the resolution from the comparative analysis of proteins in individual species, to that across classes, phyla or entire kingdoms.
The phylogenetic profile is represented by a dot matrix (Fig. 1). Cell color, as well as dot size and dot color can accommodate further information about the shared genes. Plotting takes about 10 s for 200 genes and 200 species and scales linearly with size of the data matrix. The protein sequences together with complementary information can be accessed upon a click on the dot.
PhyloProfile is able to represent the entire data matrix or to visualize only a subset of genes and taxa for a detailed inspection, without the need of modifying the input data. Furthermore, the software provides various options to dynamically filter the data. For example, increasing the fraction of species in a systematic group that must harbor an ortholog before the gene is considered present in this group reduces the impact of spurious ortholog identification on evolutionary interpretations. Likewise, filtering genes based on the similarity of their domain architectures—if given as an information layer—can either highlight or blend out orthologs that are suspicious of having changed their function.
2.3 Analysis functions
PhyloProfile provides several functions for dynamically analyzing phylogenetic profiles.
Profile clustering: The identification of proteins with similar phylogenetic profiles is a crucial step in the identification and characterization of novel functional protein interaction networks (Pellegrini, 2012). PhyloProfile offers the option to cluster genes according to the distance of their phylogenetic profiles.
Gene age estimation: PhyloProfile can estimate the evolutionary age of a gene from the phylogenetic profiles using an Last Common Ancestor (LCA) algorithm (Capra et al., 2013). Specifically, the last common ancestor of the two most distantly related species displaying a given gene serves as the minimal gene age. Age estimates are dynamically updated upon filtering of the data.
Core gene identification: Phylogenomic reconstructions are typically based on a collection of core genes (Daubin et al., 2002), i.e. genes that are shared among all genomes in a taxon collection. PhyloProfile enables users to select a set of taxa and returns their core genes.
Distribution analysis: The interpretation of phylogenetic profiles and the result of downstream analyses can change substantially upon filtering the data. To help users to decide on reasonable filtering thresholds, PhyloProfile provides a function to plot the distributions of the values incurred by the integrated information layers.
2.4 Interoperable output
Filtered data and corresponding protein sequences can be exported for downstream analysis, such as phylogenomic tree reconstruction or metabolic pathway analysis. All graphics generated by PhyloProfile can be downloaded as ready-for-publish PDF files.
Acknowledgement
The authors thank Arpit Jain for valuable discussion.
Funding
This work was supported by the Deutsche Forschungsgemeinschaft [DFG FOR 2251; Project Grant EB 285/2-1].
Conflict of Interest: none declared.
References