-
PDF
- Split View
-
Views
-
Cite
Cite
Jacob L Steenwyk, Thomas J Buida, Abigail L Labella, Yuanning Li, Xing-Xing Shen, Antonis Rokas, PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data, Bioinformatics, Volume 37, Issue 16, August 2021, Pages 2325–2331, https://doi.org/10.1093/bioinformatics/btab096
- Share Icon Share
Abstract
Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock and collapsing bipartitions (internal branches) with low support.
To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene–gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining and deriving biological meaning from increasingly large phylogenomic datasets.
PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/) and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT).
Supplementary data are available at Bioinformatics online.
1 Introduction
Multiple sequence alignments (MSAs) and phylogenetic trees are widely used in numerous disciplines, including bioinformatics, evolutionary biology, molecular biology and structural biology. As a result, the development of user-friendly software that enables biologists to process and analyze MSAs and phylogenetic trees is an active area of research (Kapli et al., 2020).
In recent years, numerous methods have proven useful for diagnosing potential biases and inferring biological events in genome-scale phylogenetic (or phylogenomic) datasets. For example, methods that evaluate sequence composition biases in MSAs (Phillips and Penny, 2003), signatures of clock-like evolution in phylogenetic trees (Liu et al., 2017), phylogenetic treeness (Lanyon, 1988; Phillips and Penny, 2003), taxa whose long branches may cause variation in their placement on phylogenetic trees (Struck, 2014), and others have assisted in summarizing the information content in phylogenomic datasets and improved phylogenetic inference (Doyle et al., 2015; Felsenstein, 1978; Liu et al., 2017; Philippe et al., 2011; Salichos and Rokas, 2013; Smith et al., 2018; Walker et al., 2019).
Other methodological innovations include identifying significant gene–gene covariation of evolutionary rates, which has been shown to accurately and sensitively identify genes that have shared functions, are coexpressed, and/or are part of the same multimeric complexes (Clark et al., 2012; Sato et al., 2005). Furthermore, gene–gene covariation serves as a powerful evolution-based genetic screen for predicting gene function (Brunette et al., 2019). Lastly, a recently developed method has enabled the identification of unresolved internal branches or polytomies in species trees (One Thousand Plant Transcriptomes Initiative, 2019; Sayyari and Mirarab, 2018); such branches can stem from rapid radiation events or from lack of data (Rokas and Carroll, 2006).
Despite the wealth of information in MSAs and phylogenetic trees, there is a dearth of tools that enable researchers to conduct these analyses in a unified framework. For example, to utilize the functions mentioned in the previous paragraphs, a combination of web-server applications, ‘hard-coded’ scripts available through numerous repositories and supplementary material, standalone software and/or extensive programming in languages including R, Python or C is currently required (Brown et al., 2017; Cock et al., 2009; Hernández et al., 2018; Huerta-Cepas et al., 2016; Junier and Zdobnov, 2010; Kück and Longo, 2014; One Thousand Plant Transcriptomes Initiative, 2019; Revell, 2012; Struck, 2014; Talevich et al., 2012; Wolfe and Clark, 2015). As a result, integrating these functions into bioinformatic pipelines can be challenging, reducing their accessibility to the scientific community.
To facilitate the integration of these methods into bioinformatic pipelines, we introduce PhyKIT, a UNIX shell toolkit with 30 functions (Supplementary Table S1) that have broad utility for analyzing and processing MSAs and phylogenetic trees. Exemplary functions implemented in PhyKIT include measuring topological similarity of phylogenetic trees, creating codon-based MSAs, concatenating sets of MSAs into phylogenomic datasets, editing and/or viewing alignments and phylogenetic trees and identifying putatively spurious homologs in MSAs. We highlight three uses of PhyKIT: (1) calculating diverse statistics that summarize the information content and potential biases (e.g. sequence- or phylogeny-based biases) in MSAs and phylogenetic trees; (2) creating a gene–gene covariation network and (3) inferring the presence of polytomies from phylogenomic data. The diverse functions implemented in PhyKIT will likely be of interest to bioinformaticians, molecular biologists, evolutionary biologists and others.
2 Materials and methods
PhyKIT is a command line tool for the UNIX shell environment written in the Python programming language (https://www.python.org/). PhyKIT requires few dependencies [Biopython (Cock et al., 2009) and SciPy (Virtanen et al., 2020)] making it user-friendly to install and integrate into existing bioinformatic pipelines. Online documentation of PhyKIT comes complete with tutorials that detail use cases for various functions. Lastly, PhyKIT is modularly designed to allow straightforward integration of additional functions in future versions.
PhyKIT has 30 different functions that help process and analyze MSAs and phylogenetic trees (Supplementary Table S1). The 30 functions can be grouped into broad categories that assist in conducting analyses of MSAs and phylogenies or in processing/editing them. For example, ‘analysis’ functions help examine information content biases, gene–gene covariation and polytomies in phylogenomic datasets; ‘processing/editing’ functions help prune tips from phylogenies, collapse poorly supported bipartitions in phylogenetic trees, concatenate sets of MSAs into a single data matrix or create codon-based alignments from protein alignments and their corresponding nucleotide sequences.
Detailed information about each one of PhyKIT’s functions and tutorials for using the software can be found in the online documentation (https://jlsteenwyk.com/PhyKIT). Here, we focus on three specific groups of functions implemented in PhyKIT that enable researchers to summarize information content in phylogenomic datasets, create gene–gene evolutionary rate covariation networks and identifying polytomies in phylogenomic data.
2.1 Evaluating information content and biases in phylogenomic datasets
MSAs and phylogenetic trees are frequently examined to evaluate their information content and potential biases in characteristics such as sequence composition or branch lengths (Doyle et al., 2015; Liu et al., 2017; Phillips and Penny, 2003; Philippe et al., 2011; Shenet al., 2016a; Smith et al., 2018; Struck, 2014). PhyKIT implements numerous functions for doing so. We demonstrate the application of 14 functions:
(1)Alignment length. The length of anMSA, which is associated with robust bipartition support and tree accuracy (Shenet al., 2016a; Walker et al., 2019).
(2)Alignment length with no gaps. The length of anMSA after excluding sites with gaps, which is associated with robust bipartition support and tree accuracy (Shenet al., 2016a).
(4)Internal branch lengths. Summary statistics of internal branch lengths in a phylogenetic tree are reported including mean, median, 25thpercentile, 75th percentile, minimum, maximum, standard deviation and variance values. Examination of internal branch lengths is useful in evaluating phylogenetic tree shape.
(6)Pairwise identity. Pairwise identity is a crude approximation of the evolutionary rate of a gene and is calculated by determining the average number of sites in an MSA that are the same character state between all pairwise combinations of taxa. This can be used to group genes based on their evolutionary rates (e.g. faster-evolving genes versus slower-evolving ones) (Chen et al., 2017).
(7)Patristic distances. Patristic distances refer to all distances between all pairwise combinations of tips in a phylogenetic tree (Fourment and Gibbs, 2006), which can be used to evaluate the rate of evolution in gene trees or taxon sampling density in species trees.
(8)Parsimony-informative sites. Parsimony-informative sites are those sites in an MSA that have a least two character states (excluding gaps) that occur at least twice (Kumar et al., 2016); the number of parsimony-informative sites is associated with robust bipartition support and tree accuracy (Shenet al., 2016a; Steenwyk et al., 2020).
(9)Variable sites. Variable sites are those sites in an MSA that contain at least two different character states (excluding gaps) (Kumar et al., 2016); the number of variable sites is associated with robust bipartition support and tree accuracy (Shenet al., 2016a).
(11)Saturation. Saturation refers to when an MSA contains many sites that have experienced multiple substitutions in individual taxa. Saturation is estimated from the slope of the regression line between patristic distances and pairwise identities. Saturated MSAs have reduced phylogenetic information and can result in issues of long branch attraction (Lake, 1991; Philippe et al., 2011).
(14)Treeness divided by relative composition variability. This function combines two metrics to measure both composition bias and other biases that may negatively influence phylogenetic inference. High treeness divided by relative composition variability values have been shown to be less susceptible to sequence composition biases and are associated with robust bipartition support and tree accuracy (Phillips and Penny, 2003; Shenet al., 2016a).
2.2 Calculating gene–gene evolutionary rate covariation or coevolution
Genes that share similar rates of evolution through speciation events (or coevolve) tend to have similar functions, expression levels, or are parts of the same multimeric complexes (Clark et al., 2012; Sato et al., 2005). Thus, identifying significant coevolution between genes (i.e. identifying genes that are significantly correlated in their evolutionary rates across speciation events) can be a powerful evolution-based screen to determine gene function (Brunette et al., 2019).
To measure gene–gene evolutionary rate covariation, PhyKIT implements the mirror tree method (Pazos and Valencia, 2001; Sato et al., 2005), which examines whether two trees have correlated branch lengths. Specifically, PhyKIT calculates the Pearson correlation coefficient between branch lengths in two phylogenetic trees that share the same tips and topology. To account for differences in taxon representation between the two trees, PhyKIT first automatically determines which taxa are shared and prunes one or both such that the same set of taxa is present in both trees. PhyKIT requires that the two input trees have the same topology, which is typically the species tree topology inferred from whole genome or proteome data. Thus, the user will typically first estimate a gene’s branch lengths by constraining the topology to match that of the species tree. When running this function, users should be aware that many biological factors, such as horizontal transfer (Doolittle and Bapteste, 2007), incomplete lineage sorting (Degnan and Salter, 2005) and introgression/hybridization (Sang and Zhong, 2000), can lead to gene histories that deviate from the species tree. In these cases, constraining a gene’s history to match that of a species may lead to errors in the covariation analysis.
Due to factors including time since speciation and mutation rate, correlations between uncorrected branch lengths result in a high frequency of false positive correlations (Chikina et al., 2016; Clark et al., 2012; Sato et al., 2005). To ameliorate the influence of these factors, PhyKIT first transforms branch lengths into relative rates. To do so, branch lengths are corrected by dividing the branch length in the gene tree by the corresponding branch length in the species tree. Previous work revealed that one or a few outlier branch length values can be responsible for false positive correlations and should be removed prior to analysis (Clark et al., 2012). Thus, PhyKIT removes outlier data points defined as having corrected branch lengths greater than five (i.e. removing gene tree branch lengths that are five or more times greater than their corresponding species tree branch lengths). Lastly, values are converted into relative rates using a Z-transformation. The resulting relative rates are used when calculating Pearson correlation coefficients.
2.3 Identifying polytomies in phylogenomic data
Rapid radiations or diversification events have occurred throughout the tree of life including among mammals, birds, plants and fungi (Jarvis et al., 2014; Li et al., 2020; Liu et al., 2017; One Thousand Plant Transcriptomes Initiative, 2019). Polytomies correspond to internal branches whose length is 0 (or statistically indistinguishable from 0) and can be driven either by biological (e.g. rapid radiations) or analytical (e.g. low amount of data) factors. Thus, polytomies are useful for inferring rapid radiation or diversification events and exploring incongruence in phylogenies (Li et al., 2020; One Thousand Plant Transcriptomes Initiative, 2019; Sayyari and Mirarab, 2018).
To identify polytomies, a modified approach to a previous strategy was implemented (Sayyari and Mirarab, 2018). More specifically, the support for three alternative topologies is calculated among all gene trees from a phylogenomic dataset. For example, in species tree ((A, B),C), D);, if examining the presence of a polytomy at the ancestral bipartition of tips A, B and C, PhyKIT will determine the number of gene trees that support ((A, B),C);, ((A, C),B); and ((B, C),A); using the rooted gene trees provided by the user. Equal support for the three topologies (i.e. the presence of a polytomy) among a set of gene trees is assessed using a Chi-squared test. Failing to reject the null hypothesis is indicative of a polytomy (Sayyari and Mirarab, 2018). Note that this approach is distinct from the approach of Sayyari and Mirarab to identify polytomies because PhyKIT uses a gene-based signal rather than a quartet-based signal. The difference between the two methods is that each gene contributes equally to the inference of a polytomy when a gene-based signal is used, whereas genes with greater taxon representation (which contain a greater number of quartets) will contribute a greater signal during polytomy identification when a quartet-based signal is used. From a technical perspective, both approaches are simple to implement and require only a single line of code in the commandline.
3 Results
We outline three example uses of PhyKIT: (1) summarizing information content and identifying potential biases in animal, plant, yeast and filamentous fungal phylogenomic datasets (Laumer et al., 2019; One Thousand Plant Transcriptomes Initiative, 2019; Shenet al., 2016b; Steenwyk et al., 2019), (2) constructing a network of significant gene–gene covariation, which reveals genes of shared functions from empirical data spanning ∼550 million years of evolution among fungi (Shen et al., 2020) and (3) illustrating how to identify polytomies using simulated and empirical data (Steenwyk et al., 2019).
3.1 Summarizing information content and biases in phylogenomic data
Examining information content in phylogenomic datasets can help diagnose potential biases that stem from low signal-to-noise ratios, multiple substitutions, nonclock-like evolution and other biological or analytical factors. To demonstrate the utility of PhyKIT to summarize the information content in phylogenomic datasets, we calculated 14 different metrics known to help diagnose potential biases in phylogenomic datasets or be associated with accurate and well supported phylogenetic inferences (Doyle et al., 2015; Felsenstein, 1978; Liu et al., 2017; Phillips and Penny, 2003; Philippe et al., 2011; Shenet al., 2016a; Smith et al., 2018; Struck, 2014) using four empirical phylogenomic datasets from animals (201 tips; 2891 genes) (Laumer et al., 2019), budding yeast (332 taxa; 2408 genes) (Shen et al., 2018), filamentous fungi (93 taxa; 1668 genes) (Steenwyk et al., 2019) and plants (1124 taxa; 403 genes) (One Thousand Plant Transcriptomes Initiative, 2019) (Figure 1 and Supplementary Table S1).

Summary of information content in four empirical phylogenomic datasets.Fourteenexemplary metrics implemented in PhyKIT help summarize the information content and identify potential biases in phylogenomic datasets. Each graph displays a violin plot with a black point representing the mean. Error bars indicate one standard error above and below the mean; however, these are difficult to see in nearly all graphs because they were often near the mean. Abbreviations are as follows: Aln. len.: alignment length; Aln. len. no gaps: alignment length excluding sites with gaps; DVMC: degree of violation of a molecular clock; Internal branch len.: average internal branch length; patristic distances: average patristic distance in a gene tree; percent PI sites: percentage of parsimony-informative sites in an MSA; percent var. sites: percentage of variable sites in an MSA; RCV: relative composition variability
Examination of the distributions of the values of the 14 different metrics revealed inter- and intra-dataset heterogeneity (Fig. 1). For example, inter-dataset heterogeneity was observed among animal and plant datasets, which had the lowest and highest average pairwise identity across alignments, respectively; intra-dataset heterogeneity was observed in the uniform distribution of pairwise identities in the budding yeast datasets. Similarly, inter-dataset heterogeneity was observed in estimates of saturation where the budding yeast and filamentous fungal MSAs were less saturated by multiple substitutions than the plant and animal datasets; intra-data heterogeneity was also observed in all four datasets. Varying degrees of inter- and intra-dataset heterogeneity was observed for other information content statistics, which may be due biological (e.g. mutation rate) or analytical factors (e.g. taxon sampling, distinct alignment, trimming and tree inference strategies).
In summary, PhyKIT is useful for examining the information content of phylogenomic datasets. For example, the generation of different phylogenomic data submatrices by selecting subsets of genes or taxa with certain properties (e.g. retention of genes with the highest numbers of parsimony-informative sites or following removal of taxa with high long branch scores) can facilitate the exploration of the robustness of species tree inference or estimating time since divergence (Li et al., 2020; Liu et al., 2017; Salichos and Rokas, 2013; Shen et al., 2018,2020; Steenwyk et al., 2019; Walker et al., 2019).
3.2 A network of gene–gene covariation reveals neighborhoods of genes with shared function
Genes with similar evolutionary histories often have shared functions, are coexpressed or are parts of the same multimeric complexes (Clark et al., 2012; Sato et al., 2005). Using PhyKIT, we examined gene–gene covariation using 815 genes spanning 1107 genomes and ∼563 million years of evolution among fungi (Shen et al., 2020). By examining 331 705 pairwise combinations of genes, we found 298 strong signatures of gene–gene covariation (defined as r > 0.825). The two genes with the strongest signatures of covariation were SEC7 and TAO3 (r = 0.87), suggesting that their protein products have similar or shared functions. Supporting this hypothesis, Sec7 p contributes to cell-surface growth in the model yeast Saccharomyces cerevisiae (Novick and Schekman, 1979) and genes with the Sec7 domain are transcriptionally coregulated with yeast-hyphal switches in the human pathogen Candida albicans (Song et al., 2008). Similarly, Tao3p in both S. cerevisiae and C. albicans is part of a RAM signaling network, which controls hyphal morphogenesis, polarized growth and cell-cycle related processes including cell separation, cell proliferation and phase transitions (Bogomolnaya et al., 2006; Song et al., 2008).
Complex relationships of gene–gene covariation can be visualized as a network (Fig. 2). Examination of network neighborhoods identified groups of genes that have shared functions and are parts of the same multimeric complexes. For example, the proteins encoded by NDC80 and NUF2 are part of the same kinetochore-associated complex termed the NDC80 complex—which is required for efficient mitosis (Sundin et al., 2011)—and significantly covary with one another (r = 0.84). Similarly, multiple genes that encode proteins involved in DNA replication and repair (i.e. POL2, MSH6, RAD26, CDC9 and EXO1) were part of the same network neighborhood, consistent with previous work suggesting an intimate interplay between DNA replication and multiple DNA repair pathways (Boiteux and Jinks-Robertson, 2013; Lujan et al., 2012; Tsubouchi and Ogawa, 2000). Other network neighborhoods of genes with shared function such as ribosome biogenesis, Golgi apparatus-related transport and control of DNA replication were identified (Fig. 2).

Gene–gene covariation network inferred from ∼550 million years of evolution across 1107 fungi.A network of significant gene–gene coevolution identifies network neighborhoods representative of associated functional categories. For example, the NDC80 and NUF2 genes (toward the top right of the network) were identified to be significantly coevolving with one another (r = 0.84, P < 0.01, Pearson’s correlation test); they both encode proteins that are part of the same multimeric kinetochore-associated complex (green). Similarly, genes that are DNA replication factors (orange), contribute to DNA replication and repair processes (yellow), participate in Golgi apparatus-related transport (brown) or ribosome biogenesis (pink) were found to be neighbors in the network. Network visualization was done with the igraph package, v1.2.4.2 (Hunter and Cohen, 2007), in R, v3.6.2 (https://www.r-project.org/)
Taken together, these results indicate PhyKIT is a useful tool for evaluating gene–gene covariation and predicting genes’ functions (Brunette et al., 2019; Clark et al., 2012; Sato et al., 2005). Thus, we anticipate PhyKIT will be helpful for evaluating gene–gene covariation and conducting evolution-based screens for gene functions across the tree of life.
3.3 Identifying polytomies in phylogenomic datasets
Rapid radiations or diversification events have occurred throughout the tree of life (Jarvis et al., 2014; Li et al., 2020; Liu et al., 2017; One Thousand Plant Transcriptomes Initiative, 2019). One approach to identifying rapid radiations is by testing for the existence of polytomies in species trees (Li et al., 2020; One Thousand Plant Transcriptomes Initiative, 2019; Sayyari and Mirarab, 2018). Polytomies can also arise when the amount of data at hand is insufficient for resolution (Walsh et al., 1999). To demonstrate the utility of PhyKIT to identify polytomies, we examined the ability of our approach to identify a simulated polytomy (Fig. 3A). PhyKIT was able to conservatively identify the simulated polytomy demonstrating the efficacy of our approach.

Identifying polytomies from phylogenomic data. (Ai) A cladogram of a simulated species phylogeny with tip names A–G. The red branch has a very short branch length of 2 × 10 − 5 substitutions per site. (Aii) Phylogram of the same phylogeny shows that all other branches are much longer (≥1.0 substitutions per site). (Aiii) After reconstructing the evolutionary history from 1000 alignments simulated from the phylogeny in Aii, the hypothesis of a polytomy was tested using gene-support frequencies for three alternative rooted topologies defined by the clades of green, orange and purple taxa. Failure to reject the null hypothesis of equal support among genes for each topology is indicative of a polytomy (χ2 = 5.97, P-value = 0.05, Chi-squared test). (B–D) The same approach was then used to examine if there is evidence for a polytomy at three different branches in a phylogeny of filamentous fungi. (D) Support for a polytomy (χ2 = 0.514, P-value = 0.77, Chi-squared test) was observed for the relationships between three different sections of Penicillium fungi. These results demonstrate the utility of gene-support frequencies for evaluating polytomies and examining incongruence in phylogenomic datasets
We next examined if there is evidence of polytomies in the evolutionary history of filamentous fungi from the genera Aspergillus and Penicillium. We examined three branches. The first two branches—one dating back ∼110 million years ago (Fig. 3B) and another dating back ∼25 million years ago (Fig. 3C)—were not polytomies. In contrast, examination of a ∼60 million-year-old branch involving Lanata-divaricata, Citrinaand Exilicaulis (Fig. 3D), which are major lineages (or sections) in the genus Penicillium, was consistent with a polytomy. Given the large number of gene trees used in our analysis (n = 1668), these results are consistent with a rapid radiation or diversification event in the history of Penicillium species.
In summary, these results suggest that PhyKIT is useful in identifying polytomies in simulated and empirical datasets. More broadly, these results support the notion that polytomies can be used to identify rapid radiation events. Beyond polytomy identification, PhyKIT can be used for exploring incongruence in phylogenies by calculating gene-support frequencies. Calculations of gene-based support among different topologies can be used in diverse applications, including identifying putative introgression/hybridization events and conducting phylogenetically-based genome-wide association (PhyloGWAS) studies (Pease et al., 2016; Steenwyk et al., 2019).
4 Conclusion
We developed PhyKIT, a comprehensive toolkit for processing and analyzing MSAs and trees in phylogenomic datasets. Executing functions implemented in PhyKIT would otherwise require extensive programming, multiple software and/or web-based applications (Supplementary Table S1); thus, PhyKIT offers users a way to streamline approaches and pipelines by relying on only one software. PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/) and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). PhyKIT is a fast and flexible toolkit for the UNIX shell environment, which allows it to be easily integrated into bioinformatic pipelines. We anticipate PhyKIT will be of interest to biologists from diverse disciplines and with varying degrees of experience in analyzing MSAs and phylogenies. In particular, PhyKIT will likely be helpful in addressing one of the greatest challenges in biology, building, understanding and deriving meaning from the tree of life.
Acknowledgement
We thank the Rokas lab for helpful discussion and feedback.
Funding
J.L.S. and A.R. were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. X.X.S. was supported by National Natural Science Foundation of China [32071665] and the Fundamental Research Funds for the Central Universities [2020QNA6019].
Conflict of Interest: none declared.
References
One Thousand Plant Transcriptomes Initiative (