Integrating phylogenetic and functional data in microbiome studies

Abstract Motivation Microbiome functional data are frequently analyzed to identify associations between microbial functions (e.g. genes) and sample groups of interest. However, it is challenging to distinguish between different possible explanations for variation in community-wide functional profiles by considering functions alone. To help address this problem, we have developed POMS, a package that implements multiple phylogeny-aware frameworks to more robustly identify enriched functions. Results The key contribution is an extended balance-tree workflow that incorporates functional and taxonomic information to identify functions that are consistently enriched in sample groups across independent taxonomic lineages. Our package also includes a workflow for running phylogenetic regression. Based on simulated data we demonstrate that these approaches more accurately identify gene families that confer a selective advantage compared with commonly used tools. We also show that POMS in particular can identify enriched functions in real-world metagenomics datasets that are potential targets of strong selection on multiple members of the microbiome. Availability and implementation These workflows are freely available in the POMS R package at https://github.com/gavinmdouglas/POMS. Supplementary information Supplementary data are available at Bioinformatics online.


Exploring the parameter space of the MAG-based simulations
The selection simulation framework that we describe in the main text represents an extremely strong selection acting upon taxa encoding the focal gene in each profile. Specifically, this simulated selective advantage included (in the sample group experiencing selection) adding a pseudocount of one to the abundance of all taxa that encoded the focal gene and then multiplying the resulting abundance by 1.5.
To confirm that POMS is not only performing better under this extreme condition, we also created and tested simulated profiles over a range of weaker selection settings. These altered settings included not adding a pseudocount and/or multiplying the abundances by lower factors (1.05 and 1.25, in addition to the original value of 1.5). In addition to either adding a pseudocount or not, we also varied what proportion of relevant metagenome-assembled genomes (MAGs) would receive a pseudocount (the pseudocount itself was 1 in all cases), testing specifically proportions of 0, 0.3, 0.7, and 1. These proportions represent the proportion of randomly sampled MAGs, of the subset that encode the focal gene, whose abundance is increased by one. We additionally tested the impact of different numbers of MAGs in these simulations, to investigate how dataset size affects the POMS framework. We varied the number of MAGs between these values: 100, 250, 500, 1000, and 1595 MAGs. To keep this analysis manageable, we considered only 10 independent replicates for each of the 60 combinations of these simulation settings. To enable clearer comparisons between replicates with differing number of MAGs, we considered the same 10 KEGG orthologs to be the focal genes for all settings combinations. When subsampling MAGs, this was done randomly except that the proportion of MAGs encoding the focal KEGG ortholog was kept as similar as possible (to a minimum of five MAGs). This analysis clearly demonstrated that across most settings the focal genes were substantially more highly ranked by the phylogenetic methods compared with the Wilcoxon test approach (Supp. Figure 2). The Wilcoxon test was chosen as the representative differential abundance test for these additional experiments due to its high speed of computation.
The same overall trends for the random taxa and clade-based analyses presented in the main text were also observed consistently under these varied parameter settings.

2
The exception to the above observation was in simulation settings where no pseudocount was added, in which the focal gene was largely non-significant in the POMS output due to insufficient statistical power. This drop in statistical power is reflected by lower numbers of balance-significant nodes (BSNs) on average for these settings (Supp. Figure 3). In contrast, the focal genes were highly ranked on average based on the Wilcoxon test for these simulation settings.
Our analysis also demonstrated a marked drop in the number of BSNs in simulations with 250 or fewer MAGs (Supp. Figure 3). POMS was unable to call any focal genes as significant in these simulations due to the small number of BSNs, while many were called as significant by other approaches.
Finally, these analyses also demonstrated that two other observations reported in the main text-the overall tendency of phylogenetic regression to result in a non-negligible proportion of significant hits under the random taxa and clade-based simulations, as well as the observation that the Wilcoxon test identifies a higher proportion of significant hits overall-were robust to the simulation settings (Supp. Figure 8).

Reference-genome-based simulations
Our observations based on the MAG-based simulations are valuable, but one caveat is that the quality of published MAGs is often questionable 1 . To ensure that misassembled MAGs were not driving our results, we repeated our simulation approach on 500 reference genomes.
These genomes were taken from the Integrated Microbial Genomes database 2 and were previously parsed for use with PICRUSt2 3 . Per-genome KEGG ortholog annotations were taken from the default PICRUSt2 database. To clarify, these annotations are based on complete genomes (i.e., with no imputation) and were obtained from the PICRUSt2 database as it provided a convenient and easy-to-use format for our analyses. We created a de novo phylogenetic tree based on a set of universal single-copy genes with GToTree 4 v1.4.16. This approach parses out universal single-copy genes from genome sequences and wraps several tools to build a phylogenetic tree. The tool was run with the bacterial hidden Markov model setting and with FastTree 5 v2.1.10. GToTree also returns estimates of the percent completeness and redundancy for each genome. We excluded all genomes with completeness below 95% and/or redundancy 3 above 5%. We then randomly sampled 500 of the remaining high-quality genomes for the subsequent analyses.
We next simulated random abundances of these genomes across 1000 samples based on the zero-inflated beta distribution implemented in the rBEZI function of the gamlss.dist v5.1.7 R package 6 . Simulations under this model can be modified with three key metrics: mu (the mean), nu (the probability of zero abundance), and sigma (the standard deviation). For these simulations we maintained values of mu and sigma of 0.1 and 1, respectively, throughout. We generated four simulated datasets based on nu values set to 0.5, 0.65, 0.8, and 0.95. When generating these simulated datasets we required that each sample contain a minimum of five genomes (i.e., simulated profiles were re-run if fewer than five genomes had non-zero abundance). These altered nu values had a large impact on the sparsity and inter-sample overlap of each dataset (Supp. Figure 4a).
We then re-ran the key steps of our simulation analysis on these simulated abundance tables and genomes. Specifically, we repeated the focal gene-based simulation, including applying POMS, both phylogenetic regression approaches, and the Wilcoxon tests (USCGcorrected), over 500 replicate profiles for each of the four datasets. In all cases, we filtered out genes found in <5 genomes prior to running analyses. The Wilcoxon test was applied after taking the cross-product of the taxonomic abundances and taxa gene copy numbers to get the samplewise gene abundances, as in the main text. The focal gene ranks varied substantially depending on the abundance table simulation approach (Supp. Figure 4b). The focal genes were ranked significantly highest for the Wilcoxon approach under the low sparsity settings (the distributions were significantly different [Wilcoxon test P < 0.01] in all cases). For example, when nu = 0.5, the median gene ranks were one and 13.5 for the Wilcoxon test and POMS, respectively. This We also investigated the relationship between focal gene ranks and the number of genomes encoding the focal gene. Like in the MAG-based simulation results, those significant focal genes in the POMS output that were not amongst the most significant genes (i.e., focal genes at higher ranks) were encoded by fewer genomes (Supp. Figure 5). This was true for the other tested approaches as well, overall. However, the focal gene ranks based on Wilcoxon test p-values displayed a positive linear relationship with the number of encoding genomes for the sparsest simulated dataset (nu = 0.95). There was a significant positive correlation in this case (Pearson R = 0.756; P < 10 -15 ). This result was similar to the positive correlation between focal gene rank and number of encoding genomes that we also observed in the corresponding analysis on the MAG-based simulations (Supp. Figure 6). Overall, these reference genome-based simulation results are consistent with the key observations from the MAG-based simulations, at least when taxa are sparsely distributed, which is characteristic of microbiome datasets.