phylogenize: a web tool to identify microbial genes underlying environment associations

Summary: Microbes differ in prevalence across environments, but in most cases the causes remain opaque. Phylogenetic comparative methods have emerged as powerful, specific methods to identify microbial genes underlying differences in community composition. However, to apply these methods currently requires computational expertise and sequenced isolates or shotgun metagenomes, limiting their wider adoption. We present phylogenize, a web server that allows researchers to apply phylogenetic regression to 16S amplicon as well as shotgun sequencing data and to visualize results. Using data from the Human Microbiome Project, we show that phylogenize draws similar conclusions from 16S and from shotgun sequencing. Additionally, we apply phylogenize to 16S data from the Earth Microbiome Project, revealing both known and candidate pathways involved in plant colonization. phylogenize has broad applicability to the analysis of both human-associated and environmental microbiomes. Availability phylogenize is available at https://phylogenize.org with source code available at https://bitbucket.org/pbradz/phylogenize. Contact kpollard@gladstone.ucsf.edu


Introduction
Shotgun and amplicon sequencing have enabled previously intractable microbial communities to be characterized and compared. However, while these communities have the potential to yield clinical (Moayyedi et al., 2015) and agricultural tools (Mendes et al., 2011), translating microbe-to-environment correlations into genelevel mechanisms remains difficult.
Phylogenetic regression is a powerful, underutilized technique (Washburne et al., 2018) that can help interpret these correlations by accounting for the confounder of common descent. Previously, we demonstrated that applying this technique to shotgun metagenomic data can identify microbial genes linked to human body sites without the high false-positive rate of standard regression (Bradley et al., 2018).
Here, we present phylogenize, a web tool that makes this technique available to researchers without specific expertise in this area by allowing them to upload and analyze their own data. We also provide the source code of phylogenize, allowing more experienced users to run it locally.
In addition to shotgun metagenomic data, phylogenize also allows researchers to analyze abundances derived from 16S amplicon sequencing. 16S data is much less expensive to generate and already exists for many environments, allowing researchers to get more from their data.
Overview phylogenize ( Figure 1) takes the following basic inputs. First, users provide a table of taxon abundances across a set of samples. These taxa should be ASVs from DADA2 (Callahan et al., 2016) or Deblur (Amir et al., 2017) (for 16S data) or MIDAS species (for shotgun data). Second, users provide a table of sample annotations matching sample IDs to environments and datasets. The abundances and sample annotations can be provided separately or as a single BIOM-format (McDonald et al., 2012) file.
Next, the user selects one environment out of those represented in the sample annotations. Finally, the user chooses whether to link gene presence to prevalence (the frequency a microbe is observed in the selected environment) or specificity (how specific a microbe is for the chosen environment compared to all others: see Bradley et al., 2018).
phylogenize uses the fast mapper BURST (Al-Ghalith and Knights, 2017) to map sense or anti-sense ASVs to individual PATRIC genomes (Wattam et al., 2014)  2018), then matches these genomes to MIDAS species (which are clusters of PATRIC genomes). Reads for sequences mapping to the same species are summed within samples.
The web front-end for phylogenize is written in Python using the Flask framework with a Beanstalkbased queueing system. For each job, phylogenize uses RMarkdown (Allaire et al., 2018) and knitr (Xie, 2014) to generate an HTML report. This report includes interactive trees showing the phenotype's phylogenetic distribution, heatmaps of significantly positively-associated genes, and tables showing which SEED subsystems (Overbeek et al., 2005) were significantly enriched at a 25% FDR. phylogenize also provides tab-delimited files containing the calculated phenotype, p-values and effect sizes for all FIGfams tested, and protein annotations for the significant, positively-associated hits.

Example Applications
Human Microbiome Project comparison: We first used phylogenize to associate gene presenceabsence with microbial prevalence in the gut. To do so, we used 454 16S amplicon sequencing data from the Human Microbiome Project (HMP) (Human Microbiome Project Consortium, 2012). 6,577 samples from 192 individuals across 16 sites were downloaded from the Sequence Read Archive and denoised with DADA2 (Callahan et al., 2016). Reads were combined for all samples from the same individual and site.
Previously, we performed a similar analysis using HMP's shotgun sequencing data (Bradley et al., 2018), which we use here as a benchmark. Despite differences in read depth and technology, species prevalence estimates obtained by mapping 16S ASVs to MIDAS genomes were similar to those from shotgun sequencing (r = 0.6), and the effect sizes calculated for genes as-sociated with gut prevalence were also broadly similar (0.339 ≤ r ≤ 0.601, Figure S1). When we compared the significantly-associated genes, we also observed shared pathway enrichments, including for genes in the SEED subsystems "Sporulation gene orphans" in Firmicutes (q shotgun = 2.7 × 10 −22 , q16S = 0.019), and "Type III, Type IV, Type VI, ESAT secretion systems" in Proteobacteria (q shotgun = 1.69 × 10 −11 , q16S = 2.23 × 10 −6 ).

Earth Microbiome Project: The Earth Microbiome
Project (EMP) (Thompson et al., 2017) comprises 16S data sampled across many biomes and habitats. Using the balanced subset of 2,000 samples processed using Deblur (Amir et al., 2017), we calculated a specificity score for being plant-associated, as opposed to being animal-associated or free-living. phylogenize identified genes enriched in processes known to be relevant to a plant-associated lifestyle, such as nitrogen fixation (Mylona et al., 1995), the metabolism of opines (metabolites whose biosynthesis in plants is induced by parasitic Agrobacterium species (Schell et al., 1979)), and xylose metabolism (xylose is a plant cell wall component: Liu et al., 2015).

Conclusion
Phylogenetic regression offers a computational way to identify genes potentially involved in site colonization, even for clinically or ecologically important microbes that are poorly characterized and/or experimentally intractable. Previously, applying this method to microbiome data required specialized computational expertise and either shotgun metagenomics data (Bradley et al., 2018) or a large collection of sequenced isolates (Levy et al., 2018). By making it significantly easier to analyze either 16S or shotgun data with phylogenetic regression, phylogenize expands the toolkit for researchers 2 studying microbial communities. Figure S1: phylogenize makes similar inferences from 16S and shotgun data. On the x-axis are effect sizes of genes associated with gut prevalence using shotgun data from HMP; the y-axis has effect sizes derived from 454 16S data. Only genes significant at q ≤ 0.05 in at least one dataset are shown.