-
PDF
- Split View
-
Views
-
Cite
Cite
Zheng-Zheng Tang, Guanhua Chen, Alexander V Alekseyenko, Hongzhe Li, A general framework for association analysis of microbial communities on a taxonomic tree, Bioinformatics, Volume 33, Issue 9, May 2017, Pages 1278–1285, https://doi.org/10.1093/bioinformatics/btw804
- Share Icon Share
Abstract
Association analysis of microbiome composition with disease-related outcomes provides invaluable knowledge towards understanding the roles of microbes in the underlying disease mechanisms. Proper analysis of sparse compositional microbiome data is challenging. Existing methods rely on strong assumptions on the data structure and fail to pinpoint the associated microbial communities.
We develop a general framework to: (i) perform robust association tests for the microbial community that exhibits arbitrary inter-taxa dependencies; (ii) localize lineages on the taxonomic tree that are associated with covariates (e.g. disease status); and (iii) assess the overall association of the whole microbial community with the covariates. Unlike existing methods for microbiome association analysis, our framework does not make any distributional assumptions on the microbiome data; it allows for the adjustment of confounding variables and accommodates excessive zero observations; and it incorporates taxonomic information. We perform extensive simulation studies under a wide-range of scenarios to evaluate the new methods and demonstrate substantial power gain over existing methods. The advantages of the proposed framework are further demonstrated with real datasets from two microbiome studies. The relevant R package miLineage is publicly available.
miLineage package, manual and tutorial are available at https://medschool.vanderbilt.edu/tang-lab/software/miLineage.
Supplementary data are available at Bioinformatics online.
1 Introduction
A promising area of genomic research focuses on the sequencing and analysis of the genomes of microorganisms (Gilbert et al., 2016; Human Microbiome Project Consortium., 2012). Many human microbiome sequencing studies have been conducted to survey the bacterial composition in different body sites and investigate associations with various covariates (e.g. disease status). Research in this area has begun to uncover relationships between human microbiota and diseases such as obesity and psoriasis (Alekseyenko et al., 2013; Sanderson et al., 2006). These studies suggest that microbes play an important role in human health and disease, and that understanding the microbiome composition provides insight into the functions of microbes and their effects on the human host.
Microbiome data consist of quantification of abundances of microorganisms with specific taxonomic identity. Typically, the data comes from high-throughput sequencing of a marker gene, such as 16S ribosomal ribonucleic acid (rRNA) gene, and assignment of the sequencing reads to taxonomy through pre-processing pipelines such as QIIME (Caporaso et al., 2010). A taxonomic tree is a directed acyclic graph representing the hierarchical structure through which organisms classify into taxonomic groups (taxa). The main levels in the taxonomic hierarchy listed from lowest to highest are species, genus, family, order, class, phylum and kingdom. A node in the taxonomic tree represents a taxon at a certain level. A lineage contains a cluster of taxa that form a line of descent. It is desirable to incorporate the taxonomic information into the analyses of the microbiome composition because the ecological roles played by the microbial communities are different at various taxonomic levels, and changes in the structure of the microbiota that are associated with covariates can occur at any taxonomic levels and along any relevant lineages of the tree (Gilbert et al., 2016; Liu et al., 2008).
Association analysis of the microbiome composition proves challenging for several reasons: the number of taxa is large relative to the sample size, the count data are sparse for many taxa with excessive zero observations, and total counts across taxa are constrained by the sequencing depth (Li, 2015). Two types of methods exist to test the association of the microbiome composition at the community level. The first type models microbiome non-parametrically and are often used to assess the overall association of the microbial community with covariates. PERMANOVA (Anderson, 2001), PERMANOVA-S (Tang et al., 2016) and MiRKAT (Zhao et al., 2015) summarize the taxa counts into a distance matrix and utilize that information to evaluate associations. MiSPU (Wu et al., 2016) generates the generalized taxon proportions that combine taxon abundance and tree information, and adaptively weights those proportions in the association testing. These methods are not designed to pinpoint the relevant taxa that drive the association. The second type directly models the compositional taxa counts using Dirichlet-multinomial (DM) distribution (Chen and Li, 2013; La Rosa et al., 2012). DM distribution was shown to fit the overdispersed count data better than the multinomial distribution by assuming that the proportion parameters follow a Dirichlet distribution. The DM test is often applied to taxa at high taxonomic levels (e.g. phylum) because it is underpowered when testing the large numbers of taxa present at low levels (e.g. genus). The DM assumption may not be appropriate for microbiome data because such models intrinsically impose a negative correlation among taxa counts, while the actual microbiome data display both positive and negative correlations (Mandal et al., 2015). In addition, current implementation of the DM association test cannot accommodate continuous covariates or potential confounders (Rosa et al., 2016).
In this article, we provide a general framework to test associations between covariates and microbial communities on a taxonomic tree. Our methods model multivariate taxa counts but do not rely on any distributional assumptions, which provides flexibility to various correlation structures among taxa. Our methods allow for confounder adjustment, which reduces false positive findings particularly for the current microbiome studies with more complex designs and heterogenous samples. Our methods handle excessive zero counts, which potentially increases power for the detection of changes in presence/absence frequencies of multiple taxa. Harnessing the taxonomic tree and the proposed methods, our framework powerfully and robustly identifies covariate-associated microbial lineages and evaluates the overall association of the whole microbial community. The advantages of these new methods over existing ones are demonstrated through extensive simulation studies. Application to two real microbiome datasets shows the usefulness of our framework in the detection of associations between the human microbiome and two disorders, psoriasis and obesity.
2 Methods
2.1 Quasi-conditional association test (QCAT)
The term in the middle of the test statistic (3) is the empirical covariance estimator of the score statistics. This estimator is robust to arbitrary inter-taxa relationships. The vector of the score statistics among taxa might generally be expected to exhibit negative correlations because the total taxa count is bounded by the sequencing depth. This negative correlation, which is implied by the information matrix of the multinomial regression, does not necessarily hold under the mean assumption (1). Therefore, the empirical covariance estimator that reflects the true dependence among taxa in the community is advantageous. In addition, as in any multivariate tests, the covariance matrix is not always invertible when the sample size is smaller than the number of taxa. This problem can be greatly alleviated by performing tests on each lineage of a taxonomic tree such that the number of taxa involved in each test will be significantly reduced (details in section 2.3).
2.2 Two-part distribution-free association test
For the zero part, we employ the generalized estimating equation (GEE) method (Zeger and Liang, 1986) for the vector of the binary indicators of absence . The covariates for the zero part are not necessarily the same as those for the positive part. But for the ease of presentation, we keep them the same in this paper. The detailed derivation of the GEE-based generalized score test is shown in Supplementary Data Method C. We refer to the test as the zero-part test and its statistic as .
The and are independent under the two-part model. Thus, we can combine them by direct summation and obtain the P-value based on the chi-square distribution with the total degrees of freedom from both parts. We refer to the resulting combined test as the two-part test.
The asymptotic approximation of the test statistics Q1 and may not be accurate with very rare taxa when most of the observations are zeros. On the other hand, the asymptotic approximation of may not be accurate with very common taxa when almost all observations are positive. In these scenarios, we resort to resampling techniques to obtain P-values. Our methods adopt score-type statistics, which are computationally efficient because the null model does not involve covariates of interest and needs to be fit only once in the resampling procedure. To further improve the computational speed, we employ a multi-stage procedure that uses a small number of resampling steps in the initial stage and sequentially increases the number if the estimated P-value is small.
2.3 Localization of covariate-associated lineages on the taxonomic tree
Suppose we have a rooted taxonomic tree with nodes representing the taxonomic units of the tree. Without loss of generality, we assume that the first K nodes are internal nodes (nodes with at least one child). We let denote the count of reads assigned to the node vk and denote a vector of counts of reads assigned to the set of nodes . For each split from a parent node to a child node, the reads on the parent node are either assigned to the child node or remain unassigned. We let denote the set of all direct child nodes of the internal node vk. Then, is the vector of counts of reads assigned to its direct child nodes, and is the count of unassigned reads. For each lineage represented by its root node vk, the observed counts for subject i consist of . See Supplementary Figure S1 for an example on a simple mock tree.
2.4 Assessment of the overall association of the whole microbial community
We first compute P-values for testing the individual lineages . The one-part test is conditional on the total read counts of the lineage (the count on its root node), so the Q1P-values over the lineages on the tree are mutually independent under H0. We can obtain the P-value of the global test using standard P-value combination methods. Simes’ method (Simes, 1986) is powerful when the association signals concentrate in a few lineages. In contrast, Fisher’s method (Fisher et al., 1950) is more powerful when the signals spread over multiple lineages. In our simulation studies and real data analyses, we combine the Q1P-values on the tree using Fisher’s method and Simes’ method, and we refer to the resulting global tests as Fisher-Q1 and Simes-Q1, respectively. Similar to the one-part test, we can combine the DM P-values across lineages to obtain the corresponding global tests, which we refer to as Fisher-DM and Simes-DM. The Q2P-values over the lineages are not independent because the zero-part test is not a conditional test. Fisher’s method does not apply for dependent P-values, but Simes’ method is largely valid (Sarkar and Chang, 1997). We refer to the global two-part test as Simes-Q2.
3 Results
We conducted extensive numerical studies (simulation studies and real data analyses) to investigate the performance of the proposed and existing methods. The DM test was performed using the R package HMP (v1.4.3) (Rosa et al., 2016). The global tests PERMANOVA, MiRKAT, MiSPU tests were performed using the R packages vegan (v2.2-1) (Oksanen et al., 2015), MiRAKT (v0.02) and MiSPU (v1.0), respectively. The PERMANOVA and MiRKAT used L1 Kantorovich-Rubinstein (K-R) distance with unit branch length (Evans and Matsen, 2012). For each pair of samples, this distance is calculated as the sum of the absolute difference between two compositional vectors over each branch of the tree. For MiSPU, we used a range of values () for the parameter γ and reported the one that yields the highest power. All the global tests were performed based on the same taxonomic tree.
3.1 Simulation studies
We simulated taxa counts for two groups with an equal number of samples in each group and tested differential abundances in multiple taxa between the two groups. We considered total sample sizes of 50 and 100 in all simulation studies.
Our first series of simulation studies was designed to evaluate the performance of the asymptotic tests on a certain set of taxa. We simulated taxa counts from parametric distributions to control the relative abundance and the presence/absence frequency of each taxon. This allowed us to clearly see how the different methods perform under various scenarios. In particular, we first generated the vector of proportions for 7 taxa from Dirichlet, log-normal, or zero-inflated log-normal (ZILN) distributions, and then simulated counts from multinomial distribution based on those proportions.
Supplementary Table S1 summarizes the simulation strategies under different parametric models. In the view of the fact that the sum of the proportions over taxa is bounded by 1, we induced an abundance difference between two groups by applying a swapping procedure to change abundances of two taxa at a time (increase the abundance in one taxon and decrease in the other) without disturbing other taxa. This procedure is illustrated by the diagram in Supplementary Table S1 and is explained below for each parametric models.
Under the Dirichlet model, the proportions of taxa were directly generated from Dirichlet distribution with mean parameters and dispersion parameter of 3. Then, we swapped the mean parameter of the last taxon (i.e. 0.02) with that of another taxon in one group. As the values of the mean parameters are in order, the magnitude of the swapped parameter dictates the degree of perturbation. Under the log-normal model, we set the means of the 7 normal distributions to and variances to 1. The proportions of the 7 taxa are calculated from the relative abundance among the 7 log-normal samples. Similar to the DM model, we swapped the mean parameter of the last taxon with that of another taxon in one group to induce an abundance difference. Under the ZILN model, we set the mean parameters of the normal distributions the same as those in the log-normal model and considered two sets of zero proportion parameters: (a) ; and (b) . We referred to these two settings as ZILN(a) and ZILN(b), respectively. In addition, we considered three patterns of differentiation under the ZILN model. In the ‘Zero’ pattern, was swapped, while was fixed. In the ‘Positive’ pattern, was swapped, while was fixed. In the ‘Zero + Positive’ pattern, both and were swapped simultaneously.
For the type I error evaluation, we considered a situation with the presence of a confounder. Specifically, we generated another group consisting of half of the members of group 1. For those members, we perturbed the abundances of the taxa at level 5. In particular, we swapped the Dirichlet parameter 0.02 with 0.12 in the Dirichlet model; we swapped the normal mean parameter 0 with 1.4 in the log-normal and ZILN models. Clearly, the additional group is correlated with the group of interest and the abundance of the taxa.
For all scenarios in the first set of simulation studies, the number of total counts was simulated from a Poisson distribution with mean 1000. We used 5000 simulated datasets to evaluate type I error and power of the asymptotic test at α = 0.05.
The empirical type I errors are shown in Table 1. The results for and tests are not shown under the log-normal model because the two-part test degenerates to the one-part test when all the counts are positive. When no confounders exist, all tests preserve the type I error. When a confounder exists, our proposed tests that adjust for the confounder control the type I error. The DM test yields inflated type I error due to the inability to accommodate confounders.
n . | Test . | Dirichlet . | Log-Normal . | ZILN(a) . | ZILN(b) . |
---|---|---|---|---|---|
Without confounders | |||||
50 | 0.032 | 0.038 | 0.040 | 0.038 | |
0.044 | – | 0.044 | 0.038 | ||
0.045 | – | 0.044 | 0.039 | ||
0.038 | – | 0.044 | 0.036 | ||
DM | 0.030 | 0.021 | 0.037 | 0.021 | |
100 | 0.040 | 0.043 | 0.041 | 0.048 | |
0.044 | – | 0.044 | 0.046 | ||
0.046 | – | 0.044 | 0.048 | ||
0.038 | – | 0.043 | 0.040 | ||
DM | 0.042 | 0.017 | 0.035 | 0.021 | |
With a confounder | |||||
50 | 0.020 | 0.014 | 0.018 | 0.010 | |
0.021 | – | 0.015 | 0.024 | ||
0.032 | – | 0.020 | 0.028 | ||
0.018 | – | 0.015 | 0.014 | ||
DM | 0.17 | 0.17 | 0.077 | 0.12 | |
100 | 0.039 | 0.035 | 0.053 | 0.028 | |
0.039 | – | 0.039 | 0.032 | ||
0.044 | – | 0.047 | 0.041 | ||
0.041 | – | 0.036 | 0.030 | ||
DM | 0.38 | 0.59 | 0.17 | 0.35 |
n . | Test . | Dirichlet . | Log-Normal . | ZILN(a) . | ZILN(b) . |
---|---|---|---|---|---|
Without confounders | |||||
50 | 0.032 | 0.038 | 0.040 | 0.038 | |
0.044 | – | 0.044 | 0.038 | ||
0.045 | – | 0.044 | 0.039 | ||
0.038 | – | 0.044 | 0.036 | ||
DM | 0.030 | 0.021 | 0.037 | 0.021 | |
100 | 0.040 | 0.043 | 0.041 | 0.048 | |
0.044 | – | 0.044 | 0.046 | ||
0.046 | – | 0.044 | 0.048 | ||
0.038 | – | 0.043 | 0.040 | ||
DM | 0.042 | 0.017 | 0.035 | 0.021 | |
With a confounder | |||||
50 | 0.020 | 0.014 | 0.018 | 0.010 | |
0.021 | – | 0.015 | 0.024 | ||
0.032 | – | 0.020 | 0.028 | ||
0.018 | – | 0.015 | 0.014 | ||
DM | 0.17 | 0.17 | 0.077 | 0.12 | |
100 | 0.039 | 0.035 | 0.053 | 0.028 | |
0.039 | – | 0.039 | 0.032 | ||
0.044 | – | 0.047 | 0.041 | ||
0.041 | – | 0.036 | 0.030 | ||
DM | 0.38 | 0.59 | 0.17 | 0.35 |
n . | Test . | Dirichlet . | Log-Normal . | ZILN(a) . | ZILN(b) . |
---|---|---|---|---|---|
Without confounders | |||||
50 | 0.032 | 0.038 | 0.040 | 0.038 | |
0.044 | – | 0.044 | 0.038 | ||
0.045 | – | 0.044 | 0.039 | ||
0.038 | – | 0.044 | 0.036 | ||
DM | 0.030 | 0.021 | 0.037 | 0.021 | |
100 | 0.040 | 0.043 | 0.041 | 0.048 | |
0.044 | – | 0.044 | 0.046 | ||
0.046 | – | 0.044 | 0.048 | ||
0.038 | – | 0.043 | 0.040 | ||
DM | 0.042 | 0.017 | 0.035 | 0.021 | |
With a confounder | |||||
50 | 0.020 | 0.014 | 0.018 | 0.010 | |
0.021 | – | 0.015 | 0.024 | ||
0.032 | – | 0.020 | 0.028 | ||
0.018 | – | 0.015 | 0.014 | ||
DM | 0.17 | 0.17 | 0.077 | 0.12 | |
100 | 0.039 | 0.035 | 0.053 | 0.028 | |
0.039 | – | 0.039 | 0.032 | ||
0.044 | – | 0.047 | 0.041 | ||
0.041 | – | 0.036 | 0.030 | ||
DM | 0.38 | 0.59 | 0.17 | 0.35 |
n . | Test . | Dirichlet . | Log-Normal . | ZILN(a) . | ZILN(b) . |
---|---|---|---|---|---|
Without confounders | |||||
50 | 0.032 | 0.038 | 0.040 | 0.038 | |
0.044 | – | 0.044 | 0.038 | ||
0.045 | – | 0.044 | 0.039 | ||
0.038 | – | 0.044 | 0.036 | ||
DM | 0.030 | 0.021 | 0.037 | 0.021 | |
100 | 0.040 | 0.043 | 0.041 | 0.048 | |
0.044 | – | 0.044 | 0.046 | ||
0.046 | – | 0.044 | 0.048 | ||
0.038 | – | 0.043 | 0.040 | ||
DM | 0.042 | 0.017 | 0.035 | 0.021 | |
With a confounder | |||||
50 | 0.020 | 0.014 | 0.018 | 0.010 | |
0.021 | – | 0.015 | 0.024 | ||
0.032 | – | 0.020 | 0.028 | ||
0.018 | – | 0.015 | 0.014 | ||
DM | 0.17 | 0.17 | 0.077 | 0.12 | |
100 | 0.039 | 0.035 | 0.053 | 0.028 | |
0.039 | – | 0.039 | 0.032 | ||
0.044 | – | 0.047 | 0.041 | ||
0.041 | – | 0.036 | 0.030 | ||
DM | 0.38 | 0.59 | 0.17 | 0.35 |
The power results under the Dirichlet and the log-normal models when n = 50 are displayed in Figure 1 (see Supplementary Fig. S2 for results when n = 100). Each plot demonstrates the power curve as a function of the degree of perturbation. Under the Dirichlet model, the one-part test is slightly less powerful than the DM test. The two-part test is dramatically more powerful than the one-part and DM tests, as the differentiation is dominated by the change of presence/absence frequencies. Under the log-normal model, the one-part test is much more powerful than the DM test.

Power as a function of the degree of perturbation under Dirichlet and log-normal models
The power results under the ZILN model when n = 50 are displayed in Figure 2 (see Supplementary Fig. S3 for results when n = 100). In the ‘Zero’ pattern, the two-part test is substantially more powerful than the one-part test and less powerful than the zero-part test. In the ‘Positive’ pattern, the two-part test is less powerful than the positive-part test. The two-part test is more powerful than the one-part test in ZILN(a) but less powerful in ZILN(b) because the zero proportions are generally higher in the former setting. In the ‘Zero + Positive’ pattern, the two-part test is more powerful than both the zero-part and positive-part tests. With the ZILN(a), the perturbation increases/decreases the zero proportion and the mean of the positive count simultaneously, in which case the overall means do not change much and the tests of the overall means (i.e. the one-part test and the DM test) have low power. With ZILN(b), the perturbation makes the zero proportion and the mean of the positive count change in different directions, in which case both the one-part and the two-part tests have good power. In all scenarios under the ZILN, the DM test is always less powerful than the proposed tests.

Power as a function of the degree of perturbation under zero-inflated log-normal models. The left panels pertain to ZILN (a) and the right panels pertain to ZILN (b). The pattern of perturbation is indicated above each graph
Our second series of studies was devoted to comparing the performance of the new methods with existing methods for lineage identification and overall association assessment. In particular, we applied the one-part, two-part and DM tests to each lineage of the tree and combined the lineage-specific P-values to produce the global P-values; we also applied PERMANOVA, MiRKAT and MiSPU to assess the overall association. We used permutation to obtain P-values for all tests.
To closely resemble reality, we generated taxa counts on a taxonomic tree based on a real dataset (Flores et al., 2014). Those investigators collected 638 gut microbial samples from 85 college-age adults over a three-month period and then characterized the microbiome composition using 16S rRNA sequencing. Counts of reads were summarized on a taxonomic tree that had 1050 nodes from kingdom to species (see Supplementary Fig. S4b). Because there is no interference over the three-month study period, we considered these samples as coming from the same null population.
We constructed a matrix with each row corresponding to the relative abundances (proportions) of the 1050 nodes for each sample. We then simulated count data from multinomial distribution with the vector of proportion parameters sampled from the rows of and size parameter Ni sampled from the sequencing depths of the 638 microbial samples. To introduce perturbation, we added count Ei to the taxa under genus Streptococcus in one group, where Ei was simulated from Binomial(Ni, ), and varied from 0 to 0.002%. For each scenario, we repeated the simulation 100 times. For lineage detection on the tree, we controlled the FDR at . For the global test, we rejected the null composite hypothesis at .
Supplementary Figure S4a displays the heatmap of the counts for the 40 taxa under order Lactobacillales when . Clearly, the counts of the taxa under genus Streptococcus are different between the two groups. As a perturbation is made to the counts under genus Streptococcus, the counts on all the parent nodes of Streptococcus also change. Therefore, the differential lineages denoted by their root nodes are kingdom Bacteria, phylum Firmicutes, class Bacilli, order Lactobacillales, family Streptococcaceae and genus Streptococcus (see Supplementary Fig. S4b). The discovery rate at genus Streptococcus is displayed in the panels of Figure 3a. The two-part test becomes more powerful than the one-part test with higher degrees of perturbation because of the remarkable difference in the presence/absence frequencies (see Supplementary Fig. S4a). Discovery rates for lineages above genus are very low (results not shown) because differences in counts on these lineages are diluted by the counts from the non-differential child nodes. The FDR is under controlled for all tests (see Supplementary Table S2).

Simulation results for association tests on the taxonomic tree. (a) Discovery rate at genus Streptococcus. (b) Power of the global test. The total sample size is indicated above each graph
The power of the global test is displayed in the panels in Figure 3b. The Simes-Q1 and Simes-Q2 tests are always more powerful than the Fisher-Q1 test because the perturbation occurs on a small number of lineages. The PERMANOVA, MiRKAT and MiSPU tests have low power because the differentiation is too subtle to be detected by jointly considering all the taxa on the tree. All global tests preserve the type I error. The DM test is underpowered in the detection of differential lineages, as well as the global association assessment.
3.2 Cutaneous microbiome in psoriasis
Psoriasis is a common chronic inflammatory disease of the skin. Alekseyenko et al. investigated the community differentiation of cutaneous microbiota in psoriasis, where a total of 51 patients with psoriasis were studied by swabbing the lesion and a contralateral sample of normal tissue from each subject (Alekseyenko et al., 2013). The sample DNA was analyzed by sequencing the V1-V3 region of the 16S rRNA gene. The microbial operational taxonomic units (OTUs) were constructed through the QIIME pipeline. After removing samples with sequencing depth less than 1000 and discarding taxa with only one read, we ended up with 100 samples for the two groups (lesion and normal tissue) and 13429 OTUs. These OTUs were taxonomically classified up to the genus level. Taxa that appear in less than two samples have been dropped resulting in 477 genera. These genera were then mapped to a taxonomic tree with 304 lineages from family to kingdom.
We compared the microbial communities among the two groups over all lineages on the taxonomic tree at the family, order, class, phylum and kingdom levels. We performed the association analysis using the proposed tests and the DM test. The DM test does not discover any differential lineages after controlling FDR at 0.05 level. Our proposed tests identify seven lineages on the tree, including Streptococcaceae, Thermaceae, Actinomycetaceae, Paenibacillaceae at family level, Bacillales at order level, Deinococci at class level and Acidobacteria at phylum level. Their permutation P-values are listed in Supplementary Table S3, and their locations on the tree are marked in Figure 4. The differentiation of the identified lineage is driven mainly by the differences in the presence/absence frequencies of taxa in those lineages. For instance, Supplementary Figure S5 displays the proportion of the presence and the count at those presence in the families of order Bacillales. The lesion and normal groups differ in the percentages of the presence (P-value = 0.000075) but do not differ in the counts at those presence (P-value = 0.34). The family Paenibacillaceae that shows a prominent difference between groups is also detected by the two-part test at the family level.

Taxonomic tree for the cutaneous microbiome. The root nodes of the discovered lineages are annotated by the names of the corresponding taxa. Four phyla that dominate the skin microbial community are also annotated
The four lineages identified at the family level were completely missed by the investigators’ initial analysis (Alekseyenko et al., 2013). The association of family Streptococcaceae with psoriasis was recently confirmed by another study (Drago et al., 2016), and the presence of species Streptococcus was previously linked to the pathogenesis of psoriasis (Weisenseel et al., 2002). This suggests the advantage of our methods, especially in the detection of associations at lower taxonomic levels.
For the global association analysis, our Fisher-Q1 test produced significant results with P-values of 0.0016. The Simes’ tests are less significant than the Fisher’s test (Simes-Q1P-value = 0.073, Simes-Q2P-value = 0.016) because the differential signals spread over many lineages in the taxonomic tree. We also applied global tests PERMANOVA, MiRKAT and MiSPU to the same data. These tests produced significant results with P-values of 0.00081, 0.00068 and 0.0088, respectively. The P-values of the DM global tests are not significant (Fisher-DM P-value = 0.06, Simes-DM P-value = 0.22).
3.3 Gut microbiome and body mass index
Gut microbiota play an important role in obesity by contributing to nutrient digestion and absorption in humans. Wu et al. examined the relationship between micronutrients and gut microbiome composition, by collecting fecal samples from 98 healthy volunteers, along with their demographic data and diet information, and then sequencing the V1–V2 region of the 16S rRNA genes in the fecal samples (Wu et al., 2011). A total of 3068 OTUs were obtained from the QIIME pipeline. These OTUs were taxonomically classified up to the genus level. Taxa that appear in less than two samples have been dropped resulting in 80 genera. These genera were then mapped to a taxonomic tree with 74 lineages from family to kingdom.
Because dysbiosis of the gut microbiome has been shown to be associated with obesity (Sanderson et al., 2006), we were interested in identifying the bacterial lineages that are associated with body mass index (BMI) after adjusting for total fat and caloric intakes. We performed the proposed tests over all lineages on the taxonomic tree at the level of family, order, class, phylum and kingdom. The DM test could not be performed because it is unable to handle continuous covariates or adjust for confounders. Our tests identify two BMI-associated lineages: family Ruminococcaceae (Q1P-value = 0.00011, Q2P-value = 0.025, P-value = 0.38, P-value = 0.0070) and family Veillonellaceae (Q1P-value = 0.0036, Q2P-value = 0.0018, P-value = 0.0011, P-value = 0.20). As indicated by the results of the zero-part and the positive-part tests, the association of Ruminococcaceae is driven mainly by the changes in the positive abundances and the association of Veillonellaceae is driven mainly by the changes of the presence/absence frequencies. For family Ruminococcaceae, the signal concentrates in its direct child genus Oscillibacter (Q1P-value becomes 0.13 after leaving Oscillibacter out). Of relevance to our findings, Oscillibacter was previously identified to be negatively associated with BMI and potentially regulates the maintenance of gut barrier integrity (Lam et al., 2012; Wu et al., 2011). In addition, family Veillonellaceae was perviously identified to be positively associated with adiposity and the deterioration of metabolic factors (Krajmalnik-Brown et al., 2012; Wu et al., 2011).
For the global association analysis, all of our global tests produce significant results with Fisher-Q1P-value of 0.010, Simes-Q1P-value of 0.0028, and Simes-Q2P-value of 0.042. We also applied global tests PERMANOVA, MiRKAT and MiSPU to the same data. The results are not significant because the association signals are subtle and sparse on the tree (PERMANOVA P-value = 0.39, MiRKAT P-value = 0.57, MiSPU P-value = 0.24).
4 Discussion
In this paper, we develop a general framework for association analysis between microbial communities on a taxonomic tree and covariates (e.g. clinical outcomes, disease status). This framework exploits the hierarchical structure of the microbial taxonomy and enables us to identify the covariate-associated lineages, as well as to assess the overall association of the microbial community with covariates. Our methods accommodate the compositional and sparse nature of microbiome data. The proposed tests are distribution-free to allow for complex dependency among taxa. Moreover, the ability to adjust for confounding variables facilitates robust discoveries and reduces false positive findings. We demonstrate that our proposed methods outperform existing methods under a wide range of scenarios. We apply our framework to real microbiome datasets to discover associations of the microbial communities with psoriasis and obesity and to localize the relevant microbial lineages on the taxonomic tree.
The performance of different proposed tests depends on the underlying distributions of the microbiome data and the association patterns. If the taxa counts have many zeros, then the two-part test tends to be more desirable than the one-part test. If most of the taxa counts are positive, then the one-part test tends to be more attractive than the two-part test. In terms of the global tests, if the association signals are very sparse on the tree, Simes’ method would be more desirable than Fisher’s method to combine P-values across lineages. If the association signals spread over many lineages on the tree, Fisher’s method would be more desirable.
Many distance-based methods have been developed to test the overall association of the whole microbial community with covariates, but they are unable to pinpoint taxa that drive the associations. On the other hand, association testing with a single taxon (Love et al., 2014; Paulson et al., 2013; Robinson et al., 2010) can be underpower because the signal in one taxon is often too weak to be powerfully detected, and there are a large number of tests to be adjusted for. Our framework leverages taxonomic structure to define the unit of the tests in a biologically meaningful way. Our multivariate tests concerning multiple taxa in a lineage can enrich association signals and reduce the penalty of multiple testing. For example, the 80 genera were reduced to 34 families in the analysis of the gut microbiome data. The tree-guided discoveries from our framework are highly interpretable and are expected to facilitate the precise delineation of molecular mechanisms and functions of the relevant bacteria that interact with the human host.
Besides the 16S rRNA sequencing, the shotgun metagenomic sequencing is a popular high-throughput techniques commonly used in microbiome studies. The shotgun sequencing platform surveys all microbial genomes instead of one marker gene. Because of the ambiguity of sequence alignment among multiple bacterial genomes, the metagenomic data are often summarized into proportions rather than counts. Even though we presented our methods using count data, the framework can be modified easily to handle proportion data from metagenomic sequencing.
Funding
This work was supported by the National Institutes of Health, Building Interdisciplinary Research Careers in Women’s Health (BIRCWH) award 5K12HD043483-14.
Conflict of Interest: none declared.
References