D-MANOVA: fast distance-based multivariate analysis of variance for large-scale microbiome association studies

Chen, Jun; Zhang, Xianyang

doi:10.1093/bioinformatics/btab498

Abstract

Summary

PERMANOVA (permutational multivariate analysis of variance based on distances) has been widely used for testing the association between the microbiome and a covariate of interest. Statistical significance is established by permutation, which is computationally intensive for large sample sizes. As large-scale microbiome studies, such as American Gut Project (AGP), become increasingly popular, a computationally efficient version of PERMANOVA is much needed. To achieve this end, we derive the asymptotic distribution of the PERMANOVA pseudo-F statistic and provide analytical P-value calculation based on chi-square approximation. We show that the asymptotic P-value is close to the PERMANOVA P-value even under a moderate sample size. Moreover, it is more accurate and an order-of-magnitude faster than the permutation-free method MDMR. We demonstrated the use of our procedure D-MANOVA on the AGP dataset.

Availability and implementation

D-MANOVA is implemented by the dmanova function in the CRAN package GUniFrac.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Over the past decade, numerous microbiome studies have been conducted to elucidate the role of the human microbiome in health and disease, generating an enormous amount of microbiome sequencing data (Kashyap et al., 2017). Microbiome data have complex structures including zero-inflation, skewed abundance distribution and phylogenetic relatedness among features. To address these statistical challenges, one popular approach summarizes the microbiome data in the form of pairwise distances and statistical analyses are then performed based on the distance matrices (Chen et al., 2012). One widely used distance-based method is PERMANOVA (permutational multivariate analysis of variance based on distances), which aims to identify covariates that could significantly explain the inter-subject variability captured by the pairwise distances (McArdle and Anderson, 2001). As a key component in microbiome data analysis, PERMANOVA has been routinely used in establishing an overall association between the microbiome and a covariate of interest. PERMANOVA uses permutation to assess the statistical significance and could be extremely slow at a large sample size. For example, running a single-threaded instance with 1000 permutations on a sample size of 5000 takes ∼1 h on a desktop computer. In practice, many hypotheses may be tested and more permutations are needed to assess a lower Type I error level, further exacerbating the computational burden. Although methods exist for estimating the tail probability of permutation tests (Knijnenburg et al., 2009), an analytical method, an analytical method, which accurately approximates the PERMANOVA P-value without permutation, is highly desirable. Recently, McArtor et al. (2017) proposed the MDMR method for analytical P-value calculation based on the asymptotic distribution of the PERMANOVA pseudo-F statistic. However, no rigorous proof was given. In addition, we found that MDMR could be conservative under many settings. Here, we rigorously derive the asymptotic distribution of the pseudo-F statistic, which is different from the one used in MDMR, and provide an accurate chi-square approximation. We show that our approach, D-MANOVA, provides more accurate approximation than MDMR and is also an order-of-magnitude faster.

2 Materials and methods

Suppose we have n subjects, p₁ variables of interest and p₂ covariates we want to adjust. Let

X \in R^{n \times p_{1}}

and

Z \in R^{n \times p_{2}}

be the design matrices for the variables of interest and the covariates, respectively. Define

H^{X, Z}

and H^Z as the projection matrices onto the corresponding column spaces. Further let

H^{X | Z} = H^{X, Z} - H^{Z}

and

H^{I | X, Z} = I_{n} - H^{X, Z}

with

I_{n} \in R^{n \times n}

being the n × n identity matrix,

rank (H^{X | Z}) = m_{1}

and

rank (H^{I | X, Z}) = n - m_{2}

⁠. Let

{Y_{i}}_{i = 1}^{n}

be the responses, which belong to a metric space denoted by

(Y, d)

⁠, and

d_{i j} = d (Y_{i}, Y_{j})

be the pairwise distance. Denote

A = (- d_{i j}^{2} / 2) \in R^{n \times n}

⁠. We define G as the Gower’s centered matrix

G = (I_{n} - \frac{1 1^{⊤}}{n}) A (I_{n} - \frac{1 1^{⊤}}{n}) = DAD,

where

1 \in R^{n \times 1}

is the vector of all 1 s and

D = I_{n} - 1 1^{⊤} / n

⁠. The distance-based pseudo-F statistic is defined as

T = \frac{tr (H^{X | Z} G H^{X | Z}) / m_{1}}{tr (H^{I | X, Z} G H^{I | X, Z}) / (n - m_{2})},

(1)

where

tr (\cdot)

denotes the trace of a matrix. The pseudo-F statistic is the basis for distance-based multivariate analysis of variance and quantifies the association between the multivariate Y, whose variability is encoded in the distance matrix, and the covariate of interest X while adjusting other covariates Z. Compared to the classic F-statistic for linear models, the distribution of the distance-based pseudo-F statistic is unknown and permutation, as implemented in PERMANOVA, is usually employed to obtain the P-value. To obtain an analytical P-value without permutation, McArtor et al. (2017) proposed an asymptotic null distribution for the pseudo-F statistic. However, no rigorous theoretical proof for their asymptotic null distribution was given. Here, we fill this gap and derive a more accurate asymptotic null distribution. Let

H

be a Hilbert space equipped with the inner product

< \cdot, \cdot >

and the inner product induced norm

| | \cdot | | .

Assume that

d_{i j}^{2} = | | ϕ (Y_{i}) - ϕ (Y_{j}) | |^{2},

(2)

where

ϕ (\cdot) : Y \to H

is an embedding from

Y

to

H

⁠. Define

Φ = {(ϕ (Y_{1}), \dots, ϕ (Y_{n}))}^{⊤} \in H^{\otimes n}

with

H^{\otimes n}

being the n-ary Cartesian power of

H

⁠. Then, the distance-based multivariate analysis of variance can be re-formulated in the linear model

Φ = X B + Z A + E,

where

B \in H^{\otimes p_{1}}, A \in H^{\otimes p_{2}}

and

E = {(e_{1}, \dots, e_{n})}^{⊤} \in H^{\otimes n}

⁠. Here,

e_{1}, \dots, e_{n}

are independent mean-zero random variables in

H

⁠, which are independent of X and Z. Let

K (e_{j}, e_{k}) = < e_{j}, e_{k} > .

By Mercer’s theorem, K is semi-positive definite and thus admits the spectral decomposition of the form

K (e_{j}, e_{k}) = \sum_{l = 1}^{+ \infty} λ_{l} ψ_{l} (e_{j}) ψ_{l} (e_{k})

⁠, where

E [ψ_{s} (e_{i}) ψ_{l} (e_{i})] = 1 {s = l}

and

E [ψ_{l} (e_{i})] = 0

⁠. Based on this setup, we have the following theorem, whose proof is given in Supplementary Note S1.

Theorem 2.1

Assume that

E | | e_{1} | |^{4} < \infty

and

| | H^{X | Z} | |_{2, 4} = \sup_{a : | | a | |_{2} = 1} | | H^{X | Z} a | |_{4} \to 0.

(3)

Then under the null,

\frac{tr (H^{X | Z} G H^{X | Z}) / m_{1}}{tr (H^{I | X, Z} G H^{I | X, Z}) / (n - m_{2})} \to^{d} T_{0} = \frac{\sum_{l = 1}^{+ \infty} λ_{l} χ_{m_{1}, l}^{2} / m_{1}}{\sum_{l = 1}^{+ \infty} λ_{l}},

where

{χ_{m_{1}, l}^{2}}_{l = 1}^{+ \infty}

are independent chi-square random variables with m₁ degrees of freedom.

Theorem 2.1 shows that as

n \to + \infty

⁠, the distance-based pseudo-F statistic converges to a weighted sum of independent chi-squared random variables. As the weights are unknown, the limiting distribution is non-pivotal. Here, we develop a chi-square approximation, which has a computational complexity

O (n^{2})

and also provides accurate enough approximation. The idea is to match the first two moments of the chi-square distribution with those of T₀. Suppose

p = {(E K (e_{1}, e_{1}))}^{2} / E K {(e_{1}, e_{2})}^{2}, \tilde{G} = ({\tilde{g}}_{i j}) = H^{I | X, Z} G H^{I | X, Z}

with

H^{I | X, Z} = (h_{i j})

⁠. Based on the derivation detailed in Supplementary Note S2, the distribution of T₀ can be approximated by

\frac{1}{\hat{p} m_{1}} χ_{\hat{p} m_{1}}^{2}

⁠, where

\hat{p} = \frac{{\hat{μ}}_{1}^{2}}{{\hat{μ}}_{2}}, {\hat{μ}}_{1} = \frac{1}{n - m_{2}} tr (\tilde{G}), {\hat{μ}}_{2} = \frac{\sum_{i \neq k} {\tilde{g}}_{i k}^{2}}{{(n - m_{2})}^{2} + \sum_{i, j} h_{i, j}^{4} - 2 \sum_{i} h_{i i}^{2}} .

We implemented D-MANOVA by the dmanova function inour GUniFrac package (Chen et al., 2012). To facilitate its use, the interface and the output are similar to those of the adonis function in the CRAN vegan package.

3 Results

We conduct simulations (Supplementary Note S3) to study the performance of D-MANOVA, comparing to PERMANOVA and MDMR. Figure 1a compares the P-values of D-MANOVA and PERMANOVA on the log scale [n = 100, Bray–Curtis (BC) distance, Scenario 3 in Supplementary Note S3] based on 1000 simulation runs under the null (H₀, left) and the alternative (H₁, right). We can see that D-MANOVA and PERMANOVA, P-values are highly correlated under both H₀ and H₁. Since the lowest P-value is 0.001 for PERMANOVA with 999 permutations, we see a large number of 0.001 under H₁ while D-MANOVA has no such restriction. Figure 1b compares the performance of the three competing methods under sample sizes of 100, 200 and 500 based on the BC distance. Under H₀ (first point of the power curve), all the methods control the Type I error under the nominal level with MDMR being more conservative. In terms of statistical power, D-MANOVA almost achieves the same power as PERMANOVA, while MDMR is less powerful under n = 100 and 200. The conservativeness has also been noted by the MDMR authors, and they do not recommend to run MDMR on sample sizes <200. However, even under n = 500, we still observe some power loss, indicating the approximation of D-MANOVA is more accurate. It is interesting to study the performance of D-MANOVA under small sample sizes. We thus repeat the simulations at n = 25 and 50. Supplementary Figure S1 shows that the Type I error of D-MANOVA is well controlled at different α levels and the size is closer to the nominal level as the sample size increases. Supplementary Figure S2 shows that the power of D-MANOVA is close to that of PERMANOVA even at n = 25. MDMR, on the other hand, is substantially less powerful under small sample sizes. We also compare the average computation time of the three methods at different sample sizes (Fig. 1c). At n = 12 800, PERMANOVA could not complete the analysis in hours while MDMR takes around 20 min. In contrast, D-MANOVA uses less than one minute. Therefore, D-MANOVA significantly improves over MDMR in terms of both accuracy and computational efficiency.

Fig. 1.

Open in new tab Download slide

Performance comparison of D-MANOVA, MDMR and PERMANOVA (999 permutations) based on simulations. Bray-Curtis distance was used. (a) Scatter plots comparing the P-values of D-MANOVA and PERMANOVA on the log scale under the null (H₀) and alternative (H₁). (b) Power comparison at sample sizes 100, 200 and 500. Simulation was averaged over 1000 runs. (c) Runtime comparison at varying sample sizes (⁠ $n = 50, 100, \dots, 6400, 12800$ ⁠). Runtimes were averaged over three repetitions. The computation was performed under R v3.3.2 on an iMAC (3.2 GHz Intel Core i5, 32 GB 1600 MHz DDR3, EI Capitan v10.11.5)

We finally demonstrate the use of D-MANOVA using the publicly available dataset (figshare doi:10.6084/m9.figshare.6137198) from the American Gut Project (AGP) (McDonald et al., 2018). We aim to test the association of the demographic and lifestyle variables with the gut microbiome composition based on the BC distance. We focus the analysis on the American and European populations with an age range between 18 and 80. A total of 7730 subjects were included in the analysis. The country residence was adjusted when testing the associations. Supplementary Table S1 shows the D-MANOVA, MDMR and PERMANOVA association P-values for these demographic/lifestyle variables ordered by effect sizes as measured by the distance-based R². Due to the large sample size, all the variables except the ‘handness’ are found to be significantly associated with the gut microbiome composition. For those significant variables, PERMANOVA P-values are all <0.001, so more permutations are needed to produce accurate p-values. For the ‘handness’ variable, D-MANOVA achieves a similar P-value as PERMANOVA. In contrast, MDMR tends to produce larger P-values, consistent with the conservativeness noted in the simulations. In terms of computational speed, D-MANOVA is about 13 times faster than MDMR and 567 times faster than PERMANOVA.

Simulations demonstrated that D-MANOVA had good type I error control at the 0.005 level, which should suffice for most community-level analyses since the number of tests is usually limited. However, when an extremely small type I error rate is needed to account for testing thousands or even millions of hypotheses, we recommend using our procedure to filter out most insignificant hypotheses and those hypotheses with extremely small P-values can be further validated by permutation. As the sample size increases, the detectable effect sizes become much smaller and statistical significance from community-level analyses may have limited practical utility. In such case, lower-level analyses (e.g. species- or genus-level) may be more meaningful. D-MANOVA could be possibly applied to those lower-level analyses by defining a relevant distance metric on the lower-level units.

Funding

This work was supported by the Center for Individualized Medicine at Mayo Clinic (J.C.), National Science Foundation [DMS-1830392 and DMS-1811747, X.Z.] and National Institutes of Health [R21 HG011662, J.C. and X.Z.].

Conflict of Interest: none decalred.

References

Chen

J.

et al. (

2012

)

Associating microbiome composition with environmental covariates using generalized UniFrac distances

.

Bioinformatics

,

28

,

2106

–

2113

.

Kashyap

P.C.

et al. (

2017

)

Microbiome at the frontier of personalized medicine

.

Mayo Clin. Proc

.,

92

,

1855

–

1864

.

Knijnenburg

T.A.

et al. (

2009

)

Fewer permutations, more accurate P-values

.

Bioinformatics

,

25

,

i161

–

i168

.

McArdle

B.H.

,

Anderson

M.J.

(

2001

)

Fitting multivariate models to community data: a comment on distance-based redundancy analysis

.

Ecology

,

82

,

290

–

297

.

Google Scholar

Crossref

WorldCat

McArtor

D.B.

et al. (

2017

)

Extending multivariate distance matrix regression with an effect size measure and the asymptotic null distribution of the test statistic

.

Psychometrika

,

82

,

1052

–

1077

.

McDonald

D.

et al. ; The American Gut Consortium (

2018

)

American gut: an open platform for citizen science microbiome research

.

Msystems

,

3

,

e00031-18

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
July 2021	41
August 2021	20
September 2021	19
October 2021	43
November 2021	14
December 2021	47
January 2022	29
February 2022	16
March 2022	18
April 2022	21
May 2022	26
June 2022	11
July 2022	9
August 2022	13
September 2022	24
October 2022	10
November 2022	10
December 2022	18
January 2023	22
February 2023	120
March 2023	107
April 2023	73
May 2023	67
June 2023	62
July 2023	54
August 2023	93
September 2023	76
October 2023	112
November 2023	64
December 2023	66
January 2024	102
February 2024	93
March 2024	112
April 2024	30

Article Contents

D-MANOVA: fast distance-based multivariate analysis of variance for large-scale microbiome association studies

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

D-MANOVA: fast distance-based multivariate analysis of variance for large-scale microbiome association studies

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only