benchdamic: benchmarking of differential abundance methods for microbiome data

Abstract Summary Recently, an increasing number of methodological approaches have been proposed to tackle the complexity of metagenomics and microbiome data. In this scenario, reproducibility and replicability have become two critical issues, and the development of computational frameworks for the comparative evaluations of such methods is of utmost importance. Here, we present benchdamic, a Bioconductor package to benchmark methods for the identification of differentially abundant taxa. Availability and implementation benchdamic is available as an open-source R package through the Bioconductor project at https://bioconductor.org/packages/benchdamic/. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
This document is a static version of the vignette for R/Bioconductor package benchdamic. It provides an introductory example on how to work with the analysis framework firstly proposed in (Calgaro et al., 2020). Some methods for differential abundance (DA) analysis are tested on microbiome datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions (GOF), ii) ability to control false positives (TIEC), iii) concordance of the findings, and iv) enrichment of DA microbial species in specific conditions.

Data loading
All datasets used in benchdamic are downloaded using the HMP16SData Bioconductor package (Schiffer et al., 2019).
For GOF and TIEC analyses a homogeneous group of samples (e.g., only samples from a specific experimental condition, phenotype, treatment, body site, etc.) ps_stool_16S is used (use help("ps_stool_16S") for details). It contains 16S data from: • 32 stool samples from participants of the Human Microbiome Project; • 71 taxa, all features having the same genus-level taxonomic classification are collapsed together (a total of 71 taxa corresponding to 71 genera).

Goodness of Fit
Assumption: Many DA detection methods are based on parametric distributions.
Research question: Which are the parametric distributions that can fit both the proportion of zeros and the counts in your data?

GOF structure
As different methods rely on different statistical distributions to perform DA analysis, the goodness of fit (GOF) of the statistical models underlying some of the DA methods on a 16S dataset is assessed. For each model, its ability to correctly estimate the average counts and the proportion of zeroes by taxon is evaluated.
The relationships between the functions used in this section are explained by the diagram in Figure 1. To help with the reading: green boxes represent the inputs or the outputs, red boxes are the methods and blue boxes are the main parameters of those method.

Negative Binomial and Zero-Inflated Negative Binomial Models
For any µ ≥ 0 and θ > 0, let f N B (·; µ, θ) denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean µ and inverse dispersion parameter θ, namely: f N B = Γ(y + θ) Γ(y + 1)Γ(θ) θ θ + 1 θ µ µ + θ y , ∀y ∈ N Note that another parametrization of the NB PMF is in terms of the dispersion parameter ψ = θ −1 (although θ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is µ and its variance is: In particular, the NB distribution boils down to a Poisson distribution when ψ = 0 ⇐⇒ θ = +∞.
To fit these distributions on real count data the edgeR (Robinson et al., 2010) and zinbwave (Risso et al., 2018) packages are used. In benchdamic they are implemented in the fitNB() and fitZINB() functions.

Zero-Inflated Gaussian Model
The raw count for sample j and feature i is denoted by c ij . The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: y ij = log 2 (c ij + 1) as a mixture of a point mass at zero I 0 (y) and a count distribution f count (y; µ, σ 2 ) ∼ N (µ, σ 2 ). Given mixture parameters π j , we have that the density of the ZIG distribution for feature i, in sample j with s j total counts is: The mean model is specified as: In this case, parameter b i0 is the intercept of the model while the term including the logged normalization factor log 2 sl j N + 1 captures feature-specific normalization factors through parameter η i . In details, sl j is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. N is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors (which is used instead of 1000). The mixture parameters π j (s j ) are modeled as a binomial process: To fit this distribution on real count data the metagenomeSeq package (Paulson et al., 2013) is used. In benchdamic it is implemented in the fitZIG() function.

Truncated Gaussian Hurdle Model
The original field of application of this method was the single-cell RNAseq data, where y = log 2 (T P M + 1) expression matrix was modeled as a two-part generalized regression model (Finak et al., 2015).
The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.
Given normalized, possibly thresholded, abundance y ij , the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene i. Define the indicator z ij , indicating whether taxon i is expressed in sample j (i.e., z ij = 0 if y ij = 0 and z ij = 1 if y ij > 0). We fit logistic regression models for the discrete variable Z and a Gaussian linear model for the continuous variable (Y |Z = 1) independently, as follows: To estimate this distribution on real count data the MAST package (Finak et al., 2015) is used. In benchdamic it is implemented in the fitHURDLE() function.

Dirichlet-Multinomial Mixture Model
The probability mass function of a n dimensional multinomial sample y = (y 1 , ..., y n ) T with library size libSize = n i=1 y i and parameter p = (p 1 , ..., p n ) is: The mean-variance structure of the MN model doesn't allow over-dispersion, which is common in real data. DM distribution models the probability parameter p in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector y over libSize trials under DM with parameter α = (α 1 , ..., α n ), a i > 0 and proportion vector p ∈ ∆ n = {(p 1 , ..., p n ) : The mean value for the i th taxon and j th sample of the count matrix is given by libSize j · αij i aij .
To estimate this distribution on real count data the MGLM package (Zhang et al., 2017;Zhang and Zhou, 2022) is used. In benchdamic it is implemented in the fitDM() function.

Comparing estimated and observed values
The goodness of fit for the previously described distributions is assessed comparing estimated and observed values. For each taxon the following measures are compared: • the Mean Difference (MD) i.e. the difference between the estimated mean and the observed mean abundance (log scale); • the Zero Probability Difference (ZPD) i.e. the difference between the probability to observe a zero and the observed proportion of samples which have zero counts.
To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction (log(counts + 1)), is well suited, indeed it reduces the count range making the differences more stable.
Except for the fitHURDLE() function, which performs a CPM transformation on the counts (or the one with the median library size), and the fitZIG() function which models the log 2 (counts + 1), the other methods, fitNB(), fitZINB(), and fitDM(), model the counts directly. For these reasons, fitHURDLE()'s output should not be compared directly to the observed log(counts + 1) mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

Mean Differences
To plot estimated and observed values the plotMD() function can be used ( Figure 2). No systematic trend are expected, moreover, the closer the values to the dotted line are (representing equality between observed and estimated values), the better the goodness of fit relative to the model.
If some warning messages are shown with this graph, they are likely due to sparse taxa. To address this, the number of NA values generated by each model can be investigated (which are 24 for each HURDLE model): plyr::ldply(GOF_stool_16S, function(model) c("Number of NAs" = sum(is.na(model))), .id = "Distribution")  because of the excessive default scaling proposed (1 million). It is also possible to plot only a subset of the estimated models ( Figure 3).  From the Figure 3, DM distribution slightly overestimates the logarithm of the average counts for low values, while the HURDLE_median distribution presents an overestimation that increases as the observed values increase. ZIG, but especially NB and ZINB distributions produce very similar estimated and observed values. Similarly, to plot the mean differences for Zero Probability/Proportion the plotMD() function is used ( Figure  4).
plotMD( data = GOF_stool_16S[1:5], difference = "ZPD", split = TRUE ) From the figure 4, ZIG and NB models underestimate the probability to observe a zero for sparse features, while the HURDLE_median model presents a perfect fit as the probability to observe a zero is the zero rate itself by construction. DM and ZINB models produce estimated values very similar to the observed ones. MDs and ZPDs are also available in the Figure 5 with a different output layout.

RMSE
As already mentioned, to summarize the goodness of fit, the Root Mean Squared Error (RMSE) metric is used. The summary statistics for the overall performance are visible in Figure 6.

Discussion about GOF
The Goodness of Fit chapter is focused on some existing parametric models: NB, ZINB, HURDLE, ZIG, DM. The assumption of this analysis is that if a model estimates the data well, then a method based on that model may be a possibly good choice for studying the differential abundance. Other distributions could also be investigated (Poisson, Zero-Inflated Poisson. . . ) but what about DA methods which are based on non-parametric models such as ANCOM? GOF framework can't be used to compare the parametric models to non-parametric models. However, non-parametric methods may work well in real scenarios due to their added robustness and other evaluations are necessary in order not to favor one group of methods over another.

DA methods
Differential abundance analysis is the core of benchdamic. DA analysis steps can be performed both directly, using the DA_<name_of_the_method>() methods, or indirectly, using the set_<name_of_the_method>() functions.
set_<name_of_the_method>() functions allow to create lists of instructions for DA methods which can be used by the runDA(), runMocks(), and runSplits() functions (more details in each chapter).
This framework grants a higher flexibility allowing users to set up the instructions for many DA methods only at the beginning of the analysis. If some modifications are needed, the users can re-set the methods or modify the list of instructions directly.
A list of the available methods is presented below (Table 1). They are native to different application fields such as RNA-Seq, single-cell RNA-Seq, or Microbiome data analysis. Some basic information are reported for each DA method, for more details please refer to functions' manual. Please remember that the data pre-processing, including QC analysis, filtering steps, and normalization, are not topics treated in benchdamic. In real life situations those steps precede the DA analysis and they are of extreme importance to obtain reliable results.
Some exceptions are present for the normalization step. In benchdamic, norm_edgeR(), norm_DESeq2(), norm_CSS(), and norm_TSS() are implemented functions to add the normalization/scaling factors to the phyloseq or TreeSummarizedExperiment objects, needed by DA methods. As for DA methods, normalization instructions list, including the previous functions, can be set using set_<normalization_name>() or setNormalizations() too. To run the normalization instructions the function runNormalizations() can be used (more examples will follow).
Many DA methods already contain options to normalize or transform counts. If more complex normalizations/transformations are needed, all the DA methods support the use of TreeSummarizedExperiment objects. In practice, users can put the modified count matrix in a named assay (the counts assay is the default one which contains the raw counts) and run the DA method on that assay using the parameter assay_name = "assay_to_use".

Add a custom DA method
To add a custom method to the benchmark, it must: • include a verbose = TRUE (or FALSE) parameter to let the user know what the method is doing; • return a pValMat matrix which contains the raw p-values and adjusted p-values in rawP and adjP columns respectively; • return a statInfo matrix which contains the summary statistics for each feature, such as the logFC, standard errors, test statistics and so on; • return a name which contains the complete name of the used method.

Type I Error Control
Assumption: Many DA methods do not control the number of false discoveries.
Research question: Which are the DA methods which can control the number of false positives in your data?

TIEC structure
The Type I Error is the probability of a statistical test to call a feature DA when it is not, under the null hypothesis. To evaluate the Type I Error rate Control (TIEC) for each differential abundance detection method: 1. using the createMocks() function, homogeneous samples (e.g., only the samples from one experimental group) are randomly assigned to a group ('grp1' or 'grp2'); 2. DA methods are run to find differences between the two mock groups using runMocks(); 3. the number of DA feature for each method is counted, these are False Positives (FP) by construction; 4. points 1-3 are repeated many times (N = 10, but at least 1000 is suggested) and the results are averaged using the createTIEC() function.
In this setting, the p-values of a perfect test should be uniformly distributed between 0 and 1 and the false positive rate (FPR or observed α), which is the observed proportion of significant tests, should match the nominal value (e.g., α = 0.05).
The relationships between the functions used in this section are explained by the diagram in Figure 7.

Create mock comparisons
Using createMocks() function, samples are randomly grouped, N = 10 times. A higher N is suggested (at least 1000) but in that case a longer running time is required.
For each row of the mock_df data frame a bunch of DA methods is run. In this demonstrative example the following DA methods are used: • basic t and wilcox tests; • edgeR with TMM scaling factors (Robinson et al., 2010) with and without ZINB weights (Risso et al., 2018 • limma-voom with TMM scaling factors (Ritchie et al., 2015;Law et al., 2014;Phipson et al., 2016) with and without ZINB weights (Risso et al., 2018;Van den Berge et al., 2018); • ALDEx2 with all and iqlr data transformation (denom parameter) performing the wilcox test (Fernandes et al., 2014); • metagenomeSeq with CSS normalization factors using both the fitFeatureModel (for a zero-inflated log-normal distribution, mixture model, as suggested in the package vignette) and the fitZig (for a zero-inflated gaussian distribution, mixture model) algorithms (Paulson et al., 2013); • corncob with a focus on average differences (not dispersion, regulated by phi.formula and phi.formula_null parameters) using both Wald and LRT tests (Martin et al., 2020); • MAST with both rescalings, default (i.e. 10 6 , for CPMs) and median (Finak et al., 2015); • Seurat with LogNormalize and CLR normalization/transformations, t and wilcox tests, and 10 5 as scaling factor (Butler et al., 2018); • ANCOM based on ANCOM-II algorithm with sampling fraction bias correction (BC parameter) (Lin and Peddada, 2020;Kaul et al., 2017); • dearseq with permutation and asymptotic tests (Gauthier et al., 2020); Among the available methods, NOISeq (Tarazona et al., 2015) has not been used since it does not return p-values but only adjusted ones. Many combination of parameters are still possible for all the methods.
The structure of the output in this example is the following: • Comparison1 to Comparison10 on the first level, which contains: -Method1 to Method23 output lists on the second level: * pValMat which contains the matrix of raw p-values and adjusted p-values in rawP and adjP columns respectively; * statInfo which contains the matrix of summary statistics for each feature, such as the logFC, standard errors, test statistics and so on; * dispEsts which contains the dispersion estimates for methods like edgeR and DESeq2; * name which contains the complete name of the used method.
The list of methods can be run in parallel leveraging the BiocParallel package. In details, parallelization is supported in Linux/Mac OS through the MulticoreParam() function (using the FORK implementation), as long as ANCOM functions are not included in the list of methods due to a different parallelization management of those functions. ANCOM based methods are usually the most time consuming. Parallel computing is still possible as long as it is directly managed by those methods (n_cl parameter). In the following example, each mock dataset is analyzed in serial mode but ANCOM is run in more than one core. # Modify the n_cl parameter my_ANCOM_parallel <-set_ANCOM( pseudo_count = FALSE, fix_formula = "group", contrast = c("group", "grp2", "grp1"), BC = TRUE, n_cl = 2, # Set this number according to your machine expand = TRUE ) bpparam = BiocParallel::SerialParam() Stool_16S_mockDA_ANCOM <-runMocks( mocks = my_mocks, method_list = my_ANCOM_parallel, # Only ANCOM object = ps_stool_16S, weights = zinbweights, verbose = FALSE, BPPARAM = bpparam)

Add a new DA method later in the analysis
It may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. For example a new version of limma: Which returns a new set of limma instructions and a warning for using CSS normalization factors instead of those native to edgeR.
First of all, the same mocks and the same object must be used to obtain the new results. To run the new instructions the runMocks() function is used: To put everything together a mapply() function is used to exploit the output structures:

Counting the False Positives
The createTIEC() function counts the FPs and evaluates the p-values distributions: A list of 5 data.frames is produced: 1. df_pval is a 5 columns and number_of_features x methods x comparisons rows data frame.

False Positive Rate
The false positive rate (FPR or observed α), which is the observed proportion of significant tests, should match the nominal value because all the findings are false positive by construction. In this example edgeR.TMM, edgeR.TMM.weighted, limma.TMM.weighted, and metagenomeSeq.CSS.fitZig appear to be quite over all the thresholds (liberal behavior), differently ALDEx2.all.wilcox.unpaired and basic_t methods are below (conservative behavior) or in line with the thresholds (Figure 8).

False Discovery Rate
The false discovery rate F DR = E F P F P +T P is the expected value of the ratio between the false positives and all the positives. By construction, mock comparisons should not contain any TPs and when all the hypotheses are null, FDR and FWER (Family Wise Error Rate) coincide. For each set of method and comparison, the FDR is set equal to 1 (if at least 1 DA feature is found) or 0 (if no DA features are found). Hence, the estimated FDR is computed by averaging the values across all the mock comparisons. As the number of mock comparisons increases, the more precise the estimated FDR will be. Just as alpha is set as a threshold for the p-value to control the FPR, a threshold for the adjusted p-value, which is the FDR analog of the p-value, can be set. FDR values should match the nominal values represented by the red dashed lines. In this example, the number of mock comparisons is set to 10, so the estimates are unprecise. That said, ALDEx2 based methods, basic_t, and Seurat.CLR.SF1e+05.t methods are able to control the FDR for all the considered thresholds ( Figure 9).

QQ-Plot
The p-values distribution under the null hypothesis should be uniform. This is qualitatively summarized in the QQ-plot in Figure 10 where the bisector represents a perfect correspondence between observed and theoretical quantiles of p-values. For each theoretical quantile, the corresponding observed quantile is obtained averaging the observed p-values' quantiles from all 10 mock datasets. The plotting area is zoomed-in to show clearly the area between 0 and 0.1.
Methods over the bisector show a conservative behavior, while methods below the bisector a liberal one.
The starting point is determined by the total number of features. In our example the starting point for the theoretical p-values is computed as 1 divided by the number of taxa, rounded to the second digit. In real experiments, where the number of taxa is higher, the starting point is closer to zero.

Kolmogorov-Smirnov test
Departure from uniformity is quantitatively evaluated through the Kolmogorov-Smirnov test which is reported for each method across all mock datasets using the the plotKS function in Figure 12.  High KS values indicates departure from the uniformity while low values indicates closeness. All the clues obtained in the previous figures 10 and 11 are confirmed by the KS statistics: metagenomeSeq.CSS.fitZig, which was very liberal and its distribution of p-values is the farthest from uniformity among the tested methods. Also ALDEx2 based methods show high KS values, indeed they showed a very conservative behaviour.

Log distribution of p-values
Looking at the p-values' log-scale can also be informative. This is because behavior in the tail may be poor even when the overall p-value distribution is uniform, with a few unusually small p-values in an otherwise uniform distribution. Figure 13 displays the distributions of all the p-values (in negative log scale) generated by each DA method across all the mock comparisons.
plotLogP(df_pval = TIEC_summary$df_pval, cols = cols) Similarly, figure 14 exploits the structure of the df_QQ data.frame generated by the createTIEC() function to display the distribution of the p-values (in negative log scale) generated by each DA method, averaged among mock comparisons (only ten in this example). As this second graphical representation is only based on 1 averaged p-value for each quantile, it is also less influenced by anomalously large values.
plotLogP(df_QQ = TIEC_summary$df_QQ, cols = cols)  In the figure 13 and 14, the − log 10 (p − value) IDEAL distribution is reported in red color as the first method. To highlight tail's behaviors, 3 percentiles (0.9, 0.95, 0.99) are reported using red-shaded vertical segments for each method. If the method's distribution of negative log-transformed p-values or average p-values is still uniform in the 3 selected quantiles of the tail, the 3 red vertical segments will align to the respective dotted line. Methods are ordered using the distances between the observed quantiles and the ideal ones. Usually, when a method has its red segments to the left of the IDEAL's ones is conservative (e.g., ALDEx2.iqlr.wilcox.unpaired and MAST.default). Indeed, for those methods, little p-values are fewer than expected. On the contrary, methods with red segments to the right of the IDEAL's ones are liberal (e.g., edgeR.TMM). Mixed results could be present: a method that has a lower quantile for one threshold and higher quantiles for the others (e.g., limma.TMM).

Discussion about TIEC
Putting all the previous graphical representations together gives a general overview of methods' ability to control FPs and p-values distribution under the null hypothesis (i.e. no differential abundance). It is clear that only methods that produce p-values can be included in this analysis. While figures 10 and 11 have a main exploratory scope regarding the p-values distribution based on quantile-quantile comparison, figures 8, 12, and 14 are able to rank methods according to False Positive Rate, uniformity of p-values distribution, and departure from uniformity in the tail. The latter graphical representations could be used as a first tool to establish which DA method to consider for further analyses and which DA methods to exclude. Finally, the figure 9 can be used to assess FDR control under the null scenario. This exposes the problem of a few extremely small p-values among a collection that looks roughly uniform. If that is the case, the Type I error would be under control but the FDR would be inflated.

Concordance
Assumption: Applying different methods to the same data may produce different results.
Questions: How much do the methods agree with each other? How much does a method agree with itself?

Concordance structure
To measure the ability of each method to produce replicable results from a dataset with two or more groups: 1. samples are divided to obtain the Subset1 and Subset2 datasets using the createSplits() function; 2. DA methods are run on both subsets using the runSplits() function;

the Concordance At the Top metric (CAT) between the lists of p-values is computed to obtain the Between Methods Concordance (BMC) and the Within Method Concordance (WMC);
4. steps 1-3 are repeated many times (N = 10, but at least 100 are suggested) and the results are averaged using the createConcordance() function.
The relationships between the functions used in this section are explained by the diagram in Figure 15. getStatistics slot = "pValMat" or "statInfo" colName type = "pvalue" or "logfc"

Split datasets
Using the createSplits() function, the ps_plaque_16S dataset is randomly divided by half. In this dataset, samples are paired: 1 sample for supragingival plaque and 1 sample for subgingival plaque are considered for each subject. The paired parameter is passed to the method (it contains the name of the variable which describes the subject IDs) so the paired samples are inside the same split. In this specific case, the two groups of samples are balanced between conditions, reflecting the starting dataset. However, if the starting dataset had been unbalanced, the balanced option would have allowed to keep the two splits unbalanced or not.

Set up normalizations and DA methods
For some of the methods implemented in this package it is possible to perform differential abundance testings for the repeated measurements experimental designs (e.g., by adding the subject ID in the model formula of DESeq2).

Add a new DA method later in the analysis
Again, it may happen that at a later time the user wants to add to the results already obtained, the results of another group of methods. First of all, the same splits and the same object must be used to obtain the new results: my_basic <-set_basic( pseudo_count = FALSE, contrast = c("HMP_BODY_SUBSITE", "Supragingival Plaque", "Subgingival Plaque"), test = "wilcox", paired = TRUE) Plaque_16S_splitsDA_basic <-runSplits( split_list = my_splits, method_list = my_basic, normalization_list = NULL, object = ps_plaque_16S, min_counts = 0, min_samples = 2, verbose = FALSE) To put everything together, two nested mapplys can be used to exploit the output structures:

Comparing the concordances
For each pair of methods the concordance is computed by the createConcordance() function. It produces a long format data frame object with several columns: • comparison which indicates the comparison number; • n_features which indicates the total number of taxa in the comparison dataset; • name of method1 ; • name of method2 ; • rank; • concordance which is defined as the cardinality of the intersection of the top rank elements of each list, divided by rank, i.e., L 1:rank M 1:rank rank , where L and M represent the lists of p-values of method1 and method2 respectively. A noise value (< 10 −10 ) is added to each p-value (or statistic) in order to avoid duplicated values which could not be ordered.
concordance <-createConcordance( object = Plaque_16S_splitsDA_all, slot = "pValMat", colName = "rawP", type = "pvalue" ) The createConcordance() method is very flexible. In the example below the concordances are built using the log fold changes or other statistics instead of the p-values. To do so, it is necessary to know the column names generated by each differential abundance method in the statInfo matrix.

Visualization
Starting from the table of concordances, the plotConcordance() function can produce 2 graphical results visible in Figure 16: • the dendrogram of methods, clustered by the area over the concordance bisector in concordanceDendrogram slot; • the heatmap of the between and within method concordances in concordanceHeatmap slot. For each tile of the symmetric heatmap, which corresponds to a pair of methods, the concordance from rank 1 to a threshold rank is drawn.
The area between the curve and the bisector is colored to highlight concordant methods (blue) and nonconcordant ones (red). The two graphical results should be drawn together for the best experience.
It is common that WMC values (in red rectangles) are lower than BMC ones. Indeed, BMC is computed between different methods on the same data, while WMC is computed for a single method, run in different datasets (some taxa are dataset-specific). values, maybe because it has not been implemented properly for repeated measure designs. Other methods have comparable WMC values.

Discussion about Concordance
Random splits allow to evaluate concordance between methods and within a method. These analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa. For this reason, the concordance analysis framework should be used as a tool to detect groups of similar methods.

Enrichment analysis
Assumption: Previous analyses did not assess the correctness of the discoveries.

Question:
If some prior knowledge about the experiment is available, would the findings be coherent with that knowledge?

Enrichment structure
While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis can provide an alternative solution to rank methods in terms of their ability to identify, as significant, taxa that are known to be differentially abundant between two groups. To run methods, the runDA() function is used. Leveraging the prior knowledge (if present), the correctness of the findings is checked using the createEnrichment() and createPositives() functions. Many graphical outputs are available through the plotContingency(), plotEnrichment(), plotMutualFindings(), and plotPositives() functions.
The relationships between the functions used in this section are explained by the diagram in Figure 17.

A priori knowledge
Here, we leveraged the peculiar environment of the gingival site (Thurnheer et al., 2016): • the supragingival biofilm is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species; • in the subgingival biofilm, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species.
From the comparison of the two sites, an abundance of aerobic microbes in the supragingival plaque and of anaerobic bacteria in the subgingival plaque is expected. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

Set up normalizations and DA methods
Both the normalization/scaling factors and the DA methods' instructions are available since the dataset is the same used in the previous section.
In concordance analysis, normalizations factor were added inside the runSlits() function, so the original object ps_plaque_16S does not contain the values. The normalization/scaling factors are added to the object:

Phylogenetic Tree: [ 58 tips and 57 internal nodes ]
Differently from the Type I Error Control and Concordance analyses, the enrichment analysis rely on a single phyloseq or TreeSummarizedExperiment object (no mocks, no splits, no comparisons). For this reason many methods can be assessed without computational trade-offs (e.g., ANCOM without sampling fraction bias correction and methods which use ZINB weights).

Testing the enrichment
Plaque_16_DA object contains the results for 10 methods. In order to extract p-values, the optional direction of DA (DA vs non-DA, or UP Abundant vs DOWN Abundant), and to add any a priori information, the createEnrichment() function can be used.
In the direction argument, which is set to NULL by default, the column name containing the direction (e.g., logfc, logFC, logFoldChange. . . ) of each method's statInfo matrix can be supplied.

Enrichment plot
To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function can be used. Only Aerobic and Anaerobic levels are plotted in Figure 19.
plotEnrichment(enrichment = enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic")) Since Subgingival Plaque is the reference level for each method, the coefficients extracted from the methods are referred to the Supragingival Plaque class. Six out of eight methods identify, as expected, a statistically significant (0.001 < p ≤ 0.05) amount of DOWN Abundant Anaerobic features in Supragingival Plaque ( Figure 19). Moreover, all of them find an enriched amount of UP Abundant Aerobic genera in Supragingival Plaque. Unexpectedly, both DESeq2.poscounts and DESeq2.poscounts.weighted find many Anaerobic genera as UP Abundant, they could be FPs.

Mutual Findings
To investigate the DA features, the plotMutualFindings() function can be used ( Figure 20). While levels_to_plot argument allows to choose which levels of the enrichment variable to plot, n_methods argument allows to extract only features which are mutually found as DA by more than 1 method. plotMutualFindings(enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1) Summary of DA features Figure 20: Mutual Findings plot. Number of differentially abundant features mutually found by 1 or more methods, colored by the differential abundance direction and separated by aerobic and anaerobic metabolism.
In this example (Figure 20), many Anaerobic genera and 6 Aerobic genera are found as DA by more than 1 method simultaneously. Among them, all methods find Prevotella, Treponema, Fusobacterium, and Dialister genera DOWN Abundant in Supragingival Plaque, while the Actinomyces genus UP Abundant, even if it has an aerobic metabolism. Similarly, all methods find Corynebacterium, Leutropia, and Neisseria aerobic genera UP abundant in Supragingival Plaque.

plotPositives(positives)
MAST.median and ALDEx2.all.wilcox.paired methods are very conservative and are located on the lower part of the Figure 21. The highest performances are of limma.TMM.weighted and edgeR.TMM.weighted. This means that their findings are in line with the a priori knowledge supplied by the user.

Enrichment without direction
When the user have a custom method where the direction of the differential abundance is not returned (e.g., NOISeq), or when the direction of DA is not of interest, the sole information about DA and not DA feature can be used. The createEnrichment() function is used without the direction parameter for all methods: enrichment_nodir <-createEnrichment( object = Plaque_16S_DA, priorKnowledge = priorInfo, enrichmentCol = "Type", namesCol = "newNames", slot = "pValMat", colName = "adjP", type = "pvalue", threshold_pvalue = c( 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, # thresholds for methods 0.4, # ANCOM threshold on 1-W/(ntaxa-1) 0.4 = liberal 0.1, 0.1, 0.1), # thresholds for other methods threshold_logfc = 0, top = NULL, alternative = "greater", verbose = FALSE ) To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function is used. All levels are plotted in Figure 22. The highest amount of DA features belongs to the Anaerobic metabolism, followed by F_Anaerobic, and Aerobic. The method that finds more DA features is DESeq2.poscounts, while ANCOM is the most conservative (even if the threshold value based on 1 − W ntaxa−1 is set at liberal value of 0.4. As for enrichment analysis with DA direction, the plotMutualFindings() function can be used here too (Figure 23). While levels_to_plot argument allows to choose which levels of the enrichment variable to plot, n_methods argument allows to extract only features which are mutually found as DA by more than 1 method. plotMutualFindings(enrichment_nodir, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)  In this example (Figure 23), only a Prevotella OTU is found as DA in Supragingival Plaque by all methods (probably because ANCOM p-value threshold is not interpretable as the same threshold for the other methods).

Enrichment analysis for simulated data
To enlarge the scope of the enrichment analysis, simulations could be used, e.g. by using the user's dataset as a template to generate simulated data, in which to know the DA features and provide this information as prior knowledge.
As an example, the SPsimSeq package is used (the tool to use is up to the user) to simulate only a single dataset (n.sim = 1 ) from the ps_plaque_16S dataset where two body sub sites are available (without considering the paired design). The data are simulated with the following properties -100 features (n.genes = 100 ) -50 samples (tot.samples = 50 ) -the samples are equally divided into 2 groups each with 25 samples (group.config = c(0.5, 0.5)) -all samples are from a single batch (batch.config = 1 ) -20% DA features (pDE = 0.2 ) -the DA features have a log-fold-change of at least 0.5.
To further assess methods' power, the createPositives() function can be used specifying as TPs the resulting DA features created as real DA features and as FPs the resulting DA features created as not DA features ( Figure 25).  We use a threshold_pvalue = 0.1 (0.4 for ANCOM) to call a feature DA based on its adjusted p-value. We compute the difference between TPs and FPs for several top thresholds (from 5 to 30, by 5) in order to observe a trend: positives_nodir <-createPositives( object = sim_DA, priorKnowledge = priorInfo, enrichmentCol = "Reality", namesCol = NULL, slot = "pValMat", colName = "adjP", type = "pvalue", threshold_pvalue = c( rep (0.1,19), # adjP thresholds 0.4, # adjP threshold for ANCOM on 1-W/(ntaxa-1) rep(0.1,5)), # adjP thresholds for other methods threshold_logfc = 0, top = seq(5, 30, by = 5), alternative = "greater", verbose = FALSE, TP = list(c("DA", "is DA")), FP = list(c("DA", "is not DA")) ) Since the number of simulated DA feature is 20, the maximum number of TPs is 20 and it is added as an horizontal line to the figure. plotPositives(positives = positives_nodir) + facet_wrap(~method) + theme(legend.position = "none") + geom_hline(aes(yintercept = 20), linetype = "dotted", color = "red") + geom_hline(aes(yintercept = 0), color = "black") + ylim(NA, 21) From figure 25 it is clearly visible that dearseq.counts.permutation.1000 and dearseq.counts.asymptotic reach the highest values of the difference, followed by corncob.counts.Wald and DESeq2.counts.poscounts. As already mentioned the desired level of power which a methods should be able to reach is represented by the red dotted line, i.e. the total number of DA simulated features (20 in our case). These methods, in this specific example, have the highest power. Differently, methods characterized by flat lines have a fixed number of features with an adjusted p-value lower than the threshold. If their lines are above the zero line, it means that the number of True Positives is greater than the number of FPs. On the contrary, if their lines are below the zero line, it means that the number of FPs is greater (e.g., edgeR.counts.TMM.weighted has negative values, maybe due to poor weight estimates since all methods with observational weights perform poorly).

Discussion about Enrichment
The enrichment analysis toolbox provides many methods to study DA in a dataset.
Firstly, when some prior knowledge is available, it allows to evaluate methods' power. Among the possible uses, it is especially useful to investigate conservative methods: are they calling only the most obvious taxa (also found by the other methods) or are they finding something new? The main drawback is that the availability of the prior knowledge is limited, especially for new datasets. For this reason, enrichment analysis could also be used in addition to simulation tools. Indeed, through parametric, semi-parametric, or non parametric assumptions it is possible to obtain an emulation of the prior knowledge.
Secondly, thanks to methods like plotMutualFindings() and plotEnrichment(), which produce graphical results like Figure 22 and Figure 23, it is also possible to use the enrichment analysis to study the distribution  of the findings across class of taxa (e.g., by using as prior knowledge the phylum of the features, it would be possible to study if a phylum is characterized by an increased number of DA compared to another phylum), or more simply, drawing biological conclusions based only on taxa found as DA by the majority of the methods.