Comprehensive evaluation of methods for differential expression analysis of metatranscriptomics data

Abstract Understanding the function of the human microbiome is important but the development of statistical methods specifically for the microbial gene expression (i.e. metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this gap, we undertook a comprehensive evaluation and benchmarking of 10 differential analysis methods for metatranscriptomics data. We used a combination of real and simulated data to evaluate performance (i.e. type I error, false discovery rate and sensitivity) of the following methods: log-normal (LN), logistic-beta (LB), MAST, DESeq2, metagenomeSeq, ANCOM-BC, LEfSe, ALDEx2, Kruskal–Wallis and two-part Kruskal–Wallis. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of childhood dental disease (early childhood caries, ECC), whereas validations were sought in two additional datasets from the ECC study and an inflammatory bowel disease study. The LB test showed the highest sensitivity in both small and large samples and reasonably controlled type I error. Contrarily, MAST was hampered by inflated type I error. Upon application of the LN and LB tests in the ECC study, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with childhood dental disease. This comprehensive model evaluation offers practical guidance for selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics. Selection of an optimal method increases the possibility of detecting true signals while minimizing the chance of claiming false ones.

Details of differential expression analysis methods and data generative models are provided in Supplementary Sections S1-S4, respectively. Supplementary Section S5 entails supplementary information about the parametric simulation set-ups, and Supplementary Section S6 includes the goodness of fit results for the ZINB model. Extensive simulation results are provided in Supplementary Section S7. Additional results of methods' application to the ZOE2.0 data and the IBD data are contained in Supplementary Sections S8 and S9, respectively.

S1 Differential expression analysis methods
In this section, we present each of the DE analysis methods in detail. We first introduce notational conventions. Throughout this material, Y i,g denotes the expression level for the gth gene in the ith cell, X i ≡ (1, X D i , X B i ) denotes the ith row of the design matrix, or a vector containing the intercept term, a binary disease status, and a binary batch indicator. Different models abusively use the same notation for parameters, as long as there is no confusion; e.g. regression coefficients β g are commonly used either in the LN model or the LB model, but they are shorthand for β LN g and β LB g , respectively.

S1.1 Log-normal test
The Log-normal (LN) test relies on the assumption that the log-transformed expression is normally distributed as in (1). A small positive constant (c) is added to the gene expression to ensure that the log-transformed values are within a feasible range. In this simulation study, 1 is uniformly added to expression levels (c = 1).
The test statistic for the gth gene is T LN and follows a χ 2 1 distribution under the null hypothesis asymptotically. The test rejects the null hypothesis if the test statistic is larger than χ 2 1 (1 − α), or the (1 − α)th quantile of the χ 2 distribution with one degree of freedom, where α is the significance level. Alternatively, the individual p-values are obtained as p g = 1 − F χ 2 1 (T LN g ), where F d (t) is the distribution function of d evaluated at t. The genes with p-values less than α are declared to have a statistically significant association with disease. This test is simply an analysis of covariance (ANCOVA) with an appropriately log-transformed dependent variable, and is easily implemented in most statistical software packages. The testing procedure, after obtaining a test statistic and the corresponding null distribution (e.g., p-values and rejection regions), is identical for all other tested methods, hence will be omitted unless needed.

S1.2 Logistic Beta test
The Logistic Beta model (LB) models relative expressions, R i,g ≡ Y i,g / G h=1 Y i,h , instead of absolute expressions, Y i,g . Because of the sum-to-one constraint of relative expressions, these tests are structurally dependent. However, in microbiome data analyses, the number of tested genes is usually large enough and thus the dependence induced by the compositional structure is negligible.
The LB model is formulated [1] as: where . Note that this model can be decomposed into two orthogonal models: where 1(·) is the indicator function, µ i,g is the mean of the Beta random variable and θ g is the dispersion parameter. Orthogonality means that the estimate of π i,g and those of µ i,g and θ g are independent. Consequently, the test statistic can be obtained from these two separately estimated models. The maximum likelihood estimators (MLE) are used for estimation and the R package gamlss [2] was used to carry out simulations in this study. The null and the alternative hypotheses for the gth gene are . Either a Wald-type or a likelihood test statistic can be used to test these hypotheses. Because they are asymptotically equivalent, here we only present a Wald test statistic: which follows a χ 2 2 distribution under the null hypothesis asymptotically. If only one of the two parts of LB is estimable, the test statistic is constructed based on the estimable component only and the reference distribution is χ 2 1 ; e.g., when only the logistic model is estimable, The same approach was followed for the other two-part tests, including MAST and the two-part Kruskal-Wallis test.

S1.3 MAST
The "Model-based Analysis of Single-cell Transcriptomics" (MAST) [3] was proposed specifically for differential expression analysis of scRNAseq data. This model, composed of a logistic regression model and a conditional log-normal model, regularizes parameter estimation and utilizes estimated cellular detection rates (CDR) as covariates as defined below. The model was designed to deal with zero-inflation which is driven by both technical and biological variabilities in scRNAseq data. Though zeros in microbiome sequencing data are believed to be generated mostly by biological reasons, the proportion of zeros is usually greater than that of conventional single-part parametric models such as Poisson, negative binomial, and log-normal. Thus, it is feasible to interrogate the performance of MAST in the context of microbial transcriptomics analysis.
The models in MAST can be summarized as where is the CDR of the ith subject for background expression level k. In this simulation we set k = 0.
The parameters are estimated using a Bayesian framework, where γ g is regularized under a weak informative prior and 1/σ g is regularized using an empirical Gamma prior. The R package mast is available [4].
The null and the alternative hypotheses for the gth gene are . Either a Wald-type or a likelihood test statistic can be used to test these hypotheses. The Wald statistic is with χ 2 2 as its asymptotic null distribution. The testing procedure is exactly the same as that of the LB test once the coefficients and their standard errors are estimated.

S1.4 DESeq2
The DESeq2 [5] method is widely used for differential expression of RNAseq data. The underlying model of DESeq2 is a negative binomial distribution and it uses empirical Bayes for regularization.
The DESeq2 model can be summarized as where µ i,g = s i,g ν i,g is the mean parameter, θ g is the dispersion parameter, s i,g is the size factor, The size factor is the parameter with which we adjust the sequencing depth. For our simulation work, we use the median-of-ratios method [6].
The parameters are estimated using maximum likelihood estimation and then θ g and β D g are regularized using an empirical Bayes approach. The R package DESeq2 is available.
The null and the alternative hypotheses for the gth gene are A Wald test is used to test these hypotheses. The Wald statistic is given as with χ 2 1 as its asymptotic null distribution. The testing procedure is exactly the same as that of LB test, once the coefficients and their standard errors are estimated. Because DESeq2 cannot accomodate high zero proportions, an extension was recently developed to enable the modeling of a greater number of zeros in the scRNAseq context [7]. In this modified DESeq2 method, namely DESeq2-ZINBWaVE, first the zero-inflation parameter is estimated using the model, and each observation is assigned a weight of the posterior probability of non-zeroinflation, where f ZIN B is the corresponding density of the ZINB distribution. For the size factor estimation in DESeq2-ZINBWaVE, we use the positive counts method. Then, the conventional DESeq2 method is applied, as described earlier, including the weights. Whenever there is no ambiguity, "DESeq2" refers to the original method and "DESeq2-ZINBWaVE" to its extension.

S1.5 metagenomeSeq
MetagenomeSeq (MGS) is a differential abundance analysis method for metagenomics data [8] and the corresponding bioconductor package, metagenomeSeq is available. MGS assumes zero-inflated log normal distribution. Furthermore, MGS uses an empirical Bayes shrinkage method for parameter estimation. Hence, MGS shares common modeling approaches with MAST; however, the two approaches are different in a few aspects. First, MAST uses CDR as a controlling factor in the model while MGS does not. Second, MAST provides tests on two parts of the model; i.e., two p-values are obtained from the zero-inflation part and the log-normal part in MAST. However, in MGS, after estimating the two-part model parameters, only the log-difference of the marginal mean is tested and a single p-value is given [9]. In our simulations, the fitFeatureModel function in the R package metagenomeSeq is used for implementation [10]. Although the MGS test can account for batch effects mathematically, the current metagenomeSeq software does not allow batch variables in the model. Thus, only results without batch effects will be reported in the simulation study in Section 4.

S1.6 ANCOM-BC
ANCOM-BC2 is used as a variation of ANCOM [11] that inherits the philosophy of differential ranking (DR) methods and ANCOM but also provides p-values with computationally efficiency. Comparing to an earlier version ANCOM-BC1, this version omits the structural zero detection procedure to avoid known inflated Type I errors.
ANCOM-BC is a differential abundance analysis method for metagenomics data [12]. It shares the philosophy of its predecessor, ANCOM [11], in that it models the ratio of abundances between taxa. However, unlike ANCOM which is a rank-based approach, ANCOM-BC specifies the test statistic and its associated p-value for a large sample. In ANCOM-BC, the observed abundance Y i,g is assumed to be a realization of the unknown abundance U i,g of the whole ecosystem from where the sample is taken with possibly different sampling fraction η i for each sample. In other words, where U i,g is a random variable with mean θ D g or θ H g , depending on the membership of the sample i to the disease (D) or health (H) group. Of note, ANCOM-BC is not limited to two-group problems and can accommodate multi-group problems. Then it formulates log Y i,g = logη i + log θ i,g + ϵ i,g , whereη i is a slightlyredefined sampling fraction parameter due to the log-transformation, and E[ϵ i,g ] = 0.
The hypotheses of ANCOM-BC are • H 1 : log θ D g ̸ = log θ H g , which are tested by the test statistic, where log θ A g is the estimates of log θ A g , {σ A g } 2 is the mean squared error for each group A = D, H, and logη is the estimate of the bias, logη ≡ E[log θ D g − log θ H g ]. The statistic follows the standard normal distribution per the large sample theory, whereas the authors defined a small sample version without defining the test statistic distribution.
ANCOM-BC does a further procedure of detecting "the structural zero" which is defined to be the absence of a certain taxon in a specific group that is present in another group. Once the structural zero is detected, ANCOM-BC declares that the taxon is differentially abundant, giving T AN COM −BC g = ∞ and p-value = 0. However, since such a procedure often inflates the type-I error significantly in this simulation study, we add another version of ANCOM-BC that declares those sturctural zeros inconclusive (i.e., p-value = NA). We report the simulations results for second version as "ANCOM-BC2".

S1.7 LEfSe
LEfSe (Linear discriminant analysis Effect Size) is commonly used for differential analysis of metagenomic biomarkers [13]. It assumes that the samples are labeled with a certain group that represents the main biological comparison class of interest. They may also include one or more subgroup labels that indicate within-group classifications where batch effects can be accounted for in simulations. By combining standard tests for statistical significance, with extra tests encoding biological consistency and effect relevance, the method determines the features most likely to explain variations across groups.
There are three steps performed in order: the KW rank-sum test on groups, the pairwise Wilcoxon test between subgroups of different groups, and the LDA on the relevant features. In the first step, the factorial KW rank-sum test is applied to each feature according to the group label; the subgroup label is utilized for further stratifying when that information exists. As a result, only the features that reject the null hypothesis of identical value distribution among groups could be analyzed further. In the second step, the pairwise Wilcoxon test is applied to the extracted features belonging to subgroups of different groups, for testing whether all pairwise comparisons between subgroups in different groups significantly agree with the group level trend. If at least one comparison between subgroups has a p-value greater than the specified threshold, or if the sign of variation is not equal across all comparisons, the pairwise Wilcoxon test is not satisfied for that feature. The first two steps employ nonparametric tests because they are distribution-free approaches and much more robust to the underlying distribution of the data: the only assumption of the Wilcoxon and KW tests is that the distributions in each group are identically shaped with possible differences in the medians. Finally, in the third step, an LDA model is generated, by assigning the remaining features and subgroup labels as the independent variable and the group label as the dependent variables. This step is used to estimate their effect sizes, which are obtained by averaging the differences between group means with the differences between group means along the first linear discriminant axis, which equally weights features' variability and discriminatory power. The LDA score for each biomarker is obtained by computing the logarithm of this value after being scaled and induces the ranking of biomarker relevance, regardless of the absolute values of the LDA score. For robustness, LDA is additionally supported by bootstrapping and subsequent averaging. For implementation of this test, an R function lefser::lefser() is available.

S1.8 ALDEx2
ALDEx2 is a differential abundance analysis method for RNA-seq, 16S rRNA gene sequencing and differential growth datasets [14]. Instead of proposing new model, this method improved on the data preprocessing, and it uses multiple instances to generate p-values.
If Y i,g denotes the read counts for the gth gene in the ith cell, suppose the intention is to use K instances to generate the p-value. For each sample i, to generate the kth instance, ALDEx2 uses posterior Dirichlet distribution with an uninformative prior of 1 2 to model the frequency of features with zero counts.
Then different models can be applied to the certralized data. Since we have two covariates in the model, we can use the linear model which is formulated as: Where The null and the alternative hypotheses for the gth gene with kth instance is The test statistic for the gth gene with kth instances T and follows a χ 2 1 distribution under the null hypothesis asymptotically. The test rejects the null hypothesis if the test statistic is larger than χ 2 1 (1 − α), or the (1 − α)th quantile of the χ 2 distribution with one degree of freedom, where α is the significance level. Alternatively, the individual p-values with kth instance are obtained as p g . And the genes with p-values less than α are declared to have a statistically significant association with disease.

S1.9 Kruskal-Wallis test
The Kruskal-Wallis (KW) test is equivalent to one-way analysis of variance (ANOVA) on ranks. KW is equivalent to Wilcoxon's rank sum (WRS) test, or Wilcoxon-Mann-Whitney test, for two sample problems and can accomodate comparisons of more than two samples [15]. Although the prototypical KW test was designed without consideration of covariates, it can be modified to account for possible batch effects [16]: where The exact and approximate distributions of the statistic under the null hypothesis can be obtained through analysis or resampling [16]. However, when disease and batch strata are large, the statistic converges to χ 2 1 . Based on the null distribution, p-values are obtained for each gene.
For implementation of this test, the R function coin::kruskal_wallis() [17] is available. The coin package function allows only a single batch variable.

S1.10 two-part Kruskal-Wallis test, KWII
Nonparametric tests such as KW and Wilcoxon's rank-sum (WRS) test have minimal distributional assumptions. The lack of model-induced information often results in lack of power. While zero-inflation is a well-known characteristic of microbiome sequencing data, explicitly modeling the proportion of zeros can enhance the power in detecting differential expression. This additional assumption can be integrated into the nonparametric models using a two-part model framework [18]. Of note, the LB test and the MAST are also two-part models but they are fully parametric. Nonparametric two-part models have been used in other 'omics applications [19] and microbiome [20] data analyses. The binary part of these nonparametric models has been modeled using a conventional proportion test where no covariates are allowed. To incorporate covariate information in the binary model, a logistic regression model can be used. The KW or the WRS test can be used as the nonzero model-to allow for the inclusion of covariates in the model, the modified KW test can be used. In this paper we combine a logistic regression model and a KW test and name it two-part KW test. The binary component of the model is the same as that of LB model, i.e. Equation (3). The nonzero component's test statistic is derived based only on subjects with non-zero gene expressions and has the same formula given in KW test, i.e. Equation (13).
The p-values can be obtained by combining two χ 2 1 statistics derived from each component: where W a g is either a Wald test statistic ( ) or a likelihood ratio statistic (two times the difference of log-likelihood of logistic regression models), and W b g has the same form as Equation 13. The test statistic, T KW II g follows a χ 2 1 distribution under the null hypothesis asymptotically.

S1.11 Data scaling and transformation
TPM is defined as RPK divided by the sample sum of RPKs times a constant (T P M i,g = c RP Ki,g ∑n genes g=1 RP Ki,g ) where the constant, c, was chosen to reflect the actual scale of the RPKs, or c = 5(20) million in the ZOE-pilot and ZOE2.0 studies. Arcsinetransformation is defined as c 2 π arcsin( T P M/c). Note that the Beta distribution is applied to compositional data and thus RPK and TPM data are equivalent in this context.

S2 Full description of the three metatranscriptomics datasets
To understand and characterize the distributional features of metatranscriptomics data, we leverage three datasets generated in two recent studies involving the human microbiome. We follow a systematic data harmonization approach to obtain gene-level expression data in RPKs as the summed (or marginalized) RPK over all species per gene before DE analysis. To expand the relevance of this comparative model evaluation, we consider additional aspects of microbial activity-i.e., expression at the gene-species level and the species level ("gene-species data") as aggregation over all genes in individual species ("species marginal data"). The first two datasets (namely, ZOE2.0 including 297 participants and ZOE-pilot including 116 participants) were generated in a molecular epidemiologic study of early childhood caries (ECC; defined as dental cavities in children under the age of 6) [21] called Zero-Out Early Childhood Tooth Decay (ZOE) [22,23]. In that study, investigators tested the associations between features of the supragingival oral microbiome (i.e., "dental plaque") and the prevalence of clinically-determined ECC. Of note, ECC prevalence was similar in the two ZOE waves, i.e., ZOE2.0: 49(147/297) and ZOEpilot: 50The study included known batch effects due to the process of clinical sample collection and microbiome sequencing at different dates. For details, estimates of taxonomic composition, gene family, path abundance, and path coverage were produced from the obtained and filtered reads using HUMAnN2 [24]. The resulting reads were scaled into reads-per-kilobase (RPK). We considered additional pre-processing methods including transcript-per-kilobase-million (TPM) and arcsine in Section 5. The total number of gene-species combinations (in that counts are at the per-geneper-species level, also called joint "gene-species") in ZOE2.0 metatranscriptomics is 535,299; there are 204 distinct species, and 402,937 distinct genes. In the ZOE-pilot dataset, there are 439,872 gene-species, 185 distinct species, and 342,004 distinct genes.  Tab. S1: Percentage of zeros at the entry level The third dataset (namely, the IBD data) was generated in the context of a  recent study of the gut microbial ecosystem and its association with Inflammatory Bowel Diseases (IBD) [25], including metatranscriptomes obtained from fecal samples of 132 subjects during a one-year period, with repeated measurements of the same participants. The generated taxonomic and functional profiles are publicly available (http://huttenhower.sph.harvard.edu/biobakery). The dataset includes a total of 1,595 metagenomic and 818 metatranscriptomic samples. We focus on the crosssectional features of the metatranscriptomics data distribution, and thus we only examined the baseline information, or the first visit data, including a sample of 104 participants. We further dichotomized participants' disease status as IBD (i.e., 50 Crohn's disease and 26 ulcerative colitis cases) versus non-IBD (i.e., 28 'control' participants). Clinic location was considered as a batch effect in that study, and thus was employed in our analyses after dichotomization (the pediatric versus the adult cohorts). The average proportion of zeros per gene in the IBD metatranscriptomics data is 96.3%, while that in the metagenomics data is 87.8% consistent with the trend of higher zero proportions in metatranscriptomics comapred to metagenomics data. Web Figure 1 illustrates the higher zero proportion in the metatranscriptomes over the metagenomes: 91% of genes have zero proportion ≥ 90% in the metatranscriptomics data, compared to 69% in the metagenomics data. These data are available in a compositional format, wherein gene expression data sum up to one over genes and species for each participant.
The overall proportion of zeros in these three datasets at the matrix level is summarized in Table S1. Metatranscriptomics data at the gene level has an overall higher proportion of zeros compared to metatranscriptomics data at the species level and metagenomics data.
Regarding the overdispersion of data, Web Figure S2 illustrates the relationship between the estimated overdispersion of the metagenomics data and that of the metatranscriptomics data, at the species and the gene levels, for each of the three datasets. Overdispersion in metatranscriptomics data is on average lower than in metagenomics data in (1) ZOE-pilot species level, (2) ZOE-pilot gene level, and (3) IBD gene level data. However, overdispersion in metatranscriptomics data is higher than in metagenomics data in the other three combinations, which are (1) ZOE2.0 species level (2) ZOE2.0 gene level, and (3) IBD species level. Web Figure S3 illustrates the relationship between the overdispersion and the proportion of zeros for a set of randomly selected genes.

S3 Technical details in semi-parametric simulation
In the semi-parametric simulation, gene effect sizes were introduced to participants in the disease groups only, as follows: Let (δ µ,g , δ π,g ) be the effect size assigned to a signal gene g. Then δ π,g is first applied to flip a certain number of zeros to non-zeros or non-zeros to zeros, and then δ µ,g is applied to rescale the non-zero values. Letπ g,A denote the sample zero proportion of the disease group A =D, H. Also, letπ g,D (δ π ) = (expit(logitπ g,D + δ π ) denote the differentiated zero proportion of gene g in the disease group D, where expit(x) = exp(x) exp(x)+1 and logit(x) = log( x 1−x ). Let Z g,D be a random realization from the binomial distribution with n D trials and the success probabilitiesπ g,D (δ π ), where n D is the sample size of the disease group D. The health group counterparts are defined similarly: π g,H (δ π ) = (expit(logitπ g,H − δ π ) is the differentiated zero proportion, and Z g,H is a binomial random draw with n H trials. If Z g,D >= n Aπg,A , Z g,D − n Aπg,A many nonzero-expressed subjects are randomly chosen and are replaced with zeros. Contrary, if Z g,D < n Aπg,A , each of the randomly chosen n Aπg,A − Z g,D many zero-expressed subjects are given the expression values of one of the randomly chosen non-zero subjects. Once the π-disease effects are manipulated, the non-zero expression values, Y g,i > 0, are transformed into exp(log(Y g,i ) ± δ µ ) with + (−) for i in group D (H). Note that the signs of (δ µ , δ π ) are preserved, so that for some genes δ µ (δ π ) is positive while for other genes it can be negative.

S4 Data generative models
The zero-inflated log-normal model is a mixture of log-normal distribution with a point mass at zero. The density is given as where φ(x, µ, σ 2 ) is the log-normal density at x with mean µ and variance σ 2 , π is the zero-inflation parameter, or π ≡ Pr(Y = 0), µ is the non-zero mean parameter (i.e. µ ≡ E[Y |Y > 0]), and θ is the over-dispersion parameter so that var[Y |Y > 0] = µ 2 θ ZINB is an extension of the negative binomial distribution and is widely used to model count data with excess zeros. In many real world applications, if more zeros are observed than the negative binomial distribution assumes, the zero-inflated negative binomial distribution is suitable and has in fact become one of the most commonly used methods in count data analysis [26], including omics data analyses involving scRNAseq [27,7] and microbiome [28,29].
The zero-inflated Gamma model is a mixture of a Gamma distribution and a point mass at zero. The density is given as where π is the zero-inflation parameter or π ≡ Pr(Y = 0), µ is the non-zero mean parameter (i.e. µ ≡ E[Y |Y > 0]), and θ is the over-dispersion parameter so that var[Y |Y > 0] = µ 2 θ.

S5.1 Overview
To comprehensively evaluate the performance of all available methods, we considered two simulation approaches, including fully parametric (Simulation I)and semiparametric (Simulation II). In Simulation I, three generative models (zero-inflated log normal or ZILN and two more) were used with a comprehensive set of parameters that ranges over the most of the parameter estimates of the example data. An extensive range of parameter sets is used to mitigate the potential model-misspecification, while we provide goodness of fit measures for all generative models. In Simulation II, genes and their expression levels are sampled from the example data followed by artificial insertion (i.e., "spiking") of disease effects. The second simulation serves a validation of Simulation I, and, at the same time, helps evaluate the robustness of the tested DE methods across different datasets. In what follows, we briefly describe how the two simulations were done. These analyses were done using

S5.2 Baseline parameters for ZILN-based simulation from ZOE2.0 at the gene level
The baseline parameters for ZILN models are chosen as shown in (Web Table S2).

S5.3 Baseline parameters for ZIG-based simulation from ZOE2.0 at the gene level
The parameter estimates for the ZIG model are identical to those of the ZILN model and for this reason are not presented. The baseline parameters for ZIG models are chosen as shown in (Web Table S3).

S5.4 Parameter estimation and baseline parameters for ZINBbased simulation from ZOE2.0 at the gene level
The estimated parameters of the ZINB from ZOE2.0 genes are in (Web Table S4). The baseline parameters, disease effects, and batch effects for ZINB models are chosen (Web Table S5) based on the parameter estimates from the ZOE2.0 data (Web Figure S4).

S5.5 Estimated parameters at the gene level from the two validation datasets
To add to our understanding of metatranscriptomics data distributions generated under realistic conditions, we followed the same procedures to estimate the parameters using the ZOE-pilot and the IBD data. Because the ZILN model is the main focus of our simulation study and the expression of the IBD data was only provided in a compositional form, we use these validation data to estimate the ZILN model parameters only. In the ZILN model, µ is the only parameter that is affected by scale transformations and most test results are thus invariant to scale transformations except for the NB-and the ZINB-based tests. The estimated parameters of the ZILN (and ZIG) models from the validation data-the ZOE-pilot and the IBD data-are presented here.

S5.5.1 Estimated parameters from the ZOE-pilot data.
Web Figure S5 and Web Table S6 illustrate the distribution of the ZILN parameter estimates in the ZOE-pilot data.

S5.5.2 Estimated parameters from the IBD data
Web Figure S6 and Web Table S7 illustrate the distribution of the ZILN parameter estimates in the IBD data.

S5.6.1 Parameters for the gene expression of each gene-species combination as the joint gene-species level
The distribution of the ZILN parameter estimates of the gene-species combinations in the ZOE2.0 data are shown (Web Figure S7 and Web Table S8). The distribution of the ZINB parameter estimates of the gene-species combinations in the ZOE2.0 data are shown (Web Figure S8 and Web Table S9).

S5.6.2 Estimated parameters for the total gene expression of each species, at the species marginal level
Web Figure S9 and Web Table S10 illustrate the distribution of the ZILN parameter estimates of species in the ZOE2.0 data. Web Figure S10 and Web Table S11 illustrate the distribution of the ZINB parameter estimates of species in the ZOE2.0 data.
Estimated parameters (Web Figure S9) of gene expression distribution for species are well covered by the parameter sets in Web Table S2.
Estimated parameters (Web Figure S10) of gene expression distribution for species can be mostly covered by the parameter sets in Web Table S5 except when π is large and the overdispersion parameter θ is very irregular.    Web Table S8.  Web Table S9.  Tab. S11: The ZINB parameter estimate distribution of species in the ZOE2.0 data Fig. S10: ZINB parameter estimates for species Column A: parameter estimates of baseline ZINB distributions in the ZOE2.0 data are presented with a 3-dimensional scatter plot on the top row and each of the subsequent rows represents π estimates being within 0.03 from 0.9, 0.6, and 0.3. Column B: disease effect estimates based on ZINB models from the ZOE2.0 data in absolute values (|δ µ |, |δ θ |, |δ π |) Column C: batch effect estimates based on ZINB models from the ZOE2.0 data in absolute values (|κ µ |, |κ θ |, |κ π |). The quartiles of the estimated parameters are provided in Web Table S11.

S6.1 Goodness of fit for the ZINB models
Web Figure S11 presents the goodness of fit results of the ZINB model for two randomly chosen genes in the ZOE2.0 data. The ZINB distribution provides a decent approximation to the RPK and TPM transformations, although it does not fit well to the arcsin transformed data.

S7.1 Full results under ZILN models
These results are presented in the main text.

S7.2.1 Full results under ZINB models-Sensitivity
Results are presented in Figures S12 and S13.

S7.2.2 Type I error and FDR under ZINB model for mean shift (alternative hypothesis D2)
Results are presented in Figure S14.  X X X X X X X X X X X X X X X X X X X X X

S7.4 Type I error and FDR results under ZIG model for mean shift (alternative hypothesis D2)
Results are presented in Figure S17.

S8 Application to the ZOE2.0 data
This section includes the analysis results from the ZOE2.0 data in terms of species and gene-species combinations. The same normalization and screening procedures are applied to these analyses except for the screening thresholds. As for the abundancebased screening, gene-species with TPM less than 0.2 and species with TPM less than 2 were filtered and not tested. Out of 535,299 gene-species combinations, after screening, 188,957 gene-species remained for testing. Out of 209 species, after screening, 97 species remained for testing. The ten most significant gene-species combinations according to the LN model are C8PHV7, C8PEV7, C8PKG9, C8PI10, C8PH26, C8PIH7, C8PHR6, C8PHV8, C8PFD0, and C8PG15, all harbored by Campylobacter gracilis.
The species and proteins mapped to and the functions of those genes can be found in Web Table S12 Fig. S19: Venn diagram of gene-species with p-values less than 10 −5 for each model.

S8.3 Application to the ZOE2.0 data -gene profiles
Tab. S12: Profiles of the top significant genes and gene-species based on the UniProt database [30]. * genes from the list of top significant genes, † genes from the list of top significant gene-species. Catalyzes the NADPH-dependent formation of L-aspartatesemialdehyde (L-ASA) by the reductive dephosphorylation of L-aspartyl-4-phosphate. C8PJD1 ilvC Campylobacter gracilis RM3268 * Ketol-acid reductoisomerase (NADP(+)). Involved in the biosynthesis of branched-chain amino acids (BCAA).
Catalyzes an alkyl-migration followed by a ketol-acid reduction of (S)-2-acetolactate (S2AL) to yield (R)-2,3-dihydroxy-isovalerate. In the isomerase reaction, S2AL is rearranged via a Mg-dependent methyl migration to produce 3-hydroxy-3-methyl-2-ketobutyrate (HMKB The species most strongly associated with childhood dental caries in this analysis was Campylobacter gracilis, a gram-negative anaerobic bacillus, traditionally isolated from gingival crevices and dental biofilms accumulated close to the gingival margin [31]. Oral campylobacters are enriched in genes encoding for lactate metabolism, which plays an important role in the development and maintenance of acidic conditions in cariogenic biofilms as the predominant glucose-derived product, which is considered to be the main acid involved in dental caries formation [32]. The capacity of Campylobacter species to produce lactate may be contributing to the development and establishment of ECC, as other microorganisms directly associated to caries disease like Streptococcus sp and Leptotrichia sp, benefit from this lactate-rich environment [32,33]. Chalmers et al. (2015) showed that Campylobacters gracilis is associated with severe ECC at a frequency detection rate of 87.5% [34]. Campylobacters gracilis' active genes shown significant associated with ECC in the essential steps for: (1) bacterial growth (C8PIH7, encodes for an enzyme that catalyzes the first committed step in fatty acid synthesis) [35]; (2) protein biosynthesis and transport (C8PI10, encodes for an enzyme that catalyzes the attachment of serine to its cognate transfer RNA molecule; C8PKZ2, encodes for the enzyme from biosynthesis of diverse amino acids leading to L-lysine, L-threonine, L-methionine and L-isoleucine; C8PJD1, encodes for an amino acid biosynthesis pathway) [36,37]; (3) protein transport (C8PG93 encodes for twin-arginine translocation (Tat) pathway, which catalyzes the export of proteins from the cytoplasm across the inner/cytoplasmic membrane.) [38]; (4) DNA replication and transcription (C8PJY1, encodes for key enzymes in the synthesis of nucleoside triphosphates molecular precursors of both DNA and RNA) [39]; (5) biofilm formation or adhesion through gene C8PKG9 (encodes for NFACT-R 1 domain-containing protein) [40] and; (6) energy conservation (C8PHR6 encodes for methylenetetrahydrofolate reductase (MTHFR) of acetogenic bacteria during reduction of carbon dioxide with molecular hydrogen to acetate) [41]. Other genes associated with ECC were A3CQN5 (encodes for Ribosomal RNA small subunit methyltransferase A, which plays the role of switch protein in the ribosome assembly in Streptococcus sanguinis) and C7NCB2,

S9.1 Application to the IBD data -gene profiles
Tab. S13: Profiles of the top significant genes based on the UniProt database [30]. * genes from the list of top significant genes. RHOM_03530 Roseburia hominis * Stage 0 sporulation protein A homolog May play the central regulatory role in sporulation. It may be an element of the effector pathway responsible for the activation of sporulation genes in response to nutritional stress. Spo0A may act in concert with spo0H (a sigma factor) to control the expression of some genes that are critical to the sporulation process. S3BF40 HMPREF1476_00737 Sutterella wadsworthensis HGA0223 * CN hydrolase domain-containing protein nitrogen compound metabolic process Fig. S24: P-values of the likelihood-ratio and Fisher's exact tests for genes with one of the subgroup prevalence rate at the boundary in the IBD data as powerful as the likelihood-ratio test in this context because it can only be applied in the case of a two-by-two table, thus not being able to account for any covariates. Figure S24 confirms that the p-values from the Exact test are overall larger than those of the likelihood-ratio test. None of the two tests identified any genes at the p < 10 −5 significance level.

S9.3 Summary for the IBD results
The 10 most statistically significant genes in the LN model are S3BI82, R5Q3H7, R5PRG3, R5Q1H1, R5QAG2, R5QE55, S3CE88, R5QEQ4, R5PLJ0, and G2T243 while the top ten genes for the LB model are D4WIY6, Q0TKG5, R6W6W2, D1PDG3, Q17UW4, I9USK4, E2ZM16, R7NP61, B0NN15, and U2ZZD9 with species and proteins associated with those genes and their functions presented in Web Table S13 in Section S9. Most of the top 20 genes in LN tests corresponded to Sutterella wadswothensis, a Gram-negative, non-spore-forming rod of the Betaproteobacteria class that grows in a microaerophilic atmosphere or under anaerobic conditions, and was previously identified as significantly differentially abundant (prior to the correction for multiple hypothesis testing) in the original IBD study report [25]. The role of S. wadworthensis in IBD pathogenesis remains unresolved with recent studies supporting [43,44] and some earlier studies not supporting this association [45,46]. The most significant gene family S3BI82, corresponded to the hydrogenase maturation factor HypA.
Hydrogenases are important factors that contribute to the metabolic versatility of Enterobacteriaceae, specifically their ability to utilize a large repertoire of terminal electron acceptors, which allows them to thrive in the inflamed gut [47,48,49]. Gene G2T243 is a stage 0 sporulation protein A homolog harbored by Roseburia hominis. R. hominis, a prevalent butyrate producer, has been found consistently depleted in CD ( [50,51] and was also identified as significant prior to the multiple testing correction in the original IRB study [25]). Alistipes, the other other genus found in top list, is member of the Bacteroidetes phylum. A. finegoldii is considered a protective species against colitis The 10 significant gene families identified in LB are quite different to what are in LN, and correspond to Bacteroides ovatus, Escherichia coli, Prevotella copri, Faecalibacterium prausnitzii, Bacteroides faecis, Bacteroides xylanisolvens, and Bacteroides stercoris; all identified as significant in the original study [25] prior to multiple hypothesis correction. Although the species Zygosaccharomyces rouxii harboring Q17UW4 was not reported in the original study [25], the gene family itself corresponded to a cytochrome c oxidase subunit 2, which is increased in UC [52]. Finally, the relevance of Faecalibacterium praunitzii as an important beneficial bacterium in IBD has been demonstrated in a number of studies [53]; alongside the butyrate-producing R. hominis, they were identified in the original IBD study report [25] and account for some of the strongest identified associations.