A general framework for powerful confounder adjustment in omics association studies

Abstract Motivation Genomic data are subject to various sources of confounding, such as demographic variables, biological heterogeneity, and batch effects. To identify genomic features associated with a variable of interest in the presence of confounders, the traditional approach involves fitting a confounder-adjusted regression model to each genomic feature, followed by multiplicity correction. Results This study shows that the traditional approach is suboptimal and proposes a new two-dimensional false discovery rate control framework (2DFDR+) that provides significant power improvement over the conventional method and applies to a wide range of settings. 2DFDR+ uses marginal independence test statistics as auxiliary information to filter out less promising features, and FDR control is performed based on conditional independence test statistics in the remaining features. 2DFDR+ provides (asymptotically) valid inference from samples in settings where the conditional distribution of the genomic variables given the covariate of interest and the confounders is arbitrary and completely unknown. Promising finite sample performance is demonstrated via extensive simulations and real data applications. Availability and implementation R codes and vignettes are available at https://github.com/asmita112358/tdfdr.np.


Introduction
One central theme of genomic data analysis is identifying genomic features associated with a variable of interest, such as disease status. The associated features are subject to further replication and validation. The validated features could then be followed up for a more in-depth mechanistic study or be used as biomarkers for disease prevention, diagnosis, and prognosis if they have sufficient predictive power (Majewski and Bernards, 2011;Ziegler et al., 2012). Due to the constraint of clinical sample collection, the variable of interest is often correlated with other variables, which may confound the associations of interest. One example is the identification of microbiome biomarkers for endometrial cancer based on a comparison between benign and malignant tumor samples (Walsh et al., 2019). Patients with benign tumors are usually much younger than those with malignant tumors since the progression to malignancy requires multiple genomic events. Age has also been known to be associated with the female reproduction tract microbiome. Therefore, age is a confounding factor, and we need to control it if the aim is to identify cancer-related microbiome biomarkers reliably. Other common sources of confounding in genomic data analysis include environmental changes (Fare et al., 2003;Gasch et al., 2000), cell mixtures (Liang and Cookson, 2014), technical variation or batch effects (Leek et al., 2010;Lazar et al., 2013) and surgical manipulation (Lin et al., 2006). Controlling the confounders could significantly increase the rate of successful validation, reduce the overall cost and shorten the time from discovery to clinical tests. However, due to a substantial multiple testing burden, confounder adjustment exacerbates the already low statistical power for genome-scale association tests. If no confounder adjustment is performed, we are faced with a severely inflated type I error, with the extent of inflation depending on the number and strength of associations with the confounder. Increasing the statistical power of a confounded association study while controlling for the false positives is a statistical topic of critical importance.
Surprisingly, few statistical efforts have been made on this important topic.
The traditional way of confounder adjustment for high-dimensional association tests is to adjust for confounders for each genomic feature and correct the individual association p-values for multiple testing using false discovery rate (FDR) control (Benjamini and Hochberg, 1995a;Storey, 2002). This procedure has been a standard statistical practice for genomic association analysis to maintain the correct type I error rate level. However, in practice, confounders may affect only a subset of omics features (Lu et al., 2004;Glass et al., 2013;Gershoni and Pietrokovski, 2017), and adjusting confounders for every omics feature will be an over-adjustment, leading to substantial power loss.
To rescue the power, one naive idea is first to test the dependence between the confounder and each omics feature. If the dependence is not statistically significant, we exclude the confounder in the model. Although this strategy substantially improves the power, it suffers from the so-called selection bias (Efron, 2011;Fithian et al., 2014), which inflates the type I error if the significance cutoff is not properly chosen to reflect the selection effect from the first step.
In a recent work, Yi et al. (2021) made a significant step toward solving this problem. The authors proposed a two-dimensional false discovery rate control procedure (2dFDR) based on linear models with the measurement of the omics feature as the outcome. The 2dFDR procedure proceeds in two dimensions. The first dimension uses the unadjusted statistic (from fitting the unadjusted linear models to each omics feature) to screen out a large number of irrelevant features (noises) that are not likely to be associated with the covariate of interest or the confounder. In the second dimension, the procedure uses the adjusted statistic (from fitting the confounder-adjusted linear models to each omics feature) to identify the true signals within the remaining features and control the FDR at the desired level. Although the unadjusted statistic is biased and captures the effects from both the covariate of interest and the confounder, it can be leveraged to increase the signal density and reduce the multiple testing burden in the second dimension. At a high level, the idea of using the unadjusted statistics is similar to the use of marginal utilities in variable screening (Fan and Lv, 2008). However, the 2dFDR procedure takes into account the selection effect from the first dimension and thus provides asymptotically valid inference.
In this work, we propose a general framework 2dFDR+ for integrating confounder adjustment into multiple testing. The 2dFDR+ framework significantly extends the scope and applicability of the original 2dFDR in the following aspects: 1. The new framework relaxes the linear model assumption in Yi et al. (2021). Indeed, the conditional distribution of the omics variables given the variable of interest and the confounders can be arbitrary and completely unknown. Thus, the new framework can be applied to different types of outcomes such as continuous/binary/count outcomes.
2. We allow the joint use of any conditional and marginal independence test statistics in 2dFDR+.
The marginal independence test statistic serves as an auxiliary statistic to screen out noise and improve power.
3. 2dFDR+ does not require a case-by-case study to derive the joint (asymptotic) distribution of the conditional and marginal independence test statistics. It provides a unified approach to approximate their joint distribution under the null by modeling the conditional distribution of the covariate given the confounders.
4. Yi et al. (2021) focuses on the case of univariate covariate while the covariate of interest can be multivariate within our framework. 6. Due to explicit modeling of the relationship between the variable of interest and confounders, the new method provides more reliable FDR control, especially when the confounding effect is strong.
In theory, we prove that 2dFDR+ provides asymptotic FDR control (see Theorem 2). By design, 2dFDR+ (using both the conditional and marginal independence test statistics) is guaranteed to deliver at least as many rejections as the corresponding 1d procedure (using only the conditional independence test statistics). The reason is that 2dFDR+ searches over a larger rejection region (2d versus 1d). By setting the cutoff for the marginal independence test statistics to be zero, our method would reduce to the 1d procedure. This observation suggests that the optimal cutoff for the 1d procedure is in the feasible set of cutoffs (that control the FDR estimate at the desired level) for the 2d procedure. The flexibility of choosing an additional cutoff leads to more rejections. See the detailed discussions in Section 4.
A unique feature of 2dFDR+ is that it lets the data decide the usefulness/informativeness of the auxiliary statistic. If the auxiliary statistic provides helpful information, 2dFDR+ has significant power gain. When the auxiliary statistic is not informative, it typically reduces to the corresponding 1d procedure (without using the auxiliary statistic) in terms of performance. Extensive numerical studies show no harm in including the auxiliary statistic except for some very special cases (see the discussions in Remark 2). Interestingly, when the FDR is controlled at level q, we can show that in the worst-case scenario, the asymptotic power loss for 2dFDR+ compared to the 1d procedure is at most q, see Section 4.
Finally, it is worth mentioning another line of research on estimating latent confounding factors. Principal component analysis was first suggested by Alter et al. (2000) to estimate the latent confounding factors. More recently, a variety of methods have been proposed for confounder adjustment in similar statistical settings, see, e.g., Friguet et al. (2009); Gagnon-Bartsch and Speed (2012); Storey (2007, 2008). The theoretical properties of some of these approaches were recently studied by Wang et al. (2017). Although we assume that the confounders are fully observed in our framework, conceptually, our method can also be coupled with the above techniques for confounder adjustment when the confounders are unobserved.
The rest of the article is organized as follows. Section 2 describes the problem setups and a two-dimensional (2d) rejection region based on a primary statistic for testing the conditional independence between the omics feature and the covariate of interest given the confounders and an auxiliary statistic for testing the marginal independence between the omics feature and covariate.
Section 3 introduces an oracle FDR-controlling procedure, where the conditional distribution of the covariate given the confounders is assumed to be known. We prove asymptotic FDR control for the oracle procedure in Section 4. We discuss several ways of estimating the conditional distribution in Section 5. We review some nonparametric conditional independence tests and discuss their use in our framework in Section 6. Sections 7 and 8 are devoted to numerical studies and real data analysis respectively. Section 9 concludes.

Problem Statement and 2d Rejection Region
We formulate the feature selection problem by allowing the omics variables to depend on the covariate of interest and confounders arbitrarily. To state the problem and the procedure carefully, Here Y represents a vector of omics features, X is the covariate of interest, and Z denotes the set of confounders. We aim to discover as many as possible omics features Y i that are dependent of X conditionally on the confounders Z. We formulate this as the problem of testing To tackle this problem, one must adjust for the confounders and the multiplicity in testing. The burden from both adjustments could lead to potential power loss, especially when the confounding effect is strong.
Our idea to resolve this issue is to use two statistics jointly, namely a primary statistic for testing the conditional independence specified in H 0,j and an auxiliary statistic for testing the marginal independence Y j ⊥ ⊥ X, for deciding whether or not to reject H 0,j . The purpose of using the auxiliary statistic is to enrich signals, reduce the multiple testing burden and thus enhance the multiple testing power. As marginal dependence does not necessarily imply conditional dependence (e.g., Y j and X are both functions of Z), the use of auxiliary statistics could lead to selection bias and requires proper adjustment in the selection of cut-off values. One of our goals is to carefully design a way to simultaneously select the cut-off values for the primary statistic and the auxiliary statistic to control the FDR at the desired level.
As a motivation, we consider m independent generalized linear models: B. Solely associated with the covariate of interest: α j = 0, β j = 0; C. Solely associated with the confounders: α j = 0, β j = 0; D. Not associated with either the covariate of interest or confounders: α j = 0, β j = 0.
We note that (i) α j = 0 if and only if Y j ⊥ ⊥ X|Z; (ii) when β j = 0 (Categories B and D), testing the conditional independence boils down to testing the marginal independence Y j ⊥ ⊥ X.
In a model-free setting, these four categories can be described as (A) Y j ⊥ ⊥ X|Z and Y j ⊥ ⊥ Z|X; for testing the conditional independence Y j ⊥ ⊥ X|Z and the marginal independence Y j ⊥ ⊥ X, respectively. Throughout the discussions below, we assume that a large positive value of T M j (T C j ) provides evidence against marginal (conditional) independence. The readers are referred to Section 6 for some examples of conditional and unconditional independence tests. Given the thresholds, t 1 , t 2 ≥ 0, the two-dimensional (2d) procedure can be described as follows.
Dimension 1. Use the marginal independence test statistics to determine a preliminary set of Dimension 2. Reject H 0,j for T C j ≥ t 2 and j ∈ D 1 . As a result, the final set of discoveries is Although marginal dependence does not imply conditional dependence, it can be leveraged to increase the signal density and reduce multiple testing burden in the second dimension. More precisely, the usefulness of the marginal dependence test is due to We illustrate the rational behind 2dFDR+ through the following example. A detailed description of the method is provided in the next section.
Example 1. Consider the following data generating process: where X = (ρZ + )/ ρ 2 + 1 with Z and being independently generated from N (0, 1). Here ρ ∈ {0.1, 0.5, 1} represents weak (+), medium (++) and strong (+++) confounding level. We generate α j and β j independently from the mixture distribution: where δ 0 denotes a point mass at 0. We let T C j be the t-statistic for testingH 0,j : α j = 0 under the logistic model (1), and let T M j be the t-statistic for testingH 0,j under the reduced model by forcing β j = 0 in model (1). In Figure 1, we plot the marginal (T M ) statistic against the conditional (T C ) statistic for various confounded scenarios. The standard approach performs (onedimensional) FDR control based on the conditional statistic (T C ) only (we refer it as 1dFDR).
When the correlation between the variable of interest and the confounder (denoted as cor(X, Z) = ρ/ ρ 2 + 1 ∈ {0.1, 0.45, 0.71}) is high, the signals (green) and noises (red) overlap much on T C . To achieve the desired FDR level, 1dFDR requires a high cutoff (black line). For 2dFDR+, it first uses T M to exclude a large number of irrelevant features (horizontal blue line). Next, a lower cutoff (vertical blue line) is used to achieve the same FDR level. As a result, it achieves significant power improvement, and the improvement increases with the correlation between the variable of interest and the confounder. Figure 1: Illustration of 2dFDR+ using simulated datasets. The three panels in the first row denote the decision boundaries for 1dFDR (black line) and 2dFDR+ (blue lines) at the 5% FDR level for three degrees of confounding. 1dFDR relies on the conditional statistic (T C ) only (one dimension) while 2dFDR+ is based on both the marginal and conditional statistics (T C and T M ), i.e., it uses two dimensions. T M is used to screen out a large number of irrelevant features (blue horizontal line), followed by a less stringent cutoff of T C that achieves higher power while keeping the FDR controlled. The figures in the second row illustrate the power difference between the two methods. When the correlation is low (ρ = 0.1) using 2dFDR+ provides little improvement over 1dFDR. When the correlation is higher ("++," "+++," ρ = 0.5, 1), the signals (green) and noises (red) are more difficult to separate on T C . By using T M , 2dFDR+ excludes a large number of noises without losing many signals. The signal density on T C is enriched, leading to significant power gain.

Oracle Procedure
We introduce an oracle FDR-controlling procedure, where we assume that the conditional distribution of X given Z, denoted by P X|Z below, is known. Section 5 introduces several ways of estimating this conditional distribution from the observations.

Estimating the false discovery proportion
Our goal here is to develop a principled way of finding the cutoff values (t 1 , t 2 ) such that the FDR is controlled at a desired level while the number of rejections is as large as possible. Let H 0,j is true} and m 0 = |M 0 | be the set and the number of true null hypotheses respectively. Write X = (X 1 , . . . , X n ) , Y j = (Y 1,j , . . . , Y n,j ) and Z = (Z 1 , . . . , Z n ) . Based on the 2d rejection region, the false discovery proportion (FDP) is given by where a ∨ b = max{a, b} for a, b ∈ R. Note that the FDP is zero when no rejection is made. We replace the numerator in the definition of FDP(t 1 , t 2 ) by its conditional expectation with respect to X given Y j and Z, which leads to the following approximate upper bound on the FDP: where P 0 (·| Y j , Z) denotes the conditional probability under the null hypothesis H 0,j . The upper bound FDP oracle (t 1 , t 2 ) relies on the conditional distribution P X|Z . To find a feasible conservative estimator of the FDP, it remains to estimate the conditional probabilities in the numerator of FDP oracle (t 1 , t 2 ). To this end, we write T M j = T M j ( X, Y j ) and T C j = T C j ( X, Y j , Z) to emphasize their dependence on the samples. As P( X| Y j , Z) = P( X| Z) under H 0,j and P( X| Z) = n i=1 P X|Z (X i |Z i ), we have under H 0,j that where x = (x 1 , . . . , x n ) with x i ∈ R p , which can be calculated once we know the conditional distribution P X|Z . One way to approximate Specifically, we generate Denote by T M j,b and T C j,b the marginal and conditional independence test statistics computed based . Hence a conservative estimate for the FDP is given by

Finding the optimal cut-off
We now introduce a greedy approach to select the cut-offs. For a desired FDR level q, we first define as the feasible set that contains all the cut-off values controlling the FDP estimate at the level q.
We then select the optimal cut-off as the one delivering the most number of rejections from the feasible set: Finally, we reject all the hypotheses H 0,j such that Remark 1. In the supplement, we describe a variant of the 2d procedure (2d-FWER+) to control the family-wise error rate (FWER). Simulation studies suggest that 2d-FWER+ provides reliable FWER control in finite sample.

Estimating the null proportion
Following the idea in Storey (2002), we can further improve the power of our method by estimating the proportion of null hypotheses. As a motivation, we suppose T C j follows the mixture where π 0 represents the null proportion, P 0 and P 1 denote the distributions under the null and alternative, respectively. Under this two-group mixture model, we have where the approximation is due to the law of large numbers. Therefore, we propose to estimate the null proportion π 0 by We can then implement the 2dFDR+ based on the following estimate of the FDP: , which can be regarded as John Storey's version of the 2dFDR+ procedure.

FDR Control and Power Analysis
We first show that under the global null, a version of the 2dFDR+ procedure provides finite sample FDR control (or equivalently FWER control). The key to the proof relies on the symmetry Then we reject any hypothesis such that T M j,0 ≥ t 1 (s * ) and T C j,0 ≥ t 2 (s * ).

Theorem 1.
Under the global null, the above 2dFDR+ procedure provides finite sample FDR control or equivalently FWER control.
Under general setting, the symmetry among (T M j,b , T C j,b ) m j=1 no longer holds and the finite sample FDR control is not guaranteed. Fortunately, we manage to show that 2dFDR+ provides asymptotic FDR control as n, m both diverge to infinity. To achieve this goal, we impose the following assumptions.
Assumption 1 requires that the marginal models of Y j conditional on X and Z are independent where u j (·) and v j (·) are some functions defined on R p and R d respectively. In this case, Assumption 1 is fulfilled provided that j 's are independent across j.
Assumption 2 is a high-level condition. We justify this assumption under model (4)

in Section
A.1. Our next assumption is similar to the requirement in Theorem 4 of Storey et al. (2004), which ensures the existence of cut-off values to control the FDR at level q. It reduces the search region for the optimal cut-offs to a rectangle of the form [0, Assumption 3. Assume that there exist t 0,1 and t 0,2 such that, To state the main theorem, we recall that The theorem below establishes the asymptotic FDR control of the 2dFDR+ procedure.
We now turn to the power analysis of the oracle 2dFDR+ procedure. We argue that 2dFDR+ is, in general, more powerful than the corresponding 1d procedure based on the conditional independence statistics alone. Assume without loss of generality that T M j takes non-negative values.
The intuition is that for t 1 = 0, the first dimension does not screen out any null hypothesis and only the second dimension is effective in identifying signals. In this case, 2dFDR+ reduces to the corresponding 1d procedure, where we reject H 0,j if T C j ≥ t * with t * being the solution to the following Clearly, (0, t * ) is in the feasible set F q of the optimization problem in Section 3.2. Therefore, we In other words, 2dFDR+ is guaranteed to deliver at least as many rejections as the corresponding 1d procedure does.
Define FP 2d and FP 1d as the number of false positives for 2dFDR+ and the associated 1d procedure respectively. Similarly, we let TP 2d and TP 1d be the number of true positives. Suppose Assumptions 1-3 hold and both procedures make rejections (i.e., FP + TP > 0). In addition, assume for some 0 ≤ q 1 , q 2 ≤ 1. As 2dFDR+ makes more rejections, i.e., FP 1d + TP 1d ≤ FP 2d + TP 2d , we must have When q 2 ≤ q 1 , TP 2d ≥ TP 1d , i.e., 2dFDR+ makes more true rejections. In general, we have the following lower bound on the number of true positives for 2dFDR+.

Corollary 1.
Under Assumptions 1-3 and as B → +∞, we have for any > 0, As can be arbitrarily small, (6) suggests that with the FDR controlled at level q, 2dFDR+ asymptotically achieves at least 100(1 − q)% true rejections of the 1d procedure in the worst-case scenario. For instance, with q = 5%, the power loss compared to the 1d procedure is at most 5%.
We refer the readers to Section A.4 of the supplement for more discussions on the asymptotic power of 2dFDR+.
Remark 2. Since 2dFDR+ depends on the marginal independence statistic to filter features, when the confounder and variable of interest have opposite effects on the feature with similar magnitude, they will cancel out each other's effect, and the feature could be excluded erroneously in the first dimension. The optimal cutoff of T M j is thus determined based on the tradeoff between power reduction due to erroneously excluding these relevant features in the first dimension and power increase due to reducing the multiple testing burden and increasing the signal density in the second dimension. If the true signals can only be revealed after adjusting for the confounder, for example, when the true and confounding signals co-locate with opposite directions, the marginal independence test statistics will not be informative. In this case, the best cutoff on T M j should be 0 and 2dFDR+ is then reduced to the 1d procedure. In finite samples, it may not always be possible to reduce 2dFDR+ to the 1d procedure exactly. Nevertheless, as argued above, the power loss is relatively moderate even in the worst-case scenario.

Estimating the Conditional Distributions
As the conditional distribution P X|Z is seldom known, we need to estimate it from the data.
There are indeed several ways of generating samples from P X|Z . Examples include classical methods such as residual permutation (Winkler et al., 2014) and parametric bootstrap (Davison and Hinkley, 1997) as well as modern approaches such as conditional generative adversarial network (conditional GAN) (Mirza and Osindero, 2014;Zhou et al., 2022). In the following subsections, we shall describe the residual permutation, residual bootstrap, and parametric bootstrap in more detail. Compared to the conditional GAN, these procedures are more suitable for omics applications, given the limited sample sizes in many omics association studies.

Residual permutation and residual bootstrap
When X is a continuous random vector, we can model the relationship between X ∈ R n×p and Z ∈ R n×d through a multivariate linear regression model given by where B ∈ R d×p is the matrix of coefficients and E ∈ R n×p is the error matrix. Consider the following strategy to generate samples from P X|Z .
Step 1: Fitting regression model. Fit the multivariate linear regression model in (7). Let E = X − Z B be the residuals from the fitted model, where B is the least squares estimate of B.
Step 2: Residual permutation. Permute the rows of the residual matrix E and denote the resulting Step 2 : Residual bootstrap. Let E * * be a n × p matrix whose rows are sampled with replacement from Remark 3. To allow nonlinearity, we can replace Z i by (g 1 (Z i ), . . . , g d (Z i )) ∈ R d for some transformations (g 1 , . . . , g d ) in the multivariate regression model.

Parametric bootstrap
Suppose the conditional distribution of X given Z takes the parametric form of where θ ∈ Θ ⊆ R r is an unknown parameter. It is natural to estimate the parameter by maximizing the conditional log-likelihood Then we can generate X i,b from the estimated likelihood P X|Z (X i |Z i ; θ). For example, suppose X is a Bernoulli random variable with the conditional success probability given by {1 + exp(−Z i θ)} −1 .
Then we can sample X i,b from the Bernoulli distribution with success probability where θ is an estimate of θ by fitting a logistic model to the data with X being the binary response and Z being the covariates.

Independence Tests
We review some parametric and nonparametric unconditional/conditional independence tests and discuss their use within our framework. In Section 6.1, we focus on the model-based (parametric) independence tests. In Sections 6.2.1-6.2.2, we consider two types of nonparametric independence tests targeting linear and nonlinear dependence respectively. These three types of independence tests will all be implemented in our numerical studies.

Model-based statistics
Suppose the conditional likelihood of Y j given X and Z has the form of The log-likelihood function based on the observations is given by In this case, testing H 0,j is equivalent to testing whether α j is zero. Thus we let T C j be a statistic for testing α j = 0 under the model (8). Examples include the Wald test and the likelihood-ratio test.
To test the marginal independence, we consider the reduced model P Y j |X,Z (Y j |X α j ) by forcing β j = 0 in (8). Under the reduced model, we let T M j be a statistic for testing α j = 0, which can be viewed as testing the marginal independence Y j ⊥ ⊥ X. When P Y j |X,Z is the likelihood function associated with a linear model with Gaussian error, we can let T C j and T M j be the adjusted and unadjusted z-statistics considered in Yi et al. (2021). In this sense, the statistics in Yi et al. (2021) fall into our framework.

Nonparametric dependence metrics
Nonparametric dependence testing, aiming to determine whether two random vectors are dependent without specifying the exact parametric forms of the distributions, is one of the fundamental problems in statistics. Classical metrics or test statistics for dependence testing include the RV coefficient, rank correlation coefficient, and nonparametric Craḿr-von Mises type statistics.
Modern approaches are built on distance and kernel embedding, which can detect non-linear and non-monotone dependence. Notable examples include the distance covariance (Székely et al., 2007), Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005(Gretton et al., , 2007 and the sign distance covariance (Bergsma and Dassios, 2014). Below we shall review the RV coefficient and HSIC and discuss their conditional versions for testing the conditional independence.

RV coefficients
Pearson correlation and partial correlation coefficients are perhaps the most commonly used nonparametric dependence metrics for measuring marginal and conditional dependence. Here we describe the RV coefficient and its conditional version as multivariate generalizations of the squared Pearson correlation coefficient and the squared partial correlation coefficient for detecting linear and conditional linear dependence.
For two random vectors U and V, we let Σ U,V be the covariance matrix between U and V. The RV coefficient between X and Y j is defined as .
To estimate the RV coefficient, we simply replace the covariance matrices in the definition above with the sample covariance matrices.
To introduce the conditional version of the RV coefficient, we let e X and e Y j be the residuals by regressing X and Y j on Z respectively. The conditional RV coefficient is defined as cRV(X, Y j |Z) = cRV(e X , e Y j ).
Similar to Remark 3, to account for the nonlinear dependence of X and Y j on Z, we can replace Z by certain basis function transform on it, e.g., spline transformation.

Hilbert-Schmidt independence criterion
Hilbert-Schmidt Independence Criterion (HSIC) was introduced as a kernel-based independence measure by Gretton et al. (2005Gretton et al. ( , 2007. Let k p (·, ·) be a reproducing kernel Hilbert space (RKHS) kernel defined on R p × R p . Commonly used kernels in this context include the Gaussian kernel and the Laplacian kernel. The HSIC for quantifying the strength of dependence between X and Y j can be defined as where (X , Y j ) and (X , Y j ) are independent copies of (X, Y j ). When k p and k 1 are characteristic kernels (Sriperumbudur et al., 2011), HSIC completely characterizes the dependence in the sense that X and Y j are independent if and only if HSIC(X, Y j ) = 0. To estimate the HSIC, define Let H = I − n −1 11 with 1 being the n-dimensional vector of all ones. Set K X = HK X H and The sample HSIC is defined as which has been shown to be a consistent estimator, see Gretton et al. (2005).
A conditional version of the HSIC (cHSIC) for measuring and testing conditional dependence was proposed in Zhang et al. (2012). Here we describe the construction of their statistic. Let K X,Z = (k X,Z,ab ) n a,b=1 with k X,Z,ab = k p+d ((X a , Z a ), (X b , Z b )) and define K Y j ,Z in a similar way.
Denote by K X,Z = HK X,Z H and K Y j ,Z = HK Y j ,Z H the centered versions of K X,Z and K Y j ,Z respectively. Further define for some small positive constant . The sample cHSIC is given by We refer the readers to Zhang et al. (2012) for more detailed properties about the cHSIC.
7 Numerical Studies

Simulation setting
We conduct comprehensive simulations to evaluate the performance of 2dFDR+ and compare it to competing methods. Throughout the simulations, we control the following three factors, namely the degree of confounding (ρ, which determines the strength of association between X and Z), the signal strength (distributions of α j and β j ) and the signal density (proportion of non-zero elements in {α j } and {β j }). Specifically, we generate α j and β j independently over j from the mixture where π ∈ (0, 1) and δ 0 denotes a point mass at 0. For each factor, we consider three different scenarios: 1. Degree of confounding: ρ ∈ {0.1, 1, 1.5} roughly corresponds to weak (+), medium (++) and strong(+++) confounding respectively. See Section A.5 for the role of ρ in each simulated model.
We report the empirical FDR and power averaged over 100 simulation runs for all possible combinations of the three factors.

Competing methods
We compare the finite sample performance of the following seven methods.
1. MS-1dFDR: The 1d procedure based on the t-statistics for testing α j = 0 under the full model (see the detailed descriptions of each data generating model in Section A.5). The 1d procedure is essentially the same as the 2dFDR+ procedure, except that instead of a two-dimensional rejection region, we are searching for a cutoff along a single dimension, namely that of the conditional statistic. The FDP is estimated using the resampled X i,b (from the conditional distribution of X given Z), but only the conditional statistic is used for estimating the number of false rejections, as opposed to both in the oracle procedure. The statistics used in this 1d procedure is the model-based statistic, i.e., the z-statistic (or t-statistic, depending on the model) corresponding to the coefficient of X for a full model fit.
2. RV-1dFDR: The 1d procedure based on the conditional RV coefficient. To account for the potential non-linearity in the underlying relationship between X and Z (and similarly, Y and Z), the residuals obtained from a cubic spline regression of X on Z (and similarly, Y on Z) have been used in the calculation of the conditional RV coefficient.
4. 2dFDR: The 2dFDR procedure proposed in Yi et al. (2021), which is based on linear models with the measurement of the omics feature as the outcome and the covariate of interest and confounders as the predictors.

MS-2dFDR+:
The proposed 2dFDR+ procedure with T C j and T M j being the t-statistics for testing α j = 0 under the full model and reduced model as described in Section 6.1.
6. RV-2dFDR+: The proposed 2dFDR+ procedure with T C j = cRV(X, Y j |Z) and T M j = RV(X, Y j ) which denote the sample estimates of the conditional and the unconditional RV coefficients respectively. As before, to account for the potential non-linearity in the underlying relationship between X and Z (and similarly, Y and Z), the residuals obtained from a cubic spline regression of X on Z (and similarly, Y on Z) have been used in the calculation of the conditional RV coefficient.
7. HSIC-2dFDR+: The proposed 2dFDR+ procedure with T C j = cHSIC(X, Y j |Z) and T M j = HSIC(X, Y j ), where we set = 0.001 and used the Gaussian kernel with the bandwidth parameter chosen using the median heuristic (Garreau et al., 2017).
The 1d procedure can be viewed as a special case of the corresponding 2d procedure by forcing the cutoff of the auxiliary statistic to be zero. As the 2d procedure is searching over a larger rejection region (by allowing the cutoff of the auxiliary statistic to be greater than zero and meanwhile lowering the cutoff for the primary statistic), the proposed 2d procedure is guaranteed to make more rejections in finite sample.

Data generating processes
To examine the performance of the above methods under different settings, we consider the following data generation scenarios. As X and Z are univariate in all cases we denote them by X and Z. The detailed models are provided in Section A.5. Throughout the simulations, we set the sample size n to be 100, and the number of hypotheses m (i.e., the number of features) to be 1000.
A. Linear/nonlinear models with continuous X and Z. Consider the additive model Here X and Z are associated with each other through the following model: where ρ controls the degree of confounding and h : R → R is a possibility nonlinear function.
We rescale X to dissociate any possible entanglement between signal strength and the degree of confounding. This type of simulation setup has been used in Models 1-4 to explore the effect of the relations among X, Y j , and Z on the FDR and power. The empirical FDR and power of RV-1dFDR, HSIC-1dFDR, 2dFDR, RV-2dFDR+ and HSIC-2dFDR+ are summarized in B. Linear/nonlinear models with discrete X and continuous Z. In particular, we consider the functional form in (9) and generate where Z ∼ N (0, 1). Models 5-7 explore this setup. In this case, we generate X i,b through a fitted logistic regression model using Z as the predictor. We report the FDR and power for MS-1dFDR, RV-1dFDR, 2dFDR, MS-2dFDR+ and RV-2dFDR+ as described in Section 7.2 in Figures 10-12 in the supplementary material. HSIC-2dFDR+ and HSIC-1dFDR are not used in this data generating setup because for binary variables, HSIC is not efficient and the bandwidth parameter is not well-defined.
C. Linear models with discrete X and Z. We consider the linear model where X ∼ Bernoulli e ρZ 1 + e ρZ and Z ∼ Bernoulli(0.7).
D. Binary response. The following logistic regression model has been considered: with X ∼ N (ρZ 2 , 1) and Z ∼ N (0, 1). We implement the MS-1dFDR, RV-1dFDR, MS-2dFDR+ and RV-2dFDR+, and report the results in Figure 4. In MS-2dFDR+, T C j is the statistic for testing α j = 0 under the full model (11) and T M j is the statistic for testing α j = 0 by forcing β j = 0 in (11).
In Section A.6, we report some additional numerical results under the following scenarios: (1) FWER control, (2) global null, (3) dependent errors, and (4) separating the effects of the densities of the signal of interest and the confounder signal.

Simulation results
We now discuss the major simulation findings under each scenario.  Figure 2), all the methods provide tight FDR control except for the 2dFDR which has slight FDR inflation in some instances when the confounding effect is strong. In contrast, the proposed RV-2dFDR+, which is equivalent to MS-2dFDR+, controls the FDR at the target level across all cases, indicating more robustness of the proposed method than the original 2dFDR. In terms of power, we observe that the power decreases as the confounding effect becomes stronger for all procedures. The 2d procedure is comparable to the 1d counterpart when the confounding effect is weak but is substantially more powerful when the confounding effect is strong. We also observe that RV-2dFDR+ is comparable to 2dFDR and is more powerful than HSIC-2dFDR+. When the underlying model is nonlinear, 2dFDR suffers from severe FDR inflation (see, e.g., Figure 3a). In contrast, 2dFDR+ controls the FDR at the target level across different cases. Among the 2dFDR+ variants, RV-2dFDR+ delivers the highest power in most cases.
Under Scenario B, where X is discrete while Z is continuous (see, e.g., Figure 10), the empirical FDR is well controlled for 2dFDR+ even when the confounding effect is strong. 2dFDR suffers from moderate FDR inflation (e.g Figure 11a) in some instances, e.g., in the case of strong confounding.
Not surprisingly, the 2d procedure is significantly more powerful than the corresponding 1d version.
Moreover, the RV-based methods generally make more true rejections compared to the HSIC-based methods.
Under Scenario C, where X and Z are both Bernoulli, as seen from Figure 13, all the approaches have empirical FDR under control. When the degree of confounding is high, 2dFDR+ delivers higher power than 1dFDR does.
Under Scenarios D and E, the original 2dFDR is not applicable, and hence only 1dFDR and 2dFDR+ have been compared in the simulations. As seen from Figures 4, 14 and 15, for Scenarios D-E (binary and count response), all the methods provide reliable FDR control. 2dFDR+ produces significant power improvement over the 1dFDR methods.
To sum up, the proposed 2dFDR+ provides reliable FDR control for all the simulation settings even when the degree of confounding is strong because 2dFDR+ explicitly models the relationship between X and Z. The 2d procedure delivers more rejections compared to the 1d counterpart, and the larger number of rejections typically translates into a higher detection power for the 2d methods. We also see that RV-2dFDR+ provides the best power in many simulation settings. As the (conditional) RV coefficients are calculated based on spline transformed covariates and confounding factors, RV-2dFDR+ can capture the nonlinearity between Y and (X, Z) and X and Z in many cases.

Microbiome data
In the first example, we analyze a microbiome dataset in the R package GUniFrac, which is part of a microbiome data set for studying the smoking effect on the upper respiratory tract microbiome Charlson et al., 2010). The original data set contains samples from the right and left nasopharynx and oropharynx. Here we use the data from the left oropharynx of 32 nonsmokers and 28 smokers (n = 60). The microbiome composition was profiled using 16S rRNA gene-targeted sequencing and processed using the QIIME bioinformatics pipeline (D'Argenio et al., 2014), resulting in a count table recording the frequencies of 856 detected OTUs (operational taxonomic units). Sex is a confounding factor in this data set, with more smokers in males (odds ratio equals 2.3). The aim here is to identify smoking-associated OTUs while adjusting sex.
For illustration purposes, the OTU abundances were treated as both continuous and binary outcomes. The results for the binary outcomes are given in supplement. We first filtered out the OTUs occurring in less than 10% of the subjects, which resulted in a total of 174 OTUs. The OTU abundance data were then transformed using a center log-ratio transformation, adding a pseudocount of 0.5. The numbers of rejections for varying levels of FDR (ranging from 0 to 0.2) were calculated for the following methods: Benjamini-Hochberg (BH, Benjamini and Hochberg (1995b)) procedure, 2dFDR, MS-2dFDR+, RV-2dFDR+, MS-1dFDR, RV-1dFDR. The BH procedure was applied to the p-values corresponding to the tests of significance of the coefficients of IR in a linear regression model with the IR and BMI being the predictors. The numbers of rejections at different FDR levels are shown in Figure 6a. The trend is consistent with the simulations, where we have observed that the 2dFDR+ procedure is more powerful than the corresponding 1dFDR procedure and RV-2dFDR+ makes the highest number of true rejections in most simulation setups. In addition, we produced a Venn diagram (Figure 23) of the rejected features for each method at the FDR level 0.10 to visualize the degree to which the rejected features in various methods overlap. We find that at level 0.1, MS-2dFDR+ successfully identifies all the seven features identified by the 2dFDR procedure and five additional features.

Metabolomics data
Next, we consider an Insulin Resistance dataset (Pedersen et al., 2016(Pedersen et al., , 2018 where the goal is to identify serum metabolites associated with insulin resistance (IR) while controlling the effect of the Body Mass Index (BMI) of the individual. 289 non-diabetic Danish adults were recruited for the study, where their IR was estimated by homeostatic model assessment (HOMA-IR) (Matthews et al., 1985). Untargeted metabolome profiles were generated on fasting serum samples, producing measurements on 325 polar metabolites and 876 molecular lipids (collectively called serum metabolites, m = 1201). The BMI of a subject is a confounding factor as the IR of a subject is largely influenced by the BMI (correlation coefficient = 0.57). In this example, 2dFDR discovers the largest number of metabolites (481 at 5% FDR), followed by RV-2dFDR+ (432 metabolites at 5% FDR). Both are a significant improvement over RV-1dFDR (333), HSIC-1dFDR (323), and the BH procedure (377).
The comparison of the number of rejections versus FDR level for all methods is displayed in Figure   6b.
Again, the result generally agrees with the findings from the simulation studies. While 2dFDR is the most powerful in this example, its inflated type I error rate observed in many non-linear simulation setups raises some concern about the reliability of the rejections solely found by itself. Figure 24 shows the Venn diagram of the serum metabolites detected by the different methods and their degree of overlap at FDR = 0.05 is provided. It is interesting to note that while RV-2dFDR+ and 2dFDR have detected 403 metabolites in common, the BH procedure has significantly fewer overlapping metabolites with either of these methods.
An additional challenge we faced while analyzing the metabolomics data was generating samples from the conditional distribution P X|Z . As observed in Figure 5, there is distinct heteroscedasticity in the conditional distribution of IR given BMI. Traditional resampling methods such as residual permutation (Winkler et al., 2014) may fail as homoscedasticity is one of the underlying assumptions.
To combat this, the data set was binned into two parts, namely BMI ≤ 26 and BMI > 26, and two separate regressions were fitted to these two subsamples, and the residuals were permuted within each segment. The right panel of Figure 5, which plots the resampled IR versus BMI using binned residual permutation, shows that the heteroscedastic structure has been preserved in the resampled data. The middle panel shows resampling using the traditional residual permutation and we can see that the original shape of the data has not been maintained in this case.

Gene expression data
Finally, we consider a Pouchitis dataset (Morgan et al., 2015), where the goal is to identify gene expressions associated with patient outcomes in a cohort with ileal pouch-anal anastomosis (IPAA) surgery in the past one year, adjusting for potential confounders such as antibiotics use and sex. This dataset considered a large population of patients having undergone IPAA at Mount Sinai Hospital, Canada. The expression levels of 19,908 genes were measured in two sites, the J-pouch and the pre-pouch ileum (PPI), using the procedure described in (Morgan et al., 2015). We considered the biopsies collected only from the pouch (n = 74) in this example. The conditioning variables were sex, smoking status, and antibiotic use in the previous month. The variable of interest is the disease outcome, including FAP (Familial Adenomatous Polyposis), No Pouchitis, Acute Pouchitis, Chronic Pouchitis, and Crohn's Disease like Inflammation. As the variable of interest is nominal, we did not use the RV coefficients in this case. Figure 6c shows the number of genes identified as associated with the disease outcome conditioning on sex, smoking status, and antibiotic usage. At the FDR level of 0.05, the 2dFDR+ identifies the maximum number of genes (2345), followed by MS-1dFDR (1811) and BH procedure (1640), respectively.

Conclusion
We have proposed a general framework (2dFDR+) for performing multiple hypothesis testing while adjusting for confounding effects. Within this new framework, the conditional distribution of the omics features given the variable of interest and confounders can be arbitrary and completely unknown. The framework is flexible by allowing the joint use of any conditional and marginal independence tests, continuous/binary/count/multivariate responses, and various ways of modeling the conditional distribution of the variable of interest given the confounders. As a general methodology, 2dFDR+ can be applied to multiple types of omics data. In view of the numerical results, we recommend using RV-2dFDR+ (based on the spline-transformed variables) under most scenarios due to its robustness and efficiency. In cases where the RV-based statistics are not applicable, for instance, when either of X, Y or Z are categorical, or when Y is discrete (e.g., originating from a Poisson or Negative Binomial distribution), the model-based statistics are recommended. The statistics will differ depending on the types of the variable of interest and the confounding variable. For example, when all the three variables (X, Y, Z) are categorical, as in the binary outcome case for the smoking microbiome data, the Pearson's chi-square statistic and the Cochran-Mantel-Haenszel (CMH) statistic are recommended for testing the marginal and the conditional dependence, respectively. Table 1 summarizes the statistics that we recommend using under different scenarios. Categorical Categorical χ 2 and CMH-statistics projection onto the column space of B Z . We consider the statistics whereσ 2 j is a consistent variance estimator of σ 2 j such thatσ 2 j → p σ 2 j . Conditional on ( X, Z), and the covariance matrix Define Σ X = cov( B X ), Σ XZ = cov( B X , B Z ) and Σ X|Z = Σ X − Σ XZ Σ −1 Z Σ ZX , where B X = (B 1,X (X), . . . , B J 1 ,X (X)) and B Z = (B 1,Z (Z), . . . , B J 2 ,X (Z)) . By the law of large numbers, we have In this case, we have If ( √ nα j /σ j , √ nβ j /σ j ) follows some distribution F independently across j and conditional on α j = 0, √ nβ j /σ j follows the distribution F 0 independently for j ∈ M 0 , then we have V (t 1 , t 2 ) = F (t 1 , t 2 , (0, β))dF 0 (β), S(t 1 , t 2 ) = F (t 1 , t 2 , (α, β))dF(α, β).

A.2 Family-wise error rate control
Family-wise error rate (FWER), referring to the probability of making one false discovery, provides more stringent type I error rate control. It is preferable to the FDR if the overall conclusion from various individual inferences is likely to be erroneous when at least one of them is, or the existence of a single false claim would cause significant loss. It is natural to ask whether our method can be modified to control other error measures such as FWER. Here we describe such a procedure to control the FWER. Given the rejection rule 1{T M j ≥ t 1 , T C j ≥ t 2 }, we let FWER(t 1 , t 2 ) := m j=1F j,B (t 1 , t 2 ) be an estimate of the FWER. We choose the optimal cut-off value as the one that maximizes the number of rejections while controls the FWER estimate at a prespecified level q: Then we reject H 0,j whenever T M j ≥t 1 and T C j ≥t 2 . We name the above procedure 2dFWER+. In Section A.6, we investigate its finite sample performance and report the empirical FWER and power for 2dFWER+ and its corresponding 1d version (1dFWER) in Figures 21 and 22.

A.3 Technical details
In this section, we prove the main theoretical results in the paper. We first present the following lemma which was recently proved in Naaman (2021).
Lemma 1 (Dvoretzky-Kiefer-Wolfowitz inequality). Let ξ 1 , · · · , ξ n be independent d-dimensional random vectors with the distribution function F (t) = P (ξ i ≤ t), where ξ i ≤ t means that ξ ij ≤ t j for ξ i = (ξ i1 , . . . , ξ id ) and 1 ≤ j ≤ d. Denote the standard empirical distribution function by We now present the proof of the main theoretical results.
Proof of Theorem 1. Define the filtration 1 ≤ a ≤ s) with s < t are the same across all b = 0, 1, . . . , B. By the symmetry, we must have for Thus U (t) − 1/(B + 1) is a martingale difference sequence. Also, we have {s * ≤ t} ∈ F t . Therefore, s * is a stopping time. By the optional stopping time theorem, Recall from the definition of s * that Using (12) and (13), we obtain Proof of Theorem 2. For (t 1 , t 2 ) ∈ R + × R + , define the following processes We divide the proof into two steps. In Step 1, we obtain some uniform convergence results while in Step 2, we apply these results to show the FDR control.

A.4 Asymptotic Power Analysis
We perform an asymptotic power analysis by comparing the asymptotic power of 2dFDR+ with that of the associated 1d procedure. For (t 1 , t 2 ) ∈ R + × R + , define K(t 1 , t 2 ) as which can be considered as the limiting power process. Assume that sup t 1 ≤t 0,1 ,t 2 ≤t 0,2 which is the limiting process for FDP(t 1 , t 2 ) in view of the derivations in Section A.3. To understand the power behavior of 2dFDR+ and the associated 1d procedure, we consider the following (infeasible) procedures based on the above limiting processes: Limiting 2dFDR+: (t * 1,2d , t * 2,2d ) = arg max Limiting 1dFDR: t * 1d = arg max t∈R + S(0, t) subject to FDP ∞ (0, t) ≤ q.
As (0, t * 1d ) is a feasible point of the optimization problem in limiting 2dFDR+, we must have Assume that FDP ∞ (t 1 , t 2 ) is a continuous function of (t 1 , t 2 ). Then we have as otherwise one can lower the values of (t 1 , t 2 ) to increase the value of the objective function S.
Some algebra yields that By (25), we have Comparing to the result in Corollary 1, we derive two terms ( and (1 − π 0 ) U (0, t * 1d )/ S(0, t * 1d ) that determine the power improvement. In the worst-case scenario, , which again suggests that the power loss is at most q.

A.6 Additional simulation results
• FWER control: We investigate the finite sample performance of 2dFWER+ and its corresponding 1d version. In Figures 21-22, we report the empirical FWER and power of 2dFWER+ and 1dFWER for both the linear and nonlinear models. In either case, the empirical FWER is well controlled for both methods. The 2d procedure again produces higher power than the 1d version, especially for stronger confounders.
• Global null: We examine the performance of 2dFDR, RV-2dFDR+, HSIC-2dFDR+, RV-1dFDR and HSIC-1dFDR under the global null. Specifically, we consider the model Y j = β j Z, where X ∼ N (ρZ, 1) and Z ∼ N (0, 1). None of the methods produced any rejections for all degrees of confounding.
• Dependent errors: To evaluate the impact of dependence on the methods' performance, we consider the model: Y j = α j e X + β j e Z + j where j = 0.7 j−1 + e j and X ∼ N (ρ(Z + Z 2 ), 1) with Z ∼ N (0, 1) and {e j } m j=1 being a white noise process. The results are summarized in Figure 16. Overall, 2dFDR+ is robust to the AR(1) type dependence with reliable FDR control and reasonable power.
• Separating the effects of densities of the signal of interest and the confounder signal: In all preceding simulations, the density of the signal of interest and the confounding signal had been kept at the same level-weak, moderate or strong. In this simulation setup, we attempt to tease apart the effects of the two types of signals.
1. First, we fix the density of the signal of interest at the 10% level and vary the density of the confounding signal through weak, moderate, and strong. The associated plots are given in Figure 17 and Figure 19, corresponding to linear and non-linear DGPs respectively.
2. Next, we fix the density of the confounding signal to 10% and vary the density of the signal of interest through weak, moderate, and strong. The associated plots are in Figure   18 and Figure 20, corresponding to linear and non-linear DGPs respectively.
In both the linear and the non-linear DGPs, we find that varying the density of the signal of interest while keeping the density of the confounding variable constant is displaying a starker difference (increase) in the power as the densities are increased.

A.7 Microbiome data: Binary Outcomes
We consider the microbiome data analyzed in Section 8 of the main paper. The abundance data of the 174 OTUs were converted into presence/absence data after rarefaction to the minimal sequencing depth (since presence/absence depends on the sequencing depth strongly, rarefaction removes the confounding effect due to sequence depth). Because X, Y , and Z are all categorical (specifically, binary) in this case, for the conditional statistic, i.e., T C , the Mantel Haenszel statistic was used. For the marginal statistic, i.e., T M , the Pearson's chi-square statistic was used. Note that the original 2dFDR in Yi et al. (2021) is not applicable in this case as the outcomes are binary.
As shown in Figure 7, for all levels of FDR, the BH procedure makes no rejections, and overall, the 2dFDR+ procedure makes a higher number of rejections compared to the corresponding 1dFDR procedure.            Figure 17: Empirical FDR and power for HSIC-1dFDR, RV-1dFDR, 2dFDR, HSIC-2dFDR+, RV-2dFDR+ under the model Y j = α j X + β j Z + j where X ∼ N (ρZ, 1) and Z ∼ N (0, 1). The signal density of α j has been fixed at 10 % while the signal density of β j has been varied through 1%, 5% and 10%. Error bars represent the 95% CIs and the horizontal line in (a) indicates the target FDR level of 0.05.  Figure 18: Empirical FDR and power for HSIC-1dFDR, RV-1dFDR, 2dFDR, HSIC-2dFDR+, RV-2dFDR+ under the model Y j = α j X + β j Z + j where X ∼ N (ρZ, 1) and Z ∼ N (0, 1). The signal density of β j has been fixed at 10 % while the signal density of α j has been varied through 1%, 5% and 10%. Error bars represent the 95% CIs and the horizontal line in (a) indicates the target FDR level of 0.05.  Figure 19: Empirical FDR and power for HSIC-1dFDR, RV-1dFDR, 2dFDR, HSIC-2dFDR+, RV-2dFDR+ under the model Y j = α j e X + β j Z 2 + j where X ∼ N (ρZ 2 , 1) and Z ∼ N (0, 1). The signal density of α j has been fixed at 10 % while the signal density of β j has been varied through 1%, 5% and 10%. Error bars represent the 95% CIs and the horizontal line in (a) indicates the target FDR level of 0.05. Figure 20: Empirical FDR and power for HSIC-1dFDR, RV-1dFDR, 2dFDR, HSIC-2dFDR+, RV-2dFDR+ under the model Y j = α j e X + β j Z 2 + j where X ∼ N (ρZ 2 , 1) and Z ∼ N (0, 1). The signal density of β j has been fixed at 10 % while the signal density of α j has been varied through 1%, 5% and 10%. Error bars represent the 95% CIs and the horizontal line in (a) indicates the target FDR level of 0.05.