False discovery rates: a new deal

Summary We introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating false discovery rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This “unimodal assumption” (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation—estimating the unimodal distribution involves solving a simple convex optimization problem—and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate and corresponding standard error), rather than the one number usually used (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$p$\end{document} value or \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$z$\end{document} score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods, our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures, we introduce the term “local false sign rate” to refer to the probability of getting the sign of an effect wrong and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable and can be more robustly estimated. Our methods are implemented in an R package ashr available from http://github.com/stephens999/ashr.


Replace normal likelihood with t likelihood
We generalize the normal likelihood (2.4) by replacing it with a t likelihood: where T ν (β j ,ŝ j ) denotes the distribution of β j +ŝ j T ν where T ν has a standard t distribution on ν degrees of freedom, and ν denotes the degrees of freedom used to estimateŝ j (assumed known, and for simplicity assumed to be the same for each j). The normal approximation (2.4) corresponds to the limit ν → ∞. This generalization does not complicate inference when the mixture components f k in (S.1.1) are uniforms; when the f k are normal the computations with a t likelihood are considerably more difficult and not implemented. Equation (S.1.2) is, of course, motivated by the standard asymptotic result However (S.1.3) does not imply (S.1.2), because in (S.1.3)ŝ j is random whereas in (S.1.2) it is conditioned on. In principle it would be preferable, for a number of reasons, to model the randomness inŝ j ; we are currently pursuing this improved approach in joint work with M.Lu.

Non-zero mode
An addition to our software implementation, due to C.Dai, allows the mode to be estimated from the data by maximum likelihood. This involves a simple grid search.

S.2 Implementation Details
Likelihood for π We define the likelihood for π to be the probability of the observed dataβ conditional onŝ: where the right hand side comes from our conditional independence assumptions. [One might prefer to define the likelihood as p(β,ŝ|π) = p(β|ŝ, π)p(ŝ|π), in which case our definition comes down to assuming that the term p(ŝ|π) does not depend on π.] Using the prior β j ∼ K k=0 π k f k (β j ) given by (S.1.1), and the normal likelihood (2.4), integrating over β j yields denotes the convolution of f k with a normal density. These convolutions are straightforward to evaluate whether f k is a normal or uniform density. Specifically, where Ψ denotes the cumulative distribution function (c.d.f.) of the standard normal distribution. If we replace the normal likelihood with the t ν likelihood (S.1.2) then the convolution for f k uniform the convolution is still given by (S.2.4) but with Ψ the c.d.f. of the t ν distribution function. (The convolution for f k normal is tricky and we have not implemented it.)

Penalty term on π
To make lfdr and lfsr estimates from our method "conservative" we add a penalty term log(h(π; λ)) to the log-likelihood log L(π) to encourage over-estimation of π 0 : where λ k ≥ 1 ∀k. The default is λ 0 = 10 and λ k = 1, which yielded consistently conservative estimation of π 0 in our simulations ( Figure 2b). Although this penalty is based on a Dirichlet density, we do not interpret this as a "prior distribution" for π: we chose it to provide conservative estimates of π 0 rather than to represent prior belief.

Problems with removing the penalty term in the half-uniform case
It is straightforward to remove the penalty term by setting λ k = 1 in (S.2.5). We note here an unanticipated problem we came across when using no penalty term in the half-uniform case (i.e. f k (·) = U [·; −a k , 0] and/or U [·; 0, a k ] in (S.1.1)): when the data are nearly null, the estimated g converges, as expected and desired, to a distribution where almost all the mass is near 0, but sometimes all this mass is concentrated almost entirely just to one side (left or right) or 0. This can have a very profound effect on the local false sign rate: for example, if all the mass is just to the right of 0 then all observations will be assigned a very high probability of being positive (but very small), and a (misleading) low local false sign rate. For this reason we do not recommend use of the half-uniform with no penalty.

Optimization
With this in place, the penalized log-likelihood for π is given by: where the l kj :=f k (β j ) are known. This is a convex optimization problem, which can be solved very quickly and reliably using interior point (IP) methods. We used the KWdual function from the R package REBayes (Koenker, 2015), which uses Rmosek (Mosek Aps, 2016). We also found a simple EM algorithm (Dempster et al., 1977), accelerated using the elegant R package SQUAREM (Varadhan and Roland, 2008), to provide adequate performance. In our EM implementation we initialized π k = 1/n for k = 1, . . . , K, with π 0 = 1 − π 1 − · · · − π K , and the one-step updates are:

Conditional distributions
Givenπ, we compute the conditional distributions p(β j |π,β, s) ∝ g(β j ; π)L(β j ;β j ,ŝ j ). (S.2.10) Each posterior is a mixture on K + 1 components: where the posterior weights w kj are computed as in (S.2.7) with π =π, and the posterior mixture component p k is the posterior on β j that would be obtained using prior f k (β j ) and likelihood L(β j ;β j ,ŝ j ). All these posterior distributions are easily available. For example, if f k is uniform and L is t ν then this is a truncated t distribution. If f k is normal and L is normal, then this is a normal distribution.
Choice of grid for σ k , a k When f k is N (0, σ k ) we specify our grid by specifying: i) a maximum and minimum value (σ min , σ max ); ii) a multiplicative factor m to be used in going from one grid-point to the other, so that σ k = mσ k−1 . The multiplicative factor affects the density of the grid; we used m = √ 2 as a default. We chose σ min to be small compared with the measurement precision (σ min = min(ŝ j )/10) and σ max = 2 max(β 2 j −ŝ 2 j ) based on the idea that σ max should be big enough so that σ 2 max +ŝ 2 j should exceedβ 2 j . (In rare cases where max(β 2 j −ŝ 2 j ) is negative we set σ max = 8σ min .) When the mixture components f k are uniform, we use the same grid for the parameters a k as for σ k described above.
Our goal in specifying a grid was to make the limits sufficiently large and small, and the grid sufficiently dense, that results would not change appreciably with a larger or denser grid. For a specific data set one can of course check this by experimenting with the grid, but these defaults usually work well in our experience.

Dependence of effects on standard errors
The model (3.11) for general α can be fitted using the algorithm for α = 0. To see this, define b j := β j /ŝ α j , andb j :=β j /ŝ α j . Thenb j is an estimate of b j with standard errorŝ j :=ŝ 1−α j . Applying the algorithm for α = 0 to effect estimatesb 1 , . . . ,b J with standard errorsŝ 1 , . . . ,ŝ J yields a posterior distribution p(b j |ŝ j ,b j ,π, α), which induces a posterior distribution on β j = b jŝ α j . x y method ash.hu.s ash.n.s ash.u.s truth (b) Average estimated cdfs across ∼ 10 data sets compared with truth; methods here do not use penalty (S.2.5) so π 0 is not systematically overestimated. Systematic differences from the truth in "skew" and "bimodal" scenarios highlight the effects of model mis-specification. (a) Comparison of true and estimated lfsr when data are simulated with no point mass at zero (π 0 = 0), and also analyzed by ash with no point mass on 0 (and mixture of normal components for g). Black line is y = x and red line is y = 2x. The results illustrate how estimates of lfsr can be more accurate in this case. That is, assuming there is no point mass can be beneficial if that is indeed true.

S.3 Supplementary Figures and Tables
(b) Comparison of true and estimated lfsr when data are simulated with point mass at zero (drawn uniformly from [0,1] in each simulation), but analyzed by ash with no point mass on 0 (and mixture of normal components for g). Black line is y = x and red line is y = 2x. The results illustrate how estimates of lfsr can be anti-conservative if we assume there is no point mass when the truth is that there is a point mass.