In Search of Lost (Mixing) Time: Adaptive Markov chain Monte Carlo schemes for Bayesian variable selection with very large p

The availability of data sets with large numbers of variables is rapidly increasing. The effective application of Bayesian variable selection methods for regression with these data sets has proved difficult since available Markov chain Monte Carlo methods do not perform well in typical problem sizes of interest. The current paper proposes new adaptive Markov chain Monte Carlo algorithms to address this shortcoming. The adaptive design of these algorithms exploits the observation that in large $p$ small $n$ settings, the majority of the $p$ variables will be approximately uncorrelated a posteriori. The algorithms adaptively build suitable non-local proposals that result in moves with squared jumping distance significantly larger than standard methods. Their performance is studied empirically in high-dimension problems (with both simulated and actual data) and speedups of up to 4 orders of magnitude are observed. The proposed algorithms are easily implementable on multi-core architectures and are well suited for parallel tempering or sequential Monte Carlo implementations.


Introduction
The availability of large data sets has led to an increasing interest in variable selection methods applied to regression models with many potential variables but few observations (large p, small n problems). Frequentist approaches have mainly concentrated on providing point estimates under assumptions of sparsity using penalized maximum likelihood procedures (Hastie et al., 2015). However, some recent work has considered constructing confidence intervals and taking into account model uncertainty (Shah and Samworth, 2013;Dezeure et al., 2015). Bayesian approaches to variable selection are an attractive and natural alternative and lead to a posterior distribution on all possible models which can be used to address model uncertainty for variable selection and prediction. A growing literature provides a theoretical basis for good properties of the posterior distribution in large p problems (see e.g. Castillo et al., 2015;Johnson and Rossell, 2012).
The posterior probabilities of all models can usually only be calculated or approximated if p is smaller than 30. If p is larger, Markov chain Monte Carlo methods are typically used to sample from the posterior distribution (George and McCulloch, 1997;Dellaportas et al., 2002;O'Hara and Sillanpää, 2009;Bottolo and Richardson, 2010;Clyde et al., 2011). García-Donato and Martínez-Beneito (2013) provide an interesting discussion of the benefits of a Markov chain Monte Carlo approach. The most widely used Markov chain Monte Carlo algorithm in this context is the Metropolis-Hastings sampler where new models are proposed using Add-Remove-Swap samplers (Brown et al., 1998). For example, this method is used by Nikooienejad et al. (2016) in a binary regression model with a non-local prior for the regression coefficients on a data set with 7129 genes. Some supporting theoretical understanding of convergence is available for the Add-Remove-Swap samplers, e.g. conditions for rapid mixing in linear regression model have been derived by Yang et al. (2016). However, the mixing of these methods is often very poor for practical run times, when applied to data sets with thousands of potential covariates. As an alternative, several authors have considered using more general shrinkage priors and developed MCMC algorithms that can work in high-dimensional problems (see e.g. Johndrow and Orenstein, 2017;Bhattacharya et al., 2016).
The challenge of performing Markov chain Monte Carlo for Bayesian variable selection in high dimensions has lead to several developments sacrificing exact posterior exploration. For example, van den Boom et al. (2015) consider using compressed sensing to define an approximation to the posterior distribution on model space. Liang et al. (2013) used the stochastic approximation Monte Carlo algorithm (Liang et al., 2007) to develop algorithms which can efficiently explore model space. In another direction, variable selection can be performed as a post-processing step after fitting a model including all variables (see e.g. Bondell and Reich, 2012;Hahn and Carvalho, 2015). Several authors develop algorithms that focus on high probability a posteriori models. In particular Rockova and George (2014) propose a deterministic expectation-maximisation based algorithm for identifying posterior modes, while Papaspiliopoulos and Rossell (2017) develop an exact deterministic algorithm that returns the most probable model of any given size in the important special case of blockdiagonal design models.
Alternatively, Markov chain Monte Carlo methods for variable selection can be tailored to the data to allow faster convergence and mixing using the adaptive Markov chain Monte Carlo framework (see e.g. Green et al., 2015, Section 2.4, and references therein). Several strategies have been developed in literature for both the Metropolis-type algorithms (Lamnisos et al., 2013;Ji and Schmidler, 2013) and Gibbs samplers (Nott and Kohn, 2005;Richardson et al., 2010). Our proposal is a Metropolis-Hastings kernel that learns the relative importance of the variables, unlike Ji and Schmidler (2013) who use an independent proposal, and unlike Lamnisos et al. (2013) who only tune the number of variables proposed to be changed. This leads to substantially more efficient algorithms than commonly-used methods in high-dimensional settings and for which the computational cost of one step scales linearly with p. The design of these algorithms utilize the observation that in large p, small n settings posterior correlations will be negligible for the vast majority of the p inclusion indicators. The algorithms adaptively build suitable non-local Metropolis-Hastings type proposals that result in moves with expected squared jumping distance (Gelman et al., 1996) significantly larger than standard methods. In idealized examples the limiting versions of our adaptive algorithms converge in O(1) and result in super-efficient sampling. They outperform independent sampling in terms of the expected squared jump distance and also in the sense of the central limit theorem asymptotic variance. This is in contrast to the behaviour of optimal local random walk Metropolis algorithms that on analogous idealized targets need at least O(p) samples to converge (Roberts et al., 1997). The performance of our algorithms is studied empirically in realistic high-dimension problems for both synthetic and real data. In particular, in Section 4.1, for a well studied synthetic data example, speedups of up to 4 orders of magnitude are observed compared to standard algorithms. Moreover, in Section 4.2, we show the efficiency of the method in the presence of multicollinearity on a real data example with p = 100 variables, and in Section 4.3, we present a real data gene expression example with p = 22 576, for which a fully Bayesian analysis has never been presented before, and reliably estimate the posterior inclusion probabilities for all variables.
2 Design of the Adaptive Samplers

The Setting and the Transition Kernel
Our approach is applicable to general regression settings but we now focus the discussion on the family of normal linear regression models. This will allow for clean efficiency comparisons independent of model specific sampling details, such as e.g. the details of a reversible jump implementation. Hence, consider the model specification where y is an (n × 1)-dimensional vector of responses, X = (x 1 , . . . , x p ) is an (n × p)dimensional data matrix and γ = (γ 1 , . . . , γ p ) ∈ Γ = {0, 1} p is a vector of indicator variables in which γ i denotes whether the i-th variable is included in the model (when γ i = 1).
The matrix X γ is formed by those columns of X corresponding to the included variables. Bayesian variable selection involves placing a prior on the parameters of the regression model in (1), (α, β γ , σ 2 ), as well as on the model γ. For clarity of exposition and validity of comparisons we will assume the commonly used prior structure with β γ | σ 2 , γ ∼ N(0, σ 2 V γ ), and p(γ) = h pγ (1 − h) p−pγ where p γ = p j=1 γ j . The hyperparameter 0 < h < 1 is the prior probability that a particular variable is included in the model and V γ is often chosen as proportional to (X T γ X γ ) −1 (a g-prior) or to the identity matrix (implying conditional prior independence between the regression coefficients). For both priors, the marginal likelihood p(y | γ) can be calculated analytically. The prior can be further extended with hyperpriors, for example, assuming that h ∼ Be(a, b).
Our choice of the transition kernel to sample from the posterior distribution π(γ | y) ∝ p(y | γ) p(γ) is motivated by diffusion limits and associated mixing time results in wellunderstood settings related to model selection, and discussed below in Section 2.2. We will consider using a non-symmetric Metropolis-Hastings kernel with proposal parametrisation suitable for optimising the expected squared jumping distance on the model space. Let the probability of proposing to move from model γ to γ be given in a product form where η = (A, D) = (A 1 , . . . , A p , D 1 , . . . , D p ), q η,j (γ j = 0, γ j = 1) = A j and q η,j (γ j = 1, γ j = 0) = D j . The proposed model γ is sampled using p independent Bernoulli trials which allows fast sampling and multiple variables to be added or deleted from the model in one iteration, in contrast to the standard Add-Remove-Swap proposal. The proposed model is accepted using the standard Metropolis-Hastings acceptance probability a η (γ, γ ) = min 1, π(γ | y)q η (γ , γ) π(γ | y)q η (γ, γ ) . (4)

In Search of Lost Mixing Time: Optimising the Sampler
In adaptive Markov chain Monte Carlo algorithms, the Markovian kernels used by the sampler are tuned on the fly to improve mixing using some computationally accessible performance criterion. Optimal scaling (Roberts et al., 1997) is a commonly used and mathematically well justified criterion for the random walk Metropolis algorithm. Suppose that µ p is a p-dimensional probability density function which has the form for some smooth enough f . The optimal scale of a random walk proposal for µ p should be adjusted so that the mean acceptance rate is approximately 0.234. The underlying analysis also implies that the optimised random walk Metropolis will converge to stationarity in O(p) steps (Roberts and Rosenthal, 2016). In practical settings, targeting the 0.234 acceptance rate turns out to be a useful guide even in moderate dimensions and well beyond the restrictive independent, identically distributed product form assumption of Roberts et al. (1997). It is therefore very tempting to follow this commonly used optimisation strategy in case of the Bayesian variable selection problems, where the state space is Γ = {0, 1} p . It would amount to designing a symmetric random walk Metropolis kernel on the model space and tuning it to achieve the 0.234 acceptance rate in the hope that as p increases, the chain will mix in O(p) steps. However, there are several reasons to think the variable selection problem is a special case, and this approach might not work as expected.
Firstly, consider optimal scaling of the random walk Metropolis for the product measures analysed by Roberts (1998). The results support the 0.234 optimal acceptance rate and O(p) mixing with fixed s close to 1/2 as p tends to infinity. However, the picture is more complex if s tends to zero as p tends to infinity, which is relevant to most variable selection problems. The numerical results of Section 3 in Roberts (1998) indicate that in such a setting the optimally tuned random walk Metropolis proposes to change two γ j 's at a time but the acceptance rate deteriorates to zero resulting in the chain not moving. This suggests the actual mixing in this regime is slower than the O(p) observed for smooth continuous densities.
Secondly, Neal et al. (2012) established, under additional regularity conditions, that the optimally tuned random walk Metropolis for target measures of the form (5) with discontinuous target densities f mixes in O(p 2 ) steps, an order of magnitude slower than with smooth target densities f , and that the optimal acceptance rate is e −2 ≈ 0.1353. Motivated by this result, and by practical difficulties of making the random walk Metropolis efficient for discontinuous densities in epidemic examples, Neal and Lee (2017) consider optimal scaling of the independence sampler. Rather surprisingly, the optimally tuned independence sampler recovers the O(p) mixing and acceptance rate of 0.234 without any additional smoothness conditions, Informed by these findings, we consider target densities on the vertices of the hypercube {0, 1} p that have product form, but are not identically distributed, where 0 < π j < 1 for j = 1, . . . , p. In the algorithmic design, we shall exploit the simplicity of product measures on Γ = {0, 1} p compared to product measures on more general spaces. Statistically, and in algorithmic performance, we shall benefit from the observation that in typical large p, small n variable selection problems, a posteriori correlations between γ j 's will be negligible for j ∈ I, while strong correlations and multimodality will occur only for j ∈ C, with I ∪ C = {1, . . . , p} and |I| |C|. Therefore, while formally optimising the performance of the sampler for a distribution given by (6), informally we can think of the target distribution as where ν(γ C ) encodes the joint distribution of the γ j 's for the small subset C. Consider now the non-symmetric Metropolis-Hastings algorithm with the product form proposal q η (γ, γ ) given by (3) targeting the posterior distribution given by (6). The acceptance probability (4) becomes Note that α η (·, ·) ≡ 1 for any choice of η = (A, D) satisfying To discuss optimal choices of η, we consider the several commonly used criteria for Markov chains with stationary distribution π and transition kernel P on a finite discrete state space Γ.
Let γ (0) and γ (1) be two consecutive values in a Markov chain trajectory and define the squared jumping distance as In this setting, the expected squared jumping distance is equivalent to the average number of variables changed in one iteration.
• Peskun ordering (Peskun, 1973). Suppose that the Markov chain is ergodic, then, for where the constant σ 2 P,f depends on the transition kernel P and function f. Consider transition kernels P 1 and P 2 . If σ 2 P 1 ,f ≤ σ 2 P 2 ,f for every f, then P 1 dominates P 2 in Peskun ordering. If P 1 dominates all other kernels from a given class, then P 1 is optimal in this class with respect to Peskun ordering. Apart from toy examples, Peskun ordering can be rarely established without further restrictions. Hence, for the model selection problem, where inclusion probabilities are often of interest, we consider Peskun ordering for the class L(Γ) of linear combinations of univariate functions: Proposition 1. Consider the class of Metropolis-Hastings algorithms with target distribution given by (6) and proposal q η (γ, γ ) given by (3) and satisfying (9). Then (i) setting A j = 1 − D j = π j , results in (a) independent sampling and optimal mixing time ρ = 1; (b) the following asymptotic variance: where V ar π f is the stationary variance of f under π p (γ) in (6) and V ar π (j) f j is the stationary variance of f j under the marginal π (j) := {1 − π j , π j } on {0, 1}; (c) the following expected squared jumping distance (and equivalently the average number of variables being changed in one iteration): (ii) setting A j = min{1, π j 1−π j } and D j = min{1, 1−π j π j }, results in (a) maximal expected squared jumping distance (and equivalently maximal average number of variables being changed in one iteration), which becomes (b) optimality with respect to the Peskun ordering for the class of linear functions L(Γ) defined in (10); the asymptotic variance becomes The proof is given in Appendix A. Remark 1. The advantage in terms of expected squared jumping distance and central limit theorem asymptotic variance of the setting (ii) over independent sampling (i), is particularly substantial for important variables for which π j is close to 1/2. Remark 2. In discrete spaces, moves which do not change the model can be proposed.
Such moves have acceptance probability one but do not help mixing. Therefore, the average acceptance rate is inappropriate and Schäfer and Chopin (2013) suggest using the mutation rate:ā Under setting (i), the mutation rate is (1 − π j ) 2 + π 2 j and, under setting (ii), the mutation rate is Therefore, setting (ii) always leads to a higher mutation rate. This suggests that the proposal in setting (ii) should be preferred over that in (i) when designing a Metropolis-Hastings sampler for idealized posteriors of the form in (6). In practice, this choice may be too greedy since the correlated set of variables C contributing ν(γ C ) in (7) may considerably decrease acceptance rate for such proposals, and a scaled proposal of the form Figure 1: The solid black segment presents A j 's and D j 's corresponding to different choices of ζ j ∈ [0, 1] in (16). Any point in the segment results in acceptance probability = 1 for the idealized target (6). The iid sampling (i), marked with a triangle, is a shrunk version of the superefficient sampling (ii), marked with a bullet. may be preferred. Note that the independent sampling (i) corresponds to a particular choice of ζ j 's above, and the family of these proposals for all ζ j ∈ [0, 1] is illustrated as the solid black segment in Figure 1.
In the next section, we devise adaptive MCMC algorithms that aim to tune proposals of the form (3) so that A j 's and D j 's lie approximately on the solid black segment in Figure 1. Their placement along the segment will be guided by how much the correlated set of variables C contributes to (7), and balance between attaining high acceptance rates and proposing large jumps.

Carlo Design
Adaptive Markov chain Monte Carlo aims to use past samples of the Markov chain to optimise the chosen performance criterion and consequently the transition kernel on the fly, as simulation progresses (see e.g. Andrieu and Thoms, 2008;Roberts and Rosenthal, 2009;Green et al., 2015). As discussed in the introduction, developing effective Markov chain Monte Carlo methods for sampling from the posterior model distribution has proved challenging for large p. Adaptive methods offer the potential to improve mixing on model spaces and the Monte Carlo estimation of posterior inclusion probabilities.
Based on Proposition 1, we use the adaptive Markov chain Monte Carlo approach to optimise the non-symmetric Metropolis-Hastings type algorithm with product form proposal q η (γ, γ ) given by (3). Let γ (i) be the value of γ at the start of the i-th iteration, γ be the subsequently proposed value and η (i) = (A (i) , D (i) ) be the value of the tuning parameters used at the i-th iteration. We define for j = 1, . . . , p, where 0 ≤ ≤ 1/2. This is the usual logit transform if = 0.
We define two strategies for adapting η. The first adaptive strategy is a general purpose method that we term Exploratory Individual Adaptation (IA). It aims to find pairs (A j , D j ) on the segment illustrated in Figure 1, balancing between large values of (A j , D j ), and their possible negative effect on average acceptance probability. There are three types of updates for A (i) and D (i) which move towards the correct ratio A j /D j and then along the segment (note that the slope of the segment is not known in practice, as it depends on π j ): 1. Expansion step, which aims to increase the average jumping distance while maintaining the same average acceptance rate. If the acceptance rate for the proposed model a η (i) (γ (i) , γ ) is above threshold τ U , coordinates of A (i) and D (i) indicated by γ A(i) and γ D(i) , i.e corresponding to all the variables proposed to change, increase approximately on the log scale, so that A j . 2. Shrinkage step, which aims to increase the acceptance rate by proposing less ambitious moves. If a η (i) (γ (i) , γ ) is below a critical threshold τ L , the proposal may be too ambitious and the strategy is to decrease coordinates of A (i) that correspond to γ A(i) and decrease the coordinates of D (i) that correspond to γ D(i) .
3. Correction step, which aims to increase the acceptance rate by correcting the ratio between A's and D's. If a η (i) (γ (i) , γ ) is below τ U , but above the critical threshold τ L , the strategy is to improve acceptance rate (c.f. (8)) by increasing coordinates of A (i) that correspond to γ D(i) and decreasing those corresponding to γ A(i) , and analogously, by increasing the coordinates of D (i) that correspond to γ A(i) and decreasing those corresponding to γ D(i) .
The gradient fields of these updates are shown in Figure 2. These three moves can be combined into the following adaptation of A (i) and D (i) : for some constant 1/2 < λ ≤ 1 and a η (i) γ (i) , γ represents the acceptance probability at the i-th iteration. The transformation implies that < A (i) j < 1 − and < D (i) j < 1 − and we assume that 0 < < 1/2. Based on several simulation studies, we suggest to take τ U = 0.1 and τ L = 0.01. As discussed in Section 2.2 targeting a low acceptance rate is often beneficial in irregular cases, so we expect this choice to be robust in real data applications. In all our simulations with this parameter setting, the resulting mean acceptance rate was between 0.15 and 0.35, i.e. in the high efficiency region identified in Roberts et al. (1997). We also suggest the initial choice of parameters such that A  (17)  j for each variable which can be timeconsuming if p is large. Alternatively, we could learn π 1 , . . . , π p to approximate the slope of the segment in Figure 1, and use the proposal (16) motivated by part (ii) of Proposition 1 and the fact that the posterior distribution will not generally have a product form. However, to accelerate adaptation, we shall use the same scale parameter for all variables. We term this approach the Adaptively Scaled Individual Adaptation (ASI) proposal. In particular, we use for j = 1, . . . , p where 0 < ζ (i) < 1 is a tuning parameter and π (i) j is a Rao-Blackwellised estimate of the posterior inclusion probability of variable j at the start of the i-th iteration. The value of ζ (i) is updated using where τ is a targeted acceptance rate. To avoid proposing to change no variable with high probability, we set ζ . This ensures that the algorithm will propose to change at least one variable with high probability. The idea of using Rao-Blackwellised estimates of posterior inclusion probabilities has been considered before. We follow Ghosh and Clyde (2011) who work with the Rao-Blackwellised estimate conditional on the model, whilst integrating over the regression coefficients, rather than Guan and Stephens (2011) who condition on the regression coefficients for the included variables. This can be achieved using an O(p) algorithm and the formulae are provided in Appendix B. The adaptively scaled individual adaptation algorithm is described in Algorithm 2. Craiu et al. (2009) showed empirically that running multiple independent Markov chains with the same adaptive parameters improves the rate of convergence of adaptive algorithms towards their target acceptance rate in the context of the classical adaptive Metropolis algorithm of Haario et al. (2001) (see also Bornn et al. 2013) . Therefore, we will compare algorithms with different numbers of independent parallel chains and refer to this as multiple chain acceleration.

Ergodicity of the Algorithms
Since adaptive Markov chain Monte Carlo algorithms violate the Markov condition, the standard and well developed Markov chain theory can not be used to establish ergodicity and we need to derive appropriate results for our algorithms. We verify validity of our algorithms by establishing conditions introduced in Roberts and Rosenthal (2007), namely simultaneous uniform ergodicity and diminishing adaptation.

Algorithm 2: Adaptively Scaled Individual Adaptation (ASI)
Recall that π(γ | y) ∝ p(y|γ)p(γ) is the target posterior on the model space Γ and the vector of adaptive parameters at time i is with specific update strategies being defined either by exploratory individual adaptation or adaptively scaled individual adaptation. By P η (γ, ·) denote the non-adaptive Markov chain kernel corresponding to the fixed choice of η. Under the dynamics of either algorithm, for S ⊆ Γ we have In the case of multiple chain acceleration, where r copies of the chain are run, the respective model state space is the product space and thus the current state of the algorithm at time i is γ ⊗r, (i) ∈ Γ r and the stationary distribution is the product density π ⊗r on Γ r . The single chain version corresponds to r = 1 and so all results apply to that case.
We show that all the considered algorithms are ergodic, and satisfy a weak law of large numbers i.e. for any starting point γ ⊗r ∈ Γ r and any initial parameter value η ∈ ∆ ε , we have where π(f ) = γ ⊗r ∈Γ r f (γ ⊗r )π ⊗r (γ ⊗r ).
To this end we first establish the following result.
This result along with diminishing adaptation directly leads to the following Theorem 1. Assume that p(y|γ) is available analytically for all γ ∈ Γ and ε > 0 in (17), (18), or (19). Then each of the algorithms (exploratory individual adaptation, adaptively scaled individual adaptation and their multiple chain acceleration versions) are ergodic and satisfy a weak law of large numbers.
Proofs can be found in Appendix A. A comprehensive analysis of the algorithms for other generalised linear models or for linear models using non-conjugate prior distributions requires a case-by-case treatment, and is beyond the scope of this paper. However, we note that if the prior distributions of additional parameters are continuous, supported on a compact set and everywhere positive, establishing ergodicity for a specific model will typically be possible with some technical care.

Simulated Data
We consider the simulated data example of Yang et al. (2016). They assume that there are n observations and p regressors and the data is generated from the model where ∼ N(0, σ 2 I) for σ 2 = 1. The first 10 regression coefficients are non-zero and we use The i-th vector of regressors is generated as In our examples, we use the value ρ = 0.6 which represents a relative large correlation between the regressors. We are interested in the performance of the two adaptive algorithms (exploratory individual adaptation and adaptively scaled individual adaptation) relative to a simple Add-Delete-Swap algorithm. We define the ratio of the relative time-standardized effective sample size of algorithm A versus algorithm B to be where ESS A is the effective sample size for algorithm A. This is estimated bŷ where t A and t B are the times taken for one run of algorithm A and algorithm B respectively, and s 2 A and s 2 B are the sample variances of the posterior inclusion probabilities for algorithm A and algorithm B respectively over different runs of the algorithms.
The posterior distribution changes substantially with the SNR and the size of the data set. All ten true non-zero coefficients are given posterior inclusion probabilities greater than 0.9 in the two high SNR scenarios (SNR=2 and SNR=3) for each value of n and p and no true non-zero coefficients are given posterior inclusion probabilities greater than 0.2 in the low SNR scenario (SNR=0.5) for each value of n and p. In the intermediate SNR scenario (SNR=1), the number of true non-zero coefficients given posterior inclusion probabilities greater than 0.9 are 4 (n = 500, p = 500), 8 (n = 1000, p = 500), 3 (n = 500, p = 5000) and 0 (n = 1000, p = 5000). Important variables are hard to identify with low SNR when the posterior is very dispersed. As SNR increases, the important variables becomer easier to identify and can be fairly well identified when SNR=3. Generally, the results are consistent with our intuition that true non-zero regression coefficients should be detected with greater posterior probability for larger SNR, larger value of n and smaller value of p. Table 1 shows the median relative time-standardized effective sample sizes for the exploratory individual adaptation and adaptively scaled individual adaptation algorithms with 5 or 25 multiple chains for different combinations of n, p and SNR. The median is taken across the estimated relative time-standardized effective sample sizes for all posterior inclusion probabilities. Clearly, the adaptively scaled individual adaptation algorithm outper- forms the exploratory individual adaptation algorithm for most settings with either 5 or 25 multiple chains. The performance of the exploratory individual adaptation and, especially, the adaptive scaled individual adaptation algorithm with 25 chains is better than the corresponding performance with 5 chains for most cases. Concentrating on results with the adaptively scaled individual adaptation algorithm, the largest increase in performance compared to a simple Metropolis-Hastings algorithm occurs with SNR=2. In this case, there are three or four orders of magnitude improvements when p = 5000 and several orders of magnitude improvements for other SNR with p = 5000. In smaller problems with p = 500, there are still substantial improvements in efficiency over the simpler Metropolis-Hastings sampler. The superior performance of the adaptively scaled individual adaptation algorithm (which has one tuneable parameter) over the exploratory individual adaptation algorithm (which has 2p tuneable parameters) is due to the substantially faster convergence of the tuning parameters of the adaptively scaled individual adaptation algorithm to optimal values. Plotting posterior inclusion probabilities against A and D at the end of a run shows that, in most cases, the values of A for a variable are close to the corresponding posterior inclusion probabilities for both algorithms. However, the value of D are mostly close to 1 for adaptively scaled individual adaptation but not for exploratory individual adaptation. If D j is close to 1, then variable j is highly likely to be proposed to be removed if already included in the model. This leads to improved mixing rates if the posterior inclusion probabilities for that variable is close to zero since it allows that variable to be included more often in a fixed run length. This is hard to learn through individual adaptation (since variables with low posterior inclusion probabilities will be rarely included in the model and so the algorithm learns the D j slowly for those variables) whereas the Rao-Blackwellised estimates can often quickly determine which variables have low posterior inclusion probabilities.

Tecator Data
The tecator data contains 172 observations and 100 variables. They have been previously analysed using Bayesian linear regression techniques by Griffin and Brown (2010), who give a description of the data, and Lamnisos et al. (2013). The regressors show a high degree of multi-collinearity and so this is a challenging example for Bayesian variable selection algorithms. The prior used was (2) with V γ = 100I and h = 5/100. Even short simulations of the exploratory individual adaptation algorithm for this data, such as 5 multiple chains with 3000 burn in and 3000 recorded iterations afterwards, taking about 5 seconds on a laptop, show consistent convergence across runs.
Our purpose was to study the adaptive behaviour of the exploratory individual adaptation algorithm on this real data example, in particular to compare the idealized values of the A j 's and D j 's with the values attained by the algorithm.
We use multiple chain acceleration with 50 multiple chains over the total of 6000 iterations. The algorithm parameters were set to τ L = 0.01 and τ U = 0.1. The resulting mean acceptance rate was approximately 0.2 indicating close to optimal efficiency. The average number of variables proposed to be changed in a single accepted proposal was 23, approximately twice the average model size, meaning that in a typical move all of the current variables were deleted from the model, and a set of completely fresh variables was proposed. Figure 3(a) shows how the exploratory individual adaptation algorithm approximates setting (ii) of Proposition 1, namely the super-efficient sampling from the idealized posterior (6). Figure 3(b) illustrates how the attained values of A j 's somewhat overestimate the idealized values min{1, π j /(1 − π j )} of setting (ii) in Proposition 1. This indicates that the chosen parameter values τ L = 0.01 and τ U = 0.1 of the algorithm overcompensate the cor-related part of the posterior ν(γ C ) in (7), which is not very pronounced for this dataset. To quantify the performance, we ran both algorithms with adaptation in the burn-in only and calculated the effective sample size. With a burn-in of 10 000 iterations and 30 000 draws, the effective sample per multiple chain was 2878 with exploratory individual adaptation and 6949 with adaptively scaled individual adaptation. This is an impressive performance for both algorithms given the multicollinearity in the regressors. The difference in performance can be explained by the time taken for the two algorithms to converge to optimal values for the proposal. To illustrate the effect, we re-ran the algorithms with the burn-in extended to 30 000 iterations, the effective sample per multiple chain was now 3851 with exploratory individual adaptation but 7044 with adaptively scaled individual adaptation.
This example shows that the simplified posterior (7) is a good fit with many datasets and can indeed be used to guide and design algorithms.

PCR Data
Bondell and Reich (2012) described a variable selection problem with 22 576 variables and 60 observations on two inbred mouse populations. The covariates are gender and gene expression measurements for 22 575 genes. Using quantitative real-time polymerase chain reaction (PCR) several physiological phenotypes are recorded. We consider one of these phenotypes, phosphoenopruvate carboxykinase (PEPCK) as the response variable. Bondell and Reich (2012) apply their method to both a subset of 2 000 variables (selected on the basis of marginal correlations with the response) and the full data set. We use our adaptively scaled individual adaptation algorithm on the full data set of 22 576 variables. In prior (2) we adopt V γ = 100I and a hierarchical prior was used for γ by assuming that h ∼ Be(1, (p − 5)/5) which implies that the prior mean number of included variables is 5. The algorithm was run with τ = 0.234, 25 multiple chains, a burn-in of 3 000 iterations and 8 000 subsequent iterations and no thinning. Rao-blackwellised updates of π (i) were only used in the burn-in and posterior inclusion probability for the j-th variable was estimated by the mean of the posterior sample of γ j . This took about 2 hours and 30 minutes to run using MATLAB and an Intel i7 @ 3.60 GHz processor. Three independent runs of the algorithms were executed to compare convergence. Figure 4 shows the estimated posterior inclusion probabilities for each run and the pairwise comparisons between the different runs. The estimates from each independent chain are very similar and indicate that the sampler is able to accurately represent the posterior distribution. To put these results in perspective, we ran three other algorithms on the same data. The algorithms were the Add-Delete-Swap algorithm and two recently proposed methods for high-dimensional variable selection: the Hamming Ball sampler (Titsias and Yau, 2017) and the Ji-Schmidler adaptive sampler (Ji and Schmidler, 2013). Each method was run in MATLAB for 2 hours and 30 minutes to make a direct comparison with our proposed method. Some results are shown in figure 5 for two independent runs of each method. Clearly, the Hamming Ball and Ji-Schmidler samplers are not able to adequately characterise the posterior model distribution with the two runs leading to dramatically different results. The Add-Delete-Swap sampler performs better but provides substantially more variable estimates of the posterior inclusion probabilities than the adaptively scaled individual adaptation method.

Conclusion
This paper introduces two adaptive Markov chain Monte Carlo algorithms for variable selection problems with very large p and small n. We recommend the adaptively scaled individual adaptation proposal, which is able to quickly find good proposals. This method uses a Rao-Blackwellised estimate of the posterior inclusion probability for each variable in an independent proposal. On simulated data this algorithm shows orders of magnitude improvements in effective sample size compared to the standard Metropolis-Hastings algorithm. The method is also applied to PCR data with 22 576 variables and shows excellent agreement in the posterior inclusion probabilities across independent runs of the algorithm, unlike the existing methods we have tried. Tuning the algorithm is key to defining an effective sampling scheme. We find that multiple independent chains with a shared proposal lead to better convergence to the optimal parameter values. Code to run both algorithms is available from https://www.kent.ac.uk/smsas/personal/jeg28/Version3.0.zip There are a number of possible directions for future research. We have only considered serial implementations of our algorithms in this paper. However, the algorithms are naturally parallelizable across the multiple chains but work is needed on efficient updating of the shared adaptive parameters. The proposals developed in this paper are well-suited to work within standard computational techniques for highly correlated or multi-modal distributions such as parallel tempering (Geyer, 1991) or sequential Monte Carlo samplers (see Del Moral et al. (2006) or Schäfer and Chopin (2013), with application to variable selection) which use powered versions of the posterior distribution. In such implementations, further computational gains may be obtained for parallel tempering versions where multimodality and correlations lessen in high temperatures and the heated posteriors become increasingly more suitable for our adaptively designed chains to pursue super-efficient mixing. Finally, it will be interesting to apply these algorithms to more complicated data which may have a non-Gaussian likelihood or a more complicated prior distribution (for example, a linear model with interactions). j } t=0,1,... evolves independently of other coordinates, and is a Markov chain on {0, 1} governed by, say, transition kernel P j , with stationary distribution π (j) = {1 − π j , π j }. Part (i): (a) and formula (11) of (b) are immediate because the proposal samples from the stationary distribution and is accepted with probability 1. To verify formula (12) of (b), use that individual coordinates are Markovian, and for f ∈ L(Γ) compute: Now recall that P j in (i) is independent sampling from π (j) , i.e. P j = Π j := 1−π j π j 1−π j π j , hence σ 2 P j ,f j = V ar π (j) f j . To verify (i), formula (13) in (c), note that and for the independent sampling Markov chain E π |γ (0) For maximality in (a), recall (25), and it is enough to check (by simple algebra) that the transition kernel P j := 1−A j A j D j 1−D j , resulting from this choice of (A, D), maximises E 1−π j ,π j |γ j − γ j | over all possible Markov chains on {0, 1} with stationary distribution {1 − π j , π j }.
For formula (14) of (ii)(a), recall (25), and note that E π |γ (0) j − γ (1) j | = 2 max{1 − π j , π j }. For Peskun optimality in (b), recall formula (24) and consider σ P j ,f j in this setting. It is enough to verify that for each j, the kernel P j = 1−A j A j D j 1−D j is optimal with respect to Peskun ordering among all Markov chains on {0, 1} with stationary distribution π (j) . Indeed, by simple algebra, P j maximises off-diagonal elements among all stochastic matrices with stationary distribution π (j) and by Theorem 2.1.1 of Peskun (1973), is optimal. To recover formula (15) of (ii)(b), recall (24), and consider asymptotic variance terms of individual coordinates σ 2 P j ,f j for this case. These can be computed directly, but we take a shortcut noting that

B Rao-Blackwellisation
For notational simplicity, in the expressions below we use X to denote X γ (i) . Furthermore, we choose V γ = n 0 I and define F = (X T X + n 0 I) −1 . The sequential update of the posterior inclusion probability is, if γ j = 0, , where Ψ = n/2 log y T y − y T XF X T y − log(y T y − B)) B = y T XF X T y + y T x j x T j XF X T y + y T XF X T x j x T j y + y T x j S j x T j y If γ j = 1 and the columns of X are permuted so that X = [X − x j ], , where Ψ = n/2 log(y T y − B) − log y T y − y T XF X T y B = (y T T X) 1:(p−1) F 1:(p−1),1:(p−1) − F 1:(p−1),p F T 1:(p−1),p × 1/F p,p (X T y) 1:(p−1) These results allow the Rao-Blackwellised estimates to be calculated. In particular, if the value of F and y T y − y T XF X T y (which are needed to calculate the marginal likelihood) are stored, these operations can be completed in O(p) steps.
(a) Gradient field of the expansion step for γ A(i) and γ D(i) (b) Gradient field of the shrinkage step for γ A(i) and γ D(i) (c) Gradient field of the correction step for γ D(i) (d) Gradient field of the correction step for γ A(i)