Automated calibration for stability selection in penalised regression and graphical models

Abstract Stability selection represents an attractive approach to identify sparse sets of features jointly associated with an outcome in high-dimensional contexts. We introduce an automated calibration procedure via maximisation of an in-house stability score and accommodating a priori-known block structure (e.g. multi-OMIC) data. It applies to [Least Absolute Shrinkage Selection Operator (LASSO)] penalised regression and graphical models. Simulations show our approach outperforms non-stability-based and stability selection approaches using the original calibration. Application to multi-block graphical LASSO on real (epigenetic and transcriptomic) data from the Norwegian Women and Cancer study reveals a central/credible and novel cross-OMIC role of LRRN3 in the biological response to smoking. Proposed approaches were implemented in the R package sharp.


Introduction
Tobacco smoking has long been established as a dangerous exposure causally linked to several severe chronic conditions.It has been estimated that one in five deaths in the United States was due to smoking (National Center for Chronic Disease et al., 2014).Nevertheless, the molecular mechanisms triggered and dysregulated by the exposure to tobacco smoking remain poorly understood.Over the past two decades, OMICs technologies have developed as valuable tools to explore molecular alterations due to external stressors or exposures (Niedzwiecki et al., 2019).Statistical analysis of OMICs data has enabled the identification of molecular markers of exposure at a single molecular level (Joehanes et al., 2016;Huan et al., 2016) and are progressively moving towards the integration of data arising from different platforms (Guida et al., 2015;Noor et al., 2019).There is an increasing need for efficient multivariate approaches accommodating high-dimensional and heterogeneous data typically exhibiting block-correlation structures.In particular, variable selection models can identify sparse sets of predictors and have proved useful for signal prioritisation in this context (Chadeau-Hyam et al., 2013;Vermeulen et al., 2018).Of these, the Least Absolute Shrinkage Selection Operator (LASSO) uses the 1 -penalisation of the coefficients to achieve variable selection (Tibshirani, 1996).Extensions of these penalised regression models have been proposed for the estimation of Gaussian graphical models (Meinshausen and Bühlmann, 2006;Friedman et al., 2007).By applying a 1 -penalisation to the precision matrix (as defined by the inverse of the covariance matrix), the graphical LASSO identifies non-zero entries of the partial correlation matrix.The evaluation (and subsequent selection) of pairwise relationships between molecular features in graphical models can guide biological interpretation of the results, under the assumption that statistical correlations reflect molecular interactions (Barabási and Oltvai, 2004;Valcárcel et al., 2011).
We focus in the present paper on the calibration of feature selection models, where feature denotes interchangeably a variable (in the context of regression) or an edge (graphical model).We illustrate our approach with regularised models, in which the model size (number of selected features) is controlled by the penalty parameter.The choice of parameter has strong implications on the generated results.Calibration procedures using cross-validation (Friedman et al., 2010;Leng et al., 2006) or maximisation of information theory metrics, including the Bayesian (BIC) or Akaike (AIC) Information Criterion (Akaike, 1998;Schwarz, 1978;Foygel and Drton, 2010;Giraud, 2008) have been proposed.
These models can be complemented by stability approaches to enhance reliability of the findings (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013;Liu et al., 2010).In stability selection, the selection algorithm is combined with resampling techniques to identify the most stable signals.The model relies on the introduction of a second parameter: a threshold in selection proportion above which the corresponding feature is considered stable.A formula providing the upper-bound of the expected number of falsely selected features, or Per-Family Error Rate (PFER), as a function of the two parameters has been derived and is currently used to guide calibration (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013).However, this calibration relies on the arbitrary choice of one of the two parameters, which can sometimes be difficult to justify.
We introduce a score measuring the overall stability of the set of selected features, and use it to propose a new calibration strategy for stability selection.Our intuition is that all features would have the same probability of being selected in an unstable model.Our calibration procedure does not rely on the arbitrary choice of any parameter.Optionally, the problem can be constrained on the expected number of falsely selected variables and generate sparser results with error control.
We also extend our calibration procedure to accommodate multiple blocks of data.This extension was motivated by the practical example on integration of data from different OMICs platforms.In this setting, block patterns arise, typically with higher (partial) correlations within a platform than between (Canzler et al., 2020).We propose here an extension of stability selection combined with the graphical LASSO accommodating data with a known block structure.For this approach, each block is tuned using a block-specific pair of parameters (penalty and selection proportion threshold) (Ambroise et al., 2009).
We conduct an extensive simulation study to evaluate the performances of our calibrated stability selection models and compare them to state-of-the-art approaches.Our multi-OMICs stability-enhanced graphical models are applied to targeted methylation and gene expression data from an existing cohort.These datasets are integrated in order to characterise the molecular response to tobacco smoking at multiple molecular levels.The transcript of the LRRN3 gene, and its closest CpG site were found to play a central role in the generated graph.These two variables have the largest numbers of cross-OMICs edges and appear to be linking two largely uni-OMICs modules.LRRN3 methylation and gene expression therefore appear as pivotal molecular signals driving the biological response to tobacco smoking.

Data overview
We used DNA methylation and gene expression data in plasma samples from 251 women from the Norwegian Women and Cancer (NOWAC) cohort study (Sandanger et al., 2018).Our study population includes 125 future cases (mean time-to-diagnosis of 4 years) and 126 healthy controls.The data was pre-processed as described elsewhere (Guida et al., 2015).DNA methylation at each CpG site are originally expressed as a proportion of methylated sequences across all copies (β-values) and was subsequent logit 2 -transformed (M-values).The gene expression data was log-transformed.Features missing in more than 30% of the samples were excluded, and the remaining data was imputed using the k-nearest neighbour.To remove technical confounding, the data was de-noised by extracting the residuals from linear mixed models with the OMIC feature as the outcome and modelling technical covariates (chip and position) as random intercepts (Sandanger et al., 2018).

Motivating research questions
Our overarching research question is to identify the role of smoking-related CpG sites in lung carcingenesis and to better understand the molecular response to the exposure to tobacco smoke.
We therefore identified a subset of 160 CpG sites found differentially methylated in never vs. former smokers at a 0.05 Bonferroni corrected significance level in a large meta-analysis of 15,907 participants from 16 different cohorts (Joehanes et al., 2016).Similarly, we selected a set of 156 transcripts found differentially expressed in never vs. current smokers from a meta-analysis including 10,233 participants from 6 cohorts (Huan et al., 2016).Of these, 159 CpG sites and 142 transcripts were assayed in our dataset.
Using a logistic-LASSO we first sought for a sparse subset of the (N=159) assayed smoking-related CpG sites that were jointly associated with the risk of future lung cancer.Second, to characterise the multi-OMICs response to exposure to tobacco smoking we estimated the conditional independence structure between smoking-related CpG sites (N=159) and transcripts (N=142) using the graphical LASSO.
To improve the reliability of our findings, both regularised regression and graphical models are used in a stability selection framework.These analyses raised two statistical challenges regarding the calibration of hyper-parameters in stability selection, and the integration of heterogeneous groups of variables in a graphical model.We detail below our approaches to accommodate these challenges.

Variable selection with the LASSO
In LASSO regression, the 1 -penalisation is used to shrink the coefficients of variables that are not relevant in association with the outcome to zero (Tibshirani, 1996).Let p denote the number of variables and n the number of observations.Let Y be the vector of outcomes of length n, and X be matrix of predictors of size (n × p).The objective of the problem is to estimate the vector β λ containing the p regression coefficients.The optimisation problem of the LASSO can be written: where λ is a penalty parameter controlling the amount of shrinkage.
Penalised extensions of models including logistic, Poisson and Cox regressions have been proposed (Simon et al., 2011).In this paper, the use of our method is illustrated with LASSO-regularised linear regression.We use its implementation in the glmnet package in R (Gaussian family of models) (Friedman et al., 2010).

Graphical model estimation with the graphical LASSO
A graph is characterised by a set of nodes (variables) and edges (pairwise links between them).As our data is cross-sectional, we focus here on undirected graphs without self-loops.As a result, the adjacency matrix encoding the network structure will be symmetric with zeros on the diagonal.
We assume that the data follows a multivariate Normal distribution: where µ is the mean vector and Σ is the covariance matrix.
The conditional independence structure is encoded in the support of the precision matrix Ω = Σ −1 .Various extensions of the LASSO have been proposed for the estimation of a sparse precision matrix (Meinshausen and Bühlmann, 2006;Banerjee et al., 2008).We use here the graphical LASSO (Friedman et al., 2007) as implemented in the glassoFast package in R (Witten et al., 2011;Friedman et al., 2018;Sustik M.A., 2012).For a given value of the penalty parameter λ, the optimisation problem can be written as: max where S is the empirical covariance matrix and Alternatively, a penalty matrix Λ can be used instead of the scalar λ for more flexible penalisation: max where • denotes the element-wise matrix product.

Stability selection
Stability-enhanced procedures for feature selection proposed in the literature include stability selection (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013) and the Stability Approach to Regularization Selection (StARS) (Liu et al., 2010).Both use an existing selection algorithm and complement it with resampling techniques to estimate the probability of selection of each feature using its selection proportion over the resampling iterations.Stability selection ensures reliability of the findings through error control.
The feature selection algorithms we use are (a) the LASSO in a regression framework (Tibshirani, 1996;Friedman et al., 2010), and (b) the graphical LASSO for the estimation of Gaussian graphical models (Meinshausen and Bühlmann, 2006;Banerjee et al., 2008;Sustik M.A., 2012) (see Supplementary Methods, section 1.1 and 1.2 for more details on the algorithms).The latter aims at the construction of a conditional independence graph.In a graph with p nodes, for each pair of variables X, Y and Gaussian vector Z compiling the (p − 2) other variables, an edge is included if the conditional covariance cov(X, Y |Z) is different from zero (see Supplementary Methods, section 1.3 for more details on model calibration).
Under the assumption that the selection of feature j is independent from the selection of any other feature i = j, the binary selection status of feature j follows a Bernouilli distribution with parameter p λ (j), the selection probability of feature j.The stability selection model is then defined as the set V λ,π of features with selection probability above a threshold π: For each feature j, the selection probability is estimated as the selection proportion across models with penalty parameter λ applied on K subsamples of the data.
The stability selection model has two parameters (λ, π) that need to be calibrated.In the original paper, Meinshausen and Bühlmann use random subsamples of 50% of the observations.They introduce q Λ , the average number of features that are selected at least once by the underlying algorithm (e.g.LASSO) for a range of values λ ∈ Λ, across the K subsamples.Under the assumptions of (a) exchangeability between selected features, and (b) that the selection algorithm is not performing worse than random guessing, they derived an upper-bound of the PFER, denoted by PFER M B , as a function of the number of selected features q Λ and threshold in selection proportion π: With simultaneous selection in complementary pairs (CPSS), the selection proportions are obtained by counting the number of times the feature is selected in both the models fitted on a subsample of 50% of the observations and its complementary subsample made of the remaining 50% of observations (Shah and Samworth, 2013).Using this subsampling procedure, the exchangeability assumption is not required for the upperbound PFER M B to be valid.Under the assumption of unimodality of the distribution of selection proportions obtained with CPSS, Shah and Samworth also proposed a stricter upper-bound on the expected number of variables with low selection probabilities, denoted here by PFER SS : For simplicity, we consider here point-wise control (Λ reduces to a single value λ) with no effects on the validity of the formulas.Both approaches provide a relationship between λ (via q λ ), π and the upper-bound of the PFER such that if two of them are fixed, the third one can be calculated.The authors of both papers proposed to guide calibration based on the arbitrary choice of two of these three quantities.For example, the penalty parameter λ can be calibrated for a combination of fixed values of the selection proportion π and threshold in PFER.
To avoid the arbitrary choice of the selection proportion π or penalty λ, we introduce here a score measuring the overall stability of the model and use it to jointly calibrate these two parameters.We also consider the use of a user-defined threshold in PFER to limit the set of parameter values for λ and π to explore.

Stability score
Our calibration procedure aims at identifying the pair of hyper-parameters (λ, π) that maximises model stability (Yu, 2013).Let H λ (j) ∈ {0, . . ., K} denote the selection count of feature j ∈ {1, . . ., N } calculated over the K models fitted with parameter λ over different subsamples.To quantify model stability, we first define three categories of features based on their selection counts.For a given penalty parameter λ and threshold in selection proportion π ∈]0.5, 1[, each feature j is either (a) stably selected if Unstably selected features are those that are ambiguously selected across subsamples.
The partitioning of the features into these three categories provides information about model stability, whereby a stable model would include a large numbers of stably selected and/or stably excluded features and a small number of unstably selected features.
We hypothesise that under the most unstable selection procedure, all features would have the same probability γ λ = q λ /N of being selected, where is the average number of selected features across the K models fitted with penalty λ on the different subsamples of the data.Further assuming that the subsamples are independent, the selection count H λ (j) of feature j ∈ {1, . . ., N } would then follow a binomial distribution: By considering the N selection counts as independent observations, we can derive the likelihood of observing this classification under the hypothesis of instability, given λ and π: where F K,γλ is the cumulative probability function of the binomial distribution with parameters K and γ λ .
Our stability score S λ,π is defined as the negative log-likelihood under the hypothesis of equi-probability of selection: The score S λ,π measures how unlikely a given model is to arise from the null hypothesis, for a given set of λ and π.As such, the higher the score, the more stable the set of selected features.By construction, this formula is accounting for (a) the total number of features N , (b) the number of iterations K, (c) the density of selected sets by the original procedure via λ, and (d) the level of stringency as measured by threshold π.The calibration approach we develop aims at identifying sets of parameters λ and π maximising our score: max Furthermore, this calibration technique can be extended to incorporate some error control via a constraint ensuring that the expected number of false positives is below an a priori fixed threshold in PFER η: U λ,π is the upper-bound used for error control in existing strategies (i.e.PFER M B or PFER SS ) (Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013).
In the following sections, the use of Equation ( 6) is referred to as unconstrained calibration, and that of Equation ( 7) as constrained calibration.

Multi-block graphical models
The combination of heterogeneous groups of variables can create technically-induced patterns in the estimated (partial) correlation matrix, subsequently inducing bias in the generated graphical models.This can be observed, for example, when integrating the measured levels of features from different OMICs platforms.The between-platform (partial) correlations are overall weaker than within platforms (Supplementary Figure S1).This makes the detection of bipartite edges more difficult.This structure is known a priori and does not need to be inferred from the data.Indeed, the integration of data arising from G homogeneous groups of variables generates B = G×(G+1) 2 two-dimensional blocks in the (partial) correlation matrix where variables are ordered by group (Ambroise et al., 2009).
To tackle this scaling issue, we propose to use and calibrate block-specific pairs of parameters, λ b and π b controlling the level of sparsity in block b.Let E b , b ∈ {1, . . ., B} denote the sets of edges belonging to each of the blocks, such that: The stability selection model can be defined more generally as: The probabilities p λ1,...,λB (j), j ∈ {1, . . ., N } are estimated as selection proportions of the edges obtained from graphical LASSO models fitted on K subsamples of the data with a block penalty matrix such that edge j ∈ E b is penalised with λ b .
Our stability score is then defined, by block, as: S λ1,...,λB,π1,...,πB = − log B b=1 j∈Eb Alternatively, we propose a block-wise decomposition, as described in Equation (9).To ensure that pairwise partial correlations in each block are estimated conditionally on all other (p − 2) nodes, we propose to estimate them from graphical LASSO models where the other blocks are weakly penalised (i.e. with small penalty λ 0 ).We introduce p b λb,λ0 (j) and H b λb,λ0 (j), the selection probability and count of edge j ∈ E b as obtained from graphical LASSO models fitted with a block penalty matrix such that edges j ∈ E b are penalised with λ b and edges i ∈ E , = b are penalised with λ 0 .We define the multi-block stability selection graphical model as the union of the sets of block-specific stable edges: The pair of parameters is calibrated for each of the blocks separately using a blockspecific stability score defined by: where γ λb,λ0 is calculated based on the selection counts in H b λb,λ0 .The implication of these assumptions are evaluated by comparing the two approaches described in Equations ( 9) and ( 8) in a simulation study.

Implementation
The stability selection procedure is applied for different values of λ and π and the stability score is computed for all visited pairs of parameters.The grid of λ values is chosen so that the underlying selection algorithm visits a range of models from empty to dense (up to 50% of edges selected by the graphical LASSO) (Friedman et al., 2010;Müller et al., 2016).Values of the threshold π vary between 0.6 and 0.9, as proposed previously (Meinshausen and Bühlmann, 2010).

Simulation models
In order to evaluate the performances of our approach and compare to other established calibration procedures, we simulated several datasets according to the models described below, which we implemented in the R package fake (version 1.3.0).

Graphical models
We build upon previously proposed models to simulate multivariate Normal data with an underlying graph structure (Zhao et al., 2012).Our contributions include (a) a procedure for the automated choice of the parameter ensuring that the generated correlation matrix has contrast, and (b) the simulation of block-structured data.
First, we simulate the binary adjacency matrix Θ of size (p × p) of a random graph with density ν using the Erdös-Rényi model (Erdös and Rényi, 1959) or a scale-free graph using the Barabási-Albert preferential attachment algorithm (Albert and Barabási, 2002;Zhao et al., 2012).To introduce a block structure in the generate data, the non-diagonal entries of the precision matrix Ω are simulated such that: 1 and i and j belong to different platforms., i = j where We ensure that the generated precision matrix is positive definite via diagonal dominance: We propose to choose u so that the generated correlation matrix has a high contrast, as defined by the number of unique truncated correlation coefficients with three digits (Supplementary Figure S2).The parameter v b ∈ [0, 1] is set to 1 (no block structure) for single-block simulations and chosen to generate data with a visible block structure for multi-block simulations (v b = 0.2).These models generate realistic correlation matrices (Supplementary Figure S1).

Linear regression
For linear regression, the data simulation is done in two steps with (i) the simulation of n observations for the p predictors, and (ii) the simulation of the outcome for each of the n observations, conditionally on the predictors data.The first step is done using the simulation model introduced in the previous section for graphical models.This allows for some flexibility over the (conditional) independence patterns between predictors.For the second step, we sample β-coefficients from a uniform distribution over {−1, 1} (for homogeneous effects in absolute value) or over {[−1, 0.5] ∪ [0.5, 1]} (to introduce variability in the strength of association with the outcome).The outcome Y i , i ∈ {1, . . ., n} is then sampled from a Normal distribution (Friedman et al., 2010): The parameter σ controls the proportion of variance in the outcome that can be explained by its predictors.The value of σ is chosen to reach the expected proportion of explained variance R 2 used as simulation parameter: where s 2 is the variance of Xβ.

Performance metrics
Selection performances of the investigated models are measured in terms of precision p and recall r: p = T P T P + F P and r = T P T P + F N , where T P and F P are the numbers of true and false positives, respectively, and F N is the number of false negatives.
The F 1 -score quantifies the overall selection performance based on a single metric:

Simulation study
We use a simulation study to demonstrate the relevance of stability selection calibrated with our approach: (a) in a linear regression context for the LASSO model, (b) for graphical model using the graphical LASSO, (c) for multi-block graphical models.
From these we evaluate the relevance of our stability score for calibration purposes, and compare our score to a range of existing calibration approaches including information theory criteria, StARS, and stability selection models using the previously proposed error control for different values of the threshold in selection proportion π.As sensitivity analyses, we evaluate the performances of stability selection for graphical models using different resampling approaches, different numbers of iterations K, and compare the two proposed approaches for multi-block calibration.

Simulation parameters
All simulation parameters were chosen in an attempt to generate realistic data with many strong signals and some more difficult to detect (weaker partial correlation).
For graphical models, we used p = 100 nodes with an underlying random graph structure of density ν = 0.02 (99 edges on average, as would be obtained in a scale-free graph with the same number of nodes).For multi-block graphical models, we considered two homogeneous groups of 50 nodes each.Reported distributions of selection metrics were computed over 1,000 simulated datasets.
Unless otherwise stated, stability selection models were applied on grids of 50 datasetspecific penalty parameter values and 31 values for the threshold in selection proportion between 0.6 and 0.9.The stability-enhanced models were based on K = 100 (complementary) subsamples of 50% of the observations.

Applications to simulated data
Our stability selection approach is first applied to the LASSO for the selection of variables jointly associated with a continuous outcome in simulated data (Figure 1).
The penalty parameter λ and threshold in selection proportion π are jointly calibrated to maximise the stability score (Figure 1-A).Stably selected variables are then identified as those with selection proportions greater than the calibrated parameter π = 0.86 (dark red line) in LASSO models fitted on 50% of the data with calibrated penalty parameter λ = 0.34 (Figure 1-B).The resulting set of stably selected variables includes 8 of the 10 'true' variables used to simulate the outcome and 1 'wrongly selected' variables we did not use in our simulation.
We observe a marginal increase in prediction performances across unpenalised models sequentially adding the 9 stably selected predictors by order of decreasing selection proportions (Figure 1-C).Further including the two False Negatives generates a limited increase in R 2 , and so does the inclusion of any subsequent variable.This suggests that our stability selection model captures most of the explanatory information and was therefore well calibrated.
To limit the number of 'wrongly selected' features, we can restrict the values of λ and π visited so they ensure a control of the PFER (Supplementary Figure S3).In that constrained optimisation, the values of λ and π yielding a PFER exceeding the specified threshold are discarded and corresponding models are not evaluated (Supplementary Figure S3-A).The maximum stability score can be obtained for different pairs (λ, π) depending on the constraint, but our simulation shows that differences in the maximal stability score (Supplementary Figure S3-B) and resulting selected variables are small (Supplementary Figure S3-C) if the constraint is not too stringent.
Our stability score is also used to calibrate the graphical LASSO for the estimation of a conditional independence graph, while controlling the expected number of falsely selected edges below 20 (Figure 1-C).The calibrated graph (Figure 1-D) included 56 (47 rightly, in plain dark blue and 9 wrongly, in dashed red lines) stably selected edges (i.e. with selection proportions ≥ π = 0.90), based on graphical LASSO models fitted on 50% of the data with penalty parameter λ = 0.52.The 9 wrongly selected edges tend to be between nodes that are otherwise connected in this example (marginal links).The 2 missed edges are connected to the central hub and thus correspond to smaller partial correlations, more difficult to detect.

Evaluation of model performance and comparison with existing approaches
Our simulations show that models with higher stability score yield higher selection performances (as measured by the F 1 -score), making it a relevant metric for calibration (Figure 2-A).We also find that irrespective of the value of λ and π, stability selection models outperform the original implementation of the graphical LASSO (Figure 2-B  and 0.41).Our stability score instead yield sparser models, resulting in slightly lower recall values (0.90) which did not include many irrelevant edges, as captured by the far better precision value (0.81).Our simulation also shows that the constraint controlling the PFER further improves the precision (0.83) through the generation of a sparser model.The precision and recall of visited stability selection models (grey) and corresponding graphical LASSO models (dark blue) are reported (B).The calibrated models using the BIC (beige) or EBIC (brown) are also showed (B).
Our calibrated stability selection graphical LASSO models are compared with stateof-the-art graphical model estimation approaches on 1,000 simulated datasets in low, intermediate and high-dimension (Figure 3, Supplementary Table S1).Non stabilityenhanced graphical LASSO models, calibrated using information theory criteria, are generally the worst performing models (median F 1 -score < 0.6 across dimensionality settings).StARS models, applied with the same number of subsampling iterations and using default values for other parameters, have the highest median numbers of True Positives.However, they include more False Positives than stability selection models, making it less competitive in terms of F 1 -score (best performing in high-dimension with a median F 1 -score of 0.66).For stability selection models calibrated using error control (MB (Meinshausen and Bühlmann, 2010), SS (Shah and Samworth, 2013)), the optimal choice of π seems to depend on many parameters including the dimensionality and structure of the graph (Supplementary Figure S4).By jointly calibrating the two parameters, our models show generally better performances compared to models calibrated solely using error control on these simulations (median F 1 -score ranging from 0.69 to 0.72 using PFER SS < 20 only in high dimension, compared to 0.74 using constrained calibration maximising the stability score).Results were consistent when using different thresholds in PFER (Supplementary Figure S5).For LASSO models, we observe a steep increase in precision with all stability selection models compared to models calibrated by cross-validation (Supplementary Figure S6).Unconstrained calibration using our stability score yielded the highest F 1 -scores in the presence of independent or correlated predictors.Computation times of the reported stability selection models are comparable and acceptable in practice (less than 3 minutes in these settings) but rapidly increase with the number of nodes for graphical models, reaching 8 hours for 500 observations and 1,000 nodes (Supplementary Table S2).

Sensitivity to the choice of resampling procedure
Stability selection can be implemented with different numbers of iterations K and resampling techniques (subsampling, bootstrap or CPSS approaches, and subsample size).
We show in a simulation study with p = 100 nodes that (a) the effect of the number of iterations K reaches a plateau after 50 of iterations, and (b) that the best performances were obtained for bootstrap samples or subsamples of 50% of the observations (Supplementary Figures S7 and S8).

Multi-block extension for graphical models
Our single and multi-block calibration procedures are applied on simulated datasets with a block structure in different dimensionality settings.Block specific selection performances of both approaches can be visualised in precision-recall plots (Figure 4, Supplementary Table S3).Irrespective of the dimensionality, accounting for the block structure as proposed in Equation ( 9) with λ 0 = 0.1 generates an increase in selection performance in both within and between blocks (up to 7% in overall median F 1 -score in low dimension).This gain in performance comes at the price of an increased computation time (from 2 to 6 minutes in low dimension).Additionally, we show in Supplementary Table S4 that the choice of λ 0 has limited effects on the selection performances, as long as it is relatively small (λ 0 ≤ 0.1).We choose λ 0 = 0.1 for a good balance between performance and computation time.We also show that the use of Equation ( 9) gives better selection performances than that of Equation ( 8) (median F 1 -score ≥ 0.71 compared to 0.57).In particular, it drastically reduces the numbers of False Positives in the off-diagonal block.

Epigenetic markers of lung cancer
To identify smoking-related markers that contribute to the risk of developing lung cancer, we use stability selection logistic-LASSO with the 159 CpG sites as predictors and the future lung cancer status as outcome (Figure 5-A, B).The calibrated model includes 21 CpG sites with selection proportions above 0.66.The unpenalised logistic models with stably selected predictors reach a median AUC of 0.69, which is close to that of pack years (median AUC of 0.74) and implies that these 21 CpG sites capture most of the information on smoking history relevant to lung cancer prediction.The limited increase in AUC beyond the calibrated number of predictors suggests that the stability selection model achieves a good balance between sparsity and prediction performance.

Multi-OMICs graph
We first estimate the conditional independence structure between smoking-related CpG sites with single-block stability selection (Supplementary Figure S9).A total of 320 edges involving 100 of the 159 CpG sites are obtained.Most CpG sites are in the same connected component, but we also observe 6 small modules made of 2 or 3 nodes.
In order to get a more comprehensive understanding of the biological response to smoking we integrate methylation data, known to reflect long-term exposure to tobacco smoking, and gene expression, which is functionally well characterised, and seek for correlation patterns across these smoking-related signals via the estimation of a multi-OMICs graph.
We accommodate the heterogeneous data structure (Supplementary Figure S10) by calibrating three pairs of block-specific parameters (λ, π) using our multi-block strategy (Figure 5-A).We found a total of 601 edges, including 150 in the within-methylation block, 425 in the within-gene expression block, and 26 cross-OMICs edges (Figure 5-B).The detected links reflect potential participation to common regulatory processes of both transcripts and CpG sites.As our analysis was limited to smoking-related markers, connected nodes can be hypothesised to jointly contribute to the biological response to tobacco smoking.
For comparison, we estimate the graphical LASSO model calibrated using the BIC on the same data (Supplementary Figure S11).Of the 601 edges included in the stability selection graph, 583 were also in the BIC-calibrated graph.The BIC-calibrated graph is more dense (N=6,744 edges), which makes it difficult to interpret.As this procedure does not account for the block structure in the data, two modules corresponding to the two platforms are clearly visible.DNA methylation nodes are annotated with the symbol of its closest gene on the genome (Joehanes et al., 2016).Most sets of CpG sites annotated with the same gene symbol are interconnected in the graph (e.g.AHRR, GNG12-AS1, and ALPPL2 on chromosomes 5, 3, and 2, respectively).The data includes a CpG site and a transcript with the same annotation for two genes, but only found a cross-OMIC link for LRRN3 (Guida et al., 2015).The LRRN3 transcript, which is linked to 4 CpG sites including AHRR, ALPPL2 and a CpG site annotated as LRRN3 (cg09837977), has a central position among methylation markers (Figure 5-B).
Strong correlations involving features that are closely located on the genome, or cis-effects, have been reported previously (Robinson et al., 2014).Our approach also detects cross-chromosome edges (Supplementary Figure S12), suggesting that complex long-range mechanisms could be at stake (Jones, 2012).
We incorporate functional information in the visualisation using Reactome pathways (Figure 5-B) (Langfelder and Horvath, 2008;Jassal et al., 2020).As previously reported, the immune system and in signal transduction (red) pathways were largely represented in the targeted set (Huan et al., 2016;Sandanger et al., 2018).Interestingly, the group of interconnected nodes around RPL4 (green) was involved in a range of pathways including the cellular response to stress, translation, and developmental biology.Similarly, the transcripts involved in the metabolism of lipids (yellow) are closely related in the graph.Altogether these results confirm the functional proximity of the nearby variables from our graph, hence lending biological plausibility of its topology.

Discussion
The stability selection models and proposed calibration procedure have been implemented in the R package sharp (version 1.2.1), available on CRAN.The selection performances of our variable selection and (multi-block) graphical models were evaluated in a simulation study.We showed that stability selection models yield higher F 1 -score, to the cost of a (limited) increase in computation time.The computational efficiency of the proposed approaches can easily be improved using warm start and parallelisation, both readily implemented in the R package sharp.We also demonstrated that the proposed calibration procedure is generally identifying the optimal threshold in selection proportion which leads to overall equivalent or better performances than previously proposed approaches based solely on error control.Our multi-block extension was successful in removing some of the technical bias through a more flexible modelling, but generated a ten-fold increase in computation time compared to single-block models on these simulations.
The proposed approaches also generated promising results on real OMICs data (Petrovic et al., 2022).The development of stability-enhanced models accommodating data with a known block structure we proposed was triggered by the multi-OMICs application for the characterisation of the molecular signature of smoking.Their application to methylation and gene expression data gave further insights on the long-range correlations previously reported (Guida et al., 2015), and revealed a credible pivotal cross-OMICs role of the LRRN3 transcript (Huan et al., 2016).Annotation of the networks using biological information from the Reactome database identifies modules mostly composed of nodes belonging to the same pathways, suggesting that statistical correlations can reflect functional role in shared biological pathways.
The stability selection approach and calibration procedure introduced here could also be used in combination with other variable selection algorithms, including penalised unsupervised models that cannot rely on the minimisation of an error term in crossvalidation (Zou et al., 2006), or extensions modelling longitudinal (Charbonnier et al., 2010) or count data (Chiquet et al., 2018).The method and its implementation in the R package sharp comes with some level of flexibility and user-controlled choices.Depending on the application and its requirements, the models can be tailored to generate more or less conservative results using (a) the threshold in PFER controlling the sparsity of the selected sets, and (b) considering features with intermediate selection proportions (between 1 − π and π).The calculation of our stability score can alternatively be based on two categories including (a) stably selected features with H λ (j) ≥ Kπ, and (b) non-stably selected features with H λ (j) < Kπ.As this definition would ignore stably excluded features, which also contribute to the overall model stability, it may hamper selection performances.
Nevertheless, the results of stability selection models should always be interpreted with care.Our simulation studies indicate that, even when the assumptions of the model are verified (including the multivariate Normal distribution), the estimations of the graphical models are not perfect and need to be interpreted with care.In particular, some of the edges selected may correspond to marginal relationships (and not true conditional links).On the other hand, the absence of an edge does not necessarily indicate that there is no conditional association between the two nodes (especially for cross-group edges, for which the signal is diluted).Reassuringly, the overall topology of the graph seems relevant, as observed when applied on data with a scale-free graphical structure.
As with all penalised approaches, the stability selection models we propose rely on a sparsity assumption.In regression, this assumption implies that some of the predictors do not contribute to the prediction of the outcome or provide information that is redundant with that from other predictors.As the stability score S λ,π we propose is equal to zero the stability selection model is empty (no stably selected features) or saturated (all features are stably selected), our calibration procedure is only informative for models where the number of stably selected features is between 1 and (N − 1).The validity of this sparsity assumption could be investigated post-hoc using unpenalised models sequentially adding the selected features in decreasing order of selection proportion.
The calculation of the stability score relies on the assumption that the feature selection counts are independent.The link between correlation across features and correlation of their selection counts is not obvious and would warrant further investigation.However, selection and prediction performances of our calibrated stability selection LASSO models do not seem to be affected by the presence of correlated predictors While stability selection LASSO has been successfully applied on high dimensional data with almost 450,000 predictors (Petrovic et al., 2022), the stability selection graphical LASSO has limited scalability.The complexity of graphical models is rapidly increasing with the number of nodes, and despite recent faster implementations of the graphical models (Sustik M.A., 2012), computation times remain high with more than a few hundreds of nodes.Beyond their computational burden, large graphical models can become very dense and more efficient ways of visualising and summarising the results will be needed.Alternatively, as structures of redundant interconnected nodes (cliques) can be observed, summarising these in super-nodes could be valuable.This could be achieved using clustering or dimensionality reduction approaches, or by incorporating a priori biological knowledge in the model.

Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analysed in this study.All codes and simulated datasets are available on https://github.com/barbarabodinier/stability_selection.The R packages sharp and fake are available on the Comprehensive R Archive Network (CRAN).

Supplementary Methods: existing calibration strategies
In both LASSO-regularised regression and graphical modelling, the calibration of the hyper-parameter λ is critical as it regulates the size of the set of selected features.Stateof-the-art approaches for the choice of λ are based on M-fold cross-validation minimising some error metric (e.g.Mean Squared Error in Prediction).For graphical models, information theory metrics are commonly used, including the Akaike, Bayesian, and Extended Bayesian Information Criterion (Akaike, 1998;Schwarz, 1978;Foygel and Drton, 2010;Chiquet et al., 2009): where ( Ωλ ) = n 2 log det ( Ωλ )−tr ( Ωλ S) is the penalised likelihood, |E λ | is the degrees of freedom (i.e.number of edges in the graph), and γ is a hyper-parameter specific to the EBIC.Supplementary Figure S3: Visualisation of the PFER constraint in calibration of stability selection models.The calibration heatmap shows the stability score (colour-coded) as a function of λ (or the corresponding average number of selected variables q) and π (A).The white area (left) represents models for which the PFER computed using the Meinshausen and Bühlmann approach would exceed the threshold (PFER M B > 5).The highest stability score obtained for a given penalty parameter λ is represented for the unconstrained (blue) and constrained (red dotted line) approaches (B).Ordered selection proportions obtained from constrained calibration are reported (C).Stability selection is applied on simulated data with n = 100 observations for p = 50 variables, of which 10 contribute to the definition of the outcome with effect sizes in {[−1, −0.5] ∪ [0.5, 1]} and an expected proportion of explained variance of 70%.99,108 [29,563] 40,240 [10,787] Supplementary Table S2: Median and inter-quartile range of the computation times (in seconds) of stability selection obtained with different numbers of variables p. Models are applied on 1,000 simulated datasets with n = 500 observations.For stability selection LASSO models, we use p = 1, 000, 2, 500, 5, 000, 7, 500 or 10, 000 independent predictors, conditionally on the outcome.For stability selection graphical LASSO models, we use p = 100, 250, 500, 750 or 1, 000 variables following a multivariate Normal distribution corresponding to a random graph structure (ν = 0.02).For graphical models, we report computation times with or without warm start, where models are iteratively fitted over a path from larger to smaller penalty values and the estimate from the previous iteration is a starting point for the gradient descent algorithm (argument "start" in the R package sharp).For LASSO models, we always use warm start as implemented in the R package glmnet.
where u > 0 is a parameter to be tuned.The data is simulated from the centered multivariate Normal distribution with covariance Ω −1 .The simulation model is controlled by five parameters: (a) number of observations n, (b) number of nodes p, (c) density of the underlying graph ν ∈ [0, 1], (d) scaling factor v b ∈ [0, 1] controlling the level of heterogeneity between blocks, (e) constant u > 0 ensuring positive definiteness.

Fig. 1 .
Fig. 1.Stability selection LASSO (A-C) and graphical LASSO (D-E) applied on simulated data.Calibration plots (A, D) show the stability score (colour-coded) for different penalty parameters λ, or numbers of features selected q, and thresholds in selection proportion π.We show selection proportions (B) and a graph representation of the detected and missed edges (E).We report the median, 5 th and 95 th quantiles of the R 2 obtained for 100 unpenalised regression models sequentially adding the predictors in order of decreasing selection proportions (C).These models are trained on 50% of the data and performances are evaluated on the remaining observations.True Positives (dark blue), False Positives (red dashed line) and False Negatives (green dotted line) are highlighted (B, C, E).Calibration of the stability selection graphical LASSO ensures that the expected number of False Positives (PFER) is below 20 (D).The two datasets are simulated for p = 50 variables and n = 100 observations.For the regression model, 10 variables contribute to the definition of the outcome with effect sizes in {[−1, −0.5] ∪ [0.5, 1]} and an expected proportion of explained variance of 70%.For the graphical model, the simulated graph is scale-free.

Fig. 2 .
Fig.2.Selection performance in stability selection and relevance of the stability score for calibration.The graphical LASSO and stability selection are applied on simulated data with n = 200 observations for p = 100 variables where the conditional independence structure is that of a random network with ν = 0.02.The F 1 -score of stability selection models fitted with a range of λ and π values is represented as a function of the stability score (A).Calibrated stability selection models using the unconstrained (dark red) and constrained (red) approaches are highlighted.The precision and recall of visited stability selection models (grey) and corresponding graphical LASSO models (dark blue) are reported (B).The calibrated models using the BIC (beige) or EBIC (brown) are also showed (B).

Fig. 3 .
Fig.3.Selection performances of state-of-the-art approaches and proposed calibrated stability selection graphical LASSO models.We show the median, quartiles, minimum and maximum F 1 -score of graphical LASSO models calibrated using the BIC, EBIC, StARS, and stability selection graphical LASSO models calibrated via error control (MB in blue, SS in green) or using the proposed stability score (red).Models are applied on 1,000 simulated datasets with p = 100 variables following a multivariate Normal distribution corresponding to a random graph structure (ν = 0.02).Performances are estimated in low (n = 2p = 200, A), intermediate (n = p = 100, B), and high (n = p/2 = 50, C) dimensions.

Fig. 4 .
Fig. 4. Precision-recall showing single and multi-block stability selection graphical models applied on simulated data with a block structure.Models are applied on 1,000 simulated datasets (points) with p = 100 variables following a multivariate Normal distribution corresponding to a random graph (ν = 0.02) and with known block structure (50 variables per group, using v b = 0.2).The contour lines indicate estimated 2-dimensional density distributions.Performances are evaluated in low (A, n = 2p = 200), intermediate (B, n = p = 100), and high (C, n = p/2 = 50) dimensions.

Fig. 5 .
Fig. 5. Stability selection on real DNA methylation and gene expression data.The stability selection logistic-LASSO with the future lung cancer status as outcome and epigenetic markers of smoking as predictors is calibrated by maximising the stability score (A).The selection proportions in the calibrated model and explanatory performances of unpenalised logistic models where the predictors are sequentially added by decreasing selection proportion are showed (B).The three blocks of a multi-OMICs graphical model integrating DNA methylation and gene expression markers of tobacco smoking are calibrated separately using models where the other blocks are weakly penalised (λ 0 =0.1), while ensuring that PFER M B < 150 overall (C).The stability selection model includes edges that are stably selected in each block (D).
Supplementary FigureS2: Choice of the value of parameter u for simulation of the precision matrix.The contrast of the simulated correlation matrix for a scale-free graphical model with p = 50 nodes and n = 100 observations is represented as a function of the parameter u on the log-scale.The chosen value for u is the one maximising the contrast (indicated by a red dashed line).The heatmaps of correlation matrices with extreme and calibrated values of the parameter u are showed.
: Single-block graphical model of DNA methylation markers of exposure to tobacco smoking.Calibration is done by maximising the stability score while ensuring that PFER M B < 70 (A).CpG sites with at least one edge are represented in the graph (B).
: Heatmap of Pearson's correlations estimated from measured levels of the 159 DNA methylation markers and 208 gene expression markers.
Supplementary FigureS11: Graphical LASSO model of smoking-related methylation (blue square) and gene expression (red circle) markers calibrated using the Bayesian Information Criterion (BIC).The BIC is represented as a function of the penalty parameter λ (A).The graphical model generating the smallest BIC is showed (B).
Supplementary FigureS12: Multi-OMICs graphical model integrating DNA methylation (square) and gene expression (circle) markers of tobacco smoking with nodes coloured by chromosome.