- Split View
-
Views
-
Cite
Cite
Stefano Nembrini, On what to permute in test-based approaches for variable importance measures in Random Forests, Bioinformatics, Volume 35, Issue 15, August 2019, Pages 2701–2705, https://doi.org/10.1093/bioinformatics/bty1025
- Share Icon Share
Abstract
In bioinformatics applications, it is currently customary to permute the outcome variable in order to produce inference on covariates to test novel methods or statistics whose distributions are poorly known. The seminal publication of Altmann et al. in Bioinformatics uses the same permutation scheme to obtain P-values that can be treated as corrected measure of feature importance to rectify the bias of the Gini variable importance in Random Forests. Since then, such method has been used in applied work to also draw statistical conclusions on variable importance measures from resulting P-values.
In this paper, we show that permuting the outcome may produce unexpected results, including P-values with undesirable properties and illustrate how more refined permutation schemes can be appropriate to obtain desirable results, including high power in discovering relevant variables.
Supplementary data are available at Bioinformatics online.
1 Introduction
In computational biology, regression analysis is one of the most used statistical methods, either in the form of generalized linear models (Nelder and Baker, 2004) or time-to-event models (Kalbfleisch and Prentice, 2002). Classical parametric methods require explicit modeling of non-linearities and interactions and are only applicable when the number of covariates P is smaller than the sample size n, unless penalized methods are used, e.g. the lasso (Tibshirani, 1996), ridge regression (Hoerl and Kennard, 1970) or the elastic net (Zou and Hastie, 2005). On the other hand Random Forests (RF) is a popular machine learning method that can handle high-dimensional problems with relatively small numbers of observations and allows non-linearities, complex interactions and correlations among predictors of mixed type. In addition, they provide an assessment of variable importance, that can used to rank predictors, such as single nucleotide polymorphisms (SNPs) in genome-wide association studies (Lunetta et al., 2004), and to perform feature selection (Altmann et al., 2010; Díaz-Uriarte and Alvarez de Andrés, 2006; Hapfelmeier and Ulm, 2013; Janitza et al., 2015).
Two of the most popular variable importance measures (VIMs) are the classical permutation importance and total decrease in node impurity (TDNI) measure. The classical permutation importance (Breiman, 2001) is assessed by comparing the prediction accuracy before and after a random permutation of a predictor calculated for each tree and then averaged over the forest. TDNI measures for variable Xi, are computed as the sum of all impurity decrease measures of all nodes in the forest at which a split on Xi has been conducted, averaged across the number of trees in the forest. For classification and regression forests, the Gini index and the estimated response variance are used as the splitting criteria and the resulting total heterogeneity reduction is referred to as TDNI. For survival forests, the log-rank test (Ishwaran et al., 2008) and the C-index based splitting rule (Schmid et al., 2016) can be used as splitting criteria. Although some authors (Zhang and Singer, 2013) tried to extend the concept of node impurity to survival trees, e.g. as the weighted sum of both the censoring indicator and the time variable, there is no unanimous agreement, so TDNI measures are usually computed for classification and regression forests only. Despite their popularity, it is common knowledge that TDNI measures are biased toward SNPs with larger minor allele frequency (Boulesteix et al., 2012) and covariates offering more splits (Breiman and Friedman, 1984). In order to address this issue, Altmann et al. (2010) introduced a heuristic scheme called PIMP, which is based on repeated permutations of the outcome variable Y inside a classification framework and outputs P-values for each covariate, which can be treated as an unbiased VIM. In the first step, TDNI VIMs are computed on the original data for all the variables. In the second step, the same VIMs are computed in a setting where the variables are not associated with the response, i.e. by permuting the response. Step 2 is replicated s times, and the resulting VIMs are considered realizations drawn from the unknown null distribution. Altmann et al. (2010) use Kolmogorov–Smirnov tests to automatically identify the most appropriate distribution among normal, log-normal and gamma. The parameters for the respective distribution are obtained through their maximum likelihood estimates. The non-parametric version determines the fraction of null VIMs that are larger than the importance computed on the original data. Although Steinskog et al. (2007) showed that if the parameters are estimated from the data, the Kolmogorov–Smirnov test is much too conservative in the sense that its P-values are strongly biased upward, the parametric approach seems to provide results that are consistent with its non-parametric counterpart, at the benefit of having to obtain a smaller number of s permutations [50–100 in Altmann et al. (2010)]. Since its introduction, this method has been used in several studies, e.g. gene expression (Ji et al., 2014), microbiome data (Ning and Beiko, 2015), genetic and epigenetic data (Polak et al., 2015; Prosperi et al., 2014) and oncological data (Pietrantonio et al., 2018). Nevertheless, we point out that, the original aim of the paper was in fact to propose a normalizing feature importance measures that can correct the feature importance bias, but since its output is P-values that can be treated as unbiased VIMs, this somehow implicitly suggested that they could be used for statistical hypothesis testing.
2 Desirable properties of P-values
P-values are used as a well-recognized tool in decision making in all areas of statistical practice. A frequentist P-value was defined by Robins et al. (2000) as a random variable that has range [0, 1] and it is also uniform under the null hypothesis. In contrast, the distribution of P-values under the alternative hypothesis is a function of the sample size and the effect size in the alternative hypothesis and their distribution is skewed toward zero (Hung et al., 1997). In practice, we reject H0 in favor of H1 if we obtain a P-value that is smaller than a pre-specified significance level α, e.g. . This is possible only if the distribution of the P-value is known to the analyst. Otherwise, we have no way of assessing whether or not observing say P = 0.1 is surprising if the null hypothesis is true. In real life applications, P-values can be modeled through mixture models, e.g. a mixture of beta distributions (Allison et al., 2002), because some variables are associated with the outcome, while others are not. That P-values be frequentist P-values is also particularly important in the light of current common practices in bioinformatics, since multiple testing is often an essential step in the analysis of biological data, with methods based on false discovery rates being probably the most popular in bioinformatics (Stephens, 2017). False discovery rates methodologies are currently employed in many settings, e.g. differential expression, spectrometric peak detection, SNP discovery, but they usually rely on the assumption that P-values be uniform in [0, 1] under the null hypothesis (Ruppert et al., 2007; Storey and Tibshirani, 2003; Strimmer, 2008).
3 Permuting the outcome
The practice of permuting the outcome to assess the relevance of covariates and the outcome is quite common in bioinformatics applications (Hoh et al., 2001; Pan et al., 2014; Poole et al., 2016), particularly when the distribution of a statistic is unknown. The idea of permuting the outcome can be traced back to Manly (1997), who presented it as an appropriate method of exact tests of hypotheses in simple linear regression, i.e. to test the relationship between two variables (see also Edgington, 1995). According to Manly (1997), permutation tests have a variety of advantages when compared to standard statistical methods: they are valid even without random samples, they can be used with non-standard test statistics, the concept of a reference population is not needed, they only require the exchangeability of the rows of Y and the covariate matrix X.
In the rest of the paper we focus on the permutation and the TDNI VIMs, with the addition of the actual impurity reduction (AIR) VIM presented in Nembrini et al. (2018) based on RFs with maximally selected rank statistics (Wright et al., 2017), which is unbiased with regard to the number of categories and minor allele frequency, but faster to compute than the classical permutation importance, especially for large datasets, with either a large P or a large n, or both.
3.1 Simple linear regression
McKechnie et al. (1975) collected data for hexokinase 1.00 mobility genes from electrophoresis of samples of the butterfly Euphydryas editha in some colonies in California and Oregon, and the altitudes of these colonies in thousands of feet. Manly (1997) argues that since gene frequencies (Y) and 1/altitude (X) can be shown to have an approximately linear relationship, they can be modeled through linear regression, which gives the following fitted equation .
Inside a linear regression framework, the relevance of X is generally assessed through a statistical test that takes the form against the alternative hypothesis . The test is bidirectional because the effect can either be positive or negative. This is usually performed with a standard t-test, which gives a P-value of 0.224%. Since the permutation and the AIR VIMs are centered around zero under the null, the test can take the form against the alternative. As for TDNI measures, it is less clear how the testing of hypothesis should be posed but P-values are computed as the number of times the TDNI measure computed on the original data is larger than the distribution computed through the permutation of Y (Altmann et al., 2010) (Fig. 1).
The null hypothesis of no association can be generated permuting the values of Y a specific number of times B. Comparing the observed coefficient of with the distribution computed using B = 49 999 randomized values gave us a two-sided P-value of 0.04%. Results obtained with a non-parametric bootstrap test (Efron, 1992) are in agreement with the permutation test with a P-value of 0.058%. Also the P-values obtained through the permutation and the AIR VIMs are in agreement (0.366 and 0.28%), while the P-value obtained through the TDNI measure is very high and = 56.79%. As an exploratory tool, we generate the alternative hypothesis H1 from a semiparametric bootstrap using the estimates from the linear regression and sampling residuals with replacement . As expected, all the test statistics show a shift (to the right), while this is not the case for the TDNI VIM. Results obtained from a wild bootstrap (Wu et al., 1986) are similar (Supplementary Figs S1–S3). Although this might seem obvious, it is probably noteworthy to remember that—when testing a hypothesis—an appropriate test statistic must be chosen so that it can discriminate between H0 and H1 (Bancroft, 2009). These results show a simple yet powerful case when the TDNI cannot serve as an appropriate test statistic.
3.2 Multiple linear regression
In order to investigate the distribution of the three VIMs under the permutation scheme for multiple linear regression, we deploy the data collected in Little et al. (2009). It consists of 195 subjects, of which 23 were diagnosed with Parkinson’s disease. The covariates included in this dataset are 22 biomedical voice measurements. The classification task is to predict the class of each patient, based on the sustained vowel phonations.
Scenario A: null case
The outcome variable Y is permuted so that every association with the predictor space is broken.
Scenario B: power case I
The outcome variable Y is generated from a logistic distribution according to the following scheme: , where and is the logistic cumulative distribution function. So only five predictors are associated to the outcome.
Scenario C: power case II
The outcome variable Y is generated from a logistic distribution according to the following scheme: , where and , i.e. is generated by sampling with replacement from the non-negative part of . So all predictors are associated with Y (Fig. 2).
In scenario A, since no predictor is informative, the distribution of the P-values for all three VIMs is uniform as expected and it is skewed toward 0 in Scenario B for informative predictors. The P-values obtained by the PIMP are hill-shaped, i.e. more concentrated toward 0.5, for uninformative predictors in Scenario B. This is due to the fact that its distribution before permuting the outcome has a smaller variance (Supplementary Fig. S4: bottom left). On the contrary the TDNI VIM is shifted to the right after the permutation (Supplementary Fig. S4: bottom right), this makes P-values skewed toward 1 for uninformative predictors, which is due to the fact that when the outcome is permuted unassociated predictors are allowed to split more often, providing a larger reduction in the impurity of provided by Xi. Finally, the AIR importance provides uniform P-values in this scenario, this is motivated by the fact that the AIR importance has the same distribution for unassociated predictors regardless of the fact that Y is permuted (Supplementary Fig. S4: top left). Under the scheme of scenario C, the AIR and the permutation VIM work well, while the TDNI VIM produces U-shaped P-values. This makes the original method by Altmann et al. (2010) unappealing in scenarios with many associated predictors. Hapfelmeier and Ulm (2013) explain that the permutation scheme used in Altmann et al. (2010) leads to low power in detecting relevant variables. They proposed a method which generates the null distribution by permuting the covariate of interest Xi instead of permuting the outcome. Under this permutation scheme, the distribution of the classical permutaiton importance for unassociated predictors is the same as in the original data (Supplementary Fig. S4: top right). The P-value for the variable is computed non-parametrically as in Altmann et al. (2010). While the PIMP method requires s + 1 replications, the computation of P-values for all variables as in Hapfelmeier and Ulm (2013) requires computing as many forests as predictor variables multiplied by the number of permutation runs, thus replications. Therefore, it becomes computationally burdensome when P gets large (Table 1).
. | Method . | value . | P-value (in %) . |
---|---|---|---|
LR | permutation | 29.15 | 0.036 |
4.83 | 0.224 | ||
Non-parametric bootstrap | 29.15 | 0.058 | |
RF | TDNI VIM | 15.64 | 56.79 |
permutation VIM | 0.98 | 0.366 | |
AIR VIM | 1.11 | 0.28 |
. | Method . | value . | P-value (in %) . |
---|---|---|---|
LR | permutation | 29.15 | 0.036 |
4.83 | 0.224 | ||
Non-parametric bootstrap | 29.15 | 0.058 | |
RF | TDNI VIM | 15.64 | 56.79 |
permutation VIM | 0.98 | 0.366 | |
AIR VIM | 1.11 | 0.28 |
. | Method . | value . | P-value (in %) . |
---|---|---|---|
LR | permutation | 29.15 | 0.036 |
4.83 | 0.224 | ||
Non-parametric bootstrap | 29.15 | 0.058 | |
RF | TDNI VIM | 15.64 | 56.79 |
permutation VIM | 0.98 | 0.366 | |
AIR VIM | 1.11 | 0.28 |
. | Method . | value . | P-value (in %) . |
---|---|---|---|
LR | permutation | 29.15 | 0.036 |
4.83 | 0.224 | ||
Non-parametric bootstrap | 29.15 | 0.058 | |
RF | TDNI VIM | 15.64 | 56.79 |
permutation VIM | 0.98 | 0.366 | |
AIR VIM | 1.11 | 0.28 |
4 Wrapping it up
From Section 3.2, we know that for unassociated predictors for which the VIM under the permutation scheme has a larger variance will yield a conservative P-value. In this section, using six datasets described in Table 2, different permutation approaches are evaluated in detail, namely the parametric and non-parametric PIMP, the same approach for the permutation VIM () and the AIR VIM () and the method proposed by Hapfelmeier et al. (2014) (HU), which are implemented with 100 trees and replicated 100 times. The P-values are computed non-parametrically except for the parametric PIMP, which uses a normal, log-normal or gamma distribution (Kolmogorov–Smirnov tests automatically identify the most appropriate distribution) and for for which they are computed assuming a normal distribution [P-values are computed from a normal distribution after applying the gaussianization process obtained through the rank-based transformation proposed by Van der Waerden (1952) (Supplementary Fig. S5). Simulations also show that 100 iterations provide P-values as accurate as those estimated non-parametrically with 4999 iterations, with a concordance correlation coefficient (Lawrence and Lin, 1989) close to one (Supplementary Fig. S6)]. All the methods require 10 000 trees, while HU requires 80 000–1 340 000 trees to be grown. An investigation of the method proposed by Hapfelmeier et al. (2014) leads to an alternative new method. If only a single variable is permuted at each step, RF is still able to perform informative splits, and uninformative variables will achieve a very small importance, but their method requires the re-computation of forests and VIMs after each permutation of a predictor variable Xi, making it computationally demanding. So instead of refitting a new forest for every permutation, one could extend the original dataset with k row-wise permutations of Xi and compute the AIR VIM and its corresponding null values inside the same forest (xForest) (k was set to 100 as in the other permutation-based methods. Additional simulations in the Supplementary Material show that a certain amount of trees may be needed for the null distributions to be estimated with precision. For small datasets convergence is obtained after 25–100 trees. For comparison and to avoid a suboptimal choice we used 5000 trees for all datasets, which is still 1/2 times the number of trees required by the permutation-based methods).
Task . | Dataset . | n . | P . |
---|---|---|---|
Classification | Wisconsin Prognostic Breast Cancer | 198 | 32 |
South African Hearth Disease | 462 | 9 | |
Regression | Blood Brain Barrier | 208 | 134 |
Prostate Cancer | 97 | 9 | |
Survival | NKI Breast Cancer | 144 | 75 |
German Breast Cancer Study Group | 686 | 8 |
Task . | Dataset . | n . | P . |
---|---|---|---|
Classification | Wisconsin Prognostic Breast Cancer | 198 | 32 |
South African Hearth Disease | 462 | 9 | |
Regression | Blood Brain Barrier | 208 | 134 |
Prostate Cancer | 97 | 9 | |
Survival | NKI Breast Cancer | 144 | 75 |
German Breast Cancer Study Group | 686 | 8 |
Task . | Dataset . | n . | P . |
---|---|---|---|
Classification | Wisconsin Prognostic Breast Cancer | 198 | 32 |
South African Hearth Disease | 462 | 9 | |
Regression | Blood Brain Barrier | 208 | 134 |
Prostate Cancer | 97 | 9 | |
Survival | NKI Breast Cancer | 144 | 75 |
German Breast Cancer Study Group | 686 | 8 |
Task . | Dataset . | n . | P . |
---|---|---|---|
Classification | Wisconsin Prognostic Breast Cancer | 198 | 32 |
South African Hearth Disease | 462 | 9 | |
Regression | Blood Brain Barrier | 208 | 134 |
Prostate Cancer | 97 | 9 | |
Survival | NKI Breast Cancer | 144 | 75 |
German Breast Cancer Study Group | 686 | 8 |
4.1 Simulation setup
We used the design matrix of the original datasets and simulated the outcome with given effect size. The classification outcome was generated from a logistic distribution, while the regression outcome was generated from a normal distribution, and the survival outcome was simulated by the minimum of a survival and censoring time, both assumed to be exponentially distributed with and , respectively. For datasets with P < 10 , for datasets with P > 10 we used four variables each, totaling in 20 effect variables. All other variables were not associated with the outcome. Correlation between covariates was removed permuting their values prior to each iteration. This was replicated 1000 times and the power and the type I error was computed for each of the methods under comparison.
4.2 Methods
The parametric and non-parametric PIMP have globally the lowest performance, which is consistent with the results previously found in Degenhardt et al. (2017) and Hapfelmeier et al. (2014). This is because—as explained in Section 3.2—non-informative predictors obtain very large P-values, therefore PIMP does not control for type I error. When P < 10, shows a similar pattern: it is rather conservative and has low power for small effect sizes compared to all other methods, while it seems powerful in detecting higher effect sized. HU controls for type I error and is one of the most powerful methods, with xForest being on par or slightly more powerful. controls for type I error and has power comparable to that of (Fig. 3).
5 Conclusions
The idea of permuting the outcome variable Y in RFs is straightforward, and mimics how test statistics are generated under the null hypothesis in many applications in life sciences, most notably generalized linear models. This idea was first presented in Altmann et al. (2010) as a normalization procedure for biased VIMs in order to obtain P-values that could be treated as corrected measures of feature importance called PIMP. We have shown that under this permutation scheme, drawing inference on the VIMs using these P-values may lead to unexpected and undesirable results, particularly if TDNI measures are used. Considering that the original PIMP method is still being used in applied work to draw statistical conclusions from these resulting P-values (Ahmad et al., 2018; Tokheim and Karchin, 2018; Wang et al., 2018; Ye et al., 2018), it is of paramount importance that practitioners be aware of this. In order to obtain P-values with desirable properties, more complex permutation schemes are needed, e.g. repeatedly permuting the covariate of interest as in Hapfelmeier et al. (2014) or injecting the original data with permutations of the covariates, which provides similar results with having to grow a much smaller number of trees.
Conflict of Interest: none declared.
References