Knockoff Boosted Tree for Model-Free Variable Selection

In this article, we propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. Our method is inspired by the original knockoff method, where the differences between original and knockoff variables are used for variable selection with false discovery rate control. The original method uses Lasso for regression models and assumes there are more samples than variables. We extend this method to both model-free and high-dimensional variable selection. We propose two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test these methods and compare them with the original knockoff method in terms of their ability to control type I errors and power. The boosted tree model is a complex system and has more hyperparameters than models with simpler assumptions. In our framework, these hyperparameters are either tuned through Bayesian optimization or fixed at multiple levels for trend detection. In simulation tests, we also compare the properties and performance of importance test statistics of tree models. The results include combinations of different knockoffs and importance test statistics. We also consider scenarios that include main-effect, interaction, exponential, and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using the Cancer Genome Atlas (TCGA) gene expression data. The proposed algorithm is included in the KOBT package, available at \url{https://cran.r-project.org/web/packages/KOBT/index.html}.


Introduction
In nearly all existing variable selection methods, a set of predictor candidates must be specified before selection (Li et al., 2005). Predictors in the final selected model are a subset of candidate set. However, it can be challenging to determine an appropriate set of predictor candidates, especially considering the order (e.g., quadratic, cubic) of main effects and interactions with a large number of variables. Li et al. (2005) stated the difference between model selection and variable selection is whether to assume a model for comparison. One of the advantages of model-free variable selection is that its performance is not limited by the choice of model formulation. Model-free methods such as Random Forest (Breiman, 2001), AdaBoost (Rätsch et al., 2001), and convolutional neural networks (LeCun et al., 1995) are widely used in research and industry for regression and classification purposes. These methods recognize data patterns without the need to impose specific model structures on regression functions. This independence allows model-free methods more flexibility, thus making them highly appropriate in many cases. For example, Gao et al. (2018) reported that compared with other methods, model-free methods provide more reliable clinical outcomes when forecasting falls of Parkinson's patients.
Among the widely used model-free methods, tree-based models have demonstrated greater effectiveness and interpretability than most other learning algorithms. In a standard regression model, an interaction term is not estimated if it is not specified in the model. In contrast, tree-based methods automatically include interactions during tree growing, without the requirement of an a prioiri formula preconception (Su et al., 2011). Further, tree-based models are more naturally associated with nonlinear interactions (Schiltz et al., 2018), and the branches in a tree model are original interaction variables. Friedman (2001) stated that the number of terminal nodes of a boosted tree model should depend on the highest order of the dominant interactions among the variables. This implies that (1) terminal (leaf) nodes in tree models cannot be interpreted simply as main effects of original variables and (2) growth of a tree model is selective.
Variable selection in tree models is affected by the number of categories of features (Kim and Loh, 2001), with variables with more categories preferred. To eliminate bias in variable selection, Loh (2002) proposed the GUIDE algorithm, which is used to conduct chi-square analysis of residuals and bootstrap calibration of significance probabilities. Gini importance in random forest methods (Breiman, 2001) is used to measure improvement in the splitting criterion produced by each corresponding variable. To generate an unbiased importance index, Breiman and Cutler (2008) proposed permutation importance by statistical permutation. In addition to regression tree and random forest models, attention has also been paid to variable selection in boosted tree models, given their superior performance in data pattern recognition. Miller et al. (2016) extended gradient boosted tree to multivariate tree boosting and performed nonparametric regression to identify important variables. Gain, weight, and cover are three common importance ranking measures in the highly successful XGBoost (Chen and Guestrin, 2016). SHapley Additive exPlanation (SHAP) (Lundberg and Lee, 2017b) has also been proposed to interpret model predictions, and its performance has been highly consistent when applied in tree models (Lundberg et al., 2018).
While most data scientists are interested in model prediction accuracy, inference from analysis is also important. Originally described for Random Forest (Breiman, 2001), the permutation method has been widely used to generate a null distribution of test statistics to detect significance in tree models. Unfortunately, some statistical inference, such as accurate false discovery rate (FDR) control, is not possible with permutation-based methods. For example, in a linear regression problem, a permutation-based construction may underestimate the false discovery proportion (FDP) in cases where the design matrix displays nonvanishing correlations (Barber et al., 2015). Under a global null, p-values from permutation tests are exact. However, under an alternative hypothesis, rejecting the null based on permutation tests does not necessarily imply a valid rejection (Chung et al., 2013). To control the FDR in variable selection, Barber et al. (2015) proposed a novel statistical framework, the knockoff framework, for low-dimensional (n ≥ p) fixed design. The knockoff framework was later extended to high-dimensional (n < p) and random design scenarios (Candes et al., 2018). The method generates knockoff variables without the need to collect new data or consider response variables while retaining the correlation structure of the original variables. These knockoff variables are used as a negative control group, and a variable selection procedure is applied for both the original and knockoff variables. Test statistics measuring the importance of both kinds of variables can be compared to compute symmetrized knockoff statistics. Finally, a threshold for these statistics is determined according to the (desired) target FDR level. The goal of using the knockoff method is discovering relevant variables while controlling the FDR.
In this article, we propose the knockoff boosted tree (KOBT) algorithm, a model-free variable selection algorithm that includes boosted tree (Friedman, 2001;Chen and Guestrin, 2016) and model-X knockoffs (Candes et al., 2018). Model-X knockoff variables are generated with the same correlation structure as the original variables. Test statistics, such as tree SHAP (Lundberg et al., 2018), Cover, Frequency, Gain, and Saabas are generated for each pair of original and knockoff variables, and the consistency and accuracy of the produced test statistics are compared. The FDR is controlled using the difference between original and knockoff test statistics. Further, regularization parameters in boosted tree models are tuned through Bayesian optimization (Snoek et al., 2012).
In the following sections, we describe our analytical framework, including (1) the generation of knockoff variables; (2) fitting and Bayesian optimization in boosted tree regression; and (3) consistency, accuracy, and FDR control in model-free variable selection. We validate our approach with simulation experiments. Finally, we demonstrate our approach on tumor purity estimation and tumor classification using the Cancer Genome Atlas (TCGA) gene expression data.

Parametric Functions of a Single Regression Tree
Denote y n×1 as n responses and X n×p as p independent variables each with n values. In regression trees, y n×1 can be continuous or ordered discrete. The goal is to minimize an objective function, which may contain a loss function component and a regularization component, for a combination of all leaves (terminal nodes) on a tree. Given a tree with m terminal nodes (i.e., the whole sample contains m partitions), if we model the response as a constant in each region, the estimated regression tree function is defined aŝ where I(·) is a function of x, and I {x i ∈ R p k } = 1, if the i-th row vector of X, x i ∈ R p k , the k-th disjoint partition region and otherwise I {x i ∈ R p k } = 0. R p k is a split region or a so-called leaf, k = 1, ..., m. With the squared L2 norm loss function, we have estimator and Solving Equation 3, the k-th estimated parameter of β iŝ where I k {x i ∈ R p k } = 0 for a fixed R p k . This is the sample mean of y on that leaf of the regression tree. Meanwhile, to grow a tree model, the data need to be partitioned or split. For the j-th variable x j , we define a split on x j with some value c as Based on Equation 4, we have two estimateŝ Thus, an optimal split is defined aŝ To avoid overfitting, similar to penalized regression, a penalty term is added to the cost-complexity criterion of a regression tree model where T (X n×p ; Θ) = f (X n×p ) is a regression tree model, Θ = {R, β} contains tree structure parameters, |T | is the cardinality of terminal nodes in T , and γ ≥ 0 is a tuning parameter.

Boosted Tree Methods and Regularization
To overcome the bias and high-variance problem in a single regression tree, ensemble tree models, such as bagging or boosting structures, have been proposed. Defining a boosted tree model as a sum of B trees in Section 2.1, In a forward stagewise algorithm , the B-th tree structure is found by solvinĝ where L(Θ B ; y, f B−1 ) is a loss function based on the B-th tree structure, given response y and previously fitted B − 1 tree models f B−1 .
In boosted tree model fitting, the objective function usually contains a loss function and penalty function for tree complexity (i.e., regularization for tree models). The complexity depends on the number of trees in a sequence, the depth of each tree, and the number of terminal nodes in each tree. One way to control the complexity of tree model fitting is to set a minimum number of instances in a single terminal node. For example, the min_child_weight parameter in XGBoost (Chen and Guestrin, 2016) controls the minimum sum of instance weight (Hessian) in a terminal node, which is equivalent to a minimum number of instances if all instances have a weight of 1. For simplicity, here, we assume all weights are 1. Combining Equation 8 and Equation 10, the B-th objective function is defined as where |T (X, Θ B ))| is the cardinality of terminal nodes in the B-th tree, and γ ≥ 0 is a tuning parameter. Similar to elastic net (Zou and Hastie, 2005), XGBoost introduces L1 and L2-norm penalty terms into objective functions. Therefore, the objective function for the B-th tree can be updated as where β B is a vector of leaf weights in the B-th tree, and γ ≥ 0, λ ≥ 0, and α ≥ 0 are tuning parameters.
Besides setting the minimum number of instances in terminal nodes and applying penalization on node weights, regularization can be achieved by limiting the maximum depth of a tree, applying the maximum number of trees in a boosting sequence, or defining an early stopping criteria for boosting (Zhang et al., 2005). Given the numerous parameters and hyperparameters of regularization and tree structure, it is tedious to find the optimal group of parameters by grid search. The parameters and their optimization are summarized in the following section.

Regularization Parameters and Bayesian Optimization
As reviewed by Nielsen (2016), there are three kinds of regularization parameters: boosting parameters, randomization parameters, and tree parameters. Boosting parameters include the number of trees B (i.e., the number of boosting iterations) and the learning rate η. A small step-size of the learning rate has been found to play an important role in the convergence of boosting procedures (Zhang et al., 2005). In simulation studies, decreasing the learning rate increased the performance of boosted models (Friedman et al., 2000). Using a smaller learning rate and thus a relatively larger number of trees has been suggested, given the relationship between the learning rate and the number of boosted models. A drawback of this procedure is increased running time for model fitting. The randomization parameters refer to the row subsampling parameter and the column subsampling parameter. In the random forest algorithm (Breiman, 2001), only a random subsample of the training data set is used for building model. Friedman (2002) built on this idea and introduced stochastic gradient boosting for the boosting tree algorithm. Basically, the row subsampling parameter controls the ratio for a subset of the data, and the column subsampling parameter determines the ratio for a subgroup of features in each tree fitting. Both boosting parameters and randomization parameters are used for general regularization since they are used in the optimization of all kinds of boosting models. However, tree parameters are for tree models only. Therefore, in this article, the discussion is focused on tree parameters and their optimization.
Note that all the tree parameters implicate a tradeoff between bias and variance. In each individual tree, besides the penalization parameters γ, λ, and α in Equation 12, the tree parameters also include structure parameters, the maximum depth of the tree, and the minimum sum of observation weights required for each leaf. Given the maximum depth of a tree, D, using the decomposition of target function introduced in Friedman et al. (1991), the b-th individual tree function in Equation 9 can be defined as where g (1) (x j ) is the first-order interaction (main effect) of the j-th variable, x j , g (2) (x j , x k ) is the second-order interaction between x j and x k , and so on. From Equation 13, it is clear that the highest order of interaction is the maximum depth of a tree, D. The minimum sum of observation weights in a leaf determines the variance ofβ's, estimated weights of leaves in a tree. If the minimum sum is large, the growth of a tree will be conservative, which means fewer leaves will be grown and thus a smaller variance ofβ's.
Finally, we discuss how Bayesian optimization (Snoek et al., 2012) is applied for tuning γ, λ, and α in Equation 12. We choose Bayesian optimization because brute-force search such as grid search or random search is time-consuming in a scenario where there are three hyperparameters. Here we treat hyperparameter optimization as a black-box process. We define a configuration of tuning hyperparameters, θ = (γ, λ, α), for the function we want to minimize, where CVTE(·) is a K-fold cross-validation error score, x i is a row vector containing p features of an observation, and f B (·) is the fitted boosted tree model with B individual trees. An early stopping criteria states that if adding new trees does not decrease cross-validation error within five trees, the boosting iteration should be stopped before the maximum number of trees is reached. The value of B is equal to the number of trees in the sequence, where the combination of trees provides the lowest cross-validation error score. This error score is the output value from CVTE(·) in Equation 14.
The support sets of γ, λ, and α are located within a tuning region, [0,20]. The optimal combination is used in the final model for comparison. Details are provided in Section 3.1. Barber et al. (2015) proposed the knockoff filter, a new variable selection method that controls the FDR. The knockoff filter generates knockoff variables that mimic the correlation structure of original variables but are not associated with the response. These knockoff variables are used as controls for the original variables, so only the original variables that are highly associated with the response are selected. The knockoff filter has been shown to provide accurate FDR control, which cannot be realistically achieved using permutation methods. Barber et al. (2015)'s filter works under the following conditions, where for a fixed design, X n×p , n > p, and y follows a linear Gaussian model. Candes et al. (2018) introduced model-X knockoffs, which extended the model assumption to high-dimensional and covariate selection from random known distributions. Following the definition of model-X knockoffs in Candes et al. (2018), we restate the definition for boosted tree models here. Definition 2.1. For a row vector of p random variables x 1×p = (x 1 , x 2 , ..., x p ), where each x j is a random variable that represents a feature, the corresponding model-X knockoff variables z 1×p = (z 1 , z 2 , ..., z p ) are constructed such that:

Knockoff Variables in Boosted Tree Models
1. for a combined random vector (x, z) 1×2p , if its j-th random variable is switched with its (j + p)-th random variable (j = 1, ..., p) (i.e, an original variable is switched with its knockoff counterpart), the distribution of the new random vector is invariant to (x, z) 1×2p ; and Based on the definition, to achieve high-dimensional FDR-controlled variable selection, qualified knockoff variables should satisfy two properties: (1) distribution equality with original variables and (2) independence of response. Two sampling methods, namely exact constructions and approximate construction, have been introduced to generate knockoffs (Candes et al., 2018). For exact construction, a new knockoff variable z j is sampled from the conditional distribution L(x j |x −j , z 1:j−1 ), where j = 1, ..., p, and x −j = (x 1:(j−1) , x (j+1):p ). Approximate construction focuses on whether (x, z) 1×2p retains its first two moments of a distribution after swapping. Given this summary of the two methods, it is apparent that the approximate construction method requires less complex computations than the exact construction method.
In this article, we compare three algorithms for knockoff generation. The first algorithm, Approximate Construction (AC), is available in the knockoff R package (Candes et al., 2018), where the covariance matrix of original variables, cov(X n×p ), is estimated directly. The estimated covariance matrix is shrunk to the identity matrix if it is not positive definite (Opgen-Rhein and Strimmer, 2007). In addition, we propose two other algorithms: one that is similar to the AC algorithm described above and one without Gaussian assumption. The second algorithm (i.e., the one that is similar to the AC algorithm) is a Sparse Constructions (SC) algorithm. Instead of estimating the covariance matrix directly, we conduct sparse estimation of the covariance matrix (Bien and Tibshirani, 2011), which generates a sparse matrix with a simpler structure. The rest of the algorithm is the same as the AC algorithm. The third algorithm (i.e., the one without Gaussian assumption) is a Principal Component Constructions (PCC) algorithm, which is inspired by Algorithm A.1 in Shen et al. (2019). In accordance with Definition 2.1, we describe our proposed PCC algorithm in Algorithm 2.1 below. Unlike the AC and SC algorithms, the PCC algorithm does not require data with a Gaussian assumption.
For each column vector of original variable x j , where j = 1, ..., p,

Conduct principal component analysis on matrix
, where X −j is matrix X without the j-th column and Z 1:j−1 is the first (j − 1) columns in matrix Z. When j = 1, Z 1:j−1 is empty.
2. Denote K as the number of principal components chosen for a regression model, K = 1, ..., n − 1. For a fixed K, fit x j on K PCs. There is a tradeoff in that the larger the K, the more akin the knockoff will be to the original variables. This results in a smaller type 1 error but weaker power of the test.
4. Permute ε n randomly. Denote the permuted vector as ε * n . 5. Set z j =x j + ε * n , and combine it with the current knockoff matrix Z 1:j−1 .
This algorithm was designed in accordance with our definition of knockoff. Using linear regression models, the empirical conditional distribution of x j , L(x j |X −j , Z 1:j−1 ) can be estimated. Using permutation, we eliminate the Gaussian assumption. Meanwhile, the generated z j is independent of the response y, since y is ignored in our algorithm. Note that this is a sequential algorithm, so computation could be expensive. In Shen et al. (2019), the residuals are assumed to be approximately independently and identically distributed. We keep this assumption for our algorithm.
To evaluate knockoffs generated by these methods, we propose the Mean Absolute Angle of Columns (MAAC) metric for checking vector independence, and use the Kernel Maximum Mean Discrepancy (KMMD) metric to test distribution similarity (Gretton et al., 2007).

Definition 2.2. Mean Absolute Angle of Columns (MAAC)
For two column vectors with the same length, x and z, we define For two matrices with the same dimensions, X and Z, with p columns, we define MAAC indicates the correlation between corresponding columns in two matrices. For knockoff variable selection, we know that the weaker the correlation, the more powerful the test. The KMMD metric is used to perform a nonparametric distribution test. The null hypothesis test, H 0 , is that the row vectors x i in X and z i in Z come from the same distribution. In summary, MAAC is for type 2 error control, and KMMD is for type 1 error control. In Section 3.2, simulation tests are described, and permutation tests are used for comparison.

Variable Importance Test Statistics in Tree Models
For each original and knockoff variable pair, Barber et al. (2015) suggested the Lasso signed max (LSM) statistic as the test statistic for variable selection. Candes et al. (2018) later proposed the Lasso coefficient difference (LCD) statistic for the same purpose. While both of these test statistics were designed specifically for Lasso, there are also test statistics designed for tree models. Table 1 provides a list of common variable importance test statistics for boosted tree models. Given that different test statistics will provide different results for importance ranking, it is difficult to determine which test statistic is the best choice in a specific scenario.

Test Statistic Definition Cover
The number of times a variable is used, weighted by the amount of training data in the corresponding nodes Gain (Breiman et al., 1984) The total loss reduction gained when using a variable Saabas (Saabas, 2014) The change in the model's expected outputs Tree SHAP (Lundberg et al., 2018) The sum of weighted difference of conditional expectations with and without a variable Weight The number of times a variable is used Among these statistics, SHAP, which is based on game theory and local explanations, has the demonstrated advantage of retaining both consistency and local accuracy (Lundberg and Lee, 2017b). Lundberg and Lee (2017a) introduced feature attribution for tree ensembles using the additive feature attribution methods defined by Lundberg and Lee (2017b). They also proposed the Tree SHAP algorithm, which reduces the complexity of computing exact SHAP values is the maximum number of leaves in any tree, p is the number of variables, and D is the maximum depth of any tree. Further, Lundberg et al. (2018) extended SHAP values to SHAP interaction values. In additive feature attribution methods, f (·) in Equation 9 is the original model, and an explanation model, h(·), is defined for f (·) such that where u ∈ {0, 1} p is a binary p dimensional column vector, u j = 1 if the j-th variable is observed, u j = 0 if it is missing, and φ stands for feature attribution values. The exact value of each φ j can be calculated as where P is the set of all variables, S is a subset of P with non-zero indices in u, and f x (S) = E[f (X)|X S ] is defined as a conditional expectation. This φ j is the SHAP value for the j-th variable.
Following the concept outlined by Barber et al. (2015) and Candes et al. (2018), we move on to define a test statistic for knockoff variables in tree models that can control the FDR. For simplicity, we use the SHAP value as an example, but other test statistics can be applied to the knockoff variable using a similar procedure. We denote a pair of original and knockoff variables as x j and z j , which is the same notation used in Section 2.4. Accordingly, their respective SHAP values are φ x j and φ z j . The hypothesis test in which we are interested is: H 0 : The original variable, x j , and the knockoff variable, z j , are equally associated with response; H a : The original variable, x j , is more closely associated with response than the knockoff variable, z j .
Here, the Shapley explanation absolute difference (SEAD) statistic is defined as where γ, λ, and α are hyperparameters used in Equation 14. A larger positive T j indicates that the response is more likely to be associated with the original variable than its knockoff. Under the null hypothesis, T j is equally likely to be positive or negative. Given that null T j s are symmetric, for any fixed t > 0, the false discovery proportion is This can be estimated byF whereFDP(t) = 1 when there are more T j s below −t than above t. Candes et al. (2018) provided equations for both the modified FDR and the typical FDR. Here, we accept the latter as it is more conservative. A threshold τ is defined such that where δ is the level under which we would like to keep the FDR. A set of selected variables is determined aŝ Then, the FDR is controlled as where S 0 ⊆ {1, 2, ..., p} is a set of noise variables. The performance of the SEAD statistic in various models is evaluated in the following sections, and test statistics constructed using other methods are tested for comparison.

Knockoff Boosted Tree Algorithm
To perform model-free variable selection with FDR control, we propose a robust multiple-stage KnockOff Boosted Tree (KOBT) algorithm. Figure 1 is a flowchart that outlines details of the procedure. The algorithm contains four main steps and one optional step (circled in red): (1) sample knockoff matrix Z according to original matrix X; (2) grow boosted trees using combined predictors (X, Z); (3) calculate the test statistic, φ, for each variable from the fitted boosted tree model, f B (X, Z); and (4) conduct steps (1) to (3) q times and get T j = 1 q q m=1 |φ x j,m | − 1 q q m=1 |φ z j,m | for the j-th original variable. A larger positive T j indicates a variable is more closely associated with response. The optional step is taken in the scenario in which some variables (shown as W in Figure 1) need to be retained in the final model. The first part of the algorithm is optional, and a regression model is added before the boosted tree model. Given that we want to retain some covariate variables in our final model, we conduct a regression, where y 0 is the original responses, W is the covariate matrix, and y stands for residuals from this optional regression. This y is treated as new responses for the subsequent boosted tree fitting.
For the first required step of KOBT, we generate knockoff variables Z conditional on original variables X. With the assumption that the data in X follow a Gaussian distribution, the column mean and covariance matrix of X are estimated. As discussed in Section 2.4, there are two choices for covariance matrix estimation: shrinking the estimation to the identity matrix (Opgen-Rhein and Strimmer, 2007) and sparse estimation (Bien and Tibshirani, 2011). Without a Gaussian assumption, we propose a strategy based on principal components and permutation (see Section 2.4 for details). The properties of knockoff variables are compared in Section 3.2.
During the process of model fitting, hyperparameters γ, λ, and α are tuned through Bayesian optimization. Once a boosted model is fitted, test statistics showing importance are calculated. Multiple statistics are considered, such as gain, cover, weight, SHAP value, and Saabas value. Their values are discussed in Section 3.3. The above steps, from generating knockoff variables to calculating test statistics, are repeated q times to achieve the mean absolute value for each variable, where q should be sufficiently large. According to the strategy of applying knockoff variables (Barber et al., 2015;Candes et al., 2018), for each pair of original and knockoff variables, a test statistic is calculated, which is the difference between the mean absolute test statistic values of each variable pair. It is straightforward that a larger test statistic proves that the original variable is more important than its corresponding knockoff. Finally, the FDR of variable selection is controlled through values of all generated T s, as shown in Equation 22.
We use simulation tests to address the following topics: (a) control of type I and type II errors for knockoffs generated by different methods; (b) the properties of different variable importance statistics in boosted tree models; and (c) power and false discovery control using various combinations of knockoffs and statistics. Further, the performance of boosted tree fitting for different model structures is compared.
3 Simulation Studies

Incorrect but Related Variables in Boosted Tree Models
First, we describe the models and the boosted tree framework used in the following simulation tests. For the design matrix X 0 n×p , we simulate n = 200 samples and p = 1000 variables. Each row vector in design matrix X 0 n×p is generated as x i ∼ N p (0, Σ) with block dependence structure matrix Σ = diag(Σ 1 , Σ 2 , Σ 3 , ...), where each Σ i is a πp × πp matrix with matrix element σ j,k = (ρ |j−k| ). We set π = 0.01 and ρ = 0.1. Because we want to test model-free variable selection, we generate the response as where β = (β πp , 0 p−πp ) , β πp is a vector with equal elements, ε ∼ N p (0, I p ), and X is transformed from initial X 0 with four different structures, namely: 1. Main-effect model: X = X 0 .
3. Exponential effect model: each While we generate the response y using the transformed X, we work under the notion of having no information about the transformed X and fit the models with the initial design matrix X 0 . Thus, we use incorrect but related variables for fitting boosted tree models. While this is challenging, there is no guarantee that the correct variables will be used for real data modeling, so it is possible to see boosted tree modeling performance with incorrect but related variables.
After defining the simulated data values, we move on to our boosted tree models. The booster used for growing our tree are the gradient boosted tree (GBRT in Table 2) and dropouts meet multiple additive regression tree (DART in Table 2) (Rashmi and Gilad-Bachrach, 2015). As suggested in Friedman et al. (2000) and Zhang et al. (2005), a small learning rate is preferred in boosted model fitting. For our simulation, we fix the learning rate γ = 0.01. In this scenario, the trees previously added to the boosted model are more important. This overfits the model with trees added earlier.
In iterations of DART model training, some trees are dropped randomly to avoid overfitting while in traditional GBRT, all trees are kept once they join a model. We compare these two algorithms in the following simulations. The tuning of hyperparameters is tricky, especially for a complex system such as a boosted tree with many parameters. The purpose of our simulation is to compare the performance of boosted trees under a series of conditions and thus we choose 10 initial hyperparameter values of (γ, λ, α) and conduct Bayesian optimization iterations 20 times for each fitting. We use the average 10-fold cross-validation error score for model evaluation. An early stopping criteria is set: if the average 10-fold cross-validation error score has not improved after five training iterations, no new trees are added. The maximum depth of each tree is fixed at 2, 3, 4, 5, and 6. The minimum weight of each single leaf is 10. Note that 1. We can increase the number of initial (γ, λ, α) hyperparameter values and Bayesian optimization iterations to find new (γ, λ, α) that can improve the performance of the final model. However, this is not within the scope of our simulation tests. 2. For real data, we can choose a larger number for the maximum depth of a boosted tree model. Here, we use only 2 to 6 because this is a simple simulation test.
For each combination of the above four models, two boosters, and five maximum tree depths, 100 boosted tree models are fitted. Table 2 lists the means and standard errors of 100 10-fold cross-validation error scores. We use the same design matrices X 0 1 , ..., X 0 100 for generating transformed matrices. For each model structure, X 1 , ..., X 100 are created according to our definitions. Among all four model structures, the main-effect model, which is the correct variable model, has the lowest cross-validation prediction error. The incorrect but related models, ranked from the lowest to highest cross-validation prediction error, are interaction, exponential, and then quadratic. This is reasonable since the structure of tree models is formed naturally by interaction among variables. From the standard errors of means, we conclude that predictions for the main-effect and interaction models are more stable than predictions for the exponential and quadratic models. The results indicate that although prediction ability differs by the various true models, it is realistic to use incorrect but related variables in boosted tree modeling. As for maximum tree depth, since the correlation between two variables is chosen as 0.1 d , where d is the index difference of each pair of variables, the correlation decreases rapidly as we choose two variables that are farther away from each other. This is why, in Table 2, the cross-validation error score is the lowest when depth = 2. This implies that prior knowledge of the strength of variable correlation is important, as we need to choose an appropriate value for tree depth. Finally, under the conditions of this simulation, there is no obvious difference between GBRT and DART boosters. Figure 2 shows examples of boosted tree iteration steps for all models and boosters.

Power and Type I Error Control of Knockoff Variables
As proposed in Section 2.4, there are three strategies to generate knockoff variables: Gaussian assumption with shrunk covariance matrix, Gaussian assumption with sparse covariance matrix, and permuted residuals from principal component regression. We are interested in evaluating the power and type I error performance using different knockoff variables for testing. We apply the MAAC metric, which is defined in Section 2.4, for power evaluation. A larger positive MAAC value shows that two tested matrices have less related column space, so the difference between a real signal and its knockoff is more significant. As for type I errors, we use KMMD to test if row vectors in the original design matrix X and knockoff matrix Z are drawn from the same distribution. A larger positive test statistic implies that it is more likely to reject the null hypothesis, which assumes that these two row vectors are from the same distribution. If knockoff variables are highly different from the original variables, this leads to large type I errors. In other words, we want knockoff variables to be drawn from the same distribution as the original variables, but with sufficiently different values. As always, there is a tradeoff between power and type I error control.  We first assume the row vector in design matrix X is from a normal distribution. We simulate n = 100 samples and p = 500 variables. Each row vector in design matrix X n×p is generated as x i ∼ N p (0, Σ) with block dependence structure matrix Σ = diag(Σ 1 , Σ 2 , Σ 3 , ...), where each Σ i is a πp × πp matrix with matrix element σ j,k = (ρ |j−k| ). We set π = 0.01 and ρ = 0.1. In Figure 3, we simulate 50 original design matrices and generate 1, 000 corresponding knockoff matrices for each original design matrix. Four kinds of knockoffs are created for comparison: shrunk Gaussian, sparse Gaussian, permuted principal component 10, and permuted principal component 30. The y-axis is calculated as the mean value of each 1, 000 knockoff samples from one original matrix. The x-axis is the seed used to sample the original matrix. Therefore, y values of different methods at the same x location are comparable. It is clear from Figure  3 (a) and (c) that all methods are consistent under normal assumption. The MAAC values in Figure 3 (a) show that the sparse method provides the most similar knockoffs, which also means they are the least powerful. On the other hand, the shrunk method has the most different knockoffs. Principal component methods provide knockoffs in between. This makes sense since the shrunk Gaussian method directly estimate the covariance matrix of X and shrinks it to the identity matrix if it is not definitely positive, while the sparse Gaussian method has zeros in the estimated matrix given its sparse assumption. In Figure 3 (c), the KMMD plot has the same order as MAAC. However, the sparse method has the lowest type I error while the shrunk method has the highest.
Since both the shrunk Gaussian and sparse Gaussian methods assume X follows a normal distribution, it is interesting to see what would happen if a design matrix does not follow a normal distribution. Figure 3 (b) and (d) show that the sparse Gaussian method is not as consistent for a Poisson distribution as it is for normal distribution. The principal component method is consistent, however, because it does not depend on the normal assumption. Despite some exceptions, for most design matrices, the y-axis order of these four curves is the same as for normal distribution. In summary, the sparse Gaussian method is the most conservative, the shrunk Gaussian method is the most powerful, and the principal component method lies in between. When we increase the number of principal components, the performance of the generated knockoff is similar to the performance of the shrunk method. Table 3: Mean and standard error of signal percentage and ranking ratio (RR) from 1, 000 simulation tests for ranking test statistics in main-effect models, where n = 100, p = 500.

Comparison of Variable Importance Ranking Statistics
For each variable in a tree model, feature importance can be used to show its importance in the model. As discussed in Section 2.5, a few test statistics can represent this importance. In this section, we conduct simulation tests comparing gain, cover, frequency, SHAP, and Saabas. Since we are only interested in the ranking ability of these test statistics for selected variables, not the performance of variable selection, which is conducted by tree growing itself, we define a score called the ranking ratio (RR) as RR = # Noise variables ranked above the last signal in ranking # Noise variables .
The RR contains two parts: the number of noise variables selected by the boosted tree and ranked above the lowest ranked signal among all selected signals, and the total number of mis-selected noise variables. This score, which ranges within (0, 1), indicates how many false positive decisions must be made, if, for a given set of variables chosen during tree growing, we want to include all signals for variable selection. The score is undefined if no signal or noise are selected in a tree model. It evaluates the ranking ability of each variable importance ranking statistic. A small RR score indicates better ranking ability of variable importance. We keep the model described in Section 3.2 and simulate n = 100 samples and p = 500 variables. Each row vector in design matrix X n×p is generated as x i ∼ N p (0, Σ), with block dependence structure matrix Σ = diag(Σ 1 , Σ 2 , Σ 3 , ...), where each Σ i is a πp × πp matrix with matrix element σ j,k = (ρ |j−k| ). We set π = 0.01 and ρ = 0.1. Following tree model hyperparameters in Section 3.1, the maximum depth of each tree is fixed at 6. We use two boosters, GBRT and DART, four model structures, and two signal strengths. In Table 3, the correct variables are used for modeling. Each mean and standard error are calculated from 1, 000 simulations. In each signal strength-booster combination, means for the test statistic are close, and the standard errors are almost the same. This indicates these variable importance statistics have similar consistency even under different signal strengths and boosters. There is no obvious difference between the two boosters. Among the five variable importance statistics, frequency has the weakest ranking ability given its highest RR score, regardless of strength or booster. In Tables 4, 5, and 6, incorrect but related variables are used for modeling. For these models, cover has the best performance as it returns the lowest RR scores in most cases. Note that we are only comparing five variable importance statistics under each combination of model structure, strength, and booster (i.e., each row in Tables 3 to 6). Different rows have different boosted tree models and are thus incomparable.

SEAD, Other Knockoff Test Statistics, and Their False Discovery Control
In this final subsection describing the simulation, we combine all of the previous steps and perform the whole framework of the knockoff boosted tree (KOBT) algorithm. We are interested in the power and FDR of different combinations of ranking statistics and knockoff types. We simulate n = 100 samples and p = 500 variables. Each row vector in design matrix X n×p is generated as x i ∼ N p (0, Σ) or x i ∼ Poisson p (5, Σ) with block dependence structure matrix  Table 6: Mean and standard error of signal percentage and ranking ratio (RR) from 1, 000 simulation tests for ranking test statistics in second-order models, where n = 100, p = 500.

Strength Booster
where each Σ i is a πp × πp matrix with matrix element σ j,k = (ρ |j−k| ). We set π = 0.01 and ρ = 0.1. The maximum depth of each tree is fixed at 6. For simplicity, we choose GBRT as the booster and the main-effect model as the true model. In Tables 7 to 10, power and FDR for different combinations of knockoff types and ranking statistics are listed for comparison. We choose a condition where signals are sparse and weak so that the performance of each method is considerably distinct.
In Table 7, 50 normally distributed design matrices are simulated, each with 1, 000 shrunk Gaussian knockoffs, 1, 000 sparse Gaussian knockoffs, 1, 000 10-principal component knockoffs, and 1, 000 30-principal component knockoffs. We set the targeted FDR as 0.1. The means and corresponding standard errors of power for each combination are listed. Among these four types of knockoffs, the shrunk Gaussian knockoff has the highest power, which is the same conclusion reached in Figure 3 (a). This is followed by 10and 30-principal component knockoffs. The sparse Gaussian knockoff has the lowest power. Among the five importance statistics we test, frequency is always the most powerful statistic for all knockoffs, while gain is the least powerful statistic. The other three are moderately powerful and fall between frequency and gain. Table 8 shows the means and corresponding standard errors of the FDR for each combination.
Since the targeted FDR is 0.1, it is clear that the sparse Gaussian knockoff is the most conservative method and is the only one that can ensure the FDR stays under the targeted level. At the same order of power, the shrunk Gaussian knockoff has the highest FDR, followed by the two principal component knockoffs. Table 9 shows the power of the statistics when the normality assumption is invalid. Fifty Poisson-distributed design matrices are simulated, each with 1, 000 shrunk Gaussian knockoffs, 1, 000 sparse Gaussian knockoffs, 1, 000 10principal component knockoffs, and 1, 000 30-principal component knockoffs. Again, the targeted FDR is set as 0.1. The power of the knockoffs is the same as for normal distribution. For the importance statistics, the order for Poisson-distributed matrices is different from that for normally distributed matrices, where cover has the highest power, followed by frequency and then the other three. Both the sparse Gaussian and 30-principal component knockoffs can control the FDR to stay close to the targeted level. The shrunk Gaussian and 10-principal component knockoffs have higher FDRs. The order of importance statistics for FDR is the same as their power ranking. In summary, the ranges in tables below show the trends of knockoffs and importance statistics.

More Conservative More Powerful Sparse Gaussian 30-Principal Component 10-Principal Component Shrunk Gaussian
More Conservative More Powerful Normal Gain Cover, Saabas, and SHAP Frequency Non-Normal Cover Frequency Gain, Saabas, and SHAP Table 7: Mean and standard error of power from 50 normal distributed simulation tests for ranking test statistics and knockoff types, where n = 100, p = 500, π = 0.04, signal strength = 1.5, design matrix variance σ 2 = 1, and q = 1000.
Sparse Gaussian Shrunk Gaussian 10-Principal Component 30-Principal Component   Tumor sample purity refers to the percentage of cancer cells in a tumor tissue sample. It is used to discover the roles of cancerous and non-cancerous cells in the tumour microenvironment, which mainly comprises immune cells (Aran et al., 2015;Turley et al., 2015). To estimate tumor purity, Carter et al. (2012) proposed ABSOLUTE, a method used to perform analysis of somatic DNA alterations. In addition, it has been reported that DNA methylation data and expression data from selected stromal genes (Houseman et al., 2012;Yoshihara et al., 2013) have been successfully used for estimation. Recently, Aran et al. (2015) and Li et al. (2019) used RNA-seq gene expression data to determine tumor purity and found several gene signatures of individual cancer types. As it has been shown reasonable to estimate tumor sample purity using gene expression data, we apply our KOBT algorithm so that we can compare our results with those of previous reports. Here, our interest is not the accuracy of estimation but the selection of genes that are important in estimation with no topology assumptions. If a gene's expression level is positively correlated with tumor purity, then it is highly likely that this gene is expressed primarily by cancer cells in tumor samples.
Processed TCGA RNA-seq gene expression data was downloaded from the Pan-Cancer Atlas Publication website (Hoadley et al., 2018). Among all 33 available tumor types, breast invasive carcinoma (BRCA) and skin cutaneous melanoma (SKCM) were chosen for tumor purity estimation. There are 1, 017 BRCA and 459 SKCM samples with 17, 176 expressed genes for each sample. The genes with missing values or zero variance were filtered out. We used the tumor purity estimates in Hoadley et al. (2018) as the responses, which were obtained using ABSOLUTE. The responses are thus bounded between 0 and 1.
To reduce the original dimensionality to a lower value, we apply Lasso (Tibshirani, 1996) on original data sets before performing our KOBT steps. After tuning the penalty terms in Lasso objective functions, 486 and 496 genes of BRCA and SKCM respectively are selected for further analysis. We generated q = 300 knockoffs for each gene. Here, we compare two knockoff types: shrunk Gaussian and 10-principal component knockoffs. We conducted Bayesian optimization and 10-fold cross-validation to choose the optimal parameters for the boosted tree algorithm (see Section 6.1 for details of the parameters). We used the SHAP for importance evaluation. Therefore, our test statistic is the average difference of SHAP values between each gene and its own knockoff, as shown in Equation 26. Finally, FDR control is applied to these test statistics of genes with a targeted FDR level of 10%. Table 11 reports the selected genes whose expression is related to BRCA tumor purity, where the targeted FDR is 0.1. Genes are selected using the Shrunk Gaussian knockoff. Genes selected by the 10-principal component knockoff are listed in Table 16. Among all detected genes, CSF2RB (colony stimulating factor 2 receptor beta common subunit), C1S (complement C1s), CCDC69 (coiled-coil domain containing 69), and FGR are also reported among the top 10-ranked important genes for pan-cancer tumor purity prediction in Li et al. (2019). CSF2RB is an immune-related gene. It has been reported that in BRCA, the high expression of CSF2RB is positively correlated with patient survival (Liu et al., 2019). ESTIMATE (Yoshihara et al., 2013) uses stromal and immune gene expression to predict tumor purity. Our detected immune genes CCDC69, FGR, and IL7R are included in Yoshihara et al. (2013). The box plots of gene expression levels for selected genes are presented in Figure 4. These genes are selected because they are detected by both types of knockoffs. Other genes are plotted in Figure 8. All samples are grouped according to their purity: samples with the top 1/3 purity are labeled as high (blue); and samples with the bottom 1/3 purity are labeled as low (yellow). The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p-values are included at the top of the plot. It is shown that genes expression levels of almost all selected genes are highly correlated with tumor purity. Samples are grouped according to their gene expression levels. The top 1/3 samples are grouped into the high-level group, and the bottom 1/3 samples are grouped into the low-level group. Survival analysis has been conducted for these genes. Among them, CCDC69, CXORF65, and FGR have significant p-values for coefficients in Cox regression. Plots of the Kaplan-Meier estimators and p-values are in Figure 10.  Figure 5. Other genes are plotted in Figure 9. Same as it is for BRCA, samples with the top 1/3 purity are labeled as high (blue); and samples with the bottom 1/3 purity are labeled as low (yellow). Related p-values from Wilcoxon signed-rank tests are attached to the plots. We see that genes expression levels of almost all selected genes are highly correlated with tumor purity. All samples are grouped according to their gene expression levels. Plots of their Kaplan-Meier estimators and p-values are in Figure 11 and 12. In summary, for tumor purity estimation, we propose that our KOBT algorithm can detect a few genes that are largely expressed in stromal cells. Our results reproduce previously reported work using different algorithms.

Tumor Type Classification
For application of KOBT on classification, we focus on its performance of assigning tumors to known classes and identifying those genes whose expression levels are related to the classification. In 1999, Golub et al. (1999) used gene expression monitoring by DNA microarrays to conduct cancer classification. Their test on classifying acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) showed that it was realistic to distinguish different cancer subtypes only based on gene expression data. Li et al. (2017) performed a pan-cancer classification with RNA-seq data from TCGA using boosted tree and k-nearest neighbours methods. Since KOBT algorithm can detect signals with false discovery rate control, we would like to see how it works on gene selection for tumor type classification.  The TCGA RNA-seq gene expression data is the same as in Subsection 4.1. Instead of a pan-cancer classification, we focus on two challenging binary classifications for (1) esophageal carcinoma (ESCA) vs stomach adenocarcinoma (STAD) and (2) rectum adenocarcinoma (READ) vs colon adenocarcinoma (COAD). We choose these two pairs of cancers because of their similarities. Esophageal carcinoma (ESCA) and stomach adenocarcinoma (STAD) are both malignant tumors in the digestive tract. In Li et al. (2017), it has been reported that almost all READ samples were mis-assigned to COAD. We have 499 samples for the ESCA vs STAD classification test and 529 samples for the READ vs COAD one. The steps are the same as in Subsection 4.1 except that the responses are binary now: 0 for one cancer and 1 for another cancer.
In Table 13, the selected genes that can distinguish ESCA and STAD are presented. Among them, BARX1 is a stomach mesenchymal transcription factor. Kim et al. (2005) has shown that BARX1 loss in the mesenchyme prevents stomach epithelial differentiation of overlying endoderm and induces intestine-specific genes instead. Their results defined a transcriptional and signaling pathway of inductive cell interactions in vertebrate organogenesis. Kim et al. (2011) proved that BARX1 controls mouse stomach morphogenesis and is required to specify stomach-specific epithelium in adjacent endoderm. HAND2 is an RNA Gene, which is affiliated with the long non-coding RNA class. Tsubokawa et al. (2002) stated that KRT14 is often expressed by tumor cells in the trabecular nests of the primary carcinoma. The box plots of gene expression levels for all detected genes are presented in Figure 6. All samples are grouped according to their cancer types: ESCA samples are in blue, and STAD samples are in yellow. The non-parametric Wilcoxon signed-rank test is conducted for each gene. Their corresponding p-values are included at the top of the plot. It is shown that genes expression levels of all selected genes are highly correlated with the cancer type.
In Table 14, we show the detected genes that can used for READ and COAD classification. HOXC4 and HOXC8 are members of Homeobox genes, which is a large family of transcription factors that direct the formation of many body structures during early embryonic development. It has been observed that HOXC family gene expression is upregulated in most solid tumor types (Bhatlekar et al., 2014). The box plots of gene expression levels for all detected genes are presented in Figure 7. All samples are grouped according to their cancer types: COAD samples are in blue, and READ samples are in yellow. The non-parametric Wilcoxon signed-rank test is conducted for each gene. Their corresponding p-values are included at the top of the plot. It is shown that genes expression levels of all selected genes are highly correlated with the cancer type.   Unlike in tumor purity estimation, cancer genes play a more important role in tumor type classification. We detect some protein coding genes and find previous published literature to support our findings in both classification tests. Given the fact that the mechanism of cancers is complicated, and the cancer types we choose have been reported to be hard to distinguish, we show that our proposed KOBT algorithm to be robust in real data with unknown models.

Conclusions
In this article, we introduce a new model-free variable selection method, called knockoff boosted tree, KOBT. Given the nature of boosted tree models, no prior model topology knowledge is required. We extend the current application of knockoff methods to tree models. Meanwhile, we proposed two new sampling methods to generate knockoffs, the principal component construction knockoff and the sparse Gaussian knockoff. Unlike currently available methods, the principal component construction knockoff does not depend on Gaussian assumption of design matrix. To evaluate the power of our generated knockoffs in simulation tests, we define a new test statistic, mean absolute angle of columns, which stands for the average distance between column vectors in two matrices. We apply kernel maximum mean discrepancy to test whether our proposed knockoffs are drawn from the same distributions as the original vectors. In our boosted tree framework, we fix most hyperparameters at multiple reasonable levels and leave regularization parameters to be tuned by Bayesian optimization. We consider different importance scores in tree models. Combinations of different knockoffs and importance test statistics are listed in our results. We test our strategy on different unknown true models, including main effect, interaction, exponential, and second order models. Finally, we apply our algorithm on tumor purity estimation and tumor type classification using TCGA gene expression data. We see that KOBT performs well and our results match previously published literature. Finally, we provide an R package to implement our proposed approaches, which is available at https://cran.r-project.org/web/packages/KOBT/index.html.
6 Supplementary Information The early stopping rule is stopping adding new trees when average test loss in cross validation does not improve in five additional trees.