Graph convolutional network-based feature selection for high-dimensional and low-sample size data

Abstract Motivation Feature selection is a powerful dimension reduction technique which selects a subset of relevant features for model construction. Numerous feature selection methods have been proposed, but most of them fail under the high-dimensional and low-sample size (HDLSS) setting due to the challenge of overfitting. Results We present a deep learning-based method—GRAph Convolutional nEtwork feature Selector (GRACES)—to select important features for HDLSS data. GRACES exploits latent relations between samples with various overfitting-reducing techniques to iteratively find a set of optimal features which gives rise to the greatest decreases in the optimization loss. We demonstrate that GRACES significantly outperforms other feature selection methods on both synthetic and real-world datasets. Availability and implementation The source code is publicly available at https://github.com/canc1993/graces.


Introduction
Many biological data representations are naturally high-dimensional and low-sample size (HDLSS) [1,2,3,4,5].RNA sequencing (RNA-Seq) is a next-generation sequencing technique to reveal the presence and quantity of RNA in a biological sample at a given moment [6].RNA-Seq datasets often contain a huge amount of features (e.g., ≥ 10 5 ), while the number of samples is very small 1 arXiv:2211.14144v1[cs.LG] 25 Nov 2022 (e.g., ≤ 10 3 ).Analyzing RNA-Seq data is crucial for various disciplines in biomedical sciences, such as disease diagnosis and drug development [4,5].However, many machine learning tasks such as feature selection likely fail on such data due to the challenge of overfitting.
A useful technique in dealing with high-dimensional data is feature selection, which aims to select an optimal subset of features.Although the selection of an optimal subset of features is an NP-hard problem [7], various compromised feature selection methods have been proposed.While feature selection methods are often grouped into filtering, wrapped, and embedded methods [8], in this work, we classify them into five categories -statistics-based [9,10], Lasso-based [11,12], decision tree-based [9,13], deep learning-based [14,15], and greedy methods [16], based on their learning schemes, see details in Section 2. Note that most of the methods address the curse of dimensionality under the blessing of large-sample size [15].Only a few of them can handle HDLSS data.The state-of-the-art feature selection methods for HDLSS data are Hilbert-Schmidt independence criterion (HSIC) Lasso [12,17] and deep neural pursuit (DNP) [15].
In this paper, we propose a graph neural network-based feature selection method -GRAph Convolutional nEtwork feature Selector (GRACES) -to extract features by exploiting the latent relations between samples for HDLSS data.Inspired by DNP, GRACES is a deep learning-based method that iteratively finds a set of optimal features.GRACES utilizes various overfitting-reducing techniques, including multiple dropouts, introduction of Gaussian noises, and F-correction, to ensure the robustness of feature selection.We demonstrate that GRACES outperforms HSIC Lasso and DNP (and other baseline methods) on both synthetic and real-world datasets.
The paper is organized into five sections.We perform a thorough literature review on feature selection (including traditional and HDLSS feature selection methods) in Section 2. The main architecture of GRACES is presented in Section 3. We evaluate the performance of GRACES along with several representative methods on both synthetic and real-world datasets in Section 4. Finally, we discuss the computational costs of the methods and conclude with future research directions in Section 5.

Related Work
Univariate statistical tests have been widely applied for feature selection [9,10].The computational advantage allows them to perform feature selection on extremely high-dimensional data.The ANOVA (analysis of variance) F-test [18] is one of the most commonly used statistical methods for feature selection.The value of the F-statistic is used as a ranking score for each feature, where the higher the F-statistic, the more important is the corresponding feature [9].Other classical statistical methods, including the student's t-test [19], the Pearson correlation test [20], the Chisquared test [21], the Kolmogorov-Smirnov test [22], the Wilks' lambda test [23], and the Wilcoxon signed-rank test [24], can be applied for feature selection in a similar manner.Empirically, the ANOVA F-test is able to achieve a relatively good performance in feature selection on some HDLSS data with very low computational costs.L1-regularization, also known as the least absolute shrinkage and selection operator (Lasso), has a powerful built-in feature selection capability for HDLSS data [11].Lasso assumes linear dependency between input features and outputs, penalizing on the l 1 -norm of feature weights.Lasso produces a sparse solution with which the weights of irrelevant features are zero.Yet, Lasso fails to capture nonlinear dependency.Therefore, kernel-based Lasso such as HSIC Lasso [12,17] has been developed for handling nonlinear feature selection on HDLSS data.HSIC Lasso utilizes the empirical HSIC [25] to find non-redundant features with strong dependence on outputs.
Decision tree-based methods are also popular for feature selection, which can model nonlinear input-output relations [9].As an ensemble of decision trees, random forests (RF) [32] calculate the importance of a feature based on its ability to increase the pureness of the leaf in each tree.A higher increment in leaves' purity indicates higher importance of the feature.In addition, gradient boosted feature selection (GBFS) selects features by penalizing the usage of features that are not used in the construction of each tree [13].However, decision tree-based feature selection methods such as RF and GBFS require large-sample size for training.Hence, these methods often do not perform well under the HDLSS setting.
Numerous deep learning-based methods have been proposed for feature selection [14,33,34,35,36,37,38,39]. Like decision tree-based methods, deep neural networks also require a large number of samples for training, so these methods often fail on HDLSS data.Nevertheless, there are several deep learning-based feature selection methods which are designed specifically for HDLSS data [15,40].DNP learns features by using a multilayer perceptron (MLP) and incrementally adds them through multiple dropout technique in a nonlinear way [15].DNP overcomes the issue of overfitting resulting from low-sample size and outperforms other methods such as LR Lasso, HSIC Lasso, and GBFS on HDLSS data.An alternative to DNP with replacing the MLP by a recurrent neural network is mentioned in [41].Yet, DNP only uses MLP to generate low-dimensional representations, which fails to capture the complex latent relationships between samples.Moreover, Deep feature screening incorporates a neural network for learning low-dimensional representations and a multivariate rank distance correlation measure (applied on the low-dimensional representations) for feature screening [40].However, the effectiveness of the method needs further investigation.
Other frequently used feature selection methods include recursive feature elimination [42] and sequential feature selection [16].The former recursively considers smaller and smaller sets of features based on the feature importance obtained by training a classifier.The latter is a greedy algorithm that adds (forward selection) or removes (backward selection) features based on the cross-validation score of a classifier.However, both methods are computational expensive, which become infeasible when dealing with HDLSS data.

Method
GRACES is an iterative algorithm which has five major components: feature initialization, graph construction, neural network, multiple dropouts, and gradient computation (Fig. 1).Motivated by DNP, GRACES aims to iteratively find a set of optimal features which gives rise to the greatest decreases in the optimization loss.For feature initialization, given a feature matrix X ∈ R n×p with n p, we first introduce a bias feature (e.g., an all-one column) into X and index it by zero.The total number of features now is p + 1, and the original features have the same index numbers as before.We initialize the selected feature set S = {0}, i.e., the bias feature.In other words, the bias feature serves as the initial selected feature to start the feature selection process.
For graph construction, we exploit the cosine similarity measure based on the selected features in S. Given two feature vectors x i ∈ R |S| and x j ∈ R |S| for sample i and j, the cosine similarity is defined as the cosine of the angle between them in the Euclidean space, i.e., Considering each sample as a node, we connect two nodes if their cosine similarity score is larger than a threshold δ (which is a hyperparameter of GRACES).The resulting similarity graph captures the latent interactions between samples and will be used in the GCN layer.The similarity graph is different at each iteration, and other similarity measures, such as Pearson correlation and Chi-squared distance [43] (for discrete features), can also be used here.
We build the neural network with three layers: an input linear layer, a GCN layer, and an output linear layer.In order to select the features iteratively, we only need to consider weights along the dimensions corresponding to the selected features in the input weight matrix (in other words, for those non-selected features, the corresponding entries in the weight matrix must be zeros) without a bias vector, i.e., where x j ∈ R p+1 is the feature vector for sample j, W input ∈ R h 1 ×(p+1) is the learnable weight matrix (h 1 denotes the first hidden dimension) such that the (i + 1)th column is a zero vector for i / ∈ S. Subsequently, we utilize one of the classical GCN -GraphSAGE [44] to refine the embeddings based on the similarity graph constructed from the second step, i.e., where W 1 ∈ R h 2 ×h 1 and W 2 ∈ R h 2 ×h 1 are two learnable weight matrices (h 2 denotes the second hidden dimension), and N (j) denotes the neighborhood set of node j.GraphSAGE leverages node feature information to efficiently generate embeddings by sampling and aggregating features from a node's local neighborhood [44].Finally, the refined embedding is further fed into an output linear layer to produce probabilistic scores of different classes for each sample, i.e., where W output ∈ R h 2 ×2 is a learnable weight matrix (assuming the labels are binary, i.e., label zero and label one) and b output ∈ R 2 is the bias vector.We denote the predicted vector containing the probabilities of label one (second entry in ŷj ) for all samples by ŷ ∈ R n .
To reduce the effect of high variance in the subsequent gradient computation, we adopt the same strategy of multiple dropouts as proposed in [15].After training the neural network based on the selected features, we randomly drop hidden neurons in the GCN layer and the output layer multiple times (the total number of dropouts m is a hyperparameter).In other words, we obtain multiple different dropout neural network models.The technique of multiple dropouts has proved to be effectively stable and robust for deep learning-based feature selection under the HDLSS setting [15,41].
For gradient computation, we compute the gradient regarding the input weight for each dropout neural network model and take the mean, i.e., where L is the optimization loss, and input is the input weight matrix for the qth dropout model.Here we use the cross-entropy loss, i.e., where y j and ŷj are the jth entries of y and ŷ, representing the true label and the predicted probability of label one for sample j, respectively.After obtaining the average gradient matrix, the next selected feature can be computed based on the magnitude of the column norm of G, i.e., where g j is the jth column of G.The selected feature set is iteratively updated until reaching the number of requested features, and the final features selected by GRACES is given by S with the bias feature removed.
To further reduce the effect of overfitting due to low-sample size, we incorporate two additional strategies in GRACES.First, we consider introducing Gaussian noises to the weight matrices of the GCN layer, i.e., adding noise matrices generated from a Gaussian distribution with mean zero and variance σ 2 (which is a hyperparameter of GRACES) to W (q) 1 and W (q) 2 , for the different dropout models in the gradient computation step.Studies have shown that introduction of Gaussian noises is able to boost the stability and the robustness of deep neural networks during training [45,46,47].
Second, we consider correcting the feature scores (i.e., g j 2 ) by incorporating it with the ANOVA F-test, i.e., the final score for feature j is given by where g j is the normalized score computed from g j 2 , f j is the normalized score computed from the F-statistic, and α ∈ [0, 1] is the correction weight (which is a hyperparameter of GRACES).
Therefore, the selected feature set is updated by the follows: The reasons we select the ANOVA F-test are: (1) it is computationally efficient; (2) it achieves a relatively good performance in feature selection for some HDLSS data; (3) it does not suffer from overfitting, so including it can reduce the effect of overfitting in GRACES.The two overfittingreducing strategies effectively improve the performance of GRACES for HDLSS data.
Detailed steps of GRACES can be found in Algorithm 1.We list all the hyperparameters of GRACES in Table 1.Although GRACES is inspired from DNP, it differs from DNP in the following aspects: (1) GRACES constructs a dynamic similarity graph based on the selected feature at each iteration; (2) GRACES exploits advanced GCN (i.e., GraphSAGE) to refine sample embeddings according to the similarity graph, while DNP only uses MLP which fails to capture latent associations between samples; (3) in addition to multiple dropouts proposed in DNP, GRACES utilizes more overfittingreducing strategies, including introduction of Gaussian noises and F-correction, to further improve the robustness of feature selection.In the following section, we will see that GRACES significantly outperforms DNP in both synthetic and real-world examples.

Experiments
We evaluated the performance of GRACES on both synthetic and real-world HDLSS datasets along with six representative feature selection methods, including the ANOVA F-test [18], LR Lasso [31], HSIC Lasso [12], RF [32], CancelOut (a traditional deep learning-based feature selection method) [37], and DNP [15].HSIC Lasso and DNP are recognized as the state-of-the-art methods for HDLSS feature selection.The reason we chose CancelOut is that it achieves a relatively better performance compared to other deep learning-based methods (which are not designed specifically for HDLSS data).We did not compare with GBFS (due to the feature of early stopping), deep feature screening (due to lack of code availability), and recursive feature elimination and sequential feature selection (due to infeasible computation).We used support vector machine (SVM) as the final classifier and the area under the receiver operating characteristic curve (AUROC) as the evaluation metric.All the experiments presented were performed on a Macintosh machine with 32 GB RAM and an Apple M1 Pro chip in Python 3.9.The code of GRACES can be found at https://github.com/canc1993/graces.

Synthetic Datasets
We used the scikit-learn function make_classification to generate synthetic data.The function creates clusters of points normally distributed about vertices of a q-dimensional hypercube (q is the number of important features) and assigns an equal number of clusters to each class [48].
We set the number of samples to 60 and fixed the number of important features to 10.We varied the total number of features from 500 to 5000 and considered three synthetic datasets with easy, intermediate, and hard classification difficulty (can be controlled by the variable class_sep).We randomly split each dataset into 70% training, 20% validation, and 10% testing with 20 replicates.
We performed grid search for finding the optimal key hyperparameters for each method.We reported the average test AUROC (over 20 times train test splits) with respect to the total number of features.In the meantime, since we know the exact important features, we also reported the correction rate of the selected features during training.
The results are shown in Fig. 2. Clearly, GRACES achieves a superb performance under all three modes.Notably, GRACES is able to capture more correct important features (i.e., the correction rate of GRACES significantly outperforms other methods), which leads to a better test AUROC.
Moreover, the performance of GRACES is remarkably stable regarding the increase of the total number of features (especially under the easy and intermediate modes).By contrast, the AUROC of the other methods (except DNP) fluctuates drastically.Under the easy mode, most of the methods (such as the ANOVA F-test, LR Lasso, CancelOut) accomplish a comparable performance (i.e., AUROC > 90%) even though their correction rates are much lower than that of GRACES.Under the hard mode, however, these methods become ineffective (i.e., AUROC ∼ 50%).Finally, DNP achieves the second-best performance for the three synthetic datasets.

Real Datasets
We used the following six biological datasets: • Colon: Gene expression data from colon tumor patients and normal control; • Leukemia: Gene expression data from acute lymphoblastic leukemia (ALL) patients and normal control; • ALLAML: Gene expression data from acute lymphoblastic leukemia (ALL) patients and acute myeloid leukemia (AML) patients; • GLI_85: Gene expression data from glioma tumor patients and normal control; • Prostate_GE: Gene expression data from prostate cancer patients and normal control; • SMK_CAN_187: Gene expression data from smokers with lung cancer and smokers without lung cancer.
The statistics of the datasets are shown in Table 2.All the datasets can be downloaded from https://jundongl.github.io/scikit-feature/datasets.html.
We randomly split each dataset into 20% training, 50% validation, and 30% testing with 20 replicates.We chose a such low-training size is that a high-training size would result in an extremely high performance for every method (which can be seen in the DNP paper [15]).We performed grid search for finding the optimal key hyperparameters for each method.We reported the average test AUROC (over 20 times train test splits) with respect to the number of selected features from 1 to 10.
The results are shown in Fig. 3, where GRACES outperforms the other methods for all the datasets except SMK_CAN_187.In particular, on the GLI_85 and Prostate_GE datasets, the advantage of GRACES can be shown with statistical significance compared to the second-best method (p-value < 0.05, one-sample paired t-test on the total 200 data points).On the SMK_CAN_187 dataset, GRACES still achieves a comparable performance with the well-performing methods.Moreover, the performance of GRACES is stable and robust across all the datasets, while the other methods (such as LR Lasso, HSIC Lasso and DNP) would fail on certain datasets (e.g., LR Lasso on ALLAML; HSIC Lasso on Colon; DNP on GLI_85), see Fig. 4 and Table 3.By combining the six dataset, the overall performance of GRACES is significantly better than these of all the other methods (p-value < 10 −6 , one-sample paired t-test on the total 1200 data points).Surprisingly, the ANOVA F-test achieves a relative good and stable performance on the real-world datasets.RF and CancelOut, which are not suitable for HDLSS data, do not perform well.In summary, GRACES can achieve a comparable or improved performance over the baselines on the six biological datasets.

Discussion
Both the synthetic and real-world datasets demonstrate compelling evidence that GRACES can achieve a superb and stable performance on HDLSS datasets.We also computed the total computational time of each method for running the six biological datasets with selected features from 1 to 10, see Table 4.The ANOVA F-test is the most computationally efficient method among the seven methods.On the other hand, our method GRACES requires more computation resources in finding the optimal features due to its complex architecture.When the number of sample is small (e.g., Colon, Leukemia, ALLAML), the computational time of GRACES is still reasonable.
However, when the number of samples becomes large (e.g., SMK_CAN_187), the computational time increases drastically.Therefore, GRACES is only suitable for HDLSS data and cannot handle normal feature selection tasks with large-sample sizes.
In this paper, we propose a GRACES model to perform feature selection on HDLSS data.By utilizing GCN along with different overfitting-reducing strategies including multiple dropouts, introduction of Gaussian noises, and F-correction, GRACES achieves a superior performance on both the synthetic and real-world HDLSS datasets compared to other classical feature selection methods.
We plan to apply GRACES to more biological datasets that suffer from HDLSS problem, such as different multi-omics data.It will be useful to investigate more sophisticated network architecture to learn the low-dimensional representations of data.For example, hypergraph convolutional network [49,50,51], generalized from GCN, is able to exploit higher-order associations among samples, which might result in a more accurate representation for each sample.Further, more overfitting-reducing techniques such as normalization can be considered.
Fig. 1: Workflow of GRACES.GRACES consists of feature initialization that adds a bias feature served as the initial selected feature, graph construction with using cosine similarity on the selected features, three-layer neural network with an input linear layer, GraphSAGE layer, and an output linear layer (where gray disks represent the hidden neurons in the neural network), multiple dropouts on the hidden neurons for reducing variance in the subsequent computation, and gradient computation (with introduction of Gaussian noises or F-correction) which gives rise to the current optimal feature according to the gradient magnitude.Note that the activation of the hidden neurons in the input linear layer depends on S.  Tables Table 1:  Train a neural network on X and y with learning rate l, including an input layer (with W input ∈ R h 1 ×(p+1) ), a GCN layer (with W 1 , W 2 ∈ R h 2 ×h 1 ), and an output layer (with W ouput ∈ R h 2 ×2 and b output ∈ R 2

7:
Dropout m times in the GCN and output layers of the neural network 8: Introduce Gaussian noises (generated from a Gaussian distribution with mean zero and variance σ 2 ) to the GCN layer 9: Compute the average gradient regarding the input weight matrix 10: Correct the feature scores by the ANOVA F-test with correction rate α 11: Update the selected feature set by (8) 12: end while 13: Drop the bias feature (i.e., the first element) from S 14: Return: Selected feature set S.

Figures
Figures

Fig. 2 :Fig. 3 :
Fig. 2: Synthetic datasets.Average test AUROC and correction rate with respect to the total number of features for the easy, intermediate, and hard synthetic datasets.Error bars indicate standard error mean.

Fig. 4 :
Fig.4: Boxplot of overall AUROC mean ranking over the six datasets for all the feature selection methods.

Table 2 :
Hyperparameters of GRACES.Statistics of the real-world datasets.

Table 3 :
Overall AUROC mean and rankings of the feature selection methods.We ranked these methods based on the overall AUROC

Table 4 :
Total computational time with selected features from 1 to 10 (in seconds) of each method the six biological datasets.We used the default hyperparameters of each method to obtain the computational time.Input: Feature matrix X ∈ R n×p , label vector y ∈ R n , the number of requested feature K, score threshold δ, hidden dimensions h 1 and h 2 , learning rate l, number of dropouts m, Gaussian variance σ 2 , and correction rate α 2: Introduce a bias feature into X and index it by 0 3: Initialize S = {0} 4: while |S| ≤ K + 1 do Construct a cosine similarity graph based on S with a similarity score threhold δ