Should we really use graph neural networks for transcriptomic prediction?

Abstract The recent development of deep learning methods have undoubtedly led to great improvement in various machine learning tasks, especially in prediction tasks. This type of methods have also been adapted to answer various problems in bioinformatics, including automatic genome annotation, artificial genome generation or phenotype prediction. In particular, a specific type of deep learning method, called graph neural network (GNN) has repeatedly been reported as a good candidate to predict phenotypes from gene expression because its ability to embed information on gene regulation or co-expression through the use of a gene network. However, up to date, no complete and reproducible benchmark has ever been performed to analyze the trade-off between cost and benefit of this approach compared to more standard (and simpler) machine learning methods. In this article, we provide such a benchmark, based on clear and comparable policies to evaluate the different methods on several datasets. Our conclusion is that GNN rarely provides a real improvement in prediction performance, especially when compared to the computation effort required by the methods. Our findings on a limited but controlled simulated dataset shows that this could be explained by the limited quality or predictive power of the input biological gene network itself.


DREAM5
Given network Simulated static expression data (obtained from the network) Prediction of one gene expression based on the expression of the other genes.
1.2 Description of preprocessing performed on the different datasets

BreastCancer
In [1], the authors reported results based on "standardized" and "non standardized" data for GNN and RF (results of [2] seem to have been obtained from standardized data).If (X ij ) i=1,...,n, j=1,...,p is the gene expression matrix for observation (patient) i and variable (gene) j, "non standardized" data correspond to substracting the minimum expression from X: such that all expressions lie in [0, 8.35]."Standardized data" correspond to a centering and scaling of the original data: Note that the random forest should give identical results using either X or ≈ X since both are linear transformations of the original variables and don't change the definition (variable and threshold) of the optimal split.Hence, differences in performance reported between the two datasets in [1] is probably just the effect of the randomness of the method (when an identical random seed has not been set for the training of both datasets).

CancerType
No further preprocessing of the data (compared to available dataset), either in [3] or in our experiments.

F1000
No further preprocessing of the data (compared to dataset sent by authors) in our experiments.

Simulated
As compared with the output of the sismonr package, inputs and outputs were centered and scaled to unit variance as described in Equation (1).

DREAM5
As compared with original dataset, inputs and outputs were centered and scaled to unit variance as described in Equation (1).Table S2: GNN: Chosen hyper-parameters for the different datasets.When the number of graph convolutional (GC) or dense layers is greater than one, brackets are used to indicate the value of the hyperparameter for each layer.Table S5: Perceptron and Random forest: Hyper-parameters as set in the original article [4] (empty fields indicate that the default value has been used).3.3.9Computational efficiency (time and memory) for full + primary site

3. 1 . 8
Figure S8: BreastCancer.Unscaled data: Fit time with and without glmgraph (first row), prediction time (second row), and maximum or total memory load (in MiB), respectively for Python scripts (left) and R scripts (right).

3. 2 . 3
Figure S12: CancerType.PPI+singleton data: Fit and prediction times (first row), effect of the implementation on fit and prediction times (second row), maximum memory load (in MiB) for Python scripts (third row), and total memory load (in MiB) for R scripts (fourth row).

Figure S21 :
Figure S21: F1000.Subtype data: Fit and prediction times (first row), effect of the implementation on fit and prediction times (second row), maximum memory load (in MiB) for Python scripts (third row), and total memory load (in MiB) for R scripts (fourth row).

Figure S24 :
Figure S24: F1000.Primary site data: Fit and prediction times (first row), effect of the implementation on fit and prediction times (second row), maximum memory load (in MiB) for Python scripts (third row), and total memory load (in MiB) for R scripts (fourth row).

3. 3 .
Figure S27: F1000.MOA data: Fit and prediction times (first row), effect of the implementation on fit and prediction times (second row), maximum memory load (in MiB) for Python scripts (third row), and total memory load (in MiB) for R scripts (fourth row).

3. 4 . 4
Figure S31: Simulated.Fit and prediction times (first row) and maximum or total memory load (in MiB), respectively for Python scripts (left) and R scripts (right, with varying input graphs).

3. 4 . 5
Figure S32: Simulated.Fit and prediction times for varying implementations (first row) and varying graphs (second row).Maximum memory load (in MiB) for Python scripts (third row).

3.5. 3
Figure S35: DREAM5.Scaled data: Fit and prediction times (first row), maximum or total memory load (second row; in MiB), respectively for Python scripts (left) and R scripts (right), effect of the implementation on fit and prediction times (third row), and on the maximum memory load for Python scripts (fourth row).

Table S1 :
Description of datasets.

Table S3 :
GNNo: Chosen hyper-parameters for the different datasets.

Table S4 :
Perceptron, Random forest, SVM, and glmgraph (whenever relevant): Chosen hyper-parameters for the different datasets.Unspecified hyper-parameters were set to their default values.