Integrating multiomics and prior knowledge: a study of the Graphnet penalty impact

Abstract Motivation In the field of oncology, statistical models are used for the discovery of candidate factors that influence the development of the pathology or its outcome. These statistical models can be designed in a multiblock framework to study the relationship between different multiomic data, and variable selection is often achieved by imposing constraints on the model parameters. A priori graph constraints have been used in the literature as a way to improve feature selection in the model, yielding more interpretability. However, it is still unclear how these graphs interact with the models and how they impact the feature selection. Additionally, with the availability of different graphs encoding different information, one can wonder how the choice of the graph meaningfully impacts the results obtained. Results We proposed to study the graph penalty impact on a multiblock model. Specifically, we used the SGCCA as the multiblock framework. We studied the effect of the penalty on the model using the TCGA-LGG dataset. Our findings are 3-fold. We showed that the graph penalty increases the number of selected genes from this dataset, while selecting genes already identified in other works as pertinent biomarkers in the pathology. We demonstrated that using different graphs leads to different though consistent results, but that graph density is the main factor influencing the obtained results. Finally, we showed that the graph penalty increases the performance of the survival prediction from the model-derived components and the interpretability of the results. Availability and implementation Source code is freely available at https://github.com/neurospin/netSGCCA

We tested 12 different configurations, each configuration consists of a vector u 2 and a graph. The different configurations allow us to assess the properties of the method on diverse situations. As a remainder, the aim of the simulation is not to compare the different graphs, as they are expected to be a priori knowledge and not computed. But, we aim to assess the behaviour of the different graph types in the different cases.
The model performance was assessed using the correlation between the estimated components. Additionally, we computed the precision, recall and F1 metrics between the true u j vectors and the weights w j estimated by the model. For each configuration, we chose the hyper-parameter γ G by running the model 20 times, with γ G ranging from 10 −4 to 10 4 each time. The best γ G was selected using the best average F1 score because our objective is mainly to recover the variables of interest. For each configuration, we also ran the model without using a graph and compared the results. The sparsity value was fixed to √ 25 for the first block (resp. √ 20 for the second block) in all runs. Table S1 and Table S2 show the results obtained on the simulated data. It shows that when we have a high mean/variance ratio, the F1 scores are higher, which means that the models using the graphs focus on retrieving the underlying projector u 2 . This was expected since a low mean/variance ratio means noisier data.
Overall, the tables also show that using netSGCCA outperformed the SGCCA without a graph. When c = 2, SGCCA selected very few variables, about 3, leading to a high precision but very low recall. In contrast, the graph penalisation allowed the model to select more variables, retrieving all the variables of interest and a high F1 score. However, this increase of the F1 score came with a slight decrease in the correlation between the estimated components, by around 2%. Additionally, when c = 0.5, the F1 score continues to show improvement when the graph penalisation is used, but to a lower degree. However, the correlation between the estimated components also increased by 0.13 on average.
Comparing each graph type, we can see that, when c = 2, the path graph recovers all the variables of interest perfectly, with an average F1 score of 1 in all cases. The star graph was also able to obtain a perfect recall but with much lower precision. This is because the star graph selects many more variables, about 45 on average. Knowing that the sparsity level is the same for all configurations, the hub in the star graph seems to spread the weights more into its neighbours compared to the path graph. However, the correlations between estimated components are comparable. When c = 0.5, the models seem to select the variables randomly, which is shown by an F1 score close to 0.2. Additionally, the weights do not seem to resemble the original u 2 . However, even this result is better than without the graph a priori, which only selected a couple of features and resulted in an F1 score close to 0. In this situation, by choosing a greater number of variables, the star graph performed better in the correlation score compared to the path graph. Additionally, the union of the star and path graph exhibited behaviour similar to the ones of path and star graphs. Finally, the models with the complete graph always failed to outperform all the other models. This result is expected since the complete graph does not contribute to bring any information.
If we fix the graph and the mean-variance ratio, for all the cases u 2 considered, we observe no significant difference in the precision of the variable selection process nor in the extracted correlations. This observation holds for all graph types and mean-to-variance ratios. The correlations between neighbours in the graph did not change the selected variables.
Overall, netSGCCA seemed to outperform the SGCCA in terms of retrieving the variables of interest, in nearly all configurations tested. It appeared that it is through its properties and structures that the graph have an influence on the behaviour of the model. We seek to investigate these results on real oncological data in the next sections. Table S1: Recovering performances depending on configurations defined by the different cases defined by the vector u 2 and graphs. Corr is the correlation between the estimated components. Precision, Recall and F1 correspond to the evaluation of u 2 against the computed weights. Bold refers to highest values between netSGCCA and SGCCA. Low mean to variance ratio (c = 0.5).

Additional datasets
To verify the robustness of the presented results, we analysed three other oncological datasets using the same experimental design as on the TCGA-LGG. These datasets comprise the TCGA-KIRP dataset of 167 patients diagnosed with kidney renal papillary cell carcinoma, the pancreatic adenocarcinoma dataset TCGA-PAAD with 124 patients, and the TCGA-OV dataset of 219 patients diagnosed with ovarian serous cystadenocarcinoma. These datasets differ regarding tumour location, sample sizes, survival profiles, and event rates, as shown in table S6. For each dataset, we compared the raw and normalised graph Laplacian using the Pathway Commons graph and in terms of variable selection and stability, as done earlier. Additionally, we also analysed the effect of the removal of graph edges.

Comparison between normalised and raw graph Laplacian
Looking at the stability of the models using the Dice metric, as exhibited in Figure S9, the stability increases as γ G grows when the normalised graph Laplacian is used. In contrast, the stability using the raw graph Laplacian will often decrease sharply for high values of γ G . Out of the four datasets used in this work, only on TCGA-PAAD did the raw graph Laplacian outperform the normalised graph Laplacian when γ G is greater than 1. However, the stability ranges differ between the different datasets, which shows that obtained results are data-dependent. Figure S10 the distribution of the selection rate of the genes selected at least once in the 100 runs, with γ G = 10 3 . This distribution is datadependent. However, most genes are selected a few times. On the TCGA- PAAD and TCGA-OV, the distribution is reflected in the low stability of the models, as shown previously. On the TCGA-KIRP, we obtained higher dice stability scores when the normalised graph Laplacian was used, which is explained by a better distribution of the selection rate. Figure S8 shows that the number of selected variables increases as γ G grows on the different studied datasets. This is in line with our findings on the TCGA-LCC dataset. Again, the increase is smoother when the normalised graph Laplacian is used.
As was previously done for TCGA-LGG, Figure S11 shows the degree distribution of selected genes, on the whole graph and the sub-graph. Both Figures exhibit similar patterns between all the studied datasets, demonstrating that our observations from the TCGA-LGG hold across datasets. Namely, the graph penalty does not favour highly connected notes, nor does it select sub-regions from the graph.

Comparisons between different graphs
Again, we permuted the notes in the graph to study the effects of the graph semantics while keeping the same graph structure. The low Dice scores shown in Figure S13 indicate little consistency between the results obtained using the original PC and permuted graphs. This is in line with results found on TCGA-LGG and demonstrates that the semantics of the chosen graph have a substantial impact on feature selection. Having a selected set of genes, we define inner edges as the direct links between the selected genes, and outer edges as links between a selected gene  were used. Results are shown in Table S7. As seen on TCGA-LGG, removing direct edges between selected variables did not significantly impact variable selection. However, making selected variables isolated considerably reduced the number of selected variables. The existence of paths between variables is important for the variable selection process.
Finally, we randomly removed edges in the graph and analysed the number of selected variables. On all datasets, including TCGA-LGG, the number of selected variables follows the decrease in the number of the graph edges, thus graph density. Table S7: The effect of pruning edges that connect genes selected when using the full PC graph. Results obtained on 100 samples, on TCGA-KIRP, TCGA-PAAD, TCGA-OV.