Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning

Abstract Motivation Whole-exome and genome sequencing have become common tools in diagnosing patients with rare diseases. Despite their success, this approach leaves many patients undiagnosed. A common argument is that more disease variants still await discovery, or the novelty of disease phenotypes results from a combination of variants in multiple disease-related genes. Interpreting the phenotypic consequences of genomic variants relies on information about gene functions, gene expression, physiology, and other genomic features. Phenotype-based methods to identify variants involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been successfully applied to prioritizing variants, such methods are based on known gene–disease or gene–phenotype associations as training data and are applicable to genes that have phenotypes associated, thereby limiting their scope. In addition, phenotypes are not assigned uniformly by different clinicians, and phenotype-based methods need to account for this variability. Results We developed an Embedding-based Phenotype Variant Predictor (EmbedPVP), a computational method to prioritize variants involved in genetic diseases by combining genomic information and clinical phenotypes. EmbedPVP leverages a large amount of background knowledge from human and model organisms about molecular mechanisms through which abnormal phenotypes may arise. Specifically, EmbedPVP incorporates phenotypes linked to genes, functions of gene products, and the anatomical site of gene expression, and systematically relates them to their phenotypic effects through neuro-symbolic, knowledge-enhanced machine learning. We demonstrate EmbedPVP’s efficacy on a large set of synthetic genomes and genomes matched with clinical information. Availability and implementation EmbedPVP and all evaluation experiments are freely available at https://github.com/bio-ontology-research-group/EmbedPVP.


Variant predictions results across several models
We have conducted evaluations on various benchmark datasets, including synthetic datasets using clinical phenotypes and OMIM phenotypes, using the transductive approach.The evaluation assesses the performance of different embedding methods on these different datasets.
The following tables and figures present a comparison of their performance with other state-ofthe-art methods.

ClinVar time-split
The results are in Table 4. Tables 5 and 6 show the results of using novel and known genes or diseases.Table 7 provides the evaluation for exonic and non-exonic variants.

Evaluations for the variants in non-exonic regions
Evaluating variants in non-exonic regions is crucial as they can significantly impact gene expression and regulation.These regions contain important regulatory elements, such as enhancers and silencers, that play a crucial role in the expression of neighboring genes.By analyzing these variants, we can comprehensively understand the genetic factors that contribute to various diseases and conditions [1].Therefore, we further extended our evaluation to include variants in non-exonic regions, specifically focusing on capturing phenotype annotations from the neighboring genes.Using the PAVS benchmark dataset, we identified a total of 296 non-exonic variants, including intronic, splicing, UTR5, UTR3, upstream, and ncRNA variants.The results are shown in Supplementary Table 2. Table 7 shows the results obtained using the ClinVar dataset with 197 non-exonic variants.The results highlight that our EmbedPVP models continue to outperform other state-of-the-art methods, even when considering variants in these non-exonic regions.

Evaluations for variants in intergenic and overlapping genes
To further assess the performance of EmbedPVP in capturing variants located in overlapping or intergenic regions, we used the maximum similarity score among the genes surrounding the variants.To evaluate the effectiveness of our approach, we collected a new set of variants (108 variants) from ClinVar, in which the variants are within the intergenic region or overlapping genes.The results presented in Supplementary Table 9 demonstrate that our method outperforms other approaches across various metrics.Specifically, the EmbedPVP (TransD) model is considered the most effective method in capturing the genomic context and achieves better performance compared to the other methods considered in this study.Also, EmbedPVP performs better with other embedding methods, such as DL2vec and OWL2vec*, compared to the other methods.Using this approach, in which we incorporate information for the surrounding genes, enables us to consider the genomic information surrounding the variants and thereby enhances the performance of EmbedPVP to predict other types of variants compared to other methods.By considering the genes in proximity to the variants, we ensure that the models capture the relevant genomic context necessary for accurately predicting the impact of these variants on phenotypes or diseases.

Evaluations for variants in genes with no phenotype annotations
Since we utilize different types of features characterized through the use of ontologies, our method can be applied to a much larger number of genes for which the functions, sites of expression, phenotypes, or interactions with other genes are known.To evaluate our method, we focused on subsets of variants (397 variants) collected from ClinVar that corresponded to genes with no human phenotype annotations.We obtained a total of 397 variants with Gene Ontology (GO) annotations.Using the gene functions in addition to the enriched knowledge graph, we ranked the variants and assessed their performance.The results are presented in Supplementary

Table 1 ,
Figure 1, or Figure 2, for the exonic variants and other types in Table 2

Table 2 :
EmbedPVP variant prediction results across several models for the exonic and non-

Table 3 :
EmbedPVP variant prediction results across several models using Phenopackets dataset

Table 4 :
EmbedPVP variant prediction results across several models using ClinVar dataset.

Table 5 :
EmbedPVP variant prediction results across several models using ClinVar dataset for the novel genes or diseases

Table 6 :
EmbedPVP variant prediction results across several models using ClinVar dataset for either known genes and/or diseases during training.

Table 7 :
EmbedPVP variant prediction results across several models using ClinVar dataset for the exonic and non-exonic variants 2.

Table 8 :
EmbedPVP evaluations for the variants in genes with no phenotype annotations 2.

Table 9 :
EmbedPVP evaluations for the variants within overlapping genes

Table 10 :
Evaluation results for the ablation study (shaded rows) considering only the annotations without using gene-disease associations and additional taxonomies from uPheno ontology.

Table 8 .
Based on these results, our method, EmbedPVP, outperformed the other methods that mainly rely on existing knowledge for gene-to-disease phenotype annotations.