DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning

Abstract Motivation Structural genomic variants account for much of human variability and are involved in several diseases. Structural variants are complex and may affect coding regions of multiple genes, or affect the functions of genomic regions in different ways from single nucleotide variants. Interpreting the phenotypic consequences of structural variants relies on information about gene functions, haploinsufficiency or triplosensitivity and other genomic features. Phenotype-based methods to identifying variants that are involved in genetic diseases combine molecular features with prior knowledge about the phenotypic consequences of altering gene functions. While phenotype-based methods have been applied successfully to single nucleotide variants as well as short insertions and deletions, the complexity of structural variants makes it more challenging to link them to phenotypes. Furthermore, structural variants can affect a large number of coding regions, and phenotype information may not be available for all of them. Results We developed DeepSVP, a computational method to prioritize structural variants involved in genetic diseases by combining genomic and gene functions information. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual cell types and anatomical sites of expression, and systematically relate them to their phenotypic consequences through ontologies and machine learning. DeepSVP significantly improves the success rate of finding causative variants in several benchmarks and can identify novel pathogenic structural variants in consanguineous families. Availability and implementation https://github.com/bio-ontology-research-group/DeepSVP. Supplementary information Supplementary data are available at Bioinformatics online.


Data sources and ontologies
The training dataset of variants contains 14,197 (10,401 deletions, 3,796 duplications) pathogenic or likely pathogenic structural variants and 4,477 variants associated with one or more diseases (3,737 deletions, 586 duplications), as well as 25,890 (13,742 deletions, 12,148 duplications) benign or likely benign structural variants.
For each pathogenic structural variant, we defined variant-disease pairs with associated diseases from Online Mendelian Inheritance in Men (OMIM) database [1]. There are 3,805 structural variants linked with one or more than one OMIM disease; if a variant is associated with n OMIM diseases, we generate n variant-disease pairs. As a result, we obtained 5,907 causative pathogenic variant-disease pairs. As negative training pairs we selected both benign and pathogenic variants and associate them with a randomly selected disease; we include pathogenic variants in the negative training pairs to simulate the case where variants may be pathogenic but not associated with the phenotypes observed in a patient (i.e., with a different phenotypes). After this step we obtained 36,041 negative (not causative) variant-disease pairs . We downloaded phenotypes from the HPO database on 16 July 2020 and obtain 169,281 phenotype associations for 4,315 human genes. We downloaded phenotype associations from MGI on 20 March 2020 and obtained 228,214 associations between 13,529 mouse genes and classes in MP. We identify the human ortholog for mouse genes using the HMD HumanPhenotype.rpt orthology file from MGI; the file contains 10,951 human orthologs for the 13,529 mouse genes resulting in 168,550 associations between human genes and MP classes. We obtain the GO annotations for human gene products from the GO Annotation database [2] on 20 March 2020 for 18,495 gene products with 495,719 annotations in total. We filtered out all the GO annotations with the evidence code indicating that the annotation was inferred from electronic annotation (IEA), or no biological data is available (ND). We map the UniProt accessions of the gene products to Entrez gene identifiers using the mappings provided by the Entrez database [3], and obtain 17,786 genes which have GO annotations for their gene product resulting in 208,630 associations between genes and GO classes. For the anatomical location of gene expression, we downloaded the GTEx Tissue Expression Profiles from the Gene Expression Atlas [4] which identifies gene expression across 53 tissues; 20,538 genes have an expression above the 4.0 threshold which was previously determined to be useful for predicting disease associations [5] resulting in 585,765 associations between genes and UBERON classes. We represent the tissues with their UBERON classes, excluding the tissues transformed skin fibroblast and EBV-transformed lymphocyte as these are not found in UBERON. For the cell type, we downloaded single-cell RNAseq data from the Tabula Muris project [6] in which genes are annotated with the CL. From this dataset, we obtain 6,559 human genes which have CL annotations, and 17,149 associations between genes and one or more classes from CL.
We use the combined PhenomeNET ontology [7] downloaded on 6 October, 2020, from the AberOWL ontology repository, as our phenotype ontology. PhenomeNET combines the phenotypes of human and other model organism as well as UBERON, GO and CL, and allows them to be compared.

Estimating variant pathogenicity by supervised prediction 2.1 Phenotype prediction model
We apply DL2Vec to generate the feature representation for the patient phenotypes and genes.
DL2Vec learns the "representation" for phenotypes and genes based on their annotations to ontology classes. The inputs to DL2Vec are associations of entities with ontology classes and the outputs are vectors (embeddings) of these entities. DL2Vec utilizes the axioms in ontologies to construct a graph representing phenotypes and their interrelations. DeepSVP incorporates biological background knowledge about the relation between phenotypes resulting from a loss of function in mouse/human genes, gene functions as defined using the GO, as well as the celltype and anatomical site of gene expression.
The phenotype model takes two vectors v 1 and v 2 as input, representing the embedding for the patient's phenotypes and the embedding for a gene, respectively. The embeddings are used as input for two neural network models ν 1 and ν 2 . We then calculate the inner product for ν 1 (v 1 ) and ν 2 (v 2 ) and apply a sigmoid activation function to generate a prediction score, between the embedding for the phenotypes v 1 , and gene v 2 . We use binary cross-entropy as a loss function to train our model defined as: where N correspond to the number of training samples, Y i is the true value for sample i, and P(Y i ) is the predicted value for sample i.
Each neural network ν 1 and ν 2 consists of two hidden layers, in which the first layer with 256 units, and the second layer with 50 units. After each layer, we use dropout [8] with a rate of 20%, followed by a Leaky Rectified Linear Unit (LeakyReLU) [9] activation function. We use the Adam optimizer [10]  DL2Vec prediction score, from smallest to highest, and represent the association between them by their m-quantile in this distribution [11] ( Figure 5 shows the distribution of the normalized quantile scores). This normalization and ranking aims to make prediction scores comparable across sets of phenotypes [12]. We use the quantile as one of the features of the combined prediction models.  Table 2.

Combined prediction model
Given a structural variant τ affecting regions that contain genes G 1 , ..., G n , we obtain the phenotype prediction score φ (G i ) for the genes G 1 , ..., G n using each of the phenotype-based prediction models. We transform these scores into a feature for the variant τ using either the maximum or average of all the gene scores, i.e., either We normalize all the features using z-score normalization, in which the values for and feature F are scaled based on the mean, and standard deviation of where v´i is z-score normalized one values, v i is the i-th value for the feature A, µ F is the mean, and σ F the standard deviation, for feature F. We use the same mean and standard deviation to normalize the testing set.
We use 22 features for variants (8 features for the variant and 9 derived from the genes overlapping the variant) as well as 5 features from ontology embeddings. Some features are missing for some variants; to account for missing values, we use imputation. We imputed missing val-

Training and testing
We tuned for the following set of hyperparameters for the model: learning-rate r ∈ [1×10 −6 , 1×  Table 1. We used a nested 5 × 2 fold cross validation for training ( Figure 3).

Variant-based features
We use AnnotSV v2.3 [13], which uses data from multiple external databases to annotate and rank SVs based on the overlapping regions of the variants with known pathogenic variants in dbVar [14], the Database of Genomic Variants (DGV) [15], We further use AnnotSV to obtain information about genes with which a structural variant overlaps: the length of the Coding DNA Sequence (CDS), transcript length (tx length), haploinsufficiency ranks collected from the Deciphering Developmental Disorders (DDD) study [17], haploinsufficiency (HI DDDpercent) and triplosensitivity estimates from ClinGen [18], gene intolerance annotations from ExAC [19] including six annotations for synonymous variants (synZExAC), missense variants (misZExAC), loss of function variants (pLIExAC), deletion (delZExAC), duplications (dupZExAC), and CNV intolerance (cnvZExAC). Table 2 summarizes all the features used in our predictive model.
While not used as a feature of our prediction model, we also use AnnotSV to identify the allele frequency of variants using the 1,000 genomes allele frequency [20] and allele frequency from gnomAD [21]. We use this information to filter out common variants before applying our prediction.

Correlation between features
To test the redundancy of different scores of structural variants corresponding to the susceptibility of the phenotypes for the disease, we analyze the features by evaluating the pairwise correlation between them (Figure 4). According to their correlation coefficients, the features were broadly clustered into four major groups using the maximum and average score. As expected, measures of the structure of variants such as SV length and the number of genes or promoters were highly correlated. We further assess the importance of the features using two methods; the first using Extremely Randomized Trees ensemble learning Classifier (Extra Trees Classifier) (Figure 6), and the second using the Shapiro-Wilk algorithm [22] (Figure 7). Both   GCcontent right* Breakpoints annotations, GC content around the right SV breakpoint (± 100bp).
Number of genes* Number of genes within the SVs.

Number of promoters* Number of promoters within the SVs.
HI CGscore HaploInsufficiency Score (categorical features).

Genes-based annotations
DL2vec Score* Predict associations between genes and sets of phenotypes different ontolgies (GO, HP, CL, MP, and UBERON).
CDS length* Length of the CoDing Sequence (CDS) (bp) overlapping with the SV.
tx length* Length of transcript (bp) overlapping with the SV.
pLI ExAC* The probability that a gene is intolerant to a loss of function variation.
HI DDDpercent* Haploinsufficiency ranks, where in a single functional copy of a gene is insufficient to maintain normal function.    Table 3: Summary of the evaluation for predicting causative variants in our testing data set using nested 5 folds cross-validation. We reported the mean for the different evaluation metrics along with the standard deviation.    Table 4: Summary of the evaluation for predicting causative variants in the synthetic wholegenome benchmark dataset derived from dbVar, using 90% of the phenotypes. AnnotSV are computed based on ranking variants from "pathogenic" to"benign". We break the ties uniformly for all the methods randomly and report the absolute number of variants recovered at each rank together with the recall, as well as areas under the ROC curve (using micro-averages per synthetic genome) and precision-recall curve. Best performing results (using maximum or average score) for each measure are indicated in bold.  Table 5: Summary of the evaluation for predicting causative variants in the synthetic wholegenome benchmark dataset derived from dbVar, using 80% of the phenotypes. AnnotSV are computed based on ranking variants from "pathogenic" to"benign". We break the ties uniformly for all the methods randomly and report the absolute number of variants recovered at each rank together with the recall, as well as areas under the ROC curve (using micro-averages per synthetic genome) and precision-recall curve. Best performing results (using maximum or average score) for each measure are indicated in bold.  Table 6: Summary of the evaluation for predicting causative variants in the synthetic wholegenome benchmark dataset derived from dbVar, using 70% of the phenotypes. AnnotSV are computed based on ranking variants from "pathogenic" to "benign". We break the ties uniformly for all the methods randomly and report the absolute number of variants recovered at each rank together with the recall, as well as areas under the ROC curve (using micro-averages per synthetic genome) and precision-recall curve. Best performing results (using maximum or average score) for each measure are indicated in bold.  Table 7: Summary of the evaluation for predicting causative variants in the synthetic wholegenome benchmark dataset derived from dbVar, using 50% of the phenotypes. AnnotSV are computed based on ranking variants from "pathogenic" to"benign". We break the ties uniformly for all the methods randomly and report the absolute number of variants recovered at each rank together with the recall, as well as areas under the ROC curve (using micro-averages per synthetic genome) and precision-recall curve. Best performing results (using maximum or average score) for each measure are indicated in bold.

Structural variant calling
We prepared 150-bp paired-end libraries using the TruSeq Nano DNA Sample Preparation kit (Illumina, USA). Sequencing was performed using an Illumina HiSeq 4000 at the Bioscience core laboratory, KAUST, with approximately 30X coverage. Following sequencing, we aligned reads to human genome build hg38 using the BWA MEM algorithm [23] and following the GATK standard workflows. We trim adapters using Trimmomatic (version 0.38), use BWA (version 0.7.17) for alignment, and samtools (version 1.8) [24] to remove duplicates and sort bam files. We use Manta (version 1.6) [25] to call structural variants. In total, we identified 8,723, 9,003, 9,608, 7,631, and 8,367 variants for the mother, father, first affected, second affected, and unaffected, respectively. We assume an autosomal recessive mode of inheritance or de novo variants, and use the pedigree to filter variants. We further filter common variants (minor allele frequency greater than 0.01) using gnomAD (version 2.1.1) [21] and the SVs from the 1000 genomes project. After filtering by pedigree, 148 structural variants remain; removing common variants reduced the number of variants to 47. We use the DeepSVP combined model with maximum as aggregation operation for the genes within the variant to prioritize diseaseassociated SVs.