DrivR-Base: a feature extraction toolkit for variant effect prediction model construction

Abstract Motivation Recent advancements in sequencing technologies have led to the discovery of numerous variants in the human genome. However, understanding their precise roles in diseases remains challenging due to their complex functional mechanisms. Various methodologies have emerged to predict the pathogenic significance of these genetic variants. Typically, these methods employ an integrative approach, leveraging diverse data sources that provide important insights into genomic function. Despite the abundance of publicly available data sources and databases, the process of navigating, extracting, and pre-processing features for machine learning models can be highly challenging and time-consuming. Furthermore, researchers often invest substantial effort in feature extraction, only to later discover that these features lack informativeness. Results In this article, we introduce DrivR-Base, an innovative resource that efficiently extracts and integrates molecular information (features) related to single nucleotide variants. These features encompass information about the genomic positions and the associated protein positions of a variant. They are derived from a wide array of databases and tools, including structural properties obtained from AlphaFold, regulatory information sourced from ENCODE, and predicted variant consequences from Variant Effect Predictor. DrivR-Base is easily deployable via a Docker container to ensure reproducibility and ease of access across diverse computational environments. The resulting features can be used as input for machine learning models designed to predict the pathogenic impact of human genome variants in disease. Moreover, these feature sets have applications beyond this, including haploinsufficiency prediction and the development of drug repurposing tools. We describe the resource’s development, practical applications, and potential for future expansion and enhancement. Availability and implementation DrivR-Base source code is available at https://github.com/amyfrancis97/DrivR-Base.


Introduction
The rapid advancement in Next Generation Sequencing technologies has facilitated the extensive identification of variants within the human genome.A significant number of these variants have an unknown functional impact.Among these, many could potentially contribute to disease phenotypes as driver variants, while others are likely to be passively involved and causatively neutral in nature.
In response, a diverse range of machine learning methodologies have been proposed, with the primary objective of integrating genome-level information (features) to identify driver variants.Notable tools in this context include DeepMinds' most recent piece of work, AlphaMissense, (Cheng et al. 2023), our FATHMM-MKL (Shihab et al. 2015) and CScape (Rogers et al. 2017) predictors, as well as CADD (Rentzsch et al. 2019), DANN (Quang et al. 2015), PolyPhen-2 (Adzhubei et al. 2013), and EVE (Frazer et al. 2021).While these tools employ diverse methodologies to tackle genomic prediction problems, the datasets, or features, integrated into the models prove equally crucial, and the utility of these classifiers heavily relies on the availability of feature data.
To our knowledge, DrivR-Base represents the first tool available to the research community that offers such a comprehensive and extensive compilation of annotations across the entire genome (Wang et al. 2010, McLaren et al. 2016, Liu et al. 2020).With its unique capability to integrate a wide array of detailed features and annotations from numerous databases, DrivR-Base stands out for its unparalleled breadth and depth of genomic and protein-level information accessible for extraction.Moreover, most modern tools focus on aggregating scores from machine learning models associated with a variant, rather than providing access to the raw annotations themselves (Liu et al. 2020).DrivR-Base, therefore, provides an unprecedented resource for the direct application in machine learning models to accelerate the development of variant prediction tools.
To date, numerous features have demonstrated their effectiveness in assessing the likelihood of a variant driving disease.Conservation-based features, such as PhyloP and PhastCons scores (Siepel et al. 2005, Pollard et al. 2009), quantify sequence conservation across species.Studies have suggested that regions with lower conservation tend to be less functionally significant (Woodruff 2001).These features have proven informative in several predictors (Shihab et al. 2015, Rentzsch et al. 2019, Sun and Yu 2019, Cabrera-Alarcon et al. 2022).
Additionally, various other features have played vital roles in driver-variant prediction.For instance, the Variant Effect Predictor (VEP) (McLaren et al. 2016) has been instrumental in developing widely-used prediction tools (Shihab et al. 2015, Rentzsch et al. 2019).VEP provides valuable insights into variant effects on transcripts within protein-coding regions, introns, and regulatory elements.Moreover, this context has seen the utilization of features such as sequencebased similarity measures, enabling mathematical comparisons of wild-type and mutant string patterns (e.g.spectrum kernels), as well as regulatory features from ENCODE (Dunham et al. 2012, Quang et al. 2015, Shihab et al. 2015, Rogers et al. 2017, Rentzsch et al. 2019).Additionally, information on GC content and CpG islands has proven valuable in these prediction tasks (Shihab et al. 2015, Rogers et al. 2017).Elevated GC content has been associated with increased bendability and the ability to undergo B-Z transitions, which are spatial features linked to open chromatin and active transcription (Vinogradov 2003).
While various feature groups are currently in use, additional molecular datasets could likely offer valuable insights in predicting driver variants.For instance, exploring the influence of single nucleotide variants (SNVs) on DNA shape properties is one such illustration.Multiple DNA shape properties have been implicated in DNA-protein interactions (Jones et al. 2003, Rohs et al. 2009, Chiu et al. 2017).Specifically, high electrostatic potentials have been linked to DNA binding sites (Jones et al. 2003, Chiu et al. 2017), and the narrowing of minor grooves has been associated with A-tracts, resulting in bending toward the minor groove (Rohs et al. 2009).As a result, SNVs occurring at these sites may disrupt these interactions and could lead to functional consequences.
Furthermore, other features that have not been extensively explored in this context include structural information sourced from the AlphaFold (Jumper et al. 2021) and PDB (Berman et al. 2000) databases.These databases contain a wealth of information that could prove valuable when assessing whether a genomic variant is likely to lead to disease.Other examples of feature groups that have not been widely employed thus far and are presented in this work include dinucleotide and amino acid properties.
In this paper, we introduce the creation of a novel repository, named DrivR-Base, designed to streamline the data acquisition process for constructing robust predictors of variant driver status.These datasets have broader applications, including the development of haploinsufficiency prediction models (Shihab et al. 2017) and potential adaptation for advancing drug repurposing tools (Irham et al. 2022).We focus on the human genome, providing users with a comprehensive toolkit of scripts, documentation, and links to original sources to build the required feature set.The deployment of bioinformatics tools across varied computational environments often presents a significant challenge due to dependency management and configuration issues.To address this, we have containerized DrivR-Base using Docker, ensuring that researchers can deploy our toolkit effortlessly, without the need to manage individual software dependencies.Further details can be found in the Supplementary section.

Description and implementation
DrivR-Base is a feature extraction toolkit that enables efficient integration of genomic and protein-level annotations for all possible combinations of single nucleotide variants in the GRCh38 build of the human genome (including all four possible nucleotides at a given position).The resulting features have a wide range of applications, including direct integration into machine learning models for variant effect prediction.The output of DrivR-Base is a single file where the variants are represented as rows, with a column dedicated to feature values for each of the attributes described below.The tool is fully containerised for Docker, facilitating straightforward installation and execution.The tool extracts information for ten different feature groups (FG) from human single nucleotide variants, which are mainly extracted from public databases: i) Conservation-based features: Conservation-based features encompass several crucial metrics.These include PhyloP and PhastCons (Siepel et al. 2005, Pollard et al. 2009) scores, which assess whether nucleotide substitution rates deviate from the expectations under neutral drift.Each of these scores is obtained using seven different alignment methods.Additionally, our analysis incorporates Umap and Bismap mappability data (Karimzadeh et al. 2018), measured using four different types of species alignment methods.These metrics assess the extent to which a genomic region can be accurately mapped during sequencing, providing insights into the reliability of genomic or epigenomic characteristics.
Regions exhibiting lower mappability readings may be more prone to error.To obtain these datasets for the entire genome, we retrieve data from the UCSC genome browser (Kent et al. 2002) and tailor our queries to specific input variants.ii) Variant Effect Predictor: The VEP (McLaren et al. 2016) is organized into three main groups of features.Firstly, we extract all predicted transcript consequences for each variant and encode them using one-hot encoding.The outcome is a file that displays a "1" in the corresponding row for each variant if the transcript consequence is predicted.Next, we retrieve the predicted wild-type and mutant amino acids, presenting the results in two files.The first file follows a BEDþ2 format, with the final two rows representing the wild-type and mutant amino acids, respectively.For synonymous variants, the amino acids will be the same.Additionally, we generate another file that is one-hot encoded, making it suitable for direct integration into the user's models.Finally, we extract distances to transcripts.When variants are predicted to affect multiple transcripts, we calculate their mean, maximum, and minimum distances.iii) Dinucleotide properties: This feature dataset is sourced from DiProDB, an extensive database encompassing 125 conformational and thermodynamic dinucleotide properties (Friedel et al. 2009), which provides values for four dinucleotide configurations: (a) The wild-type allele paired with the adjacent allele on the left, (b) The wildtype allele paired with the adjacent allele on the right, (c) The mutant allele paired with the adjacent allele on the left, and (d) The mutant allele paired with the adjacent allele on the right.The resulting table contains columns, each representing one of the 125 different properties.
Column names include a prefix specifying which of the four configurations it pertains to.For example, "1_Propeller_Twist" denotes the value for the propeller twist property in the first configuration.iv) DNA shape properties: Here, we incorporate five DNA shape properties from DNAShapeR (Chiu et al. 2016).DNAShapeR employs a sliding-window approach to calculate minor groove width (MGW), helix twist (HelT), propeller twist (ProT), roll (Roll), and electrostatic potential (EP).In our scripts, we extract DNA shape features within a window of þ10 and −10 on either side of the variant, but this can be easily adjusted by the user.
The output is presented in a table, displaying the value for each DNA shape feature for every calculated base pair, where position 11 corresponds to the variant of interest.v) GC content and CpG sites: DrivR-Base also calculates GC content, CpG counts and observed CpG versus expected CpG ratios for nine different window sizes.vi) Kernel-based sequence similarity: Our approach also employs sequence-based p-spectrum kernels to capture potential disruptions in sequences flanking a single nucleotide variant (Campbell and Ying 2011).Spectrum kernels allow us to assess the composition of k-mers within the genomic regions surrounding a mutation.We explore various window sizes ranging from 2 to 20 and k-mer sizes ranging from 1 to 20.For each chosen window size (w), we systematically generate all possible combinations of specified k-mer sizes for both wild-type and mutant sequences.We then determine the frequency of occurrence for each k-mer in the respective sequences using the following mapping function: Here, u represents the sub-string k-mer of length p, v 1 denotes the wild-type sequence, v 2 refers to the mutant sequence, and s represents the sequence of interest.We subsequently derive a p-spectrum kernel by summing the products of corresponding row entries for the two sequences: In this equation, s corresponds to the wild-type sequence, and t corresponds to the mutant sequence.We calculate the diagonals of the p-spectra by summing the squares of corresponding row entries within the mapping function matrix.For a more comprehensive explanation and detailed Python implementation, please refer to our Supplementary material and GitHub Repository.vii) Amino acid substitution matrices: In this study, we extract amino acid substitution rates from a variety of matrices for non-synonymous variants sourced from the Bio2mds package in R (Pel� e et al. 2012).The matrices used and their sources are shown in Table 1.viii) Amino acid properties: DrivR-Base retrieves 532 amino acid properties for both wild-type and mutant amino acid sequences.These properties were sourced from the AAindex data within the AAsea package in R (Reddy 2019).They encompass information related to factors such as polarity, hydrophobicity, local flexibility, and helix-bend preferences.
ix) ENCODE database features: ENCODE offers a wealth of functional information about the human genome (Dunham et al. 2012).In this work, we extract eight features potentially informative for variant pathogenicity: a) Transcription Factor ChIP-seq b) Histone ChIP-seq c) DNase-seq d) Mint-ChIP-seq e) ATAC-seq f) eCLIP g) ChIA-PET h) GM DNase-seq To achieve this, we retrieve all available files for each feature group from ENCODE via the ENCODE API.Subsequently, we download, convert, and consolidate ENCODE peak files into comprehensive data frames for each feature group.These data frames include metadata like accession, target (e.g.transcription factor), biosample (e.g.cell/tissue type), and output type (e.g.narrow peak).Note that this script downloads all ENCODE data locally, requiring approximately 160GB of space.
Next, we cross-reference feature-specific databases with target SNVs, extracting relevant information overlapping with SNV locations.We then extract crucial data such as signal values, P-values, q-values, and peaks for each variant.For cases with multiple peaks, such as when replicate assays are involved, we also record minimum, maximum, mean, and range values.x) AlphaFold structural features: DrivR-Base incorporates structural data from the AlphaFold database (Jumper et al. 2021) and PDB (Berman et al. 2000).Using the VEP query output, we identify genes and protein positions affected by coding variants.Gene names are converted to UniProtKB IDs, and an API retrieves corresponding crystallographic information files (CIF; .cif)from AlphaFold based on the UniProtKB ID.We extract structural information, including X, Y, and Z atom coordinates, isotropic atomic displacement parameters (IADP), and structural conformation types.The output includes two data frames: one containing the first four features (X, Y, Z coordinates, and IADP) for each variant, and another data frame with one-hotencoded structural conformation types indicating potential effects on amino acids, such as bends or helical structures.
A detailed list of feature groups, their sources, and their implementation can be found in our Supplementary material.

Conclusions and future efforts
In summary, DrivR-Base is a versatile cross-database toolkit that consolidates diverse features for human SNVs.These features have various applications, including constructing high-dimensional machine-learning models for predicting variant driver status.As previously commented, DrivR-Base can also be applied to predict haploinsufficient genes and to identify functional similarities to known drug targets, potentially aiding drug repurposing efforts.This tool streamlines feature extraction, saving researchers time and advancing their work.Our future goals include expanding the tool's capabilities to encompass a broader range of mutations, such as indels, deletions, and structural rearrangements, and diversifying the available feature groups for extraction.DrivR-Base is fully containerised for easy deployment using Docker, ensuring a reproducible and streamlined setup process.Detailed instructions for Docker deployment, including pulling the image, running the container, and executing the toolkit, are available in our comprehensive GitHub documentation at https://github.com/amyfrancis97/DrivR-Base.Researchers are encouraged to contact the authors to discuss the inclusion of additional feature groups in DrivR-Base or the enhancement of existing feature groups.

Table 1 .
Amino acid substitution matrices and their sources.