HmtVar: a new resource for human mitochondrial variations and pathogenicity data

Abstract Interest in human mitochondrial genetic data is constantly increasing among both clinicians and researchers, due to the involvement of mitochondrial DNA (mtDNA) in a number of physiological and pathological processes. Thanks to new sequencing technologies and modern databases, the large amount of information on mtDNA variability may be exploited to gain insights into the relationship between mtDNA variants, phenotypes and diseases. To facilitate this process, we have developed the HmtVar resource, a variant-focused database that allows the exploration of a dataset of over 40 000 human mitochondrial variants. Mitochondrial variation data, initially gathered from the HmtDB platform, are integrated with in-house pathogenicity assessments based on various evaluation criteria and with a set of additional annotations from third-party resources. The result is a comprehensive collection of information of crucial importance for human mitochondrial variation studies and investigation of common and rare diseases in which the mitochondrion may be involved. HmtVar is accessible at https://www.hmtvar.uniba.it and data may be retrieved using either a web interface through the Query page or a state-of-the-art API for programmatic access.


INTRODUCTION
The mitochondrion, traditionally defined as the powerhouse of the eukaryotic cell, plays a role in a number of biological processes, and shows extensive variation in proteomic composition and function, differentiated in tissues and cell types. By virtue of such pivotal involvement, its dysfunction causes or contributes to human pathologies, such as neurodegenerative diseases, diabetes, cancer and metabolic syndromes. Hence, there is currently great interest in the relationship between mitochondria and disease, as confirmed by clinical literature (1)(2)(3).
Recent advances in high-throughput sequencing techniques have provided an unprecedented amount of genetic data capable of offering invaluable insights into different life science questions. This becomes even more significant in the light of the high number of sequences and related metadata available in public databases. Most of these data regard genomic variability, which can be exploited to achieve a better understanding of the correlations between DNA variants, phenotypes and diseases.
Big public biological datasets can be used to assess human diversity. A large number of mtDNA variations may simply represent neutral population specific polymorphisms, as reported by Phylotree (4), and can therefore be flagged as polymorphic variants, while the rest may be more or less involved in pathogenic processes. A huge amount of online resources for mitochondrial data analysis has surfaced over the years, with the common aim of producing a comprehensive knowledge of the onset and development of diseases in which mitochondria are involved. Some examples include the Mitomap (5) portal with the MitoTip (6) pathogenicity predictor for tRNA variants, the MSe-qDR platform (7) and the HmtDB database (8); however, there is still no specific resource for specifically querying and retrieving mitochondrial variants annotated with dedicated functional, structural, population, disease-related information and classified according to well-assessed tiers of pathogenicity.
To this aim, we have designed and implemented the HmtVar resource (https://www.hmtvar.uniba.it), an online database that will guide clinicians and researchers in the assessment of variability and pathogenicity in human mitochondrial DNA.
HmtVar is a variant-centred database which offers mtDNA variability and pathogenicity data. The information available in HmtVar comes from several human mitochondrial genomes and exploits different third-party resources. Hmt-Var variants are gathered from variations found in completely sequenced human mitochondrial genomes, downloaded from GenBank (9) into HmtDB (8) (https://www. hmtdb.uniba.it) and annotated as either 'healthy', when derived from individuals with no description of disease, or 'pathologic' otherwise.
The HmtVar dataset is further enriched with nonsynonymous and tRNA variants that are not yet observed among the HmtDB genomes, but can potentially occur, considering every possible substitution for each site of the reference human mitochondrial genome. With respect to ribosomal RNAs and regulatory regions loci, variants observed in HmtDB or reported outside HmtDB are annotated with the minimum number of attributes retrieved from external resources. The discrimination of observed from potential variants is performed using the frequency of the allele causing the variation, ranging from 0 to 1.
Any nucleotidic change is defined according to the revised Cambridge Reference Sequence (rCRS, Accession Number NC 012920.1) (10).
A set of attributes has been defined for each variant, based on evidence extracted from external resources or estimated by applying, on the complete human mitochondrial genomes of both 'healthy' and 'pathologic' individuals, computational approaches such as the estimation of nucleotide and amino-acidic sites variability. These values are estimated using, respectively, the SiteVar (11) and the MitVarProt (12) algorithms. The resulting variability scores (nt var and aa var) range from 0 to 1, with a higher value indicating a lower functional constraint of the site. Alongside with variability data, allele frequencies are also calculated and stored in HmtVar. Table 1 reports the list of attributes distinguished according to the different functional loci: protein-coding (CDS), tRNAs, rRNAs and regulatory regions.

Disease score estimation
Disease Score Estimation (DSE) for non-synonymous variants is based on the algorithm implemented by Santorsola et al. (13), while DSE for mt-tRNA variants is based on criteria reported by Diroma et al. (14), derived from Yarham et al. (15) but normalized within the range 0-1 (Table 2A). mt-tRNA variants for which no functional evidence is available from literature are flagged as VUS (Variants of Undefined Significance).

Disease score and allele frequency threshold
Once defined, the Disease Score (DS) values represent a tool to discriminate benign from pathogenic variants. The Disease Score Threshold (DS T) was defined for this purpose.
The DS T value was set at 0.43 for non-synonymous variants, and at 0.35 for tRNA variants, according to protocols defined in (13) and (14), respectively. In order to reinforce the variant pathogenicity assignment, both protocols associate DS T values with the empirical cumulative distribution of the nucleotide variability which in HmtVar is replaced by the allele frequency. Such procedure made it possible to determine, according to the latest HmtDB update, an Allele Frequency threshold (AF T), estimated by considering only variants with DS greater than DS T, equal to 0.003264 for non-synonymous variants and to 0.005020 for tRNA variants.

Tier definition
In order to assign each non-synonymous or tRNA variant to a specific tier of pathogenicity, DS T and AF T were considered, according to the general rules listed in Table 3A.
Using the specific thresholds defined, the final pathogenicity tiers were assessed as detailed in Table  3B.
Further criteria relevant for an exhaustive interpretation of mt-tRNA variants were considered, estimated and annotated, conditional to the availability of specific information. The additional criteria are listed in Table 2B.
The decision to implement HmtVar back-end functionality using Python Flask was made due to the need for a lightweight yet efficient framework both to build a reliable web service and to retrieve and analyse information from the underlying database. Python Flask integrates very well with SQLite databases and offers a set of tools to perform database construction, querying, editing and versioning right out of the box. SQLite, on the other hand, allows for a simpler and faster data storage and retrieval than legacy database engines; this is fundamental given the large amount of genomic information collected.
In addition, it was possible to deploy a comprehensive Application Programming Interface (API) without relying on other services, so that researchers and developers can access HmtVar data in a programmatic mode and integrate them into their applications.
HmtVar was developed with one of the key ideas being easy access from every device, so the Bootstrap library was employed for developing its front-end. Consequently, Hmt-Var can be accessed from either desktop or mobile devices without any loss in functionality.
One of the aims of HmtVar is to provide users with data that are constantly updated. This means collecting new variant entries and variability data as soon as they are available in HmtDB, as well as gathering additional information from several third-party resources. A specific software to retrieve new information from each data source was initially   Figure 1. Based on the total length of rCRS (16 569 bp), the total number of potential variants, excluding insertions and deletions, is 49 707; HmtVar hosts annotations for 82.33% of them. The number of CDS and tRNA annotated variants matches the expected value based on rCRS gene lengths. The number of rRNA and regulatory regions variants annotated in HmtVar represent only a subset of the entire potential set due to the lack of specific information either in literature or from external resources. This information will be integrated with subsequent HmtVar updates as soon as new data will be published. Figure 2 shows the percentage of observed (AF > 0) and potential (AF = 0) variants as reported in HmtVar, while in Figure 3 it is possible to see, for each locus type, the percentage of variants annotated in HmtVar with unassigned DS, with DS equal to 0 and with DS greater than 0. Disease scores are not available for frameshift, stop-gain and stop-loss CDS variants.
Moreover, the unavailability of published functionals studies for a large number of tRNA variants did not allow to estimate a congruent disease score, whereby these variants are annotated as VUS despite DS > 0.

Variant pathogenicity
Tiers of pathogenicity based on criteria reported in Table 3 were estimated and annotated for 28 768 variants allowing their classification as pathogenic (18 857   For both tRNA and protein-coding genes, 92.8% of pathogenic variants are not observed in 'healthy' genomes, unlike the remaining 7.2%.

Assessment of pathogenicity
The best way to test HmtVar was to select a dataset of mitochondrial genomes that would harbour pathogenic variants, to be flagged properly by our database. Based on our extensive experience in the field of somatic mutations in oncocytic tumours, a subset of human neoplasms with the unique ability to accumulate damaging mtDNA mutations within their highly deranged mitochondria, was selected HmtDB. The occurrence of disruptive mtDNA mutations, some of which very rare, as a genetic hallmark of oncocytomas, is widely recognised in literature, and supported by functional studies that reveal how such homoplasmic or near-homoplasmic changes are favoured exclusively in oncocytic tumours and not in other human cancers, and determine a dramatic respiratory dysfunction, often involving oxphos complexes disassembly (16)(17)(18)(19). So, all genomes se-   quenced directly from the tumour tissue of oncocytoma patients (20)(21)(22), were retrieved from HmtDB and their variants were tested using HmtVar.
The genome analysis led to the identification of 313 substitutions in 106 patients (Supplementary Table S1). In the protein coding regions, 187 non-synonymous substitutions (59.7%), 22 nonsense (7%) and 34 frameshift mutations (10.9%) were identified using the MToolBox pipeline (23). Among all the non-synonymous variants found, HmtVar classified 72 of them as pathogenic or likely-pathogenic. As expected from the biology of oncocytic tumours, 32 out of 116 patients (28%) presented more than half of the nonsynonymous mutations (40/72, 55%) localised within the subunits of complex I (Supplementary Table S1 Table S1). Additionally, 52 patients harboured a substitution in tRNA genes.
In short, a total of 117 oncocytic neoplasms were rescreened to reveal whether additional somatic pathogenic mutations that had gone previously undetected due to a lack of information or to the use of a single pathogenicity predictor could be retrieved. Particularly for tRNA mutations, proper assessment of a mutation's disruptive effect was difficult in the past. The advantages of using HmtVar are the 11% increase in cases bearing one or more pathogenic mtDNA mutations (90/117, 77% with an average of 1.6 mutations per patient, including frameshift and nonsense mutations), underlining how the fraction of mutated samples had been previously underestimated (77/117, 66% with an average of 1.2 mutations per patient, including frameshift and nonsense mutations), helping strengthen the association between a genetic lesion and a well-characterised phenotype.

Interface
HmtVar is accessible at https://www.hmtvar.uniba.it/ and offers both a web interface and a RESTful API to query its content. Using the Query web page, the database can be interrogated using several search parameters, from broader criteria to narrower ones.
After performing a query, the list of variants retrieved is shown in an html table reporting variants and basic associated information; it is possible to sort this list as well as to show a given number of results on each page.  24,25,26,27,28,29), Download Data.
An extensive description regarding data, structure and usage of the database as well as variant statistics is presented in the About and Statistics sections (https://www.hmtvar. uniba.it/about, https://www.hmtvar.uniba.it/stats).

API
In addition to the Query page, HmtVar allows the retrieval of variants using a dedicated Application Programming Interface (API), allowing users to access and download data in a programmatic manner. Valid API calls will return one or more results formatted as a JSON string, for easy parsing of information.
When returning a single variant, the API will provide the complete set of information available for the specific variant, exactly like the data shown in the Download Data tab of a Variant Card.
When returning a list of variants, on the other hand, each element in the list will report the URL to directly access the variant's complete data, as well as a limited set of basic information.
To further distribute mitochondrial variant information, HmtVar also offers a second form of API in addition to that previously described, created to be compliant with the RD-Connect platform specifications, so that it can be exploited by RD-Connect to collect and integrate HmtVar variant data into their platform. RD-Connect (30) is a comprehensive platform that integrates databases, patient registries, data analysis tools and biobanks for rare disease research. In order to collect and integrate data from a broad range of bioinformatic resources, RD-Connect has established a common API that data providers can adopt; using a set of standardised arguments for API calls, RD-Connect is then able to retrieve, parse, integrate and redistribute third-party data.

Conclusions
HmtVar offers a wide range of information regarding human mitochondrial genome variants, representing one of the most comprehensive resources for mitochondrial genomic variation studies. The broad set of mtDNA data hosted on several different online sources was exploited to build a unique aggregated database to fulfill the clinician's needs when looking for pathogenicity information on mitochondrial variants.
Pathogenicity predictions for variants located in mitochondrial tRNA and protein-coding genes allow the assessment of the harmfulness of each variation based on a pathogenicity consensus obtained using different tools and experimental information. Thanks to the classification provided by HmtVar it was possible to reveal nonsynonymous and tRNA pathogenic mutations in oncocytic tumors, known to be characterized by disruptive mtDNA mutations, at a higher frequency than previously reported for the same samples.
The occurrence of pathogenic variants in healthy subjects (7%) should not be considered as due to the overestimation of pathogenicity derived from the algorithm implemented, but rather to the fact that such variants display a very low allele frequency in healthy subjects (see DST values in Table  2b). This implies a higher probability that the healthy subjects carrying the pathogenic variant may belong to haplogroups other than that in which the pathogenic-defined variant determines a pathologic phenotype. Moreover, the high number of likely polymorphic variants reflects the fact that these variants have an under threshold disease score but an allele frequency above the threshold; this is explained by the prevalence of a particular allele in specific population lineages. This information can be easily obtained from Hmt-Var annotations, together with the third-party data supplied to help clinicians with pathogenicity assessment.
As regards tRNA variants, it is worth mentioning that the MitoTip database provides detailed information and assignment of pathogenicity for tRNA variants. However, a close collaboration between HmtVar, Mitomap and MSe-qDR curators is agreed and the exchange of information It reports the variant's basic information such as its location, the consequent amino acidic change for non-synonymous variants or specific structural details for tRNA variants, haplogroups and macro-haplogroups code if the variant is associated to a specific haplogroup according to Phylotree build 17 (http://phylotree.org/index.htm), as well as HmtVar pathogenicity prediction. will be settled through the implementation of specific API interfaces.
Future implementations will focus on extending Hmt-Var's dataset beyond protein-coding and tRNA gene variants to embrace the whole set of potential variations with respect to the rCRS reference sequence. Calculations of pathogenicity prediction for rRNA and regulatory region variants will be performed, in order to offer a full overview of human mitochondrial variability and pathogenicity. Pathogenicity assignment will be further improved considering the guidelines that the MSeqDR-ClinGen mtDNA expert panel (https://goo.gl/68xaEN) is defining in agreement with the Mitomap and MSeqDR curators.