VarCards2: an integrated genetic and clinical database for ACMG-AMP variant-interpretation guidelines in the human whole genome

Abstract VarCards, an online database, combines comprehensive variant- and gene-level annotation data to streamline genetic counselling for coding variants. Recognising the increasing clinical relevance of non-coding variations, there has been an accelerated development of bioinformatics tools dedicated to interpreting non-coding variations, including single-nucleotide variants and copy number variations. Regrettably, most tools remain as either locally installed databases or command-line tools dispersed across diverse online platforms. Such a landscape poses inconveniences and challenges for genetic counsellors seeking to utilise these resources without advanced bioinformatics expertise. Consequently, we developed VarCards2, which incorporates nearly nine billion artificially generated single-nucleotide variants (including those from mitochondrial DNA) and compiles vital annotation information for genetic counselling based on ACMG-AMP variant-interpretation guidelines. These annotations include (I) functional effects; (II) minor allele frequencies; (III) comprehensive function and pathogenicity predictions covering all potential variants, such as non-synonymous substitutions, non-canonical splicing variants, and non-coding variations and (IV) gene-level information. Furthermore, VarCards2 incorporates 368 820 266 documented short insertions and deletions and 2 773 555 documented copy number variations, complemented by their corresponding annotation and prediction tools. In conclusion, VarCards2, by integrating over 150 variant- and gene-level annotation sources, significantly enhances the efficiency of genetic counselling and can be freely accessed at http://www.genemed.tech/varcards2/.


Introduction
Rapid advances in sequencing technology over the last few years have provided unprecedented opportunities and challenges for genetic counselling ( 1 ).To help clinicians and clinical laboratory geneticists address new challenges in sequence interpretation, standards and guidelines for interpreting sequence variants were developed by the American College of Medical Genetics and Genomics (ACMG) ( 2 ).It is well established that these standards and guidelines from ACMG are the best practices for genetic counselling.However, most datasets and in silico algorithms recommended by the ACMG for sequence variant interpretation are dispersed across various online platforms and databases.In response, we introduced Var-Cards ( http:// www.genemed.tech/varcards/ ), a comprehensive online database, to equip users with essential genetic and clinical knowledge for genetic counselling on specific coding variants ( 3 ).Because VarCards streamlines genetic counselling by offering gene-and variant-level annotation information recommended by the ACMG, VarCards has accessed more than 372 000 visits since its launch.
With the clinical significance of non-coding singlenucleotide variants (SNVs) and copy number variants (CNVs) of the human genome in genetic counselling raised more emphasis (4)(5)(6)(7)(8), a growing number of genomic tools or databases were developed to facilitate the interpretation of these noncoding variations .Still, they were either locally installed databases (such as GREEN-DB ( 9 ) and regBase ( 28 )) or command-line tools (such as ClassifyCNV ( 33 ) and DIVAN ( 14 )).In addition, many important annotation sources, such as allele frequencies, expression quantitative trait loci (eQTL), and regulatory information, have been dispersed across various online platforms.This widespread dispersion complicates the process for general clinicians, genetic counsellors and clinical laboratory geneticists trying to quickly access up-to-date data to interpret the function and pathogenicity of variations in the whole human genome in line with the standards and guidelines of the ACMG.
Although comprehensive human variation annotation databases, such as VARAdb ( 34 ) and VannoPortal ( 35 ) exist, their primary emphasis is on providing detailed data on regulatory profiles and evolutionary signatures.This is convenient for biologists to explore the underlying molecular mech-anisms but does not specifically address the need for clinical genetic counselling.In addition, some of these databases, such as VARAdb ( 34 ), compiled a total of 577 283 813 variations, of which the majority were single-nucleotide polymorphisms (SNPs) with a low likelihood of pathogenicity; however, there should theoretically be nearly nine billion SNVs in the human genome.Moreover, these databases did not include CNVs or detailed gene-level annotations.
To support clinicians, genetic counsellors, and clinical laboratory geneticists in providing effective genetic counselling, we developed VarCards2, an intuitive online database.It houses nearly nine billion SNVs, over 360 million documented short insertions and deletions (INDELs), and more than two million CNVs.VarCards2 provides in-depth annotations at both the variant and gene levels, including in silico predictions of function and pathogenicity, minor allele frequencies (MAFs) across diverse populations, splicing predictions for both canonical and non-canonical splicing regions, and gene functionality, all in alignment with standards and guidelines from ACMG.

Variant-level data source
To optimise the support for genetic counselling, VarCards2 encompassed nearly nine billion SNVs, representing any base in the human reference genome GRCh38 (including mitochondrial DNA) that had mutated into one of the three possible bases.Additionally, VarCards2 houses all reported short INDELs (length ≤ 50 bp) and CNVs (length > 50 bp) extracted from the subsequent seven databases: (I) the Genome Aggregation Database (gnomAD) ( 36 ); (II) the International Cancer Genome Consortium (ICGC) ( 37 ); (III) the clinical variations database (ClinVar) ( 38 ); (IV) the Catalogue of Somatic Mutations In Cancer (COSMIC) ( 39 ); (V) de novo mutations database called Gene4Denovo ( 40 ); (VI) the NCBI database of genetic variation named dbSNP ( 41 ) and (VII) the NCBI database of human genomic structural variation named dbVar ( 42 ).

Annotation and the conversion of genomic coordinate
Following the approach of VarCards ( 3 ), we utilised AN-NOVAR ( 103 ), an efficient annotation tool, to annotate all SNVs and INDELs (including mitochondrial DNA) using our variant-and gene-level data sources.Additionally, we annotated all curated CNVs using AnnotSV (an integrated tool for CNV annotation) ( 56 ).VarCards2 incorporates the genomic coordinates for GRCh37 / hg19 and GRCh38 / hg38 to facilitate queries.Therefore, for this reason, we employed LiftOver ( https:// genome.ucsc.edu/cgi-bin/ hgLiftOver ) to convert one genomic coordinate of some raw data which only provided GRCh37 / hg19 or GRCh38 / hg38 to the other in this study.

Database construction and interface
To ensure that users quickly adapt to the functionality of Var-Cards2, we maintained the simple and popular user interface style characteristic of VarCards.The VarCards2 database was written in Java, JavaScript, Python, and Perl by applying frontand back-end separation models.The back-end was based on Java Spring Boot( https:// spring.io/projects/ spring-boot ), a server-side Java framework that provides services through Application Programming Interface (API) endpoints.The front end, namely the interactive web interface, was powered by the JavaScript libraries Vue ( https://vuejs.org ) and Element Plus ( https:// element-plus.org/), which is a Vue 3-based component library for designers and developers that supports all modern browsers across platforms, including Google Chrome, FireFox, Safari, and Microsoft Edge.Annotation of the genomic variants and calculation of all precomputed scores of D 1481 the genomic variants were performed using Python.The integrated data were stored in a MySQL database, and tabdelimited files were indexed using Tabix ( 104 ).The website, database, and search index were deployed on Alibaba Cloud ( https:// www.alibabacloud.com/ ).

Results and web interface
The best practices for offering high-quality services in clinical variant interpretation have been established by the ACMG ( 2 ).
To streamline genetic counselling in line with the best practices established by the ACMG, VarCards2 integrates a wealth of variant-level and gene-level data sources (Figure 1 ).In the variant-level section, we include in silico predictions, allele frequencies across various populations, information on variants associated with diseases or phenotypes, reported de novo mutations and splice variants, and regulatory information such as eQTL, sQTL and epigenomics.In the gene-level section, we offer basic gene information, gene function, associations between genes and diseases or phenotypes, gene expression data, the number of variants in specific genes across diverse populations, and drug-gene interactions.All these features are presented via an intuitive web interface for user convenience.According to the ACMG guidelines, in silico prediction of function and pathogenicity is crucial for determining the potential pathogenicity of a variant.Several criteria, both pathogenic and benign, rely on these predictions, including (I) PVS1, which has a very strong pathogenic weight; (II) PS1, which carries a strong pathogenic weight; (III) PM4 and PM5, with a moderate pathogenic weight; (IV) PP3, with supporting pathogenic weight and (V) BP1, BP3, BP4 and BP7, each with supporting benign weight.To meet the requirements of the above criteria, the number of in silico prediction algorithms or tools has been expanded from 23 to 105 compared with its predecessor, VarCards (Supplemental Table 1).These tools cater to various variations, including non-synonymous substitutions, non-coding SNVs, canonical and non-canonical splicing variants, short INDELs, and CNVs.Additionally, AF is a crucial metric according to the ACMG guidelines.If a variant is not detected in several large-scale public population databases, such as gnomAD, 1000genomes, and HRC, this can be considered moderate evidence (PM2) supporting the pathogenicity of the variant.Furthermore, several assessment criteria set by the ACMG guidelines require information regarding other pathogenic variants at identical positions, reported de novo mutations, identified splicing sites, and whether the variant is situated on or proximate to a recognised pathogenic or risk gene.

Gene-level implications
In addition to variant-level annotations, VarCards2 offers the corresponding gene-level information to assist with genetic counselling.Gene-level information provided six distinct panels showing annotation details for genes containing or close to the given variant (Figure 3 ).The 'Basic Information' panel includes details such as: (I) gene names, encompassing the official symbol, full official name, and synonyms sourced from NCBI Gene ( 68 ); (II) a summary of the molecular functions of proteins encoded by the specified gene, as sourced from UniProtKB ( 71 ); (III) the genetic intolerance score from six studies ( 43 ,79-83 ).The 'Gene Function' panel aggregates information, including GO terms, protein length, mass, subunit structure, domains, biological pathways, gene constraint metrics from gnomAD, and protein-protein interactions corresponding to the protein encoded by the specified gene.The 'Phenotype and disease' panel retrieved the reported diseaseassociated variants or genes from OMIM ( 75 ), ClinVar ( 38 ), GeneReviews ( 84 ), ClinGen ( 85 ), HPO ( 86 ), GenCC ( 87 ), DE-CIPHER ( 88 ), Orphadata ( 89 ,90 ), GTR ( 92 ), NONCODE ( 93 ), MGI ( 94 ) and Gene4Denovo ( 40 ).For the 'Gene expression' panel, the expression data sourced from Brainspan ( 95 ), the GTEx project ( 66 ) and the Allen Brain Atlases ( 96 ) were illustrated using heatmaps or bar plots separately.Users can view variant counts based on functional effects and observe the overall mutation rates across various populations in the' Variants in Different Populations' panel.For the drug-gene interaction panel, the drugs which affected the given gene were DGIdb ( 98 ), DrugCentral ( 99 ), DTC ( 100 ), PharmGKB ( 101 ) and CTD ( 102 ).In contrast to their predecessors, VarCards and VarCards2 have enriched their gene-level annotation resources by integrating additional sources such as gene function, gene expression, gene-drug interactions, and phenotype and disease information (Supplemental Table 1).

Customised annotations
VarCards2 incorporates a feature that allows users to upload genetic data files in the VCF4 format for customised annotations, akin to its predecessor, VarCards.In addition to selecting specific annotations and setting threshold values for in silico prediction scores, VarCards2 not only can pinpoint cosegregated mutations in non-trio-based samples but also can identify de novo , homozygous, compound heterozygous, and X-linked hemizygous mutations in trio-based samples.This functionality can be achieved using a straightforward fourstep process: (I) users provide an email address to receive annotation results; (II) they choose between the Trio or Nontrio options for the VCF4 data; (III) VCF4 genetic data files are uploaded and (IV) for the Trio option, users must input the sample IDs for the father , mother , and proband, including the proband's gender.If the Non-trio option is selected, users specify the genotype information for each sample, such as heterozygous, homozygous, and wild type.

Other sections in VarCards2
VarCards2 also provided additional sections, including (I) the upload, which permitted users to upload additional annotation datasets for customised annotations; (II) the data source, D 1483 Figure 2. Snapshot of v ariant-le v el implications in VarCards2.There are three approaches to access variant-level implications, including 'Quick search', ' A dv anced search' and 'Annotate'.As an example, the results of a quick search for the variant 'chr1:11845727 T > G (GRCh38)', including predicted the damaging se v erity of the variants, allele frequencies in different populations and information in disease related database.VarCards2 offers three methods for accessing variant-level implications: 'Quick search', ' A dv anced search' and ' Annotate'.For instance, a quick search for the variant 'chr1:11845727 T > G (GRCh38)' yields results that include the damaging severity of the variant, allele frequencies across various populations, and rele v ant inf ormation from disease-associated databases.
which provided a summary of the integrated data sources; (III) the updates, which provided the latest news about Var-Cards2 and (IV) the tutorial, which provided a further description of VarCards2 and how to use it.

Case studies
To assess the precision and utility of VarCards2 in detecting a broad range of potential causative variations, we examined several well established and emerging loci based on published literature.(I) For pathogenic SNVs in non-coding regions, we queried chr1:11845727 T > G (GRCh38), located in the 3 UTR (untranslated regions) of the NPPA gene, which is associated with cardiovascular disorders ( 105 ).As we excepted, more than half of the non-coding prediction software categorised this variant as deleterious with a Phred-scaled score ( −10 × lo g 10 ( r ank o f r aw sco res/to tal number o f raw sco res ) ) > 15.
According to the ACMG guidelines, this variant has a supporting pathogenic weight in genetic counselling (PP3).Moreover, the variant was not detected in several large-scale public pop-ulation databases, such as gnomAD, 1000genomes, ExAC and HRC.Therefore, according to the ACMG guidelines, this can be considered a moderate piece of evidence for the pathogenicity (PM2) of the variant.Simultaneously, our gene-level annotation data indicated that NPPA is associated with cardiovascular diseases and is highly expressed in the heart.(II) We examined BBS1:c.G1339A for non-canonical splicing sites.This mutation, a missense variant at a non-canonical splicing site, impairs the splicing process ( 106 ).In VarCards2, all 16 splicing-site prediction software tools with available data supported that this site is an alternative splicing site.However, only approximately 20% of the missense mutation prediction tools deem this site detrimental.This underscores the benefits of using diverse prediction software in databases.

Discussion
It is becoming increasingly evident that variants in the noncoding regions of the human genome significantly impact hereditary diseases ( 6 ,7 ).However, providing clinical interpre-  tations of variants in non-coding areas remains challenging for clinicians and genetic counsellors ( 4 ).For general clinicians and genetic counsellors, the optimal approach for interpreting non-coding sequences is to adhere to the ACMG guidelines ( 8 ).To facilitate the interpretation of whole-genome sequencing, we incorporated over 150 annotation sources essential for genetic counselling by building VarCards2 within the framework of VarCards.For users seeking additional details, we provide a link that redirects them to the corresponding website for more comprehensive information.
Although many existing tools and databases can annotate non-coding sequences, VarCards2 presents distinct differences (Supplemental Table 2).Compared with seven existing databases, including FAVOR ( 107 ), VannoPortal ( 35 ), Var-Some ( 108 ), CADD ( 10 ), wAnnovar ( 109 ), VEP ( 110 ), and SnpEff ( 111 ), only VarCards2 could identify co-segregated variants, de novo mutations, homozygous variants, compound heterozygous variants, and X-linked hemizygous variants from user-provided VCF files for batch annotation.This feature efficiently assists clinicians and genetic counsellors, who may lack bioinformatics skills, in filtering potential pathogenic variants from extensive data, but also provides evidential support for the interpretation of variant pathogenicity in genetic counselling based on the ACMG guidelines.Furthermore, most existing tools and databases need to encompass comprehensive gene-level annotation.Although Var-Some ( 108 ) and VEP ( 110 ) are exceptions, VarSome ( 108 ) operates as a commercial database, whereas VEP ( 110 ) only provides linkage information between genes and diseases or phenotypes at the gene level.However, VarCards2 provides users with more than 40 gene-level functional annotations, including 'Gene function', 'Gene expression', 'Gene-drug interaction', and 'Phenotype and disease information', through an intuitive web interface for user convenience.Additionally, VarCards2 not only provides the most in silico functional or pathogenic predictions compared to existing databases but is also the only database that offers distinct prediction tools for various types of variants, including SNVs, short INDELs, CNVs, splicing variants and mitochondrial variants.Furthermore, VarCards2 is a unique, non-commercial, one-stop online database capable of providing genetic counselling for SNVs, CNVs, short INDELs and mitochondrial variants.Var-Cards2 focuses primarily on the clinical interpretation of genetic mutations.It integrates commonly used essential tools and data while discarding less useful and redundant datasets, making it convenient for genetic counselling.
As a comprehensive one-stop online database designed to facilitate genetic counselling, VarCards2 exhibits distinct advantages over the traditional resources used in genetic counselling.For instance, ClinVar ( 38 ), an online database, is a valuable and widely used resource for genetic counselling.However, it is important to note that it does not represent all genetic variants owing to its dependency on voluntary submissions.To include as many genetic variants as possi-D 1485 ble, VarCards2 has not only manually generated close to nine billion SNVs, representing all conceivable SNVs throughout the genome, but has also aggregated reported short INDELs and SVs from a multitude of databases, including dbVAR ( 42 ), dbSNP ( 41 ), ICGC ( 37 ), COSMIC ( 39 ), gnomAD ( 36 ), Gene4Denovo ( 40 ) and ClinVar ( 38 ).Additionally, despite ClinVar guidelines, inconsistencies in how different laboratories interpret and classify genetic variants may still arise.This may have led to conflicting classifications of a single variant within the database, and certain submissions may lack comprehensive evidence or interpretations.Consequently, Var-Cards2 not only aggregates various variant-and gene-level databases for disease information but also provides multiple in silico pathogenic prediction scores and allele frequencies across diverse populations based on the ACMG-AMP guidelines, thereby providing comprehensive evidence to assist users in genetic counselling.
Although VarCards2 offers extensive data to support genetic counselling in line with the ACMG standards and guidelines, users should be aware of the following precautions: First, although we have incorporated over 150 annotation sources into VarCards2, we can only present the datasets used for rating to users, rather than automatically determining them, as this could lead to a high number of false positives.Secondly, because we cannot automate the interpretation and extraction of key information from a large volume of open-access (OA) literature, the vast majority of annotation resources in VarCards2 originate from public databases.Consequently, some crucial information concealed within the most recent publications might be overlooked.Additionally, we encourage users to contribute their in-house annotation datasets because sharing them can benefit a wider user community.Third, disease-and phenotype-related data were collated from several databases, including ClinVar ( 38 ), OMIM ( 75 ), COSMIC ( 39 ) and HPO ( 86 ).Consequently, evidence of variations' clinical significance was obtained from diverse teams that employed various criteria and potential methodological biases.Users must remain vigilant of potential false positives in disease-and phenotype-related data ( 50 ,112 ).Furthermore, VarCards2 offers over 100 computational prediction scores for determining the pathogenicity or function of variations, including SNVs, INDELs, and CNVs; users should recognise that these methods vary in their specificity and sensitivity ( 62 , 63 , 113 ).
Transitioning from VarCards to VarCards2, we refreshed our integrated data sources and incorporated additional datasets vital for the clinical interpretation of non-coding region variants.Although VarCards2 has a vast array of annotation resources, it refrains from directly pinpointing diseasecausing variations owing to its intricate genetic testing criteria.However, we are setting sights on enhancing the Var-Cards2 database during the subsequent phase of automated genetic testing.We also invited the users to share their feedback, suggestions, or valuable data sources.VarCards2 offers a user-friendly gateway for genetic, genomic, and clinical insights into the human genome, expediting the identification and prioritisation of critical variants and genes.

Figure 3 .
Figure 3. Snapshot of gene-le v el implications in VarCards2.For instance, details provided for the NPPA gene include basic information, gene functions, associated phenotypes and diseases, gene expression patterns, variant distributions across populations, and drug-gene interactions.
Figure1.A general w orkflo w of VarCards2.VarCards2 enables the identification of candidate variants from user-uploaded VCF files or through a quick search.For effective prioritization of these variants and the genes associated with genetic diseases, a comprehensive assessment of genomic, genetic, and clinical data sources is imperative.Accordingly, VarCards2 has integrated a range of variant-level and gene-level implications.
tional annotations for the respective variant.The new page displays all variant-level implications, including (I) summary for genetic counselling, (II) in silico prediction of function and pathogenicity, (III) AF data sourced from several public population databases, (IV) disease-related information, (V) additional variant insights, such as whether a particular variant is reported as a de novo mutation or splicing variant and (VI) regulatory information (Figure2).