VCFshiny: an R/Shiny application for interactively analyzing and visualizing genetic variants

Abstract Summary Next-generation sequencing generates variants that are typically documented in variant call format (VCF) files. However, comprehensively examining variant information from VCF files can pose a significant challenge for researchers lacking bioinformatics and programming expertise. To address this issue, we introduce VCFshiny, an R package that features a user-friendly web interface enabling interactive annotation, interpretation, and visualization of variant information stored in VCF files. VCFshiny offers two annotation methods, Annovar and VariantAnnotation, to add annotations such as genes or functional impact. Annotated VCF files are deemed acceptable inputs for the purpose of summarizing and visualizing variant information. This includes the total number of variants, overlaps across sample replicates, base alterations of single nucleotides, length distributions of insertions and deletions (indels), high-frequency mutated genes, variant distribution in the genome and of genome features, variants in cancer driver genes, and cancer mutational signatures. VCFshiny serves to enhance the intelligibility of VCF files by offering an interactive web interface for analysis and visualization. Availability and implementation The source code is available under an MIT open source license at https://github.com/123xiaochen/VCFshiny with documentation at https://123xiaochen.github.io/VCFshiny.


Introduction
Recent advances in sequencing technologies have enabled the detection of a large number of genetic variants at the whole genome level (Metzker 2010, van Dijk et al. 2014).Genetic variants are obtained in cells during acquired development, and these variants may be caused by DNA replication errors or exposure to environmental mutagens (Pei et al. 2021).The most common scenario for genetic variant detection is in cancer genomics research because most cancers are caused by genetic variants in driving genes, and harmful genetic variants continue to accumulate during the development of cancer (Nakagawa andFujita 2018, Xiao et al. 2021).Thus, the crucial first step in the analysis of cancer sequencing data is identifying genetic variants (Koboldt 2020).Another use for genetic variant detection is in gene editing research because the wide application of clinical gene therapy has led to increasing concerns about its safety.The off-target effects of CRISPR/Cas9-mediated gene editing may bring potential risks (Kuscu et al. 2014, Aquino-Jarquin 2021, Ho ¨ijer et al. 2022).Therefore, genetic variant detection could be used as an unbiased method of detecting off-target effects at the whole genome level (Veres et al. 2014, Kim et al. 2015, Wang et al. 2021, Luo et al. 2023).The 1000 Genomes Project and 100 000 Genomes project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations (Siva 2008, Peplow 2016).These studies described the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.Re-analysis genetic variant data generated by these projects may also lead to new biological insights.
To identify mutations in DNA sequencing data, a series of variant callers and computational pipelines have been developed with their own unique characteristics (Barnell et al. 2019, Cameron et al. 2019, Krusche et al. 2019, Koboldt 2020, He et al. 2021).Despite differences in calling algorithms and applications, most use genome sequencing data aligned to a reference as input and output single nucleotide variants and indels recorded in variant call format (VCF) (Danecek et al. 2011).The VCF file stores the details of variations, including the chromosome location, base sequence, base quality, read depth, and genotype.An annotated VCF file, such as that annotated by Annovar (Wang et al. 2010), also has information columns containing the corresponding gene name and corresponding genomic features.The VCF files are usually used by end-users to search for variants of interest and evaluate the potential impact of these variants.Although some command line tools have been developed to filter, annotate, and visualize VCF files, these programs may require programming skills and a bioinformatics background, limiting their use by researchers without a computational background.
Recently, many efforts have been made to develop graphical tools to process VCF files for researchers with limited bioinformatics backgrounds.Tools such as vcfView (O'Sullivan and Seoighe 2020), VCF/Plotein (Ossio et al. 2019), shinyCircos (Yu et al. 2018), shinyChromosome (Yu et al. 2019), BrowseVCF (Salatino and Ramraj 2017), and IGV (Thorvaldsdottir et al. 2013) have been developed to enable researchers to browse and filter variants in the VCF.However, they skip the annotation step, so users may need to annotate the VCF file with other annotation tools prior to use.Other tools, including VCF-Server (Jiang et al. 2019), VCF-Miner (Hart et al. 2016), andEnsembl-VEP (McLaren et al. 2016), focus on annotating and filtering variants but lack visualization functions for exploring the variant information.And, some of these tools are obsolete and lack maintenance, making them unavailable.In addition, a major disadvantage of web tool solutions such as VEP is that the transmission of large amounts of genetic data over public networks raises confidentiality and performance issues and requires a dedicated server that may not be available to every end user.
To fill this void, we developed VCFshiny, an interactive R/ Shiny application for analyzing and visualizing VCF files.It allows non-bioinformatician researchers to upload VCF files to annotate and visualize detailed variant information without requiring any programming code.VCFshiny allows users to annotate VCF files using Annovar or VariantAnnotation with commonly used databases.VCFshiny also accepts annotated VCF files for comparing and visualizing variants between different samples.Furthermore, VCFshiny supports the summarization of cancer driver gene-relevant variants and cancer mutational signatures, improving its ability to predict the biological consequences of variants.Collectively, it enables researchers without a bioinformatics background to explore and interpret variant data, thereby facilitating research in the field of genetics.The total number of variants    2 Chen et al.

Graphical user interface
The program has a graphical user interface to make it easy for the user to interact, analyze and visualize information

Custom VCF
The program allows to use an user provided VCF No pre-processing steps The program does not require the VCF to be pre-processed or to be converted into a database format

Data annotation
This program is able to variation data VCF file based on a variety of database comments

Sample repeatability analysis
This program can detect the quality of sample duplication for multiple repeated experimental groups

SNV analysis
This program can screen SNV data in mutation data and analyze the type and frequency of SNV mutations

Indel analysis
This program can screen Indel data in the variation data and analyze the mutation length and frequency of Indel mutations

Genomic circosplot
This program can show the distribution of SNVS and Indel on the genome map

Genomic feature analysis
The program is able to analyze the functional characteristic area where the variation is located

Key gene screening
The program can screen for high-frequency mutated genes in each VCF file based on the genomic feature

Screening for cancer driver genes
The software can screen potential cancer drivers in conjunction with a cancer database

Mutational signature analysis
This software can be used to select samples and analyze the mutation signature

Available for free
The program can be used freely

Figure 1 .
Figure 1.Overview of the full workflow performed by VCFshiny (annotation and visualization of genetic variant data analysis).(A) The analysis pipeline consists of two function modules: (i) variant annotation, and (ii) variant data analysis.Variant annotation module is supported by Annovar and VariantAnnotation, allowing users to download annotation database (such as dbsnp) and annotate variants to corresponding genes, genomic regions, or related disease.The variant data analysis module allows users to summarize the detailed information of VCFs and perform statistical analysis and comparison of variants between samples.(B) Result visualization.Once the analysis is done, user can interactively explore and export the results.For example, they can explore the total variant numbers (B1), base substitution bias of single nucleotides (B2), length distributions bias of indels (B3), location of variant in the genome and genome features (B4), variants in cancer driver genes (B5), and cancer mutational signatures (B6).The variant dataset used in this figure is an RNA-seq data of three breast cancer subtypes (TNBC, Non-TNBC, and HER2-positive) and normal human breast organoids (epithelium) samples (NBS) under the GEO accession number: GSE52194 (Horvath et al. 2013).

Table 1 .
Comparison of applications for analyzing, filtering, annotating and visualizing VCF files.