-
PDF
- Split View
-
Views
-
Cite
Cite
Mete Akgün, Hüseyin Demirci, VCF-Explorer: filtering and analysing whole genome VCF files, Bioinformatics, Volume 33, Issue 21, November 012017, Pages 3468–3470, https://doi.org/10.1093/bioinformatics/btx422
Close - Share Icon Share
Abstract
The decreasing cost in high-throughput technologies led to a number of sequencing projects consisting of thousands of whole genomes. The paradigm shift from exome to whole genome brings a significant increase in the size of output files. Most of the existing tools which are developed to analyse exome files are not adequate for larger VCF files produced by whole genome studies. In this work we present VCF-Explorer, a variant analysis software capable of handling large files. Memory efficiency and avoiding computationally costly pre-processing step enable to carry out the analysis to be performed with ordinary computers. VCF-Explorer provides an easy to use environment where users can define various types of queries based on variant and sample genotype level annotations. VCF-Explorer can be run in different environments and computational platforms ranging from a standard laptop to a high performance server.
VCF-Explorer is freely available at: http://vcfexplorer.sourceforge.net/.
Supplementary data are available at Bioinformatics online.
1 Introduction
As the goal to sequence a whole genome under 1000 USD becomes a reality, the utilization of whole genome sequencing in medicine has increased progressively. Large scale genome projects such as Genomics England (http://www.genomicsengland.co.uk/) have been actualized with the aim of having a better understanding of health and disease using the information gathered from thousands of genomes. Various types of software have been developed to analyse data obtained from next generation sequencing studies. Whole genome VCF files are on average ten times larger than the exome VCF files. However most of the existing tools neglect the increase in the size of file generated by whole genome studies and hence become inadequate because of extra requirements such as large memory. There is an increasing need for efficient whole genome data analysis software tools.
Different platforms have been proposed for the analysis of human NGS data. GEMINI Paila et al. (2013) and CanvasDB Ameur et al. (2014) provide a database framework where the VCF files are initially preprocessed after which the required queries can be performed efficiently. An experienced user is able to perform filtering operations in a short time using the database structured data. However these programs may require an experienced bioinformatician to manage and query their databases. This limits the usage of these programs for a researcher from a non-IT background. VarSifter Teer et al. (2012) is a powerful publicly available tool which allows to define broad types of queries with the help of a graphical user interface (GUI). However, the ease of use is limited with the memory. VarSifter initially takes the whole input file inside the memory and then is ready to do further analysis. It takes too much time or even fails to open a VCF file larger than the available memory. VCF-Miner Hart et al. (2016) is a recently proposed tool based on the MongoDB database engine system with an easy to use interface. However, it takes hours to load a large VCF file to the program. BrowseVCF Salatino and Ramraj (2016) is a recently proposed variant filtering tool that uses Wormtable as a data storage system.
In this work we present VCF-Explorer, a software tool which is designed to carry out analysis for large VCF files. VCF-Explorer is lightweight in structure, it can process very large files without running out of memory. The GUI makes it easy to filter variants with respect to sample or variant level annotations. Clinicians and researchers without IT-background can easily use our tool. There is no initial file loading or pre-processing thus saving CPU time and memory. It can be used in a simple personal computer or a large server with greater capabilities. We provide performance metrics of VCF-Explorer by analysing large public and in-house data at different computational environments. We compare the usage of the software with the existing tools in detail.
2 Methods
2.1 VCF pre-processing
NGS analysis tools such as GEMINI Paila et al. (2013), CanvasDB Ameur et al. (2014) and VCF-Miner Hart et al. (2016) prefer to maintain a database structure to handle any kind of VCF file. These programs require a pre-processing step that may take long for large input files to be able to analyse a file. On the other hand, VarSifter Teer et al. (2012) opens and takes the VCF file in memory and do the filtering operations inside the memory. VarSifter requires approximately 3 times larger memory than the file size. Therefore working with large files becomes impractical with ordinary computers.
On the contrary to the existing programs, VCF-Explorer does not pre-process the data inside the memory. Instead, it processes the file line by line for each query. This approach eliminates the requirement of a large memory use or handling a database structure. Efficient file reading, writing and filtering processes enable to execute each query in a reasonable time. This brings the ability to analyse whole genome files easily at an ordinary computer. The general comparison of VCF-Explorer with some of the existing software is presented in Table 1 and Table 2.
Comparison of VCF-Explorer with existing tools
| . | Large file handling . | Elimination of informatics expertise . | GUI . | Elimination of pre-processing . | Elimination of large memory requirement . | Robust query definition . | Custom VCF support . | SQL like queries . | Sorting . | Web based . | Indexing . |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VarSifter | − | + | + | − | − | + | + | − | + | − | − |
| VCF-Miner | + | + | + | − | + | − | + | − | − | + | + |
| Gemini | + | − | − | − | + | + | − | + | + | − | + |
| BrowseVCF | + | + | + | − | + | − | + | − | − | + | + |
| VCF-Explorer | + | + | + | + | + | + | + | − | − | − | − |
| . | Large file handling . | Elimination of informatics expertise . | GUI . | Elimination of pre-processing . | Elimination of large memory requirement . | Robust query definition . | Custom VCF support . | SQL like queries . | Sorting . | Web based . | Indexing . |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VarSifter | − | + | + | − | − | + | + | − | + | − | − |
| VCF-Miner | + | + | + | − | + | − | + | − | − | + | + |
| Gemini | + | − | − | − | + | + | − | + | + | − | + |
| BrowseVCF | + | + | + | − | + | − | + | − | − | + | + |
| VCF-Explorer | + | + | + | + | + | + | + | − | − | − | − |
Comparison of VCF-Explorer with existing tools
| . | Large file handling . | Elimination of informatics expertise . | GUI . | Elimination of pre-processing . | Elimination of large memory requirement . | Robust query definition . | Custom VCF support . | SQL like queries . | Sorting . | Web based . | Indexing . |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VarSifter | − | + | + | − | − | + | + | − | + | − | − |
| VCF-Miner | + | + | + | − | + | − | + | − | − | + | + |
| Gemini | + | − | − | − | + | + | − | + | + | − | + |
| BrowseVCF | + | + | + | − | + | − | + | − | − | + | + |
| VCF-Explorer | + | + | + | + | + | + | + | − | − | − | − |
| . | Large file handling . | Elimination of informatics expertise . | GUI . | Elimination of pre-processing . | Elimination of large memory requirement . | Robust query definition . | Custom VCF support . | SQL like queries . | Sorting . | Web based . | Indexing . |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VarSifter | − | + | + | − | − | + | + | − | + | − | − |
| VCF-Miner | + | + | + | − | + | − | + | − | − | + | + |
| Gemini | + | − | − | − | + | + | − | + | + | − | + |
| BrowseVCF | + | + | + | − | + | − | + | − | − | + | + |
| VCF-Explorer | + | + | + | + | + | + | + | − | − | − | − |
Filtering options of VCF-explorer with existing tools
| . | Per variant . | Logical operatorsA . | Per sample . | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| . | Location . | Gene name . | Mutation type . | MAF . | A . | B . | GT . | GQ . | DP . | |
| VarSifter | + | + | + | + | + | + | &,|,⊕ | + | − | − |
| VCF Miner | + | + | + | + | + | − | & | + | − | − |
| BrowseVCF | + | + | + | + | + | − | & | + | − | − |
| VCF explorer | + | + | + | + | + | + | &,|,⊕ | + | + | + |
| . | Per variant . | Logical operatorsA . | Per sample . | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| . | Location . | Gene name . | Mutation type . | MAF . | A . | B . | GT . | GQ . | DP . | |
| VarSifter | + | + | + | + | + | + | &,|,⊕ | + | − | − |
| VCF Miner | + | + | + | + | + | − | & | + | − | − |
| BrowseVCF | + | + | + | + | + | − | & | + | − | − |
| VCF explorer | + | + | + | + | + | + | &,|,⊕ | + | + | + |
A: Exact keyword search in annotation fields.
B: Partial keyword search in annotation fields.
& = AND, | = OR, ⊕ = XOR.
Filtering options of VCF-explorer with existing tools
| . | Per variant . | Logical operatorsA . | Per sample . | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| . | Location . | Gene name . | Mutation type . | MAF . | A . | B . | GT . | GQ . | DP . | |
| VarSifter | + | + | + | + | + | + | &,|,⊕ | + | − | − |
| VCF Miner | + | + | + | + | + | − | & | + | − | − |
| BrowseVCF | + | + | + | + | + | − | & | + | − | − |
| VCF explorer | + | + | + | + | + | + | &,|,⊕ | + | + | + |
| . | Per variant . | Logical operatorsA . | Per sample . | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| . | Location . | Gene name . | Mutation type . | MAF . | A . | B . | GT . | GQ . | DP . | |
| VarSifter | + | + | + | + | + | + | &,|,⊕ | + | − | − |
| VCF Miner | + | + | + | + | + | − | & | + | − | − |
| BrowseVCF | + | + | + | + | + | − | & | + | − | − |
| VCF explorer | + | + | + | + | + | + | &,|,⊕ | + | + | + |
A: Exact keyword search in annotation fields.
B: Partial keyword search in annotation fields.
& = AND, | = OR, ⊕ = XOR.
2.2 Program usage
The software is designed to procure an easy to use environment with the help of the GUI. The GUI is divided into three main menus: File Chooser, Filters and Variant Viewer. The detailed explanation of the program usage is provided in the Supplementary Material. Figure 1 shows the workflow of VCF-Explorer.
VCF-Explorer is developed in C ++ with Qt framework and can be executed under Windows and Linux operating systems. The GUI of the program is developed using Qt Creator 3.1.2 based on the Qt 5.3.1. All necessary external libraries come with the installation interface which makes the installation step an easier task.
3 Results
We have demonstrated the effectiveness of VCF-Explorer with VCF files including 1, 3.1 GB and 10 GB public VCF files from 1000 Genomes Project Consortium (2015) and 10, 26, 50 and 100 GB in-house VCFs. We ran the software at a desktop PC. We have executed a standard query (Find all the loss of function variants with minor allele frequency less than 0.01 which are either heterozygous or homozygous variant) and have produced the output file. We observe that the time spent during the execution of VCF-Explorer is mainly for file reading and writing operations, not for the filtering. The results are summarized in Supplementary Tables S1 and S2. We have also compared the time execution of VCF-Explorer with the use of VarSifter and VCF-Miner in Supplementary Table S3. We observe that when the VCF file size is larger than 10 GB which is typical for whole genome studies, alternative methods spend hours to be able to perform an analysis. Instead, VCF-Explorer executes each query in minutes if not in seconds.
We conclude that VCF-Explorer has major advantages for large VCF file processing over existing tools. It provides a powerful alternative environment for whole genome studies. As a future work, we will provide several extra features to VCF-Explorer. We will implement Query Manager tool that enables query and worksheet saving. Therefore, the user can use previously used configuration of the program.
Conflict of interest: none declared.
References
