vcfpp: a C++ API for rapid processing of the variant call format

Abstract Motivation Given the widespread use of the variant call format (VCF/BCF) coupled with continuous surge in big data, there remains a perpetual demand for fast and flexible methods to manipulate these comprehensive formats across various programming languages. Results This work presents vcfpp, a C++ API of HTSlib in a single file, providing an intuitive interface to manipulate VCF/BCF files rapidly and safely, in addition to being portable. Moreover, this work introduces the vcfppR package to demonstrate the development of a high-performance R package with vcfpp, allowing for rapid and straightforward variants analyses. Availability and implementation vcfpp is available from https://github.com/Zilong-Li/vcfpp under MIT license. vcfppR is available from https://cran.r-project.org/web/packages/vcfppR.


Introduction
Computational biologists have made numerous efforts and contributions to facilitate analyses of genomic variants.The variant call format (VCF) has become the standard for storing genetic variant information, consisting of detailed specifications (Danecek et al. 2011).With big data on the rise, the binary variant call format (BCF) was later designed to query and store large datasets efficiently.The C API of HTSlib (Bonfield et al. 2021) provides a full set of functionalities to manipulate the VCF/BCF for both compressed and uncompressed files.Given that the C API is challenging for less proficient programmers to use, APIs derived from other languages have been created to fill in the gap.Existing popular libraries include vcfR (Knaus and Gru ¨nwald 2017) for R, cyvcf2 (Pedersen and Quinlan 2017) for Python, hts-nim (Pedersen and Quinlan 2018) for Nim, and vcflib (Garrison et al. 2022) for Cþþ.All are valuable to each respective community, but not without their disadvantages.In particular, vcflib is both an API and a large collection of command line tools, with the primary pitfall being that it does not support the BCF format.It is noteworthy that many methods written in Cþþ designed for large sample size cannot input the compressed VCF or BCF, as in the case of Syllbable-PBWT (Wang et al. 2023).The motivation behind vcfpp is to offer full functionalities as HTSlib, and provide a simple and safe API in a single header file, which can be easily integrated for programming in Cþþ as well as other languages that can call C/Cþþ codes, such as R (https://www.r-project.org/) and Python (https://www.python.org/).

The VCF and HTSlib
A VCF file consists of a header section and a body section.The header contains lines with meta-information, each starting with the characters "##" or "#," while the body consists of TAB-delimited lines.A VCF file differs from a typical TAB-delimited file in several ways.Firstly, the header is too important to be ignored.A VCF file that contains a tag in the body without its declaration in the header violates the specification.Secondly, the VCF can be compressed and randomly accessed using bgzip and tabix, which were originally part of SAMtools library (Li et al. 2009) but were later separated into HTSlib (Bonfield et al. 2021).Additionally, the VCF standard is established and regularly updated by the SAMtools team.Therefore, it would be more advantageous to use a specialized library, namely HTSlib, to process the VCF rather than a customized parser.HTSlib offers full functionalities to interact with the VCF, including support for BCF, format validation, compression, random access, identification of variant types, and support for URL links as filenames, among others.Furthermore, HTSlib has demonstrated the best performance among many competitors (Bonfield et al. 2021).However, HTSlib was written in C language, which can be challenging for beginner programmers to use, especially when it comes to memory management.

A C11 API
As a Cþþ API of HTSlib, vcfpp inherits all functionalities from HTSlib and is implemented in a single header file that can be easily integrated and used safely.There are four core classes in vcfpp, as summarized in Table 1, all of which allocate and free memory automatically and safely, enabling users to program with ease.To illustrate commonly used features and demonstrate the simplicity of vcfpp, here, I showcase an example that will be used throughout the article.In Listing 1, we count the number of heterozygous sites for each sample in a VCF file.The following code first includes a single vcfpp.hfile (Line 1), opens a compressed VCF file constrained to three samples and region "chr21" (Line 2), and creates a variant record associated with the header information in the VCF file (Line 3).Then, it defines several types of objects to collect the results we want (Lines 4 and 5).Taking advantage of generic templates in Cþþ, the genotype (GT) value can be of bool, char, or int type so that users may control the memory consumed by their program.Then, it iterates over the VCF file and processes each variant record in the loop (Line 6).Here, we ignore variants of other types (INDEL, SV), or if FILTER does not display "PASS," or if the QUAL value is smaller than nine (Lines 8 and 9), though the API also allows us to do more complicated filtering should.Finally, we count the number of heterozygous variants for each diploid sample (Lines 10 and 11).The core is only 12 lines.
Listing 1: Counting the number of heterozygous GTs for three samples on chr21

Results
To demonstrate the simplicity, portability, and performance of vcfpp, the following sections include the benchmarking results and highlight the vcfppR package as an example of vcfpp working with R (R Core Team 2023).

Working with R
While vcfpp is very simple for writing a Cþþ program, a single Cþþ header file can be easily integrated into popular script languages like R and Python.Particularly, R is designed for statistical modeling and visualization, widely used for data analyses.Therefore, I developed the vcfppR package to demonstrate how vcfpp can seamlessly work with R using Rcpp (Eddelbuettel and Francois 2011).For instance, with the basic familiarity of Cþþ and Rcpp, we can turn the Cþþ code in Listing 1 into an Rcpp function to return the heterozygosity counted per sample along with the sample's name (Listing 2), which can then be compiled and called dynamically in R using sourceCpp (Listing 3).As such, we can further process and visualize these results straightforwardly within the R ecosystem.For example, we can analyze the data by genomic region in parallel using the parallel package and then stratify the results by population, leveraging the external labels of each sample, to finally visualize them in R.

The vcfppR package
The vcfppR package is developed and powered by the vcfpp API.To parse the VCF with vcfppR, the vcfreader and vcftable functions can rapidly read contents of the VCF into the R data types with fine control over the region, samples, variant types, FORMAT column, and filters.For instance, the code in Listing 4 parses the read depth per variant (DP) in the raw called VCF by the 1000 Genomes Project through the URL link.It restricts the analysis to three samples with variants in "chr21:1-10000000" region of SNP type, passing the FILTER, and discarding the INFO column in the returned list.Subsequently, the visual summary can be generated by using boxplot() in R (see Fig. 1).
Listing 4: Example of vcfppR::vcftable Furthermore, as characterizing variants is an essential task in genomic analyses, I showcase the vcfsummary function in Fig. 1, which summarizes the variants found in the latest VCF released by the 1000 Genome Project (Byrska-Bishop et al. 2022).

Benchmarking
In addition to simplicity and portability, I showcase here the performance of vcfpp and vcfppR.For the benchmarking, I developed scripts (https://github.com/Zilong-Li/vcfpp/tree/main/scripts) to perform a common analysis of counting heterozygous GTs per sample on a Linux server with AMD EPYC 7643 48-Core Processor.As shown in Table 2, when using the compressed (gzipped) VCF of 3202 samples and

Figure 1 .
Figure 1.Analyzing variants discovered in the 1000 Genome Project with vcfppR (see code in the Supplementary Material).

Table 1 .
(Davies et al. 2021)nd implemented Cþþ class.variants,whichincludesonlyGT in the FORMAT, the Rcpp function vcfppR::heterozygosity in Listing 2 demonstrates comparable performance to the compiled Cþþ code of vcfpp in Listing 1 with a minor overhead.This overhead is attributed to a list of sample names being returned to R. The dynamic script using vcfppR::vcfreader is only 1:3Â slower than its compiled Cþþ counterpart, whereas the cyvcf2::VCF is 1:9Â slower.With the streaming strategy, all scripts use little RAM, given that they only load one variant into memory at a time.However, R packages like vcfR and data.tableusuallyloadallVCFdatainto memory first and perform analyses later, which is referred here as the "two-step" strategy.To this end, I have also developed vcftable function in vcfppR to load the entire contents of the VCF into R for the two-step comparison.Notably, the vcfppR::vcftable is only 2:0Â slower compared to the 19Â slower vcfR::read.vcfR and the 119Â slower data.table::fread.This discrepancy arises because the GT values returned by both vcfR and data.tablearecharacters,whichareinefficient to further process in R. On the other hand, with vcfppR, an integer matrix of GTs can be returned to R directly for fast computation.If we exclude the elapsed time of loading data, which means ignoring time marked by the *, then vcfppR demonstrates a 101Â speed improvement over vcfR.Importantly, vcfpp and vcfppR offer users the full functionalities of HTSlib, including support for compressed VCF/BCF, selection of samples, regions, and variant types.Here, I have developed vcfpp, a fast and flexible Cþþ API for high-performance genetic variant analyses with the VCF/ BCF input.Its simplicity and portability can be very valuable for both developing packages and writing in-house scripts.The vcfppR package is a great example of vcfpp working with R, and there are also examples of developing a Python API for vcfpp available on GitHub.In vcfppR, there are vcfreader and vcftable functions that can process the variants.The vcftable is specifically designed for reading the VCF into the tabular data structure in R, but it can only read a single FORMAT item in one pass over the VCF.In contrast, vcfreader serves as the full R-bindings of vcfpp and allows iterative parsing of variants, giving users the flexibility to decide the information to retrieve for each variant.As such, many packages written in Cþþ using a customized VCF parser can be simply replaced with vcfpp to offer more functionalities.For instance, vcfpp can be found successfully implemented in the imputation software programs STITCH(Davies et al. 2016)and QUILT(Davies et al. 2021)to parse large reference panels in the compressed VCF/BCF.

Table 2 .
Performance of counting heterozygous GTs per sample with VCF of 3202 samples and 1 002 753 variants.a a þ11243 119.1 77.3 Two-step Time is in seconds.RAM is in gigabytes.aUsed by loading data in two-step strategy.