Pygenomics: manipulating genomic intervals and data files in Python

Abstract Summary We present pygenomics, a Python package for working with genomic intervals and bioinformatic data files. The package implements interval operations, provides both API and CLI, and supports reading and writing data in widely used bioinformatic formats, including BAM, BED, GFF3, and VCF. The source code of pygenomics is provided with in-source documentation and type annotations and adheres to the functional programming paradigm. These features facilitate seamless integration of pygenomics routines into scripts and pipelines. The package is implemented in pure Python using its standard library only and contains the property-based testing framework. Comparison of pygenomics with other Python bioinformatic packages with relation to features and performance is presented. The performance comparison covers operations with genomic intervals, read alignments, and genomic variants and demonstrates that pygenomics is suitable for computationally effective analysis. Availability and implementation The source code is available at https://gitlab.com/gtamazian/pygenomics.


Package features
In this section we present (where possible) head-to-head feature comparison between pygenomics and the aforementioned packages.

Data formats
Pygenomics supports reading and writing data in multiple bioinformatic formats in accordance with their specifications. Some output formats, however, are software-specific; therefore, routines related to these formats are implemented in the separate extension package pygenomics-ext (see Subsection 3.1). The formats supported by pygenomics and their implementation statuses in other Python packages are listed in Table 1.

Implementation specifics
Pure Python implementation. Pygenomics is implemented in pure Python using routines from the Python standard library only. Importantly, the clean Python code in the absence of dependencies from third-party libraries or compilers facilitates easy deployment and maintenance of pygenomics and also enables the performance boost when using PyPy [5]. Unlike pygenomics, the other packages considered here include code fragments in another programming languages or require external libraries: *a.samsonova@spbu.ru • pysam requires the HTSlib library [6] for parsing files; • pybedtools calls the external program, BEDTools [7] for operations with genomic intervals; • cyvcf2 is implemented in Cython [8] and uses routines from HTSlib; • PyRanges is based on the pandas library [9] and, consequently, depends on the NumPy library [10].
Functional programming paradigm. Pygenomics routines are designed in accordance with the functional programming paradigm: objects are immutable and functions produce no side effects (i.e., pure functions) except for routines raising exceptions or related to input-output. For that purpose, we utilized means for functional programming provided by Python: first-class and higher-order functions, immutable tuples, enumerations, and iterators. Immutability guarantees that pygenomics' objects will not be able to violate invariants of their classes after the objects are created. Immutability also provides thread safety, and thus facilitates parallel programming.
Due to the absence of side effects, combining pure functions results in a pure function as well. As pure functions return a result of their work, the returned object can be passed on to another function, thus enabling the function call chaining. Besides combining and chaining, pure functions in pygenomics can be optimized by replacing their Python implementations with calls to external libraries implemented in other programming languages (e.g., C or Fortran).
Other packages discussed here use regular Python classes with methods that allow in-place modification of their objects. In particular, PyRanges objects are based on pandas' data frames, which methods implement changing their contents by destructing their previous state.
Property-based testing framework. In addition to source code validation, a testing framework can be used for both deploying a package and its integrity maintenance. When deploying a package, an administrator launches tests to ensure that the environment is suitable for the package. Whenever a developer introduces changes to the code tests could be run to check that proposed modifications to the source code do not break other components of the package.
Due to pygenomics' adherence to the functional programming paradigm, the package is provided with the property-based testing framework as implemented in the Hypothesis library [11]. Hypothesis allows to create test cases by specifying properties of routines being tested and by passing randomly generated data to the routines to check whether the properties hold. Randomly generated testing framework enables easy development and deployment of pygenomics since a developer does not need to design the tests manually. Moreover, edge cases of input parameters to the tested routines are also automatically covered.
Pysam, pybedtools and cyvcf2 use test cases based on the static testing datasets provided with the packages. PyRanges includes the property-based testing framework that just as pygenomics uses Hypothesis. However, its implementation is less efficient because PyRanges does not follow the functional programming paradigm and contains complex classes which random sampling requires a lot of computations and time.
Type annotations and static type checking. Although Python is a dynamically typed language, it allows a developer to annotate object types. In addition to source code validation, the type annotations facilitate code development by providing hints about package entities and by enabling detection of errors caused by passing parameters of incorrect types to the package routines.
Every entity in the source code of pygenomics has its type explicitly annotated. Consistencies between the entity types are checked by the mypy static type checker [12]. This procedure does not require any calls of pygenomics routines. The type annotations are also installed when pygenomics is deployed, thus allowing a user to refer to types of pygenomics' routines when developing their own code.
The other packages do not have type annotations.
Both API and CLI provided. Pygenomics provides both the application programming interface (API) and the command-line interface (CLI) to its routines. The API gives access to all routines of the package and the CLI enables to run the package routines as ordinary programs outside a development environment. This is a significant advantage as packages mentioned above do not provide CLI.

Consistent API.
The API of pygenomics is developed with the intention to keep it consistent and easy to comprehend. Thus, the following principles are observed to maintain the API integrity: 1. Routines for parsing data are organized into modules or subpackages (e.g., subpackage pygenomics.vcf contains routines related to reading various parts of a VCF file).
2. Exception classes are organized into a downward hierarchy that starts from the base class for exceptions in the package (pygenomics.error.Error). Base exception classes are also present in modules and subpackages of pygenomics (e.g., pygenomics.vcf.error.Error). Classes for specific exceptions inherit from their base classes to allow catching the exceptions at any level of the package (e.g., exceptions for a specific subpackage or all exceptions possibly generated by pygenomics).
3. Routines shared between modules and subpackages are organized into a downward hierarchy of modules which are named common (e.g., pygenomics.common). Similar to the exception classes, the modules for common routines import only higher-level modules to avoid circular dependencies (e.g., pygenomics.vcf.meta.common imports pygenomics.common). (b) Names of classes representing module-level errors end with Error. Such classes might be accompanied by enumeration-based classes which names end with ErrorType and that allows for the specification of the error type (e.g., pygenomics.fastq.Read-ingError and pygenomics.fastq.ReadingErrorType).
(c) Methods that implement conversion of a line or a list of strings into class objects are called of_string() or of_list(), respectively. These methods return the class object, if the conversion was successful or None otherwise.
Pysam and cyvcf2 wrap code from HTSlib and their APIs depend on the HTSlib API, which is implemented in C. The API of pybedtools is designed to pass data to programs from BEDTools and parse their output. The API of PyRanges seems to be influenced by pandas' API and is based on the data frame objects.
Structured CLI. The CLI of pygenomics organizes its commands into categories by the kind of data the commands are applied to (e.g., file formats or operation types). Each command is provided with a help message that describes its command-line arguments. The CLI commands are described in Subsection 3.2.
The other packages we consider here do not provide CLI.

Source code design.
Pygenomics is developed following the best practices for Python programming: arranging routines in subpackages and modules, annotating types, documenting entities in their source code, and, finally, assessing the code complexity to avoid bloated classes or functions. The development framework established for pygenomics is described in Subsection 1.3. The source code of Pysam, cyvcf2 and pybedtools is affected by their dependency on external tools or libraries (that is, BEDTools and HTSlib). PyRanges is based on the pandas package, and thus its source code adheres to design principles used to develop pandas and NumPy. These design principles are aimed at maximizing performance of operations with large data sets (series and data frames in pandas and multidimensional arrays in NumPy) by allowing modification of existing instances of the data objects. Such in-place modifications increase performance but also introduce side effects that may have undesirable consequences. For example, pandas (version 1.5.1) is known not to be completely thread safe [13]. Furthermore, NumPy introduces its own development roadmap and a series of enhancement proposals called NEPs (NumPy Enhancement Proposals) [14] in addition to the roadmap and proposals for Python [15]. Tracking NEPs might be excessive for a user or developer of bioinformatic software that does not explicitly involve operating with multidimensional arrays.

Stream-based input and output.
Pygenomics routines use streams for specifying sources and destinations of input and output (I/O). Using streams provides a user with a flexible way to process I/O errors outside the routines. The streams can be connected to files, network resources or located in memory.
The other packages mentioned here use file names for specifying I/O sources.
Flexible parsing of bioinformatic data. Methods that parse bioinformatic data from a line or a series of strings in pygenomics' classes return None if the parsing fails. This behavior allows a user to specify their own strategy for dealing with data that cannot be parsed e.g., to terminate the program, to collect the parsing failures for further processing, to try correcting the failures, or to ignore them completely. A user can also create an intermediate function for processing non-standard input data (e.g., spaces in lines from a non-standard BED file can be replaced with tab characters before passing these lines to the function that parses BED records). This approach is far more efficient than implementation of handling of multiple variants of a data format (e.g., allowing several column separators). It adapts to irregularities in input data and imposes no computational burden by processing redundant variants of a data format.
Pysam, cyvcf2 and pybedtools raise exceptions if incorrect input data is parsed. PyRanges allows variants of input data formats, but does not give an opportunity to specify custom variants without introducing changes to the source code of the package.
Classes allow easy querying. Pygenomics classes are designed to enable easy exploration of their content during a read-eval-print loop (REPL) in the Python interpreter. Each class has its representation (that is returned by function repr()) defined in the human-readable format and type annotations facilitate perusal of the source code.
Pysam, cyvcf2 and pybedtools wrap data structures from other libraries or programs that cannot be conveniently viewed in the REPL. PyRanges provides viewing its objects in the REPL in a manner similar to pandas.

Development framework
Pygenomics has been developed using a number of programming tools that facilitate writing clear code and maintaining the package integrity: • Black [16] keeps the source code of pygenomics formatted in the same style and in accordance with widely used conventions for Python source code formatting.
• isort [17] manages the hierarchy of the import statements in the source code.
• pylint [18] detects incorrect or improper code and assists with refactoring the codebase of pygenomics (e.g., by reporting duplicated fragments of code).
• mypy [12] checks type annotations and reports errors caused by inconsistent types.
• radon [19] assesses source code complexity for the package entities (classes, methods, and functions) and estimates maintainability of the package codebase based on cyclomatic complexity [20].
• Hypothesis [11] implements property-based testing that was introduced in the Haskell library Quickcheck [23].
• Sphinx [24] extracts in-source documentation of the package entities and converts it to a number of output formats, including HTML and PDF.
• tox [25] organizes running tests and static code checks in the way that enables convenient integration with external continuous integration (CI) tools.
• Poetry [26] manages packaging and dependencies of pygenomics in a portable way. Although pygenomics routines do not depend on third-party libraries, its development framework does and Poetry guarantees that there are no version conflicts between the development framework packages and their dependencies. Poetry also can be used for managing a virtual environment for developing and testing pygenomics.
The presented development framework and the package properties listed in Subsection 1.2 facilitate continuous integration and continuous development (CI/CD) of pygenomics, as well as its incorporation into production-grade bioinformatic packages and pipelines.

Comparison with general-purpose numeric interval libraries
A number of general-purpose libraries that implement numeric interval operations have been developed. These libraries include Boost Interval Arithmetic Library [27] and intervaltree [28].
Boost Interval Arithmetic Library is a template-based C++ library that implements manipulating mathematical intervals. The library is indented for various applications, including computer graphics and interval calculation for imprecise data. Intervaltree is a Python package that provides a mutable self-balancing interval tree originally designed for text and time intervals. Both libraries allow a developer to specify the type of interval bounds, which might be a floating-point number type.
Unlike general numeric intervals, genomic intervals are associated with assembled genome sequences (chromosomes, scaffolds, or contigs) and the interval start and end positions are specified by non-negative integers bounded by sizes of the sequences. Pygenomics uses these features of genomic intervals for effective and robust implementation of the associated operations. For example, an attempt to create a BED record range with a negative start position will raise an exception: >>> import pygenomics.bed >>> pygenomics.bed.Range(-1, 1) pygenomics.bed.BedError: BedErrorType.INCORRECT_RANGE Boundaries for end positions of genomic intervals vary according to a genome assembly or might be undefined. A developer can implement extra routines for restricting the interval end coordinates. For example, the pygenomics interval complement command from pygenomics' CLI allows a user to specify genome sequence sizes: if the sizes are specified, then intervals at the beginning and end of the sequences are included in the tool output. Otherwise, only intervals located between the given ones are reported.
Boost Interval Arithmetic Library and intervaltree do not provide routines for working with genomic intervals, although such routines can be implemented using these libraries. The example is intervaltree-bio [29], that is a Python package based on intervaltree. However, intervaltree-bio is designed primarily for working with UCSC Genome Browser annotation records and does not support multiple bioinformatic data formats.

Extra repositories
In addition to the main GitLab repository of pygenomics, we present three repositories with routines related to the package.
pygenomics-examples stores a collection of stand-alone scripts that utilize routines of pygenomics and can be run in a command line (https://gitlab.com/gtamazian/pygenomics-examples).

Command-line interface
Pygenomics provides the command-line interface (CLI) to its routines related to genomic intervals or processing FASTA files. Having the package installed, the CLI can be invoked by calling script size Print names and lengths of sequences.
unmask Remove soft masking from sequences.
interval complement Get complement genomic intervals.
find Find genomic intervals that overlap with the target.
find_all Print query genomic intervals that overlap with the target; the printed intervals are followed by shared intervals between the query and the target.
intersect Intersect genomic intervals from two or more files.

merge
Merge genomic intervals from one or several files.
subtract Subtract genomic intervals; several files of intervals to be subtracted can be specified.
pygenomics or by passing the module option to Python in the following way: python3 -m pygenomics. The second method allows to specify the Python interpreter used for running pygenomics (e.g., one may use PyPy instead of the default Python interpreter installed in the environment: pypy3 -m pygenomics) or to run the CLI routines from the source code tree without having the package installed: git clone https://gitlab.com/gtamazian/pygenomics.git cd pygenomics/src python3 -m pygenomics Commands of the CLI are arranged by the type of data they process: genomic intervals or FASTA files. The list of the CLI commands is given in Table 2. Commands in the interval group support input files in the BED, GFF3, GTF and VCF formats. The files of different formats can be combined in a single command; conversions between interval formats (e.g., zero-based half-open for the BED format and one-based closed for the GFF3 format) are performed automatically.

Examples
This section presents three pipelines that demonstrate pygenomics usage in bioinformatics data analysis. The pipelines are implemented in Snakemake [30,31] and stored in a separate repository (see Subsection 3.1). In addition to Python scripts that use pygenomics, the pipelines include scripts in R [32] for visualizing data generated by the Python scripts. The R scripts use routines from ggplot2 [33] and tidyverse [34] packages.

Repeat coverage by aligned reads
The pipeline calculates coverage of genomic repeats in the human genome by aligned reads. For each repeat, the base-level coverage profile is produced and the mean coverage is derived. The input data includes the repeat annotation by RepeatMasker [35] and read alignments to chromosome 20 for individual HG00448 sequenced in the framework of the 1000 Genomes project [36]. The mean coverage values are summarized for each repeat class from the RepeatMasker annotation and the perclass coverage distribution is shown by a joint box plot in Figure 1.

Distribution of transition to transversion ratio
The pipeline computes a Ti/Tv ratio across genomic regions of different types (namely, exons, introns, long non-coding RNAs, and intergenic regions). The input data for the pipeline are a) biallelic SNPs detected on chromosome 20 from the 1000 Genomes project [37], and b) gene annotation from the NCBI Eukaryotic Genome Annotation Pipeline [38]. For every genotyped individual, heterozygous SNPs are counted for every type of genomic regions and the transition to transversion ratios (Ti/Tv) are summarized by ancestry of the individuals as presented in [39]. The box plots that visualize the Ti/Tv distribution by the genomic regions and the ancestry are shown in Figure 2

Comparing performance of CPython and PyPy
We assessed pygenomics performance using two Python implementations: CPython (version 3.10.1) and PyPy (version 7.3.9). Average time required to read RepeatMasker entries for the human genome and to merge the repeat intervals was tracked for nine sets of various sizes sampled from the Repeat-Masker annotation file. The benchmark was performed on the Dell™ Precision™ 3440 SFF workstation equipped with the Intel® Core™ i7-10700 CPU and 128 GB of RAM. The running time was measured using the hyperfine program [40] and the results are shown in Figure 3. Error bars in the figure designate two-sided 95% confidence intervals for the mean running time based on the normal distribution.
For each input set, there was one warm-up run to reduce delay effects caused by reading from a disk and ten subsequent runs to record running time. Figure 3 shows that PyPy accelerates running routines of pygenomics compared to CPython and the difference in performance grows with the increase of the input data size.

Comparative performance analysis
We compared performance of pygenomics with the aforementioned Python packages: cyvcf2, pybedtools, pysam, and PyRanges. Running time and maximum memory usage were recorded. All runs were performed without using parallelization capabilities of the packages. Each run was repeated thirty times on the Dell™ OptiPlex™ 5070 personal computer equipped with the Intel® Core™ i5-9600 CPU and 16 GB of RAM and two-sided 95% confidence intervals for the mean values of the time and memory usage were obtained. In the following figures, the mean values and the confidence intervals are shown by points and error bars, respectively. Ten samples of input data with equally ranged sizes were prepared for each run. Versions of the packages and their dependencies that were used in the comparison analysis are given in Table 3.

Genomic intervals
We compared performance of genomic interval operations between pygenomics, pybedtools, and Py-Ranges. The operations included intersecting, subtracting, and detecting overlapping intervals. The BEDTools toolkit was also installed for running pybedtools.
Intervals of genomic repeats annotated by RepeatMasker in the human genome were used as a query set. Two target sets were used: intervals of protein-coding genes from the NCBI human genome annotation and intervals of the gene exons. The gene set served as a medium-size collection of contiguous genomic intervals, while the exon set represented numerous but short intervals.
Performance measurement results for the gene and exon query sets are shown in Figures 4 and 5. The results show different rankings based on the query set, the interval operation, and the measured performance value (time or memory). Unlike pygenomics and PyRanges, the pybedtools package created temporary files for passing data to BEDTools and thus performed extra I/O operations.

Read alignments
Performance of routines for reading a BAM file was compared between pygenomics and pysam. The benchmarking script iterated read alignments in a BAM file and counted properly aligned reads. The BAM file of low-coverage sequencing reads from the 1000 Genomes project was used as input data.
Pysam outperformed pygenomics in both running time and maximum memory usage, as shown in Figure 6. We assume that the high performance of pysam could be explained by its being a Python wrapper for routines from the HTSlib library, that was implemented in C.

Genomic variants
Performance of routines for reading a VCF file was compared between pygenomics, pysam, and cyvcf2. The benchmarking script iterated variant records in a VCF file, filtered biallelic SNPs, and printed them in the HGVS notation [41]. The VCF file of genomic variants for multiple individuals from the 1000 Genomes project was used as input data.
As shown in Figure 7, pygenomics outperformed other packages in running time but used the greatest amount of memory.