Skip to Main Content

Article Navigation

Journal Article

Explore, edit and leverage genomic annotations using Python GTF toolkit

Abstract

Motivation

While Python has become very popular in bioinformatics, a limited number of libraries exist for fast manipulation of gene coordinates in Ensembl GTF format.

Results

We have developed the GTF toolkit Python package (pygtftk), which aims at providing easy and powerful manipulation of gene coordinates in GTF format. For optimal performances, the core engine of pygtftk is a C dynamic library (libgtftk) while the Python API provides usability and readability for developing scripts. Based on this Python package, we have developed the gtftk command line interface that contains 57 sub-commands (v0.9.10) to ease handling of GTF files. These commands may be used to (i) perform basic tasks (e.g. selections, insertions, updates or deletions of features/keys), (ii) select genes/transcripts based on various criteria (e.g. size, exon number, transcription start site location, intron length, GO terms) or (iii) carry out more advanced operations such as coverage analyses of genomic features using bigWig files to create faceted read-coverage diagrams. In conclusion, the pygtftk package greatly simplifies the annotation of GTF files with external information while providing advance tools to perform gene analyses.

Availability and implementation

pygtftk and gtftk have been tested on Linux and MacOSX and are available from https://github.com/dputhier/pygtftk under the MIT license. The libgtftk dynamic library written in C is available from https://github.com/dputhier/libgtftk.

1 Introduction

Several formats exist to store genomic features. The standard BED format stores basic information (chromosome, start, end, name, score and strand) related to generic genomic features (BED6) or composite genomic features (BED12). The GTF/GFF2 format (thereafter referred as GTF) can describe more exhaustively defined genomic features (genes, transcripts, exons, etc.) by taking advantage of the ‘attributes’ column which contains a set of keys/values to store various kinds of annotations. Some composition relationships are implicitly declared in the GTF file making it possible to describe, for instance, the exons of the transcripts corresponding to a gene. This relationship is more explicit in the GFF3 format that can be viewed as a directed acyclic graph with nodes corresponding to features (gene, transcript, exon, etc.) and edges corresponding to part-of relationships. Only few libraries are specifically dedicated to GTFs and most of them propose very focused tasks. The GenomeTools suite is a collection of bioinformatic tools based on the libgenometools C library that handle GTF and GFF3 formats (Gremme et al., 2013). However, this library extends well beyond these annotation formats and the developing framework may appear rather complicated for naive developers as it requires deep knowledge of C programming language. Regarding R/Bioconductor, the rtracklayer provides fast access to the GTF/GFF by providing the user with a GRanges object (Lawrence et al., 2009).

While Python language has gained lot of popularity among bioinformaticians, only a handful of tools are available for manipulating GTF files. The gffutils package can parse and store GTF/GFF files into SQLite databases. The creation of a subsequent hierarchical models of genomic features while highly useful can be relatively time consuming. We developed the pygtftk package with the objective to provide a fast and readable way to load and manipulate GTF files within Python scripts. This package comes with the gtftk command line interface (CLI) that provide various operations to write workflows based on GTF files.

2 Implementation

2.1 The core libgtftk C library

The core of the package is written in C and exposed through a dynamic library called libgtftk. The GTF format is represented without hierarchical relationships to maximize performances. More complex operations are carried out by the libgtftk Python client.

2.2 The pygtftk Python package

The GTF class of pygtftk comes with a large number of methods. Most of these methods return a new GTF object so that they can be chained intuitively. This object can also produce two additional objects from the gtftk library including: a TAB object (representation of a matrix) and a FASTA object (representation of a FASTA file). The GTF object is integrated within the scientific Python ecosystem and can produce pybedtools.BedTool objects, Bio.SeqRecord generators or a pandas.DataFrame (Cock et al., 2009; McKinney, 2010; Quinlan, 2014). A typical use case is proposed in Figure 1, where the transcription start site (TSS) coordinates of lincRNAs are extracted with the conditions that (i) the transcript size is above 200 nt, (ii) the number of exons is greater than 2 (iii) and the coding potential (imported from a separated file) is lower than 0.2. The TSSs are then obtained using the get_tss() method returning a pybedtools.BedTool object that can be used to extend coordinates by 1000 nucleotides in the 5′ and 3′ directions. Regarding performances, the human genome annotation in GTF format from Ensembl release 92 (⁠ $\sim {2.7.10}^{6}$ lines) is loaded in about 30 s while the creation of a hierarchical model using gffutils takes about 11 min [performed on Intel(R) Xeon(R) CPU E5-2640 v3, 2.60GHz]. In addition, the search engine is also highly optimized since it takes 0.6 s to select all lincRNAs from the human genome.

Fig. 1.

Use case for the pygtftk package. These few lines of codes are used to extract the promoter region [(−1000, 1000) around the TSS] of LincRNAs, with the conditions that the transcripts have size greater than 200 nt, at least two exons and a coding potential (assessed by CPAT and joined from an external file) below 0.2 (Wang et al., 2013)

Open in new tab Download slide

Use case for the pygtftk package. These few lines of codes are used to extract the promoter region [(−1000, 1000) around the TSS] of LincRNAs, with the conditions that the transcripts have size greater than 200 nt, at least two exons and a coding potential (assessed by CPAT and joined from an external file) below 0.2 (Wang et al., 2013)

2.3 The gtftk CLI

The pygtftk package provides a gtftk CLI with 57 subcommands. These subcommands can be used to: (i) download GTF files, (ii) edit them, (iii) mine the GTF files in various ways (select transcripts by genomic/exonic/intronic size, number of exons, associated GO term, etc.), (iv) annotate the GTF files (flagging divergent/convergent/overlapping transcripts, etc.), (v) convert them to other formats or (vi) perform epigenomic analyses by producing faceted coverage diagrams through the plotnine Python package (i.e. the recently developed Python port of ggplot2).

3 Conclusion

The pygtftk package and the associated gtftk CLI provides a new way to easily handle gene coordinates with Python. They are regularly updated and users familiar with Python and/or command-line programmes should quickly get comfortable and productive with (py)gtftk. As the GTF/GFF format is also now used for storing regulatory features and variants, this paves the way for future developments of (py)gtftk that could be an interesting framework for the integration of heterogeneous genomic data (Reese et al., 2010; Zerbino et al., 2018).

Acknowledgements

We thank Jacques van Helden for helpful discussion.

Funding

G.C. was supported by a fellowship from the “Fondation pour la Recherche Médicale” (FRM). S.S. and D.P. were supported by recurrent funding from INSERM and Aix Marseille Univ and by the Foundation for Cancer Research ARC [ARC PJA 20151203149] and A*MIDEX [ANR-11-IDEX-0001-02], Plan Cancer 2015 [C15076AS] and Ligue contre le Cancer Equipe Labellisée. Y.K. was supported by the Franco-Algerian partenariat Hubert Curien (PHC) Tassili [15MDU935].

Conflict of Interest: none declared.

References

Cock

P.J.

et al. (

2009

)

Biopython: freely available Python tools for computational molecular biology and bioinformatics

.

Bioinformatics

,

25

,

1422

–

1423

.

Gremme

G.

et al. (

2013

)

GenomeTools: a comprehensive software library for efficient processing of structured genome annotations

.

IEEE/ACM Trans. Comput. Biol. Bioinform

.,

10

,

645

–

656

.

Lawrence

M.

et al. (

2009

)

rtracklayer: an R package for interfacing with genome browsers

.

Bioinformatics

,

25

,

1841

–

1842

.

McKinney

W.

(

2010

).

Data structures for statistical computing in python

. In:

van der Walt

S.

,

Millman

J.

(eds)

Proceedings of the 9th Python in Science Conference

, pp.

51

–

56

.

Quinlan

A.R.

(

2014

)

BEDTools: the Swiss-Army Tool for Genome Feature Analysis

.

Curr. Protoc. Bioinformatics

,

47

,

1

–

34

.

Reese

M.G.

et al. (

2010

)

A standard variation file format for human genome sequences

.

Genome Biol

.,

11

,

R88.

Wang

L.

et al. (

2013

)

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

.

Nucleic Acids Res

.,

41

,

e74.

Zerbino

D.R.

et al. (

2018

)

Ensembl 2018

.

Nucleic Acids Res

.,

46

,

D754

–

761

.

© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Views

1,629

Altmetric

Total Views 1,629

1,230 Pageviews

399 PDF Downloads

Since 2/1/2019

Month:	Total Views:
February 2019	37
March 2019	30
April 2019	25
May 2019	47
June 2019	7
July 2019	5
August 2019	19
September 2019	97
October 2019	57
November 2019	22
December 2019	15
January 2020	50
February 2020	53
March 2020	12
April 2020	2
May 2020	6
June 2020	9
July 2020	2
August 2020	4
September 2020	11
October 2020	18
November 2020	4
December 2020	3
January 2021	6
February 2021	2
March 2021	14
April 2021	16
May 2021	48
June 2021	34
July 2021	31
August 2021	21
September 2021	30
October 2021	28
November 2021	32
December 2021	23
January 2022	27
February 2022	28
March 2022	55
April 2022	23
May 2022	21
June 2022	19
July 2022	15
August 2022	18
September 2022	27
October 2022	21
November 2022	12
December 2022	16
January 2023	22
February 2023	15
March 2023	23
April 2023	44
May 2023	23
June 2023	15
July 2023	16
August 2023	14
September 2023	25
October 2023	26
November 2023	46
December 2023	29
January 2024	52
February 2024	55
March 2024	82
April 2024	40