Abstract

Summary

Studies on gene clusters proved to be an excellent source of information to understand genomes evolution and identifying specific metabolic pathways or gene families. Improvements in sequencing methods have resulted in a large increase of sequenced genomes for which cluster annotation could be performed and standardized. Currently available programs are developed to search for specific cluster types and none of them is suitable for a broad range of user-based choices. We have developed ClusterScan which allows identifying clusters of any kind of feature simply based on their genomic coordinates and user-defined categorical annotations.

Availability and implementation

The tool is written in Python, distributed under the GNU General Public License (GPL) and available on Github at http://bit.ly/ClusterScan or as Docker image at sangeslab/clusterscan: latest. It is supported through a mailing-list on http://bit.ly/ClusterScanSupport.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A ‘genomic cluster’ commonly indicates a group of genes sharing genomic location and belonging to a common category such as involvement in the same pathway (Barona-Gómez et al., 2004; Yu et al., 2000), function (Hourcade et al., 1992), co-expression (Sémon and Duret, 2006), binding or cellular localization (Jamieson et al., 2004) to name a few. This definition may not only suit a group of genes but also any group of genomic features that can be categorically and positionally described. With the decreasing costs of sequencing technologies new genomes can be produced even in small laboratories and the identification of clusters of features could represent a valuable analysis to be adopted as a standard annotation step. There are different tools capable to identify clusters on a given genome (Cimermancic et al., 2014; Cruz-Morales et al., 2016; Khaldi et al., 2010; Li et al., 2009; Medema et al., 2011; Röttig et al., 2011; Starcevic et al., 2008; Umemura et al., 2013; Vesth et al., 2016; Wolf et al., 2016; Yi et al., 2007). They are generally specialized to identify a given type of cluster and use algorithms which take into account the specific organization of the cluster they search for. In addition, many of them have been developed for bacterial genomes (Chavali and Rhee, 2017). In order to overcome these limitations we developed ClusterScan, a generalistic tool to identify clusters, allowing flexibility in the choice of features and categories. The tool identifies clusters of genomic features close on the genome and associated to the same category. Both the positional and categorical information are user defined using a tabular format allowing the user to search for any kind of cluster such as PFAM domains, GO classifications, SNPs, conserved regions, binding sites, transposable elements and so on.

2 Implementation

ClusterScan is developed in Python and makes usage of bedtools (Quinlan and Hall, 2010) for positional analysis and R for plotting. It requires in input a tab delimited text file with genomic coordinates of the features to analyze and a file annotating the given features according to one or more categories. To identify clusters, the tool searches, for each category, groups of features sharing the same annotation and location. It produces several output files storing clusters coordinates and composition as well as a list of bystanders for each cluster, if any, and singletons if required. The minimal number of features in a cluster and maximal distance between them are user configurable and the search strategy can be chosen between two different algorithms (Fig. 1a).

Fig. 1.

(a) ClusterScan pipeline scheme. (b) ClusterScan summary table. Top-10 PFAM domains in clusters. The table shows, for each domain, the total number of features (n_ft) and bystanders (n_bs) in clusters, maximum and minimum number of features (max_ft and min_ft) and bystanders (max_bs and min_bs). The ZnF domain (PFAM: PF00096) is the one defining the higher number of gene clusters in the human genome. (c) Circos representation of human top-10 clusters for ZnF genes, tRNAs and FlnI-L1 retrotrasposons. From outer to inner: the first track represents the locations of the 10 larger ZnF clusters; the second track represents the locations of the 10 larger tRNA clusters; the third track represents the locations of the 10 larger FlnI-L1 clusters; each track is associated with an inside oriented histogram which displays the number of features of each type in bins of 1 Mb

3 Materials and methods

ClusterScan can use two different algorithms to identify clusters: clusterdist and clustermean. Clusterdist searches for clustered features belonging to the same category which are separated by a maximum distance. This distance is selected by the user with the –distance parameter. For example, studies based on gene families in human (Niimura and Nei, 2003) and mouse (Tadepally et al., 2008) have estimated that members of the same gene family within 500 kb can be considered to form a cluster. The second method, clustermean, splits the genome in sliding windows of user-selected size using the –window parameter, performs counting of categories and a Z-score statistics with a final extension step. Following clusters identification, ClusterScan identifies bystanders, those features contained within a cluster not associated to the category for which the cluster has been called. If requested, ClusterScan can also identify singletons, features belonging to a category, for which at least one cluster is annotated, that are outside of any cluster for that category.

4 Results

We have tested ClusterScan searching for clusters of human protein coding genes from Ensembl (Aken et al., 2017) categorized by their PFAM (Finn et al., 2014) domain annotations. Gene locations (chromosome, start and end) and categorical annotations (PFAM domains), were easily obtained through BioMart (Kinsella et al., 2011). ClusterScan took 160 s to run clusterdist with –distance set to 500 kb, analyzing the locations of 19 919 unique protein-coding genes annotated with 6056 PFAM domains, resulting in the annotation of 2287 clusters from 1010 PFAM composed by at least 2 genes. As expected, ClusterScan results show that the domain forming the highest number of clusters is the C2H2 zinc finger (C2H2 ZnF) domain (PFAM: PF00096, Fig. 1b) and the largest group of these clusters is located on chromosome 19 in agreement with Grimwood et al. (Grimwood et al., 2004) (Fig. 1c). Analyzing the same dataset using clustermean with a window of 500 kb, took 746 s resulting in 2364 significant clusters from 1030 PFAM domains. In order to test our tool with different sets of data, we downloaded the human tRNA gene table in bed format from GtRNAdb (Chan and Lowe, 2016) and the full length non-intact L1s (FlnI-L1s) retrotransposons table from L1Base2 (Penzkofer et al., 2017). To search for tRNA clusters we used clustermean with a window size of 500 kb finding 17 clusters. The top-10 clusters in terms of number of features result to be composed by at least 10 tRNAs (Supplementary Table S1). The search for FlnI-L1s clusters was performed using clustermean again with the same parameters. We have found 435 clusters, 38 of them containing at least 10 FlnI-L1s and 8 clusters showing 20 or more elements (Supplementary Table S2). The top-10 clusters in terms of number of features found with ClusterScan for the three analysis discussed above, are depicted in a circos (Krzywinski et al., 2009) (Fig 1c). Finally, we performed a survey among commonly used tools to identify genomic clusters and found that ClusterScan is the only tool able to build clusters of any mappable feature, for any type of category and also capable to identify bystanders and singletons (Supplementary Table S3).

5 Conclusions

ClusterScan is a generalistic tool capable of identifying genomic clusters of any type of feature in a given genome. It is highly configurable, distributed on Github and as a Docker image, extensively documented and complemented by a mailing list to support the user community. These features make ClusterScan a tool easily adoptable in bioinformatics pipelines for genome annotations, establishing the possibility to adopt clusters identification as a standard procedure.

Funding

Massimiliano Volpe was supported by a SZN PhD fellowship.

Conflict of Interest: none declared.

References

Aken
 
B.L.
, et al.  (
2017
)
Ensembl 2017
.
Nucleic Acids Res.
,
45
,
D635
D642
.

Barona-Gómez
 
F.
, et al.  (
2004
)
Identification of a cluster of genes that directs desferrioxamine biosynthesis in Streptomyces coelicolor M145
.
J. Am. Chem. Soc.
,
126
,
16282
16283
.

Chan
 
P.P.
,
Lowe
T.M.
(
2016
)
GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes
.
Nucleic Acids Res.
,
44
,
D184
D189
.

Chavali
 
A.K.
,
Rhee
S.Y.
(
2017
)
Bioinformatics tools for the identification of gene clusters that biosynthesize specialized metabolites
.
Brief. Bioinform.
, doi: 10.1093/bib/bbx020.

Cimermancic
 
P.
, et al.  (
2014
)
Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters
.
Cell
,
158
,
412
421
.

Cruz-Morales
 
P.
, et al.  (
2016
)
Phylogenomic analysis of natural products biosynthetic gene clusters allows discovery of arseno-organic metabolites in model streptomycetes
.
Genome Biol. Evol.
,
8
,
1906
1916
.

Finn
 
R.D.
, et al.  (
2014
)
Pfam: the protein families database
.
Nucleic Acids Res.
,
42
,
D222
D230
.

Grimwood
 
J.
, et al.  (
2004
)
The DNA sequence and biology of human chromosome 19
.
Nature
,
428
,
529
535
.

Hourcade
 
D.
, et al.  (
1992
)
Analysis of the human regulators of complement activation (RCA) gene cluster with yeast artificial chromosomes (YACs)
.
Genomics
,
12
,
289
300
.

Jamieson
 
S.E.
, et al.  (
2004
)
Evidence for a cluster of genes on chromosome 17q11-q21 controlling susceptibility to tuberculosis and leprosy in Brazilians
.
Genes Immun.
,
5
,
46
57
.

Khaldi
 
N.
, et al.  (
2010
)
SMURF: genomic mapping of fungal secondary metabolite clusters
.
Fungal Genet. Biol.
,
47
,
736
741
.

Kinsella
 
R.J.
, et al.  (
2011
)
Ensembl BioMarts: a hub for data retrieval across taxonomic space
.
Database
,
2011
,
bar030
.

Krzywinski
 
M.
, et al.  (
2009
)
Circos: an information aesthetic for comparative genomics
.
Genome Res.
,
19
,
1639
1645
.

Li
 
M.H.
, et al.  (
2009
)
Automated genome mining for natural products
.
BMC Bioinformatics
,
10
,
185
.

Medema
 
M.H.
, et al.  (
2011
)
antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences
.
Nucleic Acids Res.
,
39
,
W339
W346
.

Niimura
 
Y.
,
Nei
M.
(
2003
)
Evolution of olfactory receptor genes in the human genome
.
Proc. Natl. Acad. Sci. USA
,
100
,
12235
12240
.

Penzkofer
 
T.
, et al.  (
2017
)
L1Base 2: more retrotransposition-active LINE-1s, more mammalian genomes
.
Nucleic Acids Res.
,
45
,
D68
D73
.

Quinlan
 
A.R.
,
Hall
I.M.
(
2010
)
BEDTools: a flexible suite of utilities for comparing genomic features
.
Bioinformatics
,
26
,
841
842
.

Röttig
 
M.
, et al.  (
2011
)
NRPSpredictor2—a web server for predicting NRPS adenylation domain specificity
.
Nucleic Acids Res.
,
39
,
W362
W367
.

Sémon
 
M.
,
Duret
L.
(
2006
)
Evolutionary origin and maintenance of coexpressed gene clusters in mammals
.
Mol. Biol. Evol.
,
23
,
1715
1723
.

Starcevic
 
A.
, et al.  (
2008
)
ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures
.
Nucleic Acids Res.
,
36
,
6882
6892
.

Tadepally
 
H.D.
, et al.  (
2008
)
Evolution of C2H2-zinc finger genes and subfamilies in mammals: species-specific duplication and loss of clusters, genes and effector domains
.
BMC Evol. Biol.
,
8
,
176
.

Umemura
 
M.
, et al.  (
2013
)
MIDDAS-M: motif-independent de novo detection of secondary metabolite gene clusters through the integration of genome sequencing and transcriptome data
.
PLoS One
,
8
,
e84028
.

Vesth
 
T.C.
, et al.  (
2016
)
FunGeneClusterS: predicting fungal gene clusters from genome and transcriptome data
.
Synth. Syst. Biotechnol.
,
1
,
122
129
.

Wolf
 
T.
, et al.  (
2016
)
CASSIS and SMIPS: promoter-based prediction of secondary metabolite gene clusters in eukaryotic genomes
.
Bioinformatics
,
32
,
1138
1143
.

Yi
 
G.
, et al.  (
2007
)
Identifying clusters of functionally related genes in genomes
.
Bioinformatics
,
23
,
1053
1060
.

Yu
 
J.
, et al.  (
2000
)
Cloning of a sugar utilization gene cluster in Aspergillus parasiticus
.
Biochim. Biophys. Acta
,
1493
,
211
214
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data