ClusterScan: simple and generalistic identification of genomic clusters

Volpe, Massimiliano; Miralto, Marco; Gustincich, Stefano; Sanges, Remo

doi:10.1093/bioinformatics/bty486

Abstract

Summary

Studies on gene clusters proved to be an excellent source of information to understand genomes evolution and identifying specific metabolic pathways or gene families. Improvements in sequencing methods have resulted in a large increase of sequenced genomes for which cluster annotation could be performed and standardized. Currently available programs are developed to search for specific cluster types and none of them is suitable for a broad range of user-based choices. We have developed ClusterScan which allows identifying clusters of any kind of feature simply based on their genomic coordinates and user-defined categorical annotations.

Availability and implementation

The tool is written in Python, distributed under the GNU General Public License (GPL) and available on Github at http://bit.ly/ClusterScan or as Docker image at sangeslab/clusterscan: latest. It is supported through a mailing-list on http://bit.ly/ClusterScanSupport.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A ‘genomic cluster’ commonly indicates a group of genes sharing genomic location and belonging to a common category such as involvement in the same pathway (Barona-Gómez et al., 2004; Yu et al., 2000), function (Hourcade et al., 1992), co-expression (Sémon and Duret, 2006), binding or cellular localization (Jamieson et al., 2004) to name a few. This definition may not only suit a group of genes but also any group of genomic features that can be categorically and positionally described. With the decreasing costs of sequencing technologies new genomes can be produced even in small laboratories and the identification of clusters of features could represent a valuable analysis to be adopted as a standard annotation step. There are different tools capable to identify clusters on a given genome (Cimermancic et al., 2014; Cruz-Morales et al., 2016; Khaldi et al., 2010; Li et al., 2009; Medema et al., 2011; Röttig et al., 2011; Starcevic et al., 2008; Umemura et al., 2013; Vesth et al., 2016; Wolf et al., 2016; Yi et al., 2007). They are generally specialized to identify a given type of cluster and use algorithms which take into account the specific organization of the cluster they search for. In addition, many of them have been developed for bacterial genomes (Chavali and Rhee, 2017). In order to overcome these limitations we developed ClusterScan, a generalistic tool to identify clusters, allowing flexibility in the choice of features and categories. The tool identifies clusters of genomic features close on the genome and associated to the same category. Both the positional and categorical information are user defined using a tabular format allowing the user to search for any kind of cluster such as PFAM domains, GO classifications, SNPs, conserved regions, binding sites, transposable elements and so on.

2 Implementation

ClusterScan is developed in Python and makes usage of bedtools (Quinlan and Hall, 2010) for positional analysis and R for plotting. It requires in input a tab delimited text file with genomic coordinates of the features to analyze and a file annotating the given features according to one or more categories. To identify clusters, the tool searches, for each category, groups of features sharing the same annotation and location. It produces several output files storing clusters coordinates and composition as well as a list of bystanders for each cluster, if any, and singletons if required. The minimal number of features in a cluster and maximal distance between them are user configurable and the search strategy can be chosen between two different algorithms (Fig. 1a).

Fig. 1.

Open in new tab Download slide

(a) ClusterScan pipeline scheme. (b) ClusterScan summary table. Top-10 PFAM domains in clusters. The table shows, for each domain, the total number of features (n_ft) and bystanders (n_bs) in clusters, maximum and minimum number of features (max_ft and min_ft) and bystanders (max_bs and min_bs). The ZnF domain (PFAM: PF00096) is the one defining the higher number of gene clusters in the human genome. (c) Circos representation of human top-10 clusters for ZnF genes, tRNAs and FlnI-L1 retrotrasposons. From outer to inner: the first track represents the locations of the 10 larger ZnF clusters; the second track represents the locations of the 10 larger tRNA clusters; the third track represents the locations of the 10 larger FlnI-L1 clusters; each track is associated with an inside oriented histogram which displays the number of features of each type in bins of 1 Mb

3 Materials and methods

ClusterScan can use two different algorithms to identify clusters: clusterdist and clustermean. Clusterdist searches for clustered features belonging to the same category which are separated by a maximum distance. This distance is selected by the user with the –distance parameter. For example, studies based on gene families in human (Niimura and Nei, 2003) and mouse (Tadepally et al., 2008) have estimated that members of the same gene family within 500 kb can be considered to form a cluster. The second method, clustermean, splits the genome in sliding windows of user-selected size using the –window parameter, performs counting of categories and a Z-score statistics with a final extension step. Following clusters identification, ClusterScan identifies bystanders, those features contained within a cluster not associated to the category for which the cluster has been called. If requested, ClusterScan can also identify singletons, features belonging to a category, for which at least one cluster is annotated, that are outside of any cluster for that category.

4 Results

We have tested ClusterScan searching for clusters of human protein coding genes from Ensembl (Aken et al., 2017) categorized by their PFAM (Finn et al., 2014) domain annotations. Gene locations (chromosome, start and end) and categorical annotations (PFAM domains), were easily obtained through BioMart (Kinsella et al., 2011). ClusterScan took 160 s to run clusterdist with –distance set to 500 kb, analyzing the locations of 19 919 unique protein-coding genes annotated with 6056 PFAM domains, resulting in the annotation of 2287 clusters from 1010 PFAM composed by at least 2 genes. As expected, ClusterScan results show that the domain forming the highest number of clusters is the C2H2 zinc finger (C2H2 ZnF) domain (PFAM: PF00096, Fig. 1b) and the largest group of these clusters is located on chromosome 19 in agreement with Grimwood et al. (Grimwood et al., 2004) (Fig. 1c). Analyzing the same dataset using clustermean with a window of 500 kb, took 746 s resulting in 2364 significant clusters from 1030 PFAM domains. In order to test our tool with different sets of data, we downloaded the human tRNA gene table in bed format from GtRNAdb (Chan and Lowe, 2016) and the full length non-intact L1s (FlnI-L1s) retrotransposons table from L1Base2 (Penzkofer et al., 2017). To search for tRNA clusters we used clustermean with a window size of 500 kb finding 17 clusters. The top-10 clusters in terms of number of features result to be composed by at least 10 tRNAs (Supplementary Table S1). The search for FlnI-L1s clusters was performed using clustermean again with the same parameters. We have found 435 clusters, 38 of them containing at least 10 FlnI-L1s and 8 clusters showing 20 or more elements (Supplementary Table S2). The top-10 clusters in terms of number of features found with ClusterScan for the three analysis discussed above, are depicted in a circos (Krzywinski et al., 2009) (Fig 1c). Finally, we performed a survey among commonly used tools to identify genomic clusters and found that ClusterScan is the only tool able to build clusters of any mappable feature, for any type of category and also capable to identify bystanders and singletons (Supplementary Table S3).

5 Conclusions

ClusterScan is a generalistic tool capable of identifying genomic clusters of any type of feature in a given genome. It is highly configurable, distributed on Github and as a Docker image, extensively documented and complemented by a mailing list to support the user community. These features make ClusterScan a tool easily adoptable in bioinformatics pipelines for genome annotations, establishing the possibility to adopt clusters identification as a standard procedure.

Funding

Massimiliano Volpe was supported by a SZN PhD fellowship.

Conflict of Interest: none declared.

References

Aken

B.L.

, et al. (

2017

)

Ensembl 2017

.

Nucleic Acids Res.

,

45

,

D635

–

D642

.

Barona-Gómez

F.

, et al. (

2004

)

Identification of a cluster of genes that directs desferrioxamine biosynthesis in Streptomyces coelicolor M145

.

J. Am. Chem. Soc.

,

126

,

16282

–

16283

.

Chan

P.P.

,

Lowe

T.M.

(

2016

)

GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes

.

Nucleic Acids Res.

,

44

,

D184

–

D189

.

Chavali

A.K.

,

Rhee

S.Y.

(

2017

)

Bioinformatics tools for the identification of gene clusters that biosynthesize specialized metabolites

.

Brief. Bioinform.

, doi: 10.1093/bib/bbx020.

Google Scholar

OpenURL Placeholder Text

WorldCat

Cimermancic

P.

, et al. (

2014

)

Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters

.

Cell

,

158

,

412

–

421

.

Cruz-Morales

P.

, et al. (

2016

)

Phylogenomic analysis of natural products biosynthetic gene clusters allows discovery of arseno-organic metabolites in model streptomycetes

.

Genome Biol. Evol.

,

8

,

1906

–

1916

.

Finn

R.D.

, et al. (

2014

)

Pfam: the protein families database

.

Nucleic Acids Res.

,

42

,

D222

–

D230

.

Grimwood

J.

, et al. (

2004

)

The DNA sequence and biology of human chromosome 19

.

Nature

,

428

,

529

–

535

.

Hourcade

D.

, et al. (

1992

)

Analysis of the human regulators of complement activation (RCA) gene cluster with yeast artificial chromosomes (YACs)

.

Genomics

,

12

,

289

–

300

.

Jamieson

S.E.

, et al. (

2004

)

Evidence for a cluster of genes on chromosome 17q11-q21 controlling susceptibility to tuberculosis and leprosy in Brazilians

.

Genes Immun.

,

5

,

46

–

57

.

Khaldi

N.

, et al. (

2010

)

SMURF: genomic mapping of fungal secondary metabolite clusters

.

Fungal Genet. Biol.

,

47

,

736

–

741

.

Kinsella

R.J.

, et al. (

2011

)

Ensembl BioMarts: a hub for data retrieval across taxonomic space

.

Database

,

2011

,

bar030

.

Krzywinski

M.

, et al. (

2009

)

Circos: an information aesthetic for comparative genomics

.

Genome Res.

,

19

,

1639

–

1645

.

Li

M.H.

, et al. (

2009

)

Automated genome mining for natural products

.

BMC Bioinformatics

,

10

,

185

.

Medema

M.H.

, et al. (

2011

)

antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences

.

Nucleic Acids Res.

,

39

,

W339

–

W346

.

Niimura

Y.

,

Nei

M.

(

2003

)

Evolution of olfactory receptor genes in the human genome

.

Proc. Natl. Acad. Sci. USA

,

100

,

12235

–

12240

.

Google Scholar

Crossref

WorldCat

Penzkofer

T.

, et al. (

2017

)

L1Base 2: more retrotransposition-active LINE-1s, more mammalian genomes

.

Nucleic Acids Res.

,

45

,

D68

–

D73

.

Quinlan

A.R.

,

Hall

I.M.

(

2010

)

BEDTools: a flexible suite of utilities for comparing genomic features

.

Bioinformatics

,

26

,

841

–

842

.

Röttig

M.

, et al. (

2011

)

NRPSpredictor2—a web server for predicting NRPS adenylation domain specificity

.

Nucleic Acids Res.

,

39

,

W362

–

W367

.

Sémon

M.

,

Duret

L.

(

2006

)

Evolutionary origin and maintenance of coexpressed gene clusters in mammals

.

Mol. Biol. Evol.

,

23

,

1715

–

1723

.

Starcevic

A.

, et al. (

2008

)

ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures

.

Nucleic Acids Res.

,

36

,

6882

–

6892

.

Tadepally

H.D.

, et al. (

2008

)

Evolution of C2H2-zinc finger genes and subfamilies in mammals: species-specific duplication and loss of clusters, genes and effector domains

.

BMC Evol. Biol.

,

8

,

176

.

Umemura

M.

, et al. (

2013

)

MIDDAS-M: motif-independent de novo detection of secondary metabolite gene clusters through the integration of genome sequencing and transcriptome data

.

PLoS One

,

8

,

e84028

.

Vesth

T.C.

, et al. (

2016

)

FunGeneClusterS: predicting fungal gene clusters from genome and transcriptome data

.

Synth. Syst. Biotechnol.

,

1

,

122

–

129

.

Wolf

T.

, et al. (

2016

)

CASSIS and SMIPS: promoter-based prediction of secondary metabolite gene clusters in eukaryotic genomes

.

Bioinformatics

,

32

,

1138

–

1143

.

Yi

G.

, et al. (

2007

)

Identifying clusters of functionally related genes in genomes

.

Bioinformatics

,

23

,

1053

–

1060

.

Yu

J.

, et al. (

2000

)

Cloning of a sugar utilization gene cluster in Aspergillus parasiticus

.

Biochim. Biophys. Acta

,

1493

,

211

–

214

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
June 2018	65
July 2018	144
August 2018	44
September 2018	16
October 2018	24
November 2018	154
December 2018	39
January 2019	44
February 2019	64
March 2019	50
April 2019	29
May 2019	28
June 2019	13
July 2019	17
August 2019	8
September 2019	9
October 2019	17
November 2019	23
December 2019	24
January 2020	45
February 2020	27
March 2020	18
April 2020	41
May 2020	22
June 2020	47
July 2020	58
August 2020	13
September 2020	43
October 2020	46
November 2020	32
December 2020	39
January 2021	42
February 2021	38
March 2021	54
April 2021	56
May 2021	53
June 2021	35
July 2021	39
August 2021	30
September 2021	39
October 2021	54
November 2021	45
December 2021	30
January 2022	36
February 2022	68
March 2022	54
April 2022	50
May 2022	45
June 2022	38
July 2022	42
August 2022	48
September 2022	45
October 2022	24
November 2022	40
December 2022	25
January 2023	39
February 2023	29
March 2023	56
April 2023	37
May 2023	34
June 2023	34
July 2023	14
August 2023	25
September 2023	30
October 2023	36
November 2023	35
December 2023	27
January 2024	25
February 2024	28
March 2024	36
April 2024	23

Article Contents

ClusterScan: simple and generalistic identification of genomic clusters

Abstract

1 Introduction

2 Implementation

3 Materials and methods

4 Results

5 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

ClusterScan: simple and generalistic identification of genomic clusters

Abstract

1 Introduction

2 Implementation

3 Materials and methods

4 Results

5 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only