Motif scraper: a cross-platform, open-source tool for identifying degenerate nucleotide motif matches in FASTA files

Roberson, Elisha D O

doi:10.1093/bioinformatics/bty437

Abstract

Summary

Many genomic features are defined not by exact sequence matches, but by degenerate nucleotide motifs that represent multiple compatible matches. While there are databases cataloging genomic features, such as the location of transcription factor motifs, for commonly used model species, identifying the locations of novel motifs, known motifs in non-model genomes, or known motifs in personal whole-genomes is difficult. I designed motif scraper to overcome this limitation, allowing for efficient, multiprocessor motif searches in any FASTA file.

Availability and implementation

The motif scraper package (MIT license) is available via PyPI, and the Python source is available on GitHub at https://github.com/RobersonLab/motif_scraper.

1 Introduction

Genomic features can often be described by sequence motifs, rather than exact sequence matches. Particularly important examples of this property are proximal promoter elements that bind transcription factors, and proteins that bind at enhancers and insulators. In these cases, the binding protein does not find an exact sequence match, but rather binds a range of sequences with compatible charge profiles for the protein binding interface. Using methods such as ChIP-Seq, the binding sequences for these factors can be determined and represented as a sequence motif using IUPAC-approved degenerate nucleotide codes. Some important features are exact matches, such as the match between a microRNA (miR) and seed sequences in the 3′ untranslated region (UTR) of a targeted gene. Others have well-defined degeneracy, such as genome-editing target sites. Many databases exist cataloging the location of transcription factor motifs (Kaplun et al., 2016; Kel et al., 2003; Knuppel et al., 1994; Matys et al., 2006; Wingender et al., 1996; Wingender, 1988; Wingender, 2008), miRNA binding sites (Andres-Leon et al., 2015; Dweep et al., 2014; Griffiths-Jones, 2004; Griffiths-Jones et al., 2006; Griffiths-Jones et al., 2008; Kozomara and Griffiths-Jones, 2014; Lagana et al., 2012; Prabahar and Natarajan, 2017), and genome-editing sites (Gratz et al., 2014; Heigwer et al., 2014; Liu et al., 2015; Montague et al., 2014; Naito et al., 2015; Stemmer et al., 2015; Xiao et al., 2014). However, these databases are often restricted to commonly used model species. Newly sequenced species are likely never to be included, and model species may lag behind the release of new genome drafts. Furthermore, many individual, phased whole genomes are being generated. The databases of sequence motifs are designed relative to a reference sequence, rather than to personal genomes. There are other tools that exist that could identify motifs based on a position-weight matrix or other information, such as ChIP-Seq peaks and DNase hypersensitivity, including HOMER (Heinz et al., 2010) and MEME-Suite (Bailey et al., 2009). The downside of de novo motif identification is often a substantial time trade off.

Inspired by previous work to identify a specific subset of CRISPR/Cas9 sites (Roberson, 2015), my goal for motif scraper was to instead develop a more general purpose motif searching tool that would have broader use. Motif scraper fills this annotation gap by allowing for the specification degenerate sequence motifs and reporting the location and composition of all matches in a FASTA file, which could be a personal genome, a reference genome, or a set of genomic slices, such as all the 3′ UTRs of protein coding genes. This tool therefore functions more as a FASTA degenerate sequence ‘grep’ that is easy to install and use, and scales well with full genome sequence files.

2 Materials and Methods

Motif scraper was designed in Python, and is compatible with both Python 2 and 3. The ability to read FASTA formatted files and generate FASTA indexes is provided by pyfaidx (Shirley et al., 2015). Motifs are specified as a text string with using IUPAC degenerate bases, which are converted internally into a regular expression and compiled by the regex package. This allows for detection of overlapping motifs. One or more specific regions or a specific strand relative to the reference can be specified for targeted search. By default all contigs in the FASTA file are searched for both + and − strands. The multiprocessor Python package handles the use of multiple computer cores, searching each target region/strand separately. Each hit is reported with the contig, start position, end position, strand, sequence, and matching motif in the output file. The code is available under an MIT license, stored on GitHub, and distributed through the Python Package Index (PyPI). Compatibility with Python 2 and 3 is assessed with every repository commit using Travis CI service. This paper used motif scraper v1.0.1.

3 Results

3.1 Identification of mock transcription factor binding motifs

As a benchmark, I calculated a faux consensus sequence for two DNA binding proteins: CCAAT/Enhancer Binding Protein Beta (CEBPB) and CCCTC-Binding Factor (CTCF). I downloaded their Position Weight Matrices for Homo sapiens from Jaspar (Mathelier et al., 2016). I then calculated the fraction of weight at each position attributable to each base. At each position I considered a base contributing at least 5% of overall weight to be a possible match at that position. I then converted these possible base matches per position into degenerate IUPAC bases to form an estimated degenerate motif. The CEBPB (MA0466.2) calculated motif was VTKDYRHAAY, and the CTCF (MA0139.1) calculated motif was NNNMCDSNAGRDGDHRVNN. I also downloaded the MEME formatted position-weight matrix (PWM) for both motifs for use with MEME-Suite. I compiled MEME-Suite v4.12.0 from the source code using gcc/g++v5.4.0. I used the FIMO (FInd MOtif) tool to search the human genome (Ensembl GRCh38 release 91) for binding sites for both motifs with default settings. I tested the performance of multiple processors for the faux motifs using 1–10 processors on a machine with an Intel i7-3929k 3.20 GHz processor and 32 GB RAM running Ubuntu 16.04.1 64-bit and Python 2.7.12. In this benchmark motif scraper had decreased run time with additional processors (saturating at ∼6), and required more time for longer motifs (Fig. 1).

Fig. 1.

Open in new tab Download slide

(A, B) Runtimes for variable processor usage. Above are the runtimes for two motifs on the same system using 1–10 processors. The dots represent means, and the ribbons show the standard deviation for ten iterations of each condition

3.2 Comparison to MEME-suite

FIMO is designed to not just identify potential matches to a motif, but also to enrich for potential matches present greater than expected by chance given genomic background. FIMO therefore requires significant computational time. For CEBPB, motif scraper identified 4 568 172 potential sites based on my definition of the binding degeneracy, whereas FIMO found 61 123 significantly enriched binding sites. For CTCF, motif scraper found 496 026 sites and FIMO found 53 566 sites. This highlights the major differences in the tools. FIMO is designed to give you a likely binding site based on the PWMs. The final lists are relatively small and likely to be non-random. However, this operation is slow. For CEBPB, FIMO took 1435.0 s ± 19.8 s to find the enriched sites. Out of the enriched sequences, ATTACACAAT was the most common (10 927/61 123). Searching for that specific sequence with motif scraper using only 1 processor took only 209.6 s ± 0.5 s with 100% overlap with the FIMO. Therefore, for transcription factor binding sites, finding significantly enriched motifs clearly benefits from taking background sequence into context and requires additional computational time. However, for sequences not based on a PWM, motif scraper can significantly decrease processing time.

4 Summary

The lack of portable, general-purpose motif-finding tools for uses such as genome annotation is a significant barrier for the discovery of motifs in new/non-model genomes. The rapid increase in the number of available whole-genomes only amplifies this problem. Motif scraper aims to fill this gap. This tool has cross-platform compatibility and a permissive license for broad reuse. The runtime for annotation of relatively degenerate nucleotide sequences is fast, on the order of minutes for a whole-genome using multiple processors. The FASTA format allows for flexible input, ranging from whole genomes down cDNA sequences and plasmids. It could also be used to search for potential microRNA binding seed sequences in 3′ UTRs to predict potential partners for organisms not available in TargetScan (Agarwal et al., 2015).

It is worth noting that this tool cannot, and does not aim to, replace probabilistic binding models. For interactions best specified by a position-specific weighted matrix, other tools that quantify enrichment over background are more apt. But for sequences that are well-defined and exact, such as restriction enzyme sites and microRNA binding sites, or that have defined degeneracy, such as genome-editing motifs, motif scraper can annotate their location with ease.

It is also worth noting that the performance of parallel processing is highest with few relatively large contigs, i.e. reference genomes. The algorithm can be applied to smaller contigs, such as 3′ UTRs from a whole-genome to identify microRNA binding sites. However, the performance decreases appreciably with many short contigs. This limitation could be overcome by instead processing a batch of contigs per core to limit the number of data transfer operations. Overall, the broad operating system compatibility, use of a standard input format, and relative speed help support motif scraper as an important tool for non-model organisms and annotation of non-standard motifs.

Acknowledgements

Special thanks to Dr. Karyn Meltz Steinberg for her helpful discussions during the development of this tool, and to the reviewers for significantly improving the manuscript.

Funding

This work was partially supported by the National Institutes of Health, National Institute of Arthritis and Musculoskeletal and Skin Diseases (P30-AR048335).

Conflict of Interest: none declared.

References

Agarwal

V.

et al. (

2015

)

Predicting effective microRNA target sites in mammalian mRNAs

.

Elife

,

4

,

e05005

.

Andres-Leon

E.

et al. (

2015

)

miRGate: a curated database of human, mouse and rat miRNA-mRNA targets

.

Database

,

2015

,

bav035

.

Bailey

T.L.

et al. (

2009

)

MEME SUITE: tools for motif discovery and searching

.

Nucleic Acids Res

.,

37

,

W202

–

W208

.

Dweep

H.

et al. (

2014

) miRWalk database for miRNA—target interactions. In:

Alvarez

M.L.

,

Nourbakhsh

M.

(eds.)

RNA Mapping: Methods and Protocols

.

Springer

,

New York

, pp.

289

–

305

.

Gratz

S.J.

et al. (

2014

)

Highly specific and efficient CRISPR/Cas9-catalyzed homology-directed repair in Drosophila

.

Genetics

,

196

,

961

–

971

.

Griffiths-Jones

S.

(

2004

)

The microRNA registry

.

Nucleic Acids Res

.,

32

,

D109

–

D111

.

Griffiths-Jones

S.

et al. (

2006

)

miRBase: microRNA sequences, targets and gene nomenclature

.

Nucleic Acids Res

.,

34

,

D140

–

D144

.

Griffiths-Jones

S.

et al. (

2008

)

miRBase: tools for microRNA genomics

.

Nucleic Acids Res

.,

36

,

D154

–

D158

.

Heigwer

F.

et al. (

2014

)

E-CRISP: fast CRISPR target site identification

.

Nat Methods

,

11

,

122

–

123

.

Heinz

S.

et al. (

2010

)

Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities

.

Mol. Cell

,

38

,

576

–

589

.

Kaplun

A.

et al. (

2016

)

Establishing and validating regulatory regions for variant annotation and expression analysis

.

BMC Genomics

,

17

,

393.

Kel

A.E.

et al. (

2003

)

MATCH^TM: a tool for searching transcription factor binding sites in DNA sequences

.

Nucleic Acids Res

.,

31

,

3576

–

3579

.

Knuppel

R.

et al. (

1994

)

TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins

.

J. Comput. Biol

.,

1

,

191

–

198

.

Kozomara

A.

,

Griffiths-Jones

S.

(

2014

)

miRBase: annotating high confidence microRNAs using deep sequencing data

.

Nucleic Acids Res

.,

42

,

D68

–

D73

.

Lagana

A.

et al. (

2012

)

miR-EdiTar: a database of predicted A-to-I edited miRNA target sites

.

Bioinformatics

,

28

,

3166

–

3168

.

Liu

H.

et al. (

2015

)

CRISPR-ERA: a comprehensive design tool for CRISPR-mediated gene editing, repression and activation

.

Bioinformatics

,

31

,

3676

–

3678

.

Mathelier

A.

et al. (

2016

)

JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles

.

Nucleic Acids Res

.,

44

,

D110

–

D115

.

Matys

V.

et al. (

2006

)

TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes

.

Nucleic Acids Res

.,

34

,

D108

–

D110

.

Montague

T.G.

et al. (

2014

)

CHOPCHOP: a CRISPR/Cas9 and TALEN web tool for genome editing

.

Nucleic Acids Res

.,

42

,

W401

–

W407

.

Naito

Y.

et al. (

2015

)

CRISPRdirect: software for designing CRISPR/Cas guide RNA with reduced off-target sites

.

Bioinformatics

,

31

,

1120

–

1123

.

Prabahar

A.

,

Natarajan

J.

(

2017

)

ImmunemiR—a database of prioritized immune mirna disease associations and its interactome

.

MicroRNA

,

6

,

71

–

78

.

Roberson

E.D.O.

(

2015

)

Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2

.

PeerJ Comput. Sci

.,

1

,

e33

.

Shirley

M.D.

et al. (

2015

)

Efficient “Pythonic” Access to FASTA Files Using Pyfaidx.

PeerJ PrePrints

,

e1196

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Stemmer

M.

et al. (

2015

)

CCTop: an intuitive, flexible and reliable CRISPR/Cas9 target prediction tool

.

PLoS One

,

10

,

e0124633

.

Wingender

E.

et al. (

1996

)

TRANSFAC: a database on transcription factors and their DNA binding sites

.

Nucleic Acids Res

.,

24

,

238

–

241

.

Wingender

E.

(

1988

)

Compilation of transcription regulating proteins

.

Nucleic Acids Res

.,

16

,

1879

–

1902

.

Wingender

E.

(

2008

)

The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation

.

Brief. Bioinform

.,

9

,

326

–

332

.

Xiao

A.

et al. (

2014

)

CasOT: a genome-wide Cas9/gRNA off-target searching tool

.

Bioinformatics

,

30

,

1180

–

1182

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
May 2018	11
June 2018	33
July 2018	18
August 2018	18
September 2018	8
October 2018	11
November 2018	99
December 2018	24
January 2019	21
February 2019	5
March 2019	15
April 2019	15
May 2019	16
June 2019	9
July 2019	16
August 2019	20
September 2019	12
October 2019	15
November 2019	13
December 2019	27
January 2020	31
February 2020	19
March 2020	16
April 2020	30
May 2020	10
June 2020	40
July 2020	41
August 2020	16
September 2020	19
October 2020	67
November 2020	45
December 2020	33
January 2021	21
February 2021	54
March 2021	31
April 2021	36
May 2021	44
June 2021	29
July 2021	35
August 2021	30
September 2021	14
October 2021	38
November 2021	37
December 2021	45
January 2022	34
February 2022	56
March 2022	69
April 2022	59
May 2022	54
June 2022	43
July 2022	29
August 2022	49
September 2022	50
October 2022	27
November 2022	44
December 2022	29
January 2023	52
February 2023	38
March 2023	43
April 2023	28
May 2023	20
June 2023	28
July 2023	16
August 2023	24
September 2023	18
October 2023	11
November 2023	27
December 2023	18
January 2024	26
February 2024	15
March 2024	30
April 2024	15

Article Contents

Motif scraper: a cross-platform, open-source tool for identifying degenerate nucleotide motif matches in FASTA files

Abstract

1 Introduction

2 Materials and Methods

3 Results

3.1 Identification of mock transcription factor binding motifs

3.2 Comparison to MEME-suite

4 Summary

Acknowledgements

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Motif scraper: a cross-platform, open-source tool for identifying degenerate nucleotide motif matches in FASTA files

Abstract

1 Introduction

2 Materials and Methods

3 Results

3.1 Identification of mock transcription factor binding motifs

3.2 Comparison to MEME-suite

4 Summary

Acknowledgements

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only