-
PDF
- Split View
-
Views
-
Cite
Cite
Stefano Castellana, Tommaso Biagini, Luca Parca, Francesco Petrizzelli, Salvatore Daniele Bianco, Angelo Luigi Vescovi, Massimo Carella, Tommaso Mazza, A comparative benchmark of classic DNA motif discovery tools on synthetic data, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab303, https://doi.org/10.1093/bib/bbab303
- Share Icon Share
Abstract
Hundreds of human proteins were found to establish transient interactions with rather degenerated consensus DNA sequences or motifs. Identifying these motifs and the genomic sites where interactions occur represent one of the most challenging research goals in modern molecular biology and bioinformatics. The last twenty years witnessed an explosion of computational tools designed to perform this task, whose performance has been last compared fifteen years ago. Here, we survey sixteen of them, benchmark their ability to identify known motifs nested in twenty-nine simulated sequence datasets, and finally report their strengths, weaknesses, and complementarity.
Introduction
Complex networks of DNA-protein interactions drive most human cellular functions. The interaction efficacy depends on the affinity between DNA-binding proteins and specific nucleotide sequences. Around 2500 human proteins were found experimentally to play a DNA-binding (GO:0003677) activity, and hundreds of them exhibited, in fact, the ability to establish transient interactions with rather degenerated sequences or motifs. The study of these interactions represents one of the most challenging topics in modern molecular biology and bioinformatics [1, 2].
Motifs are short patterns of nucleotides or amino acids with a putative or ascertained biological significance. At the genome level, motifs are ~5–31 nucleotides long (mean 9.9 for eukaryotes and 15.9 for prokaryotes) [3] and are globally interspersed throughout the genome, although most are located within the intergenic regions and nearby promoters. They can be described by either consensus sequences, which are composed of the most frequent nucleotides of a set of aligned sequences, or Position Weight Matrices (PWMs), which are shaped as 4 x n matrices obtained by local alignments of n sequences and that can be considered as models for the binding specificity of DNA-binding proteins. Today, the JASPAR CORE 2020 [4] collection contains 1646 PWMs with widths ranging from 5 to 24 nucleotides.
The identification of motifs is not an easy task, with numerous factors, such as the genomic complexity, the incomplete knowledge of both the evolutionary forces shaping motifs and the structural interaction between DNA and DNA-binding protein domains, their variable size, and the possibility of motif co-occurrence and overlapping, which increase the complexity of the problem. For this reason, the last twenty years witnessed an explosion of computational tools designed to perform the most diverse tasks: (i) discovering overrepresented patterns in background sets of shuffled sequences; (ii) searching for known transcription factors binding sites (TFBSs); (iii) computing the similarity of a pattern with all sequences contained in a database of known regulatory elements; (iv) generating and drawing logos or PWMs from input sequences.
In parallel, the number of genomic sequences has dramatically increased over time [5]. Today, the ENCODE Project [6] contains data for thousands of experimental assays, most of which derive from DNA binding experiments. The HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) [7] collects transcription factors (TFs) for 680 humans and 453 mice, obtained using the ChIPMunk DNA motif discovery software [8] and stored in the Gene Transcription Regulation Database (GTRD) [9]. RegulonDB [10] holds experimentally validated and predicted regulatory elements in the E. coli genome, reconstructed regulatory networks, and operon structures. TRANSFAC [11] is a huge database of TFs, associated DNA motifs in eukaryotic genomes, and analysis tools. JASPAR [12] is the largest open bioinformatics resource of TFBSs in the form of PWMs for eukaryotic genomes. It allows users to browse, query, and analyze amino acid or nucleotide sequences to detect known regulatory motifs associated with TFs.
The first comparative study appeared in 2005 when Hu et al. [13] discussed the limitations and potentials of AlignACE [14], MEME [15], BioProspector [16], MDScan [17], and MotifSampler [18] on two E. coli sequences datasets retrieved from RegulonDB. They evaluated the sequence length, variability in the background models, and motif width as confounding factors in assessing the sensitivity, specificity, and other general performance metrics. The same tools were also run with real ChIP-Seq data to assess their ability to identify PWMs of a set of validated TFBSs [19]. Their overall performance was low, and the length of motifs resulted in the major determinant factor, with MEME exhibiting the best performance with long sequences. The tools were generally incongruent, leading to the conclusion that their results were not always ‘true positives.’ In the same year, Tompa et al. [20] conducted a large-scale comparative study using 13 popular motif discovery tools, whose authors were asked to run their algorithms on various simulated sequence datasets. The study concluded that: (i) the generation of good quality input datasets, both real and simulated, is problematic. The evolution and functionalization of regulatory elements along genomes are poorly understood and extremely difficult to formalize by a computational framework. Furthermore, real genomic sequences can contain unknown functional elements, making the definition of ‘true-positive result’ a puzzling concept. (ii) Weeder [21] was the best performing algorithm with almost all test datasets, especially with the yeast sequences. (iii) Generally, all tools performed worse with real than simulated sequences. Two years later, Sandve et al. [18] reevaluated Tompa’s results using machine learning and generated two improved benchmark datasets. One was characterized by easy-to-find motifs by the current methods, and another by real binding sites, which are notably hard to distinguish from background sequences (https://tare.medisin.ntnu.no/pages/tools.php). The most recent and extensive comparative work dates back to 2008, when Quest and colleagues presented the Motif Tool Assessment Platform (MTAP) [22], which included nine popular motif discovery tools, and used it to detect real TFBSs in upstream genomic sequences of human, mouse, fly, yeast, E. coli, and B. subtilis organisms retrieved from RegulonDB. The tested tools’ performance was generally poor, with Weeder, AlignACE, and MEME performing better than others. As expected, the length of sequences was a penalizing factor as the identification of TFBS in long genomic sequences was generally more complex.
Since this work, several other tools have been implemented, and local comparative studies were performed. Marschall and Rahmann proposed a new method for exact motif discovery and compared its performance with MEME and Weeder [23]. In 2013, Kulakovskiy and Makeev discussed numerous tasks, starting from basic DNA motif finding and discovery, further applied to explore various features of experimental data. Finally, they reviewed the existing software in the field [24]. Similarly, Ma et al. presented MEME-ChIP, their web-based tool for motif analysis of large nucleotide data sets [25]. All new tools were, in fact, progressively designed to deal with ChIP-Seq data [26–29]. In particular, Lihu and Holban reviewed seven ensemble tools designed to process ChIP-Seq data and observed their limitations and strengths [30]. Lee et al. made a comparative work of tools based on genetic algorithms [31]. They provided a detailed technical description of 18 different implementations of genetic algorithms, together with a performance evaluation of four of them on small-size real datasets taken from [32]. In the last five years, strategies based on Machine Learning have been applied to the motif discovery problem, especially in the task of identifying genomic regulatory motifs across massive experimental datasets [33, 34]. Since the results of an accurate evaluation of these tools have been recently made available [35], in this work, we focused on classic methods and benchmarked a total of 16 investigated tools on 29 different synthetic test datasets by performing four hundred sixty-four analysis runs.
Materials and methods
More than one hundred and fifty DNA motif discovery tools exist [2, 26, 34, 36, 37]. Initially, we resorted to their standalone versions to avoid uploading datasets to the web, monitor their execution, and eventually face internet latency and errors. Whenever we encountered configuration problems or other compatibility issues, we have turned to their web versions, if available. We have also considered web-based tools. We retained those that ran fast (≤ 1 day) and returned easy-to-parse outputs for at least one benchmark dataset. To summarize, we used the standalone version of the tools 3–5, 7, 10, 12–16 and the web interface of the remaining tools listed in Table 1. Details of each tool, command lines and parameters are reported in Supplementary File 1.
DNA motif discovery tools that were considered in this comparative study. *Accessed on March 2019. #Accessed on April 2019
# . | Tool . | Version . | Note . |
---|---|---|---|
1 | BaMM* | 1.4.0 | Motif analysis web portal |
2 | DMINDA2* | - | Motif analysis web portal |
3 | Gimmemotifs | 0.13.1 | Standalone, aggregator; it can return a consensus motif list |
4 | Gimsan | 20,100,830 | Standalone |
5 | Homer2 | 4.10.3 | Standalone |
6 | Improbizer* | - | Motif analysis web portal |
7 | MEME | 4.11.4 | Standalone |
8 | Modside – XXMotif* | 1.0 | Motif analysis web portal; output in proprietary format |
9 | Modside – ChIPMunk* | 1.0 | |
10 | MotifSampler | 3.2.2 | Standalone |
11 | RSAT# | - | Motif analysis web portal |
12 | Seeder | 0.0.1 | Standalone |
13 | Tmod – AlignACE | 1.1.1 | Standalone, aggregator; output in proprietary format |
14 | Tmod – SeSiMCMC | 1.1.1 | |
15 | Tmod – MDScan | 1.1.1 | |
16 | Weeder2 | 2.0.1 | Standalone |
# . | Tool . | Version . | Note . |
---|---|---|---|
1 | BaMM* | 1.4.0 | Motif analysis web portal |
2 | DMINDA2* | - | Motif analysis web portal |
3 | Gimmemotifs | 0.13.1 | Standalone, aggregator; it can return a consensus motif list |
4 | Gimsan | 20,100,830 | Standalone |
5 | Homer2 | 4.10.3 | Standalone |
6 | Improbizer* | - | Motif analysis web portal |
7 | MEME | 4.11.4 | Standalone |
8 | Modside – XXMotif* | 1.0 | Motif analysis web portal; output in proprietary format |
9 | Modside – ChIPMunk* | 1.0 | |
10 | MotifSampler | 3.2.2 | Standalone |
11 | RSAT# | - | Motif analysis web portal |
12 | Seeder | 0.0.1 | Standalone |
13 | Tmod – AlignACE | 1.1.1 | Standalone, aggregator; output in proprietary format |
14 | Tmod – SeSiMCMC | 1.1.1 | |
15 | Tmod – MDScan | 1.1.1 | |
16 | Weeder2 | 2.0.1 | Standalone |
DNA motif discovery tools that were considered in this comparative study. *Accessed on March 2019. #Accessed on April 2019
# . | Tool . | Version . | Note . |
---|---|---|---|
1 | BaMM* | 1.4.0 | Motif analysis web portal |
2 | DMINDA2* | - | Motif analysis web portal |
3 | Gimmemotifs | 0.13.1 | Standalone, aggregator; it can return a consensus motif list |
4 | Gimsan | 20,100,830 | Standalone |
5 | Homer2 | 4.10.3 | Standalone |
6 | Improbizer* | - | Motif analysis web portal |
7 | MEME | 4.11.4 | Standalone |
8 | Modside – XXMotif* | 1.0 | Motif analysis web portal; output in proprietary format |
9 | Modside – ChIPMunk* | 1.0 | |
10 | MotifSampler | 3.2.2 | Standalone |
11 | RSAT# | - | Motif analysis web portal |
12 | Seeder | 0.0.1 | Standalone |
13 | Tmod – AlignACE | 1.1.1 | Standalone, aggregator; output in proprietary format |
14 | Tmod – SeSiMCMC | 1.1.1 | |
15 | Tmod – MDScan | 1.1.1 | |
16 | Weeder2 | 2.0.1 | Standalone |
# . | Tool . | Version . | Note . |
---|---|---|---|
1 | BaMM* | 1.4.0 | Motif analysis web portal |
2 | DMINDA2* | - | Motif analysis web portal |
3 | Gimmemotifs | 0.13.1 | Standalone, aggregator; it can return a consensus motif list |
4 | Gimsan | 20,100,830 | Standalone |
5 | Homer2 | 4.10.3 | Standalone |
6 | Improbizer* | - | Motif analysis web portal |
7 | MEME | 4.11.4 | Standalone |
8 | Modside – XXMotif* | 1.0 | Motif analysis web portal; output in proprietary format |
9 | Modside – ChIPMunk* | 1.0 | |
10 | MotifSampler | 3.2.2 | Standalone |
11 | RSAT# | - | Motif analysis web portal |
12 | Seeder | 0.0.1 | Standalone |
13 | Tmod – AlignACE | 1.1.1 | Standalone, aggregator; output in proprietary format |
14 | Tmod – SeSiMCMC | 1.1.1 | |
15 | Tmod – MDScan | 1.1.1 | |
16 | Weeder2 | 2.0.1 | Standalone |
Motif analysis tools
BaMM [38] (Bayesian Markov Model) is a suite of tools to search and compare motifs of known TFBSs in DNA FASTA sequences. It applies a three-stage strategy to (i) discover de-novo motifs, (ii) generate PWMs from an optimal w-mers (w = pattern length) using an Expectation–Maximization strategy, and (iii) transform PWMs into BaMM models. The web interface is easy to use, still giving expert users the possibility to fine-tune parameters, such as the value of w, the statistical enrichment cutoffs, and choose how to generate the background models from control sequences. Here, BaMM was used to search motifs in the direct DNA strand only. When two or more hits for the same input sequence were found, we selected the one with the most extended overlap with the known nested DNA pattern.
GimmeMotifs [39] is a Python-based command-line and API suite that predicts de-novo motifs, searches for known motifs, identifies differentially represented motifs, calculates motif enrichment statistics, and plots sequence logos. It allows users to run up to 14 different motif discovery algorithms and assemble their output into a unique and comprehensive report. It takes FASTA sequences or BED-formatted genomic coordinates as input files. We run the gimme motifs and gimme scan modules of this tool to obtain the PWMs of our benchmark datasets and identify matching sites on the tested sequences, respectively.
MotifSampler is a probabilistic de-novo motif detection tool for DNA sequences. It uses a stochastic optimization strategy based on a Gibbs sampling method to search for all possible sets of short DNA segments in sequence datasets. MotifSampler can be optionally used with the MotifRanking and FuzzyClustering post-processing tools to filter a short list of high-ranking motifs from the MotifSampler output. An instance of this tool was installed and run locally. It processed each input sequence and returned a text file containing the PWMs of the predicted motifs and a GFF-like file of all identified motifs, the corresponding input sequences, and the minimum and maximum Log-Likelihood and Information Content scores. In our evaluation, we considered only the highest score motifs.
MEME (Multiple Expectation Maximization Estimation) is the core algorithm of the MEME motif analysis suite. It is equipped with a rich array of algorithms for motif discovery, scanning, comparison, discriminative motif enrichment, and various other utility functions. DNA motifs are represented as PWMs. It discovers novel and ungapped motifs in input sequences by splitting variable-length patterns into two or more distinct motifs and yielding as many motifs as requested, sorted by E-value (expected-value). The best width and number of occurrences are determined by each motif’s information content, evaluated by an Expectation–Maximization (EM) statistical approach. The web interface of MEME is easy to use, but we used the standalone version.
Improbizer [40] implements a variation of the EM method to detect asymmetric, non-random patterns in the form of PWMs. It is flexible in terms of parameters that can be tuned, e.g., the strand to be searched and the number of expected occurring patterns per sequence, and can additionally generate random data to evaluate the significance of results obtained with user-specific sequences. Improbizer was run with default parameters via its web interface. The HTML output files contained tables of predicted motif sites, together with reliability scores, sequence names, and FASTA-like relative sequences, with predicted sites highlighted in capital letters.
RSAT [41] (Regulatory Sequence Analysis Tools) is a suite of sequence analysis tools that (i) performs gapped and ungapped motif discovery on sets of input sequences; (ii) detects sequences that match with a consensus string; (iii) generates various types of random control sets; (iv) maps variants to sequences; (v) extracts sequences from primary databanks; (vi) calculates motif enrichment. This study used the oligo-analysis method to detect over-represented oligonucleotides in an exhaustive, fast, and rigorous way. It returned a series of PWMs, one for each benchmark dataset, which we fed into the pattern matching tool, matrix-scan, which in turn returned a list of candidate motif sites, reporting the start-end coordinates of the motifs within the input sequences, p-values, and ranks.
Gimsan [42] (GibbsMarkov with Significance Analysis) is a tool for de novo motif discovery based on GibbsMarkov, a variation of the popular Gibbs Sampler algorithm. It implements a hybrid model that accounts for a Bayesian prior of the percentage of sequences containing candidate sites and uses a maximum likelihood approach for the definition of PWMs. Gimsan is available as a standalone application on Unix and PBS clusters. It generated complex output folders, where we retrieved PWMs and matching sites from ‘.stdout’ files.
Weeder is a consensus-based algorithm for conserved motifs’ automatic discovery in a set of related regulatory DNA sequences. It enumerates all the oligonucleotides of a set of input FASTA sequences to produce a list of candidate consensus sequences, sorted by frequency, the number of substitutions or motif conservation, and motif width. The final ranks depend on the relative expectation of the detected patterns concerning a background model. Two textual output files were generated. One reported the detected PWMs, while another contained summary tables of sequence names, start positions of the detected motifs, motif sequences, and associated scores for each PWMs.
Seeder [43] is designed for the efficient and reliable prediction of regulatory motifs in eukaryotic promoters. It starts by enumerating all words of a given length and, for each word, it calculates the substring minimal distance (SMD) between the word and its best matching subsequence in each sequence of a background set. For each word, it then calculates the sum of SMDs in a positive set and a p-value, using the word-specific background probability distribution. The word for which the p-value is minimal is retained, and a seed PWM is built and extended to full motif width. The process is iterated until convergence, or a maximum number of iterations is reached. For each dataset, we have obtained a list of detected motifs and, for each motif, their general features (e.g., seed width, sequence width, and information content), nucleotide matrices of counts and frequencies, their start and end positions, strands, and nucleotide sequences.
Modside [44] is a web platform that integrates four popular motif discovery algorithms: ChIPMunk, Weeder, MEME, and XXMotif [45]. ChIPMunk is an iterative algorithm mainly designed for finding strong regulatory motifs in ChIP-Seq, HT-SELEX, and DNase footprints datasets. It accepts extended multifasta as the input format. PWMs are generated, and all the aligned sequences are classified into signal or noise according to their PWM scores. We extracted the highest scores PWMs from the textual output, together with the motif occurrences in the input sequence set (only those reported on the plus strand). XXMotif is a three-step seed-based algorithm that performs sequence polishing, detects significantly enriched short oligomers (seeds), and builds matrix profiles from the enriched strings. Matrices are iteratively optimized, extended or shortened, until they reach the lowest possible enrichment p-value, i.e., they are over-represented within the input sequences compared to the background. The output consists of a long list of degenerate strings sorted by Bonferroni-corrected p-value, together with a list of sequences where motifs were found. MEME and Weeder were installed independently and then not queried through Modside.
DMINDA [46] is a one-stop web server for DNA motif analysis. It performs de-novo motif discovery, searches for known TFBSs, and groups similar motifs. The motif discovery step implements the BoBRO algorithm [47], which generates PWMs using a two-stage alignment process (matrix approximation and matrix consolidation). The list of predicted motifs is represented as an unweighted graph, where cliques are the most critical components. They undergo an extension/refinement procedure to generate significant motifs. The output of our analyses consisted of lists of significant motifs, as consensus strings and count matrices and lists of matching sequences for each resulting motif.
HOMER [48] is a standalone package for motif finding and scanning. It reduces input FASTA sequences into normalized oligo tables, where the occurrence of each oligo is assessed both in the input and control sequences. Then, it performs a global search for enriched oligos, which are then converted into PWMs, which it finally refines using a local optimization algorithm. This process is repeated until the specified number of motifs is found. The HOMER’s output contains a list of candidate motifs as IUPAC strings and PWMs. This study used the homer2 find option and considered only the most significant motifs, i.e., those tagged as ‘1-XXXXXX’ in the Positives.txt output files.
Tmod [49] aggregates 12 DNA motif discovery tools. It is a standalone package written in C++ that provides non-expert users with a practical Windows-based GUI. We queried only three of them, AlignACE, SeSiMCMC, and MDScan, since we had already installed MEME, MotifSampler, and Weeder locally, and the others were not compatible anymore with the most recent versions of Windows. AlignACE generates predicted DNA motifs as weight matrices that are overrepresented in a FASTA input file. It is based on the original Gibbs sampling algorithm but with some variations. AlignACE uses the MAP (maximum a priori log-likelihood) score to estimate the degree of overrepresentation of a predicted motif compared to the expected random occurrence in an input file. In the output text file, the resulting motifs are ranked by MAP and presented as nucleotide strings. Sites belonging to the predicted gapped or ungapped motifs are shown at the bottom of the string list and evidenced with a ‘*’ symbol. The MDScan (Motif Discovery Scan) is a seed-based method tailored to analyze ChIP–array sequences. MDscan enumerates a set of non-redundant oligomers of size ‘w’ (seeds) in both strands on the top-scoring aligned sequences. Seeds are used to compose a motif weight matrix and scan all the w-mers in the remaining sequences. New w-mers are iteratively added for a maximum of ten iterations. It returns five top-ranking count matrices and motif matching strings for both strands of the input sequences. SeSiMCMC [50] (Sequence Similarities by Markov Chain Monte Carlo) identifies gapped, ungapped, or palindromic motifs. First, it optimizes the PWM of a motif using a Gibbs-like Markov Chain and then ranks the obtained motifs by each motif’s information content. We extracted a table of DNA patterns from the output file, each described by the pattern length, start position of the pattern, and associated score, and a summary of the configuration parameters.
Benchmark datasets
We generated 19 datasets of sequences using Markov Chain simulation, as recommended in [20]. Modeling DNA sequences as Markovian processes is, in fact, a consolidated and efficient approach [51]. In particular, each sequence was zero-order of length s, obtained sampling s times the four nucleotides with the following different probabilities: PA = 0.2, PC = 0.3, PG = 0.3, PT = 0.2 (high GC content), PA = 0.3, PC = 0.2, PG = 0.2, PT = 0.3 (low GC content), and PA = 0.25, PC = 0.25, PG = 0.25, PT = 0.25 (normal GC content). We made a nucleotide substitution matrix for each sequence, setting two parameters: 0.25 for transitions and 0.05 for the four possible transversions. Then, given the zero-order sequence, a substitution matrix, and the desired sequence length, a final background sequence was generated after three simulation steps (commands in Supplementary File 3). Finally, the DNA sequences sampled from known PWMs were inserted into each generated sequence. Our benchmark dataset is then composed by (i) sets 1–12, which vary for the size of the background sequences (50–1000 bp), base composition, and positional regularity of the nested string; (ii) sets 13–15 differ from the previous for extreme sequence lengths, ranging from 100 to 2000 bp, and random placements of the Motif1 (Table 2, Supplementary Figure 1) in the sequences. (iii) Sets 16 and 17 are characterized by extreme sequence lengths (100–2000 bp), with normal GC content, where we nested Motif2 either in the middle or randomly in the background sequences, respectively. Sets 18 and 19 are made by twenty sequences of length 500 bp, with normal GC composition, where we nested both Motif3 and Motif4. In the former, the two patterns were equidistant from a constant spacer region of 343 bases. They were inserted randomly within the first and third quarters of the sequences in the latter.
Selected motifs from JASPAR 2020 CORE dataset. JASPAR accession numbers, corresponding human Transcription Factor, and motif length are provided. Nucleotide frequency matrices in MEME format are available in Supplementary File 3. Logos are available in Supplementary Figure 1
# . | JASPAR ID . | TF . | Motif length (nt) . |
---|---|---|---|
Motif1 | MA0018.3 | CREB1 | 12 |
Motif2 | MA0849.1 | FOXO6A | 7 |
Motif3 | MA0670.1 | NF1A | 10 |
Motif4 | MA0106.3 | TP53 | 18 |
# . | JASPAR ID . | TF . | Motif length (nt) . |
---|---|---|---|
Motif1 | MA0018.3 | CREB1 | 12 |
Motif2 | MA0849.1 | FOXO6A | 7 |
Motif3 | MA0670.1 | NF1A | 10 |
Motif4 | MA0106.3 | TP53 | 18 |
Selected motifs from JASPAR 2020 CORE dataset. JASPAR accession numbers, corresponding human Transcription Factor, and motif length are provided. Nucleotide frequency matrices in MEME format are available in Supplementary File 3. Logos are available in Supplementary Figure 1
# . | JASPAR ID . | TF . | Motif length (nt) . |
---|---|---|---|
Motif1 | MA0018.3 | CREB1 | 12 |
Motif2 | MA0849.1 | FOXO6A | 7 |
Motif3 | MA0670.1 | NF1A | 10 |
Motif4 | MA0106.3 | TP53 | 18 |
# . | JASPAR ID . | TF . | Motif length (nt) . |
---|---|---|---|
Motif1 | MA0018.3 | CREB1 | 12 |
Motif2 | MA0849.1 | FOXO6A | 7 |
Motif3 | MA0670.1 | NF1A | 10 |
Motif4 | MA0106.3 | TP53 | 18 |
It is important to remark that using nucleotide substitution matrices allowed us to generate background sequences different from each other. Nested sequences were also different since they were generated by sampling nucleotides from the JASPAR frequency matrix of the four motifs listed in Table 2. From a biological perspective, single nucleotide variations of two distinct nested patterns deriving from the same matrix could be seen as independent mutational events occurring in distinct genomic regions. Alternatively, these variations could be interpreted as sequencing errors, while the background strings can be considered sequencing reads. The generated sequences are described in Supplementary Table 1, Custom_seq sheet. They all differ for at least one nucleotide. For example, the nested patterns of dataset 6 (Motif_1_Type_1_3_2.fasta file in Supplementary File 3) are all different and differ from those planted in other datasets. Only datasets 16 and 17 do not consider the problem of ‘pattern variability.’ They were deliberately generated by nesting the same exact pattern sequence (Motif2) with no site variations to be simple to detect.
Even if comprehensive, these benchmark sets cannot completely cover the dramatic variability of DNA motifs, nor can they correctly model the complex and relatively unknown evolutionary processes that give rise, spread, modify, and remove functional regulatory elements throughout the genomes. Sequence length, complexity, and the amount itself of sequences within a multi-FASTA file have theoretically no limits.
We thus limited ourselves to generate small-medium-sized datasets to comply with the technical requirements of the software packages used in this study, irrespective of whether they were web servers or standalone programs. However, we included ten other sequence datasets publicly available from [18], generated by Markov Chain simulation (Table 4), and randomly selected from all those available. The features of these benchmarking sequence sets are provided in Table 4 and Supplementary Table 1. Raw files are available in Supplementary File 3. The chosen datasets contained similar numbers of sequences. Seven of them were GC-rich. Lengths of the nested patterns were essentially different among datasets, ranging from 7 (M00799) to 23 (M01036) nucleotides.
In this work, we did not include real genomic sequences since they could contain unknown TFBSs that may be considered false positives and, in turn, affect the specificity of the tools.
Performance evaluation
The way to sort real from false motifs and dig out false-negative results is currently a matter of debate. In principle, a true-positive motif should fully overlap with the real motif. However, common practice suggests tolerating a certain degree of inaccuracy to increase sensitivity. In particular, Tompa et al. [20] indicated 25% overlap as the minimum threshold to identify putative motifs, thereby hypothesizing that the deletion of one-quarter of a site would be enough to significantly hamper the canonical interaction between the site and a transcription factor. However, we could not comply with this threshold since some of the generated sequences were short, ranging from 7 to 18 bp, and differently degenerated. 25% overlap between a predicted sequence and a short, degenerated motif could not be enough to identify a real binding site. We then opted to count the number of overlapping nucleotides to a real motif (true-positive, TP) and those correctly not overlapping (true-negative, TN) and to distinguish them from not overlapping nucleotides to a motif that instead they should (false-negative, FN) and from nucleotides predicted to overlap that actually do not (false-positive, FP). We then calculated sensitivity, specificity, positive predictive value, performance coefficient, Matthew’s correlation coefficient (MCC), accuracy, and false-positive rate (Supplementary File 1) and have chosen MCC as the reference metrics to assess the extent of agreement between identified and real motifs. When MCC was 1, the agreement was maximum. When it was −1, there was complete disagreement. MCC was calculated for all datasets, as if they were pooled together or averaged through all datasets, as in [20].
Results
Generally, the investigated tools behaved differently depending on the dataset they were tested on (Figure 1). MEME and Seeder exhibited good sensitivity estimates, pooled: 0.79, averaged: 0.69 and pooled: 0.80, averaged: 0.75, respectively. Improbizer performed well (0.78, 0.80), although it did not return any result for five datasets. RSAT, MotifSampler, Homer2, SeSiMCMC, and ChIPMunk sensitivity exceeded 0.5, the latter not providing any output for five datasets. The best performer in terms of MCC was MEME (0.81, 0.81), followed by Seeder (0.7, 0.82) and Improbizer, which, however, did not give results for five datasets (13,14,15,16,17). ChIPmunk’s MCC was uneven between pooled (0.78) and averaged (0.46) MCC values. Similar discrepancies were observed for SeSiMCMC (0.51, 0.7). ChIPmunk failed to analyze 13, 14, 15, 18, and 19, which are the largest and more complex datasets. 13, 14, and 15 are, in fact, around 80 kilobytes in size, while 18 and 19 contain two nested patterns. SeSiMCMC failed with datasets 12 and 13, which contained nested patterns deriving from Motif 1, with different numbers of sequences and background base composition. Gimmemotifs exhibited lower performance scores (~0.5 pooled and averaged MCC). Even lower records were achieved by Homer2, MotifSampler, and RSAT (Supplementary Table 1, Results_custom sheet).

Distribution of tools sensitivity (A) and MCC (B) values calculated for all simulated datasets. Values for pooled datasets in orange; averaged values in gray.
MCC values were compared among tools, datasets, and the following features: base composition, motif position, motif type, sequence length, and length range. We showed a statistically significant difference between all tools by Kruskal-Wallis H test, χ2(15) = 100.99, p = 8.44x10−15. Post-hoc pairwise analysis highlighted numerous differences between tools, with the most diverse being MEME, Seeder, and Improbizer, which exhibited the highest MCC values (Figure 2A).

MCC distributions drilled down into (A) tools, (B) datasets, (C) motif types, (D) motif length, (E) GC composition of motifs, (F) variability of the length of sequences, and (G) position of motifs within the sequences. Networks in (A) and (B) wire tools or datasets with edges if the MCC distributions of the connected nodes are significantly different (red if p-value <0.05, blue if adj. p-value <0.05). Datasets 1–6 blue, 7–12 green, 13–15 orange, 16–17 black, 18–19 yellow.
The distribution of MCC values differed significantly inter datasets, χ2(18) = 31.172, p = 0.013 (Figure 2B, Supplementary Table 2). However, datasets 1–6, consisting of a 12 bp DNA motif inserted in different sequence contexts, were comparable. Datasets 7–12 containing Motif1 (Table 2) nested into increasingly long sequences, up to 1 Kbp, were also comparable. Also, datasets 13–15, where Motif1 was inserted into sequences with varying levels of GC content, datasets 16 and 17, characterized by the presence of Motif2 in the middle or random positions of the sequences, and datasets 18 and 19, where Motifs3 and 4 co-occurred at fixed or variable positions in 500 bp long sequences, resulted in being similar in terms of MCC values. Generally, the median MCC values were comparable for most pairwise tests (Supplementary Table 1, MCC_pairwise sheet). Datasets 12 and 16 are noteworthy: the median MCC value of the former resulted significantly lower than most datasets (1, 3–6, 16–17), while that of the latter was significantly higher than almost all other datasets (7–13,15,18–19). Moreover, roughly all tools proved not to be able to identify randomly interspersed patterns (Motif 1) within long sequences. Dataset 16 probably represents the easiest scenario since most tools succeeded in finding a 7 bp-long pattern inserted in the exact middle of all the fifty sequences composing the dataset.
MCC differed by motif types (χ2(2) = 11.306, p = 0.003) and Motif2 was significantly better recognized than others (Figure 2C). Moreover, the background sequences’ length resulted in impacting the performance of tools, χ2(3) = 19.479, p = 0.0002. In particular, motifs in short sequences (50 bp) were significantly better identified than motifs nested in longer sequences (Figure 2D, Supplementary Table 2). Finally, the base composition of motifs (High AT, AT = GC, High GC, Figure 2E), the variability of the length of the sequences (fixed versus variable, Figure 2F), and the position (middle versus random) of the motifs in the simulated sequences (Figure 2G) did not affect performance.
Similar results were obtained if assessing sensitivity in place of MCC (Supplementary Table 2). Base composition and motifs relative localization were also studied in datasets 1–6 and 7–12, and no significant differences emerged (Supplementary File 2).
Performance optimization by tool combination
From previous results, MEME, Seeder, and Improbizer were the most reliable tools concerning the average MCC scores through all 19 generated datasets (Figure 2A). However, considering the variability in the tools’ performance through datasets (Figure 2B) and their generally short lifetime or maintenance, we investigated the possibility of combining them to increase their performance (Figure 3A). Then, we considered all possible pairs of tools, i.e., 120, and, for any pair A and B, we calculated a unified MCCAB. In detail, we compared MCCAi and MCCBi for all the benchmark datasets i. If MCCAi > MCCBi, then TPi = TPAi, FPi = FPAi, TNi = TNAi, and FNi = FNAi, namely counts of A were considered for the dataset i. Counts of B were considered in the opposite case. Finally, all counts were summed, e.g., |$TP={\sum}_{i=1}^{19}T{P}_i$|, |$TN={\sum}_{i=1}^{19}T{N}_i$|, etc., and the unified MCCAB score was calculated by considering the total amount of TP, TN, FP, and FN for the pair AB. For example, subtable 1 of Supplementary Table 1, Unified_MCC sheet, compares AlignACE and Weeder2. Weeder2’s MCC was higher (0.735) than AlignACE’s MCC (0.721) for dataset 1. Thus, the unified MCC (MCCAB) accounted for the TP, TN, FP, and FN counts of Weeder2 with this dataset. This process was repeated for all 19 datasets. Bold values in row 22 represent the sum of all TP, TN, FP, and FN counts from which MCCAB was calculated. This strategy allowed us to evidence ‘dataset complementarity,’ namely the ability of tool A to perform well with a dataset where B performed poorly and measure their joint ability to find the nested motifs.

(A) Heatmap of combined scores. Scores greater than 0.7 are explicitly reported. Individual scores are on the diagonal. (B) Heatmap of the absolute increase or decrease from individual scores. Scores greater than 0.2 or less than −0.2 are explicitly reported.
Characteristics and composition of datasets. For each dataset, the number of FASTA sequences, sequence lengths, GC composition, the identity of the nested motifs, and position of the motifs within the background sequences are reported. Finer details in Supplementary Table 1, Custom_seq sheet
# . | Dataset name . | # of sequences . | Initial sequence length (bp) . | Base Composition . | Nested Motif . | Motif position . |
---|---|---|---|---|---|---|
1 | Motif_1_Type_1_1_1 | 20 | 50 | 60% GC 40% AT | Motif1 | Middle |
2 | Motif_1_Type_1_1_2 | Random | ||||
3 | Motif_1_Type_1_2_1 | 40% GC 60% AT | Middle | |||
4 | Motif_1_Type_1_2_2 | Random | ||||
5 | Motif_1_Type_1_3_1 | 50% GC 50% AT | Middle | |||
6 | Motif_1_Type_1_3_2 | Random | ||||
7 | Motif_1_Type_2_1_1 | 1000 | 60% GC 40% AT | Middle | ||
8 | Motif_1_Type_2_1_2 | Random | ||||
9 | Motif_1_Type_2_2_1 | 40% GC 60% AT | Middle | |||
10 | Motif_1_Type_2_2_2 | Random | ||||
11 | Motif_1_Type_2_3_1 | 50% GC 50% AT | Middle | |||
12 | Motif_1_Type_2_3_2 | Random | ||||
13 | Motif_1_Type_3_1 | 76 | 100–2000 | 60% GC 40% AT | Random | |
14 | Motif_1_Type_3_2 | 40% GC 60% AT | ||||
15 | Motif_1_Type_3_3 | 50% GC 50% AT | ||||
16 | Motif_2_1 | 50 | 100–2000 | 50% GC 50% AT | Motif2 | Middle |
17 | Motif_2_2 | Random | ||||
18 | Motif_3_1 | 20 | 500 | 50% GC 50% AT | Motif3 + Motif4 | Middle |
19 | Motif_3_2 | Random |
# . | Dataset name . | # of sequences . | Initial sequence length (bp) . | Base Composition . | Nested Motif . | Motif position . |
---|---|---|---|---|---|---|
1 | Motif_1_Type_1_1_1 | 20 | 50 | 60% GC 40% AT | Motif1 | Middle |
2 | Motif_1_Type_1_1_2 | Random | ||||
3 | Motif_1_Type_1_2_1 | 40% GC 60% AT | Middle | |||
4 | Motif_1_Type_1_2_2 | Random | ||||
5 | Motif_1_Type_1_3_1 | 50% GC 50% AT | Middle | |||
6 | Motif_1_Type_1_3_2 | Random | ||||
7 | Motif_1_Type_2_1_1 | 1000 | 60% GC 40% AT | Middle | ||
8 | Motif_1_Type_2_1_2 | Random | ||||
9 | Motif_1_Type_2_2_1 | 40% GC 60% AT | Middle | |||
10 | Motif_1_Type_2_2_2 | Random | ||||
11 | Motif_1_Type_2_3_1 | 50% GC 50% AT | Middle | |||
12 | Motif_1_Type_2_3_2 | Random | ||||
13 | Motif_1_Type_3_1 | 76 | 100–2000 | 60% GC 40% AT | Random | |
14 | Motif_1_Type_3_2 | 40% GC 60% AT | ||||
15 | Motif_1_Type_3_3 | 50% GC 50% AT | ||||
16 | Motif_2_1 | 50 | 100–2000 | 50% GC 50% AT | Motif2 | Middle |
17 | Motif_2_2 | Random | ||||
18 | Motif_3_1 | 20 | 500 | 50% GC 50% AT | Motif3 + Motif4 | Middle |
19 | Motif_3_2 | Random |
Characteristics and composition of datasets. For each dataset, the number of FASTA sequences, sequence lengths, GC composition, the identity of the nested motifs, and position of the motifs within the background sequences are reported. Finer details in Supplementary Table 1, Custom_seq sheet
# . | Dataset name . | # of sequences . | Initial sequence length (bp) . | Base Composition . | Nested Motif . | Motif position . |
---|---|---|---|---|---|---|
1 | Motif_1_Type_1_1_1 | 20 | 50 | 60% GC 40% AT | Motif1 | Middle |
2 | Motif_1_Type_1_1_2 | Random | ||||
3 | Motif_1_Type_1_2_1 | 40% GC 60% AT | Middle | |||
4 | Motif_1_Type_1_2_2 | Random | ||||
5 | Motif_1_Type_1_3_1 | 50% GC 50% AT | Middle | |||
6 | Motif_1_Type_1_3_2 | Random | ||||
7 | Motif_1_Type_2_1_1 | 1000 | 60% GC 40% AT | Middle | ||
8 | Motif_1_Type_2_1_2 | Random | ||||
9 | Motif_1_Type_2_2_1 | 40% GC 60% AT | Middle | |||
10 | Motif_1_Type_2_2_2 | Random | ||||
11 | Motif_1_Type_2_3_1 | 50% GC 50% AT | Middle | |||
12 | Motif_1_Type_2_3_2 | Random | ||||
13 | Motif_1_Type_3_1 | 76 | 100–2000 | 60% GC 40% AT | Random | |
14 | Motif_1_Type_3_2 | 40% GC 60% AT | ||||
15 | Motif_1_Type_3_3 | 50% GC 50% AT | ||||
16 | Motif_2_1 | 50 | 100–2000 | 50% GC 50% AT | Motif2 | Middle |
17 | Motif_2_2 | Random | ||||
18 | Motif_3_1 | 20 | 500 | 50% GC 50% AT | Motif3 + Motif4 | Middle |
19 | Motif_3_2 | Random |
# . | Dataset name . | # of sequences . | Initial sequence length (bp) . | Base Composition . | Nested Motif . | Motif position . |
---|---|---|---|---|---|---|
1 | Motif_1_Type_1_1_1 | 20 | 50 | 60% GC 40% AT | Motif1 | Middle |
2 | Motif_1_Type_1_1_2 | Random | ||||
3 | Motif_1_Type_1_2_1 | 40% GC 60% AT | Middle | |||
4 | Motif_1_Type_1_2_2 | Random | ||||
5 | Motif_1_Type_1_3_1 | 50% GC 50% AT | Middle | |||
6 | Motif_1_Type_1_3_2 | Random | ||||
7 | Motif_1_Type_2_1_1 | 1000 | 60% GC 40% AT | Middle | ||
8 | Motif_1_Type_2_1_2 | Random | ||||
9 | Motif_1_Type_2_2_1 | 40% GC 60% AT | Middle | |||
10 | Motif_1_Type_2_2_2 | Random | ||||
11 | Motif_1_Type_2_3_1 | 50% GC 50% AT | Middle | |||
12 | Motif_1_Type_2_3_2 | Random | ||||
13 | Motif_1_Type_3_1 | 76 | 100–2000 | 60% GC 40% AT | Random | |
14 | Motif_1_Type_3_2 | 40% GC 60% AT | ||||
15 | Motif_1_Type_3_3 | 50% GC 50% AT | ||||
16 | Motif_2_1 | 50 | 100–2000 | 50% GC 50% AT | Motif2 | Middle |
17 | Motif_2_2 | Random | ||||
18 | Motif_3_1 | 20 | 500 | 50% GC 50% AT | Motif3 + Motif4 | Middle |
19 | Motif_3_2 | Random |
In this setting, the best performing tools, MEME, and Seeder did not benefit from being paired with others, meaning that none other tool outperformed their scores significantly on any dataset. Comparing MEME versus Seeder, we verified that Seeder outperformed MEME on 6 out of 19 datasets and that their combined score increased only imperceptibly (+0.026) the original MEME score (0.813). Interestingly, five out of the six datasets (1, 3, 5, 7, 11) were generated by placing Motif1 in the middle of background sequences (Table 3), thereby indicating a better ability of Seeder in the discovery of these kinds of patterns. However, the best-performing combination of tools was made by MEME and Improbizer that achieved a score of 0.846, which increased the individual MEME MCC score of +0.033 (Supplementary Table 1, Unified_MCC, subtable 115). Like all other tools that have failed to yield results for some datasets (13–17 in this case), Improbizer was penalized when paired with Weeder, MDSCan, and SeSiMCMC. In our tests, Improbizer could not manage large FASTA files, generally greater than 50 kb, and thus it did not produce any results for datasets 13–17. Furthermore, Improbizer performed better with datasets 1, 3, 5, 7, 9, and 11, where patterns were nested in the middle of 50 or 1000 bp long background sequences. ChIPMunk achieved outstanding performance alone (combined MCC: 0.78) and was penalized when paired with all other tools but MEME. This tool had excellent performance with datasets 1–9 and 11, while it did not return any result for datasets 13–15 (the largest sequence datasets) and 18–19. All other combinations of tools did not provide sensibly higher unified MCC values, but the pair SeSiMCMC and RSAT, which scored 0.723 and that significantly improved their individual scores (0.513 for SeSiMCMC and 0.547 for RSAT, Figure 3B). All unified MCC scores and pairwise variations are summarized in Supplementary Table 1, Unified_MCC_var.
Evaluating performance on third-party datasets
Performance of the same tools run on ten markovian datasets available in Supplementary File 3 was generally poor. Performance differed from that reported above for the length and complexity of nested patterns and global base composition. We observed very low MCC and sensitivity scores for all datasets and methods (Figure 4; Supplementary Table 1, Sandve sheet). Furthermore, BaMM and SeSiMCMC did not correctly run with most of these datasets. In particular, BaMM found motifs in only three out of ten Sandve datasets, while SeSiMCMC repeatedly hung with all datasets and did not yield any results.

Distribution of tools sensitivity (A) and MCC (B) values calculated for all ten third-party datasets. Values for pooled datasets in orange; averaged values in gray.
Characteristics of selected datasets from the Sandve website. Details in Supplementary Table 1, Sandve_seq sheet
Dataset name . | Motif length (nt) . | Number of sequences per dataset . | GC content (%) . |
---|---|---|---|
M01007 | 19 | 17 | 56.6 |
M00919 | 11 | 18 | 57.3 |
M00920 | 12 | 18 | 57.7 |
M01011 | 21 | 17 | 45.6 |
M00982 | 14 | 13 | 61.5 |
M00774 | 16 | 15 | 48.1 |
M00939 | 9 | 14 | 59.1 |
M00799 | 7 | 10 | 57.4 |
M01036 | 23 | 8 | 47.5 |
M01035 | 11 | 11 | 52.8 |
Dataset name . | Motif length (nt) . | Number of sequences per dataset . | GC content (%) . |
---|---|---|---|
M01007 | 19 | 17 | 56.6 |
M00919 | 11 | 18 | 57.3 |
M00920 | 12 | 18 | 57.7 |
M01011 | 21 | 17 | 45.6 |
M00982 | 14 | 13 | 61.5 |
M00774 | 16 | 15 | 48.1 |
M00939 | 9 | 14 | 59.1 |
M00799 | 7 | 10 | 57.4 |
M01036 | 23 | 8 | 47.5 |
M01035 | 11 | 11 | 52.8 |
Characteristics of selected datasets from the Sandve website. Details in Supplementary Table 1, Sandve_seq sheet
Dataset name . | Motif length (nt) . | Number of sequences per dataset . | GC content (%) . |
---|---|---|---|
M01007 | 19 | 17 | 56.6 |
M00919 | 11 | 18 | 57.3 |
M00920 | 12 | 18 | 57.7 |
M01011 | 21 | 17 | 45.6 |
M00982 | 14 | 13 | 61.5 |
M00774 | 16 | 15 | 48.1 |
M00939 | 9 | 14 | 59.1 |
M00799 | 7 | 10 | 57.4 |
M01036 | 23 | 8 | 47.5 |
M01035 | 11 | 11 | 52.8 |
Dataset name . | Motif length (nt) . | Number of sequences per dataset . | GC content (%) . |
---|---|---|---|
M01007 | 19 | 17 | 56.6 |
M00919 | 11 | 18 | 57.3 |
M00920 | 12 | 18 | 57.7 |
M01011 | 21 | 17 | 45.6 |
M00982 | 14 | 13 | 61.5 |
M00774 | 16 | 15 | 48.1 |
M00939 | 9 | 14 | 59.1 |
M00799 | 7 | 10 | 57.4 |
M01036 | 23 | 8 | 47.5 |
M01035 | 11 | 11 | 52.8 |
The best performing tools were MEME and Weeder2, with the combined MCC scores of 0.19 and 0.17, respectively, and mean values of 0.13 and 0.15. The highest sensitivity was achieved by MEME (0.179), followed by Weeder2 (0.135). Globally, all failed to recognize most motif sites. No results were obtained with SeSiMCMC, while BaMM and Gimmemotifs did not work with 7 and 1 datasets out of 10. Among all the 160 runs (16 tools per 10 datasets), we obtained decent MCC values (≥ 0.2) for only 14 of them: four for Weeder2 (datasets: M01007, M00919, M00920, M00939); two for RSAT (M01007, M00799); two for MEME (M01007, M00919); two for XXMotif (M00919, M01007); one for Homer2 (M00982), ChIPMunk (M01007), Improbizer (M01007) and MotifSampler (M00982). All results in Supplementary Table 1, Sandve_seq and Sandve sheets. The unified MCC of the best pair of tools in our previous comparisons, i.e., MEME and Weeder2, was still low (0.16). This is motivated by the fact that both tools identified a low number of motif sites for all datasets, and then they could only limitedly complement each other (Supplementary Table 1, Unified_MCC_Sandve). Thus, we did not perform this analysis for the remaining tools.
Discussion
More than a decade ago, several authors assessed the ability of their algorithms to detect known TFBSs in simulated and real sequences. Their performance was generally low, with nucleotide Correlation Coefficient generally <0.3 and Sensitivity <0.1 (cf. Figure 1a in [20]). Weeder outperformed the thirteen other competing methods. Already from this work, multiple issues came out. The generation of test sets of sequences, tool configuration, output file parsing, outcomes interpretation and summarization resulted in critical tasks, all confirmed in the present work. Every tool actually requires a number of warmup runs to preliminary clarify some issues: using or not control sequences (when possible), limiting the search to a certain motif width, selecting the proper statistics to determine the enriched motifs, and how to rank results. These technicalities make benchmarking initiatives tough and affect comparability.
In this study, we opted to consider only artificial sequences to prevent or maximally minimize the presence of unknown real motifs, which would otherwise cause the underestimation of the specificity of the tested tools. We also kept the effort of configuring tools at a minimum to stick with a basic user experience and recruited only the tools that were easy to configure, either standalone or online, and responding in a reasonable time to our queries. Tools were selected among those whose output was automatically parsable. However, considering the complex and unstructured nature of some of them, manual intervention was sometimes needed. Parsing and collecting results was then a laborious task and a critical issue that requires developers’ care for some tools.
Tool . | Pros . | Cons . |
---|---|---|
BaMM | Easy to use interface; results obtained in short time; simple raw output; it produces images for the detected motifs. | Poor performance with short-length background sequences and/or long/complex nested motifs. |
DMINDA2 | Simple raw output; the interface is easy-to-use. | Poor performance with long background sequences. |
Gimmemotifs | Easy to configure; it produces images for the detected motifs; it implements many third-party algorithms. | Poor performance with long background sequences; quite complex output. |
Gimsan | Good performance with long background sequences. | Hard to configure; complex output; high computing time. |
Homer2 | Easy to configure; short computing time; excellent performance only when searching for short and conserved motifs. | Complex output; poor performance when searching other than short and conserved motif types. |
Improbizer | Simple web interface; simple output; generally good performance in the custom benchmark. | It does not work with large sequence datasets. |
MEME | Easy to configure; excellent performance in different scenarios; very short computing time. | It does not perform very well when the number of sequences per dataset is relatively low. |
MODSIDE – ChIPMunk | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output. |
MODSIDE – XX motif | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output; it does not perform well with long background sequences and/or complex nested motifs. |
MotifSampler | Decent performance in some scenarios. | Hard to configure; long computing time. |
RSAT | Decent performance; fast computing time. | Web-interface not so easy-to-use; complex raw output. |
Seeder | Decent performance in most scenarios; simple output. | Hard to configure. |
Tmod – MDScan | Simple output. | Hard to configure; It performs poorly when background sequences have variable sizes. |
Tmod – AlignACE | Simple output; decent performance only with short-length background sequences. | Hard to configure; it did not return any result for many datasets. |
Tmod – SeSiMCMC | Simple output; short computing time; decent performance in some scenarios. | Hard to configure; it did not return any result for some datasets. |
Weeder | Simple output. | Hard to configure; it performs poorly, although it works well with short nested motifs. |
Tool . | Pros . | Cons . |
---|---|---|
BaMM | Easy to use interface; results obtained in short time; simple raw output; it produces images for the detected motifs. | Poor performance with short-length background sequences and/or long/complex nested motifs. |
DMINDA2 | Simple raw output; the interface is easy-to-use. | Poor performance with long background sequences. |
Gimmemotifs | Easy to configure; it produces images for the detected motifs; it implements many third-party algorithms. | Poor performance with long background sequences; quite complex output. |
Gimsan | Good performance with long background sequences. | Hard to configure; complex output; high computing time. |
Homer2 | Easy to configure; short computing time; excellent performance only when searching for short and conserved motifs. | Complex output; poor performance when searching other than short and conserved motif types. |
Improbizer | Simple web interface; simple output; generally good performance in the custom benchmark. | It does not work with large sequence datasets. |
MEME | Easy to configure; excellent performance in different scenarios; very short computing time. | It does not perform very well when the number of sequences per dataset is relatively low. |
MODSIDE – ChIPMunk | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output. |
MODSIDE – XX motif | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output; it does not perform well with long background sequences and/or complex nested motifs. |
MotifSampler | Decent performance in some scenarios. | Hard to configure; long computing time. |
RSAT | Decent performance; fast computing time. | Web-interface not so easy-to-use; complex raw output. |
Seeder | Decent performance in most scenarios; simple output. | Hard to configure. |
Tmod – MDScan | Simple output. | Hard to configure; It performs poorly when background sequences have variable sizes. |
Tmod – AlignACE | Simple output; decent performance only with short-length background sequences. | Hard to configure; it did not return any result for many datasets. |
Tmod – SeSiMCMC | Simple output; short computing time; decent performance in some scenarios. | Hard to configure; it did not return any result for some datasets. |
Weeder | Simple output. | Hard to configure; it performs poorly, although it works well with short nested motifs. |
Tool . | Pros . | Cons . |
---|---|---|
BaMM | Easy to use interface; results obtained in short time; simple raw output; it produces images for the detected motifs. | Poor performance with short-length background sequences and/or long/complex nested motifs. |
DMINDA2 | Simple raw output; the interface is easy-to-use. | Poor performance with long background sequences. |
Gimmemotifs | Easy to configure; it produces images for the detected motifs; it implements many third-party algorithms. | Poor performance with long background sequences; quite complex output. |
Gimsan | Good performance with long background sequences. | Hard to configure; complex output; high computing time. |
Homer2 | Easy to configure; short computing time; excellent performance only when searching for short and conserved motifs. | Complex output; poor performance when searching other than short and conserved motif types. |
Improbizer | Simple web interface; simple output; generally good performance in the custom benchmark. | It does not work with large sequence datasets. |
MEME | Easy to configure; excellent performance in different scenarios; very short computing time. | It does not perform very well when the number of sequences per dataset is relatively low. |
MODSIDE – ChIPMunk | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output. |
MODSIDE – XX motif | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output; it does not perform well with long background sequences and/or complex nested motifs. |
MotifSampler | Decent performance in some scenarios. | Hard to configure; long computing time. |
RSAT | Decent performance; fast computing time. | Web-interface not so easy-to-use; complex raw output. |
Seeder | Decent performance in most scenarios; simple output. | Hard to configure. |
Tmod – MDScan | Simple output. | Hard to configure; It performs poorly when background sequences have variable sizes. |
Tmod – AlignACE | Simple output; decent performance only with short-length background sequences. | Hard to configure; it did not return any result for many datasets. |
Tmod – SeSiMCMC | Simple output; short computing time; decent performance in some scenarios. | Hard to configure; it did not return any result for some datasets. |
Weeder | Simple output. | Hard to configure; it performs poorly, although it works well with short nested motifs. |
Tool . | Pros . | Cons . |
---|---|---|
BaMM | Easy to use interface; results obtained in short time; simple raw output; it produces images for the detected motifs. | Poor performance with short-length background sequences and/or long/complex nested motifs. |
DMINDA2 | Simple raw output; the interface is easy-to-use. | Poor performance with long background sequences. |
Gimmemotifs | Easy to configure; it produces images for the detected motifs; it implements many third-party algorithms. | Poor performance with long background sequences; quite complex output. |
Gimsan | Good performance with long background sequences. | Hard to configure; complex output; high computing time. |
Homer2 | Easy to configure; short computing time; excellent performance only when searching for short and conserved motifs. | Complex output; poor performance when searching other than short and conserved motif types. |
Improbizer | Simple web interface; simple output; generally good performance in the custom benchmark. | It does not work with large sequence datasets. |
MEME | Easy to configure; excellent performance in different scenarios; very short computing time. | It does not perform very well when the number of sequences per dataset is relatively low. |
MODSIDE – ChIPMunk | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output. |
MODSIDE – XX motif | Web interface easy to use and configure; short computing time. | It does not return results for large datasets; quite complex output; it does not perform well with long background sequences and/or complex nested motifs. |
MotifSampler | Decent performance in some scenarios. | Hard to configure; long computing time. |
RSAT | Decent performance; fast computing time. | Web-interface not so easy-to-use; complex raw output. |
Seeder | Decent performance in most scenarios; simple output. | Hard to configure. |
Tmod – MDScan | Simple output. | Hard to configure; It performs poorly when background sequences have variable sizes. |
Tmod – AlignACE | Simple output; decent performance only with short-length background sequences. | Hard to configure; it did not return any result for many datasets. |
Tmod – SeSiMCMC | Simple output; short computing time; decent performance in some scenarios. | Hard to configure; it did not return any result for some datasets. |
Weeder | Simple output. | Hard to configure; it performs poorly, although it works well with short nested motifs. |
Among all calculated indices of performance, our assessments were primarily based on MCC and sensitivity since the former embodies the whole confusion matrix (TP, FP, TN, and FN counts), while the latter gives a quick view of how many real motifs were found. Then, MEME was the best tool for performance, easiness of configuration, and manageability of output, followed by Seeder that failed to return any TP sites for datasets 18 and 19 and was not as easy as MEME to configure. Improbizer demonstrated to be promising, even if it was clearly designed for short sequences (< 50 kb of file size), an evident limitation in the modern genomics era. We encourage Improbizer authors to improve their method for what concerns the submission of large sequence data and their analysis.
Then, RSAT, ChIPMunk, Homer2, and MotifSampler performed sufficiently well, with decent MCC and sensitivity values for half of the datasets. While MEME and RSAT can be used both online and standalone, Seeder, Homer2, and Gimmemotifs must be installed locally and require minimal technical skills. As expected, the impact of base composition (background sequences), relative motif position was negligible. A plausible reason is that most tools integrate normalization/correction techniques to deal with sequence variability. Instead, performance varied when datasets were grouped by length range and motif type. MCC and sensitivity scores differed among groups, remarking the gradually decreased capacity of all tools to identify motifs in more extended regions and more complex motifs.
Considering dataset-specific MCC distributions, the best performance was obtained with dataset 16, where a short motif was inserted in the exact middle of sequences of variable size. Good performances were also obtained with dataset 17: nine tools, i.e., BaMM, DMINDA2, Homer2, MEME, MotifSampler, RSAT, Seeder, SeSiMCMC, and Weeder2, efficiently detected short exact motifs within datasets 16 and 17. Dataset 12, characterized by degenerated DNA patterns motifs sampled from Motif1, randomly nested in 1000 bp sequences, represented our worst-case scenario.
Combining tools did not help to increase individual performance. In particular, partnering MEME with all other tools resulted in only slightly increased performance. Similar results were obtained with Improbizer and Seeder. The former improved its performance only when paired with MEME, reaching the absolute maximum MCC value, and with Seeder itself. The latter took advantage of any other tool to increase its scores, but BaMM, ChiPMunk, XXMotif, and Weeder2. Considering that these results were specific to the features of the generated datasets, we tested the same 16 tools on ten other third-party benchmark datasets. Contrary to what was reported above, all tools performed poorly, with decent MCC scores provided by Weeder2, MEME, RSAT, and XXMotif. These results were consistent with [19].
The differences in performance on the two series of datasets raise several concerns about their construction and the generation/localization of patterns. The Sandve datasets consist of very few sequences, i.e., <20 per dataset, with lengths ranging from 200 to 2000 nucleotides. Nested patterns are 7–23 bp-long strings and are interspersed through sequences. These features make this dataset heterogeneous and, then, motif detection a challenging task for all tools, as it can be noted when comparing M00799 and dataset 17. The former comprises only ten very long sequences with a 7 bp roughly conserved motif (CACGTGG) located at variable positions; the latter contains 50 sequences with a 7 bp long conserved and interspersed motif (GTAAACA). In the first case, motif discovery was poor, with only RSAT achieving a good performance (MCC = 0.545, sensitivity 0.6). However, with a broader sequence space to explore, most tools succeeded in identifying nested patterns, as observed for dataset 17 (cf. Supplementary Table 1, Custom sheet).
Another critical point is the choice between real-world or simulated data and, in the latter case, how to simulate data reliably. As evidenced in [19], natural genomic sequences can contain zero, one, or dozens of regulatory elements, and their number is only partially related to the size of sequences. Thus, it is rarely possible to exclude the coexistence of undiscovered but biologically functioning regulatory elements with annotated DNA motifs. This makes the definition of TP regulatory site hard. To better explain this concept, we sought 20 known TFBS of the GATA2 protein taken from JASPAR (details in Supplementary Table 3) using MEME. For each binding site, we created a ± 100 bp genomic interval and asked MEME to detect into it any significant pattern spanning 10–12 bp to ease the detection of the 11 bp-long GATA2-related motifs. MEME not only identified the most known motifs (MCC = 0.733, Sensitivity = 0.727) but also returned a set of patterns that partially overlapped several known regulatory elements from the JASPAR Genome Tracks. These could be considered incidental findings because they are unintentional TPs, given that they are biologically relevant but not what we were searching for. Hence, real-world datasets can invalidate the definition and quantification of TP, FP, TN, and FN counts and, consequently, tool performance assessment.
The usage of simulated sequence datasets minimizes the possibility that unintentional motifs could be found and simplifies the analytical steps of comparing known with predicted patterns, counting overlapping and non-overlapping sites, and combining or averaging dataset-specific results. More importantly, the absence of secondary hits would allow measuring the performance of a DNA motif discovery tool more reliably and realistically. This is linked to the second aspect, i.e., generating artificial datasets that are sufficiently close to real-world DNA sequences. In our opinion, it is tough to declare that one dataset is more realistic than the other is. Concerning ours, we generated nested patterns from real PWMs. We considered realistic base composition for our background sequences and simulated the coexistence of two very proximal regulatory elements (datasets 18 and 19): conditions commonly observed in genomes. Further efforts in this direction would require modeling insertion/deletion events and low-complexity regions in background sequences. However, these other features would make simulated datasets even more incomparable, and tools’ performance strictly dataset-specific.
To summarize, the current classic methodologies would perform well if the genomic regions to investigate were small, as is the case of the analysis of TFBS within ChIP-sequencing data, where the screened genomic regions are pretty homogeneous in length and short. Their performance decreases when the regions become bigger and bigger, for example, in the analysis of gene promoters or introns searching transcriptional regulatory elements. Then, as expected, motif discovery tools have good performance in detecting short ultra-conserved (or exact) motifs, while long and highly degenerate binding sites are hard to identify. In this case, it is recommended to run more than one tool and make a consensus of the results. We suggest using MEME together with BaMM, RSAT, or Improbizer for less experienced users. If users have computational skills, MEME may be coupled with Seeder, GimmeMotifs, or Homer2. Finally, we suggest using standalone packages with large sequence datasets. MEME, Seeder, Gimmemotifs, and Homer2, for example, are easy to configure. Although Weeder2 generally performed poorly, it returned pretty decent results when searching for short exact motifs in datasets 16, 17 of our custom series or datasets M01007, M00919, M00920, M00939 of the Sandve series. MEME exhibited outstanding performance with two of them. Thus, we recommend using these tools for their ductility, although Weeder2 requires some programming skills (Table 5).
As mentioned above, recent years have witnessed the development of several deep learning-based tools to identify TFBS in genomes and transcriptomes, e.g., DeepBind [33] and several more recent tools [52–54]. They have been proven to perform well, especially if applied to large datasets [34, 35]. Today, high-throughput experiments (e.g., whole-genome sequencing, chromatin immunoprecipitation sequencing, and cross-linking immunoprecipitation sequencing) have, in fact, dramatically increased the quantity and quality of molecular data in a vast range of conditions. Furthermore, with the support of powerful hardware infrastructures, deep-learning methods well meet the needs of modern research, i.e., efficiently extracting biological information from massive, very heterogeneous sequence datasets. However, we did not include these methods in this work for several reasons. First, they are hard to configure, requiring complex tuning of hyper-parameters and peculiar skills in the fields of machine learning and, especially, neural networks. Second, we explicitly based our investigation on limited-size artificial datasets, where deep-learning-based methods would not work at their best by definition. However, comparing classic and deep-learning-based methods is interesting and might be the object of future comparative works.
Motifs are short patterns of nucleotides or amino acids with a putative or ascertained biological significance.
Identifying DNA motifs is not an easy task, with numerous factors, such as the genomic complexity, the incomplete knowledge of the evolutionary forces shaping motifs, their variable size, and the possibility of motif co-occurrence and overlapping, increasing the complexity of the problem.
Artificial datasets cannot capture the overall complexity of the problem but can reduce the bias introduced by any unknown secondary finding present in real sequences.
Base composition, pattern location, and background sequence length have no significant impact on prediction results.
Across the benchmark datasets implemented in this study, tools complement differently, with relevant anti-synergistic effects for a few of them.
Funding
Italian Ministry of Health (Ricerca Corrente 2018-2020) and 5x1000 voluntary contribution.
Stefano Castellana holds a Master degree in Cellular and Molecular Biology and a PhD in Genetics and Molecular Evolution. His research interests include molecular evolution and NGS data analysis.
Tommaso Biagini holds a Master degree in Bioinformatics and a PhD in Cellular and Molecular Biology. His research interests include molecular dynamics simulation methods and oncogenomics.
Luca Parca holds a Master degree in Bioinformatics and a PhD in Cellular and Molecular Biology. His research interests include proteogenomics and data science.
Francesco Petrizzelli holds a Master degree in Bioinformatics and is a PhD student in human biology and medical genetics. His research interests include the study of bio-inspired molecular dynamics simulation techniques.
Salvatore Daniele Bianco holds a Master degree in Bioinformatics and is a PhD student in human biology and medical genetics. His research interests include Artificial Intelligence and Data Science.
Angelo Luigi Vescovi is Scientific Director of the IRCCS Casa Sollievo della Sofferenza, owner of the biotech StemGen, member of the scientific committee of Revert Onlus and professor at Bicocca University of Milan. His research interests include stem cell biology, regenerative medicine and innovative therapies.
Massimo Carella is deputy Scientific Director and head of the medical genetics unit of the IRCCS Casa Sollievo della Sofferenza. His research interests include the study of causes and mechanisms of pathogenesis of rare genetics diseases.
Tommaso Mazza holds a Master degree in Computer Science Engineering and a PhD in Computer Science and Biomedical Engineering. His main research focus is on network biology, biomarker discovery through unconventional software and hardware solutions. He leads the Bioinformatics Unit at CSS.