-
PDF
- Split View
-
Views
-
Cite
Cite
Oriol Fornes, Jaime A Castro-Mondragon, Aziz Khan, Robin van der Lee, Xi Zhang, Phillip A Richmond, Bhavi P Modi, Solenne Correard, Marius Gheorghe, Damir Baranašić, Walter Santana-Garcia, Ge Tan, Jeanne Chèneby, Benoit Ballester, François Parcy, Albin Sandelin, Boris Lenhard, Wyeth W Wasserman, Anthony Mathelier, JASPAR 2020: update of the open-access database of transcription factor binding profiles, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D87–D92, https://doi.org/10.1093/nar/gkz1001
- Share Icon Share
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles. Moreover, we created a Q&A forum to ease the communication between the user community and JASPAR curators. Finally, we updated the genomic tracks, inference tool, and TF-binding profile similarity clusters. All the data is available through the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.
INTRODUCTION
Transcription factors (TFs) are proteins involved in the regulation of gene expression at the transcriptional level (1). They interact with DNA in a sequence-specific manner through their DNA-binding domains (DBDs), which are used to classify TFs into structural families (2). The genomic locations where TFs bind to DNA are known as TF binding sites (TFBSs), which are typically short (6–20 bp) and exhibit sequence variability (3). Genome-wide identification of TFBSs is key to understanding transcriptional regulation. As it is not possible to identify all TFBSs for every cell type and cellular condition experimentally, computational modeling of TF-binding specificities has been instrumental to predict TFBSs in the genome. These computational models aim at representing the complex interplay between nucleotide and/or DNA shape readout at TFBSs (4), and can be used to predict not only the precise location where TFs interact in the genome (5), but also TFs with enriched TFBSs in a set of sequences (6), or the impact of mutations on TF binding (7,8), amongst others.
From the plethora of existing computational models (9), position frequency matrices (PFMs) (10) are one of the simplest and (still) most commonly used, although more complex models, for instance based on hidden Markov models or deep learning (11–13), are becoming more common. A PFM is a TF-binding profile that models the DNA-binding specificity of a TF by summarizing the frequencies of each nucleotide at each position from observed TF-DNA interactions. These interactions are usually derived from in vitro assays (e.g. SELEX (14) or protein binding microarrays (15)), which assess the binding affinity of TFs to DNA sequences, or from ChIP-based experiments (e.g. ChIP-seq (16), ChIP-exo (17), or ChIP-nexus (18)), which capture TF-DNA interactions in vivo, by looking for over-represented DNA sequences in regions bound by the ChIP’ed TF.
With the advent of high-throughput sequencing more than a decade ago, the number of PFMs derived from in vivo and in vitro experiments has increased dramatically, leading to the creation of multiple databases storing PFMs or more complex TF-binding profiles such as JASPAR (19), CIS-BP (20) and HOCOMOCO (21) (see (22) for a comprehensive review). The JASPAR database (http://jaspar.genereg.net/) is one of the most popular databases of TF-binding profiles, and has been maintained for over 15 years (23). As such, many computational tools dedicated to the study of gene regulation incorporate profiles from JASPAR (e.g. TFBSshape (24,25), RSAT (26), MEME (27) or i-cisTarget (6)). At the heart of JASPAR is its CORE collection, which contains TF-binding profiles that are: (i) manually curated (meaning that orthogonal supporting evidence from the literature is required for each profile); (ii) non-redundant (one profile per TF with the exception of TFs with multiple DNA-binding sequence preferences (28)); (iii) associated with TFs from one of six taxa (vertebrates, nematodes, insects, plants, fungi, and urochordata) and (iv) freely available to the community through a user-friendly web interface, a RESTful API (29), and a dedicated R/Bioconductor data package (‘JASPAR2020’).
Here, we present the 8th release of JASPAR, which comes with a major expansion and update of its CORE collection. Moreover, we introduce a new collection of unvalidated profiles, which stores quality-controlled PFMs for which our curators could not find orthogonal support. This collection has a dedicated web interface to engage the community of users in the curation of TF-binding profiles. Finally, we have updated the hierarchical clusters of TF-binding profiles, the genomic tracks of predicted TFBSs (now available for 8 genomes), and the profile inference tool.
EXPANSION AND UPDATE OF THE JASPAR CORE COLLECTION
For this 8th release of JASPAR, we added to the CORE collection 245 new TF-binding profiles for TFs in the following taxa: vertebrates (169 profiles, corresponding to an expansion of 29% for this taxon), plants (42 profiles, 9% expansion), nematodes (17 profiles, 65% expansion), insects (10 profiles, 8% expansion) and fungi (7 profiles, 4% expansion). We updated 156 profiles (Table 1). The new PFMs were derived from HT-SELEX (30), PBMs (20), ChIP-seq and DAP-seq experiments (data sourced from CistromeDB (31), ReMap (32,33), GTRD (34), ChIP-atlas (35) and ModERN (36), see Supplementary Text for method details). As previously described, the newly introduced profiles were manually curated to be supported by an orthogonal reference from the literature, which is provided in the metadata of the profiles. Moreover, the TF DBD class and family (following the TFClass classification (2)), the TF UniProt ID (37), and links to the TFBSshape (24,25), ReMap (32,33) and UniBind (38) databases are provided in the profiles metadata (whenever possible). Finally, the profiles previously associated with ID2, ID4 and TRB2 were removed from the CORE collection as these proteins are not TFs (1).
Overview of the growth of the number of PFMs in the JASPAR 2020 CORE and unvalidated collections compared to the JASPAR 2018 CORE collection
Taxonomic Group . | Non-redundant PFMs in JASPAR 2018 . | New non-redundant PFMs in JASPAR 2020 . | Removed profiles . | Updated PFMs in JASPAR 2020 . | Total PFMs (non-redundant) in JASPAR 2020 . | Total PFMs (all versions) in JASPAR 2020 . |
---|---|---|---|---|---|---|
Vertebrates | 579 | 169 | 2 | 125 | 746 | 1011 |
Plants | 489 | 42 | 1 | 28 | 530 | 572 |
Insects | 133 | 10 | 0 | 3 | 143 | 153 |
Nematodes | 26 | 17 | 0 | 0 | 43 | 43 |
Fungi | 176 | 7 | 0 | 0 | 183 | 184 |
Urochordata | 1 | 0 | 0 | 0 | 1 | 1 |
Total CORE | 1404 | 245 | 3 | 156 | 1646 | 1964 |
unvalidated | 337 | 337 |
Taxonomic Group . | Non-redundant PFMs in JASPAR 2018 . | New non-redundant PFMs in JASPAR 2020 . | Removed profiles . | Updated PFMs in JASPAR 2020 . | Total PFMs (non-redundant) in JASPAR 2020 . | Total PFMs (all versions) in JASPAR 2020 . |
---|---|---|---|---|---|---|
Vertebrates | 579 | 169 | 2 | 125 | 746 | 1011 |
Plants | 489 | 42 | 1 | 28 | 530 | 572 |
Insects | 133 | 10 | 0 | 3 | 143 | 153 |
Nematodes | 26 | 17 | 0 | 0 | 43 | 43 |
Fungi | 176 | 7 | 0 | 0 | 183 | 184 |
Urochordata | 1 | 0 | 0 | 0 | 1 | 1 |
Total CORE | 1404 | 245 | 3 | 156 | 1646 | 1964 |
unvalidated | 337 | 337 |
Overview of the growth of the number of PFMs in the JASPAR 2020 CORE and unvalidated collections compared to the JASPAR 2018 CORE collection
Taxonomic Group . | Non-redundant PFMs in JASPAR 2018 . | New non-redundant PFMs in JASPAR 2020 . | Removed profiles . | Updated PFMs in JASPAR 2020 . | Total PFMs (non-redundant) in JASPAR 2020 . | Total PFMs (all versions) in JASPAR 2020 . |
---|---|---|---|---|---|---|
Vertebrates | 579 | 169 | 2 | 125 | 746 | 1011 |
Plants | 489 | 42 | 1 | 28 | 530 | 572 |
Insects | 133 | 10 | 0 | 3 | 143 | 153 |
Nematodes | 26 | 17 | 0 | 0 | 43 | 43 |
Fungi | 176 | 7 | 0 | 0 | 183 | 184 |
Urochordata | 1 | 0 | 0 | 0 | 1 | 1 |
Total CORE | 1404 | 245 | 3 | 156 | 1646 | 1964 |
unvalidated | 337 | 337 |
Taxonomic Group . | Non-redundant PFMs in JASPAR 2018 . | New non-redundant PFMs in JASPAR 2020 . | Removed profiles . | Updated PFMs in JASPAR 2020 . | Total PFMs (non-redundant) in JASPAR 2020 . | Total PFMs (all versions) in JASPAR 2020 . |
---|---|---|---|---|---|---|
Vertebrates | 579 | 169 | 2 | 125 | 746 | 1011 |
Plants | 489 | 42 | 1 | 28 | 530 | 572 |
Insects | 133 | 10 | 0 | 3 | 143 | 153 |
Nematodes | 26 | 17 | 0 | 0 | 43 | 43 |
Fungi | 176 | 7 | 0 | 0 | 183 | 184 |
Urochordata | 1 | 0 | 0 | 0 | 1 | 1 |
Total CORE | 1404 | 245 | 3 | 156 | 1646 | 1964 |
unvalidated | 337 | 337 |
Overall, the JASPAR 2020 CORE collection includes 1646 non-redundant PFMs (746 for vertebrates, 530 for plants, 183 for fungi, 143 for insects, 43 for nematodes and 1 for urochordates) (Table 1; Figure 1). Moreover, we continued with the incorporation of novel transcription factor flexible models (TFFMs), which are hidden Markov-based models capturing dinucleotide dependencies in TF–DNA interactions (11). We introduced new TFFMs for 217 TFs (136 for vertebrates, 38 for plants, 21 for insects, 17 for nematodes, and 5 for fungi) and updated TFFMs for 20 vertebrates TFs, which represents a 50% increase in the number of TFFMs available. All data is available on the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.

JASPAR CORE growth. The number of profiles in each taxon and overall (see legend) through all JASPAR releases.
A NEW COLLECTION OF UNVALIDATED PROFILES FOR COMMUNITY ENGAGEMENT
We introduced a novel ‘unvalidated’ collection to store high-quality (i.e. passing multiple quality controls, see Supplementary Text) TF-binding profiles for which no independent support was found in the literature by our curators. This collection contains 337 PFMs. As these profiles are not yet supported by an orthogonal evidence, we recommend users to use this collection with caution. We encourage the community to engage in the curation of these profiles by providing the JASPAR curators with supporting complementary evidence (from their own work or others) whenever possible. This is facilitated by the availability of an individual submission form for each profile in the ‘unvalidated’ collection (Figure 2).

Unvalidated TF-binding profile collection. Example with the ZNF793 profile. This high-quality PFM was derived from a ChIP-seq experiment and was built from thousands of potential TFBSs. Further, the TFBSs are enriched around the ChIP-seq peak summits. However, no orthogonal evidence supporting this profile was found by our curators. Users can upload relevant information about the profile in the unvalidated collection through the ‘Community curation’ box.
Further, we started a Q&A forum (https://groups.google.com/forum/#!forum/jaspar) to ease the communication between JASPAR curators and the community; we welcome the community to send us their questions and suggestions, or to report errors in JASPAR.
CLUSTERED PROFILES, GENOMIC TRACKS AND PROFILE INFERENCE TOOL
In the previous releases, we introduced novel features such as hierarchical clustering of TF-binding profiles in the CORE collection to visualize profile similarities, genomic tracks of predicted TFBSs, and an inference tool to predict TF-binding profiles likely recognized by TFs not available in the JASPAR CORE. We improved the profile inference tool using our own implementation of a recently described similarity regression method (20). We updated the generation of genomic tracks that are publicly available through the UCSC Genome Browser data hub (39) for 7 organisms: human (hg19, hg38), mouse (mm10), zebrafish (danRer11), Drosophila melanogaster (dm6), Caenorhabditis elegans (ce10), Arabidopsis thaliana (araTha1) and baker's yeast (sacCer3). For more details on the updated genomic tracks and inference tool, refer to the Supplementary Text. Finally, we generated the hierarchical clusters of available TF-binding profiles for each taxon with RSAT matrix-clustering (40). Users can explore the CORE/unvalidated collection through the trees and access directly the corresponding profiles by clicking on the TF name.
CONCLUSIONS AND PERSPECTIVES
Similar to previous releases, we substantially expanded the CORE collection of the JASPAR database. For this 8th release, we processed more than 18,000 ChIP-seq datasets. As a large number of the obtained high-quality TF-binding profiles were not supported with orthogonal supporting evidence, it motivated us to create the novel ‘unvalidated’ collection of profiles. We expect that upcoming experiments and publications will provide additional supporting evidence to some profiles to be incorporated into the JASPAR CORE collection. Meanwhile, we would like to extend our invitation to the research community to 1) help us curate these unvalidated profiles (e.g. by pointing us to supporting literature), and 2) send us their own novel profiles (e.g. determined experimentally) for incorporation in the next release of JASPAR.
The JASPAR CORE vertebrates collection now contains 746 profiles, 637 of which are associated with human TFs with known DNA-binding profiles (1), which corresponds to a 58% of the 1,107 reported by Lambert et al. (1). While this is an impressive collective achievement by the field (the original JASPAR database only contained 81 profiles, a ∼7% coverage for human TFs), it suggests that targeted experimental efforts to find the binding preferences for remaining TFs will be important. Although computational approaches can be used to infer missing TF-binding profiles (20,41), especially for non-model organisms, the JASPAR approach is conservative, including profiles supported by at least two experiments in the literature. This is very important as we stand by the reliability of our data. Since its initial publication in 2004 (23), the JASPAR database has been committed to provide the research community with high-quality, manually curated, non-redundant TF-binding profiles.
Lastly, although PFMs have dominated the field of gene regulation for decades, new profile representations have emerged. For example, profiles with expanded alphabets to represent methylated bases (42,43), modelling binding energy (44) or derived from deep learning importance scores (45). Depending on how the field evolves and how popular these profiles become, we will consider them for inclusion in JASPAR in the future.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
We thank the user community for useful input and the scientific community for performing experimental assays of TF–DNA interactions and for publicly releasing the data. We thank Giovanna Ambrosini for her help with PWMScan, the UCSC Genome Browser Project Team for their assistance with the genome tracks, WestGrid (https://www.westgrid.ca), Compute Canada (https://www.computecanada.ca), Georgios Magklaras and Georgios Marselis for their IT support, Jacques van Helden and Adam Handel for contacting us to add and validate TF binding profiles, and Dora Pak and Ingrid Kjelsvik for administrative support.
FUNDING
Norwegian Research Council [187615]; Helse Sør-Øst; University of Oslo through the Centre for Molecular Medicine Norway (NCMM) (to A.M., J.A.C.-M., A.K., M.G.); Norwegian Research Council [288404 to J.A.C.-M. and Mathelier group]; The Norwegian Cancer Society [197884 to Mathelier group]; O.F., X.Z., P.A.R., S.C. and W.W.W. were supported by grants from the Canadian Institutes of Health Research [BOP-149430 and PJT-162120]; Genome Canada and Genome British Columbia [255ONT and 275SIL]; Michael Smith Foundation for Health Research [17746]; Natural Sciences and Engineering Research Council of Canada Discovery Grant [RGPIN-2017-06824]; CREATE programs; Weston Brain Institute [20R74681]; BC Children's Hospital Foundation and Research Institute; Netherlands Organization for Scientific Research [Rubicon fellowship to R.v.d.L., 452172015]; Genome British Columbia [SIP007 to B.P.M.]; A.S. was supported by grants from the Lundbeck Foundation, the Danish Cancer Foundation, the Danish Innovation Fund and the Danish Council for Independent Research. F.P. was supported by the French National Agency for Research [FloPiNet ANR-16-CE92-0023-01; GRAL, ANR-10-LABX-49-01]; D.B. is a recipient of a Rutherford Fund Fellowship.
Conflict of interest statement. None declared.
This paper is linked to: https://doi.org/10.1093/nar/gkz945.
REFERENCES
Author notes
The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.
Comments