-
PDF
- Split View
-
Views
-
Annotate
-
Cite
Cite
Markus Hoffmann, Nico Trummer, Leon Schwartz, Jakub Jankowski, Hye Kyung Lee, Lina-Liv Willruth, Olga Lazareva, Kevin Yuan, Nina Baumgarten, Florian Schmidt, Jan Baumbach, Marcel H Schulz, David B Blumenthal, Lothar Hennighausen, Markus List, TF-Prioritizer: a Java pipeline to prioritize condition-specific transcription factors, GigaScience, Volume 12, 2023, giad026, https://doi.org/10.1093/gigascience/giad026
- Share Icon Share
Abstract
Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic datasets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multimodal datasets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., chromatin immunoprecipitation [ChIP], ATAC, or DNase sequencing) and RNA sequencing data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.
We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multimodal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE datasets for cell lines K562 and MCF-7, including 12 histone modification ChIP sequencing as well as ATAC and DNase sequencing datasets, where we observe and discuss assay-specific differences.
TF-Prioritizer accepts ATAC, DNase, or ChIP sequencing and RNA sequencing data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
Introduction
Understanding how genes are regulated remains a major research focus of molecular biology and genetics [1]. In eukaryotes, gene expression is controlled by cis-regulatory elements (CREs) such as promoters, enhancers, or suppressors, which are bound by transcription factors (TFs) promoting or repressing transcriptional activity depending on their accessibility [2]. TFs play an important role not only in development and physiology but also in diseases; for example, it is known that at least a third of all known human developmental disorders are associated with deregulated TF activity and mutations [3–5]. An in-depth investigation of TF regulation could help to gain deeper insights into the gene-regulatory balance found in normal physiology. Since most complex diseases involve aberrant gene regulation, a detailed understanding of this mechanism is a prerequisite to developing targeted therapies [6, 7]. This is a daunting task, as multiple genes in eukaryotic genomes may affect the disease, each of which is possibly controlled by candidate CREs.
TF chromatin immunoprecipitation sequencing (ChIP-seq) experiments are the gold standard for identifying and understanding condition-specific TF binding at a nucleotide level. However, since there are approximately 1,500 active TFs in humans [8] and about 1,000 in mice [9], and additionally considering the need to establish TF patterns separately for each tissue and physiological condition, this approach is logistically prohibitive. Alternatively, histone modification (HM) ChIP-seq but also ATAC sequencing (ATAC-seq) and DNAse sequencing (DNAse-seq) offer a broader view of the chromatin state due to their individual capability (i.e., ChIP-seq identifies protein–DNA interactions, ATAC-seq detects open chromatin regions via Tn5 transposase cuttings, and DNAse-seq maps accessible chromatin sites by digesting chromatin with DNase I) to highlight open chromatin regions aligned with active genes, hence allowing the identification of condition-specific CREs [10]. Computational methods can then be used to prioritize TFs likely binding to these CREs, leading to hypotheses and defining the most promising TF ChIP-seq experiments. This narrows the scope of TF ChIP-seq experiments needed to confirm working hypotheses about gene regulation [11–13].
Several general approaches have been proposed to identify key TFs that are responsible for gene regulation. Among them, for example, is (i) a basic coexpression or mutual information analysis of TFs and their target genes combined with computational binding site predictions [14]. (ii) Some tools use a combination of TF ChIP-seq data—providing genome-wide information about the exact locations of TF binding—with predicted target genes that can enhance coexpression analyses [15]. (iii) Other tools employ a combination of genome-wide chromatin accessibility (e.g., HM ChIP-seq data) or activity information, putative TF binding sites, and gene expression data. This combination can be powerful in determining key TF players and is used by the state-of-the-art tool diffTF [16]. Most of the proposed approaches require substantial preprocessing, computational knowledge, adjustment of the method to a new use case (e.g., more than 2 conditions and/or time-series data), and manual evaluation of the results (e.g., manual search and visualization for TF ChIP-seq data to provide experimental evidence for the predictions). Hence, to streamline this process, we present TF-Prioritizer, a Java pipeline to prioritize TFs that show condition-specific changes in their activity. TF-Prioritizer falls into the third category of the previously described approaches and automates several time-consuming steps, including data processing, TF affinity analysis, machine learning predicting relationships of CREs to target genes, prioritization of relevant TFs, data visualization, and visual experimental validation of the findings using public TF ChIP-seq data (i.e., ChIP-Atlas [17]).
Figure 1 depicts a general overview of the pipeline. TF-Prioritizer accepts 2 types of input data: (i) histone modification peak ChIP-seq/ATAC-seq/DNase-seq data indicating accessible regulatory regions showing differential activity (peak data are typically generated by MACS2 [18]) and (ii) gene expression data from RNA-seq, which allows the identification of differentially expressed genes that are potentially regulated by TFs at specific time points or physiological condition. If peaks from ATAC-seq or DNase-seq were provided, we generate footprints (i.e., specific regions of the peaks within hypersensitive sites that could indicate the regulatory region of genes [19]) by employing HINT (i.e., HINT uses hidden Markov models to identify footprints by using strand-specific, nucleosome-sized signals with corrections for ATAC-seq and DNase-seq protocol-specific biases to successfully target CREs) for further processing [20–22]. Our pipeline searches for TF binding sites using TRAP [23] within CREs around accessible genes and calculates an affinity score for each known TF to bind at these particular loci using TEPIC [24,25]. TEPIC uses an exponential decay model that was built under the assumption that regulatory elements close to a gene are more likely important than more distal elements and weighs this relationship accordingly. This allows us to assess TF binding site specific probabilities by using TF binding affinities calculated by TRAP, which uses a biophysical model to assess the strength of the binding energy of a TF to a CRE’s total sequence [23]. Beginning with these CRE candidates, we search for links to possible regulated putative target genes that are differentially expressed between given conditions (e.g., disease and healthy). Approaching the task of linking CREs to target genes, we employ the framework of TEPIC2 [25] and DYNAMITE [25] (feature comparison Supplementary Table S1), which uses a logistic regression model predicting differentially expressed genes across time points and conditions based on TF binding site information to score different TFs according to their contribution to the model and their expression (for a more technical description, see “Technical workflow” section). In general, TF-Prioritizer uses TEPIC and DYNAMITE pairwise of the provided data (i.e., pairwise for each condition and each time point). Based on a background distribution of the scores (combination of differential expression, TEPIC, and DYNAMITE—see “Discovering cis-regulatory elements using a biophysical model” section), TF-Prioritizer computes an empirical P value reflecting the significance of the results (see “An aggregated score to quantify the contribution of a TF to gene regulation” section). TF-Prioritizer offers automated access to complementary ChIP-seq data of the prioritized TFs in ChIP-Atlas [17] for validation and shows predicted regulatory regions of target genes using the Integrative Genomics Viewer (IGV) [27–29]. Then, TF-Prioritizer automatically generates a user-friendly and feature-rich web application that could also be used to publish the results as an online interactive report.
![General overview of the TF-Prioritizer pipeline. TF-Prioritizer uses peaks from ChIP-seq or ATAC-seq/DNase-seq and gene counts from RNA-seq. If peaks from the protocols ATAC-seq or DNase-seq were provided, we treat them by using the footprinting method HINT and use the footprints for further processing [20–22]. It then (1) calculates TF binding site affinities using the tool TRAP [23], (2) links candidate regions to potential target genes by employing TEPIC [24], (3) performs machine learning (by using the framework of TEPIC2 [25] and DYNAMITE) to find relationships between TFs and their target genes, (4) calculates background and TF distributions, (5) picks TFs that significantly differ from the background using the Mann–Whitney U test [26] and a comparison between the mean and the median of the background and TF distribution, (6) searches for tissue-specific TF ChIP-seq evaluation data in ChIP-ATLAS [17], (7) creates screenshots using the Integrative Genomics Viewer from predicted regions of interest [27–29], and (8) creates a feature-rich web application for researchers to share and evaluate their results.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/gigascience/12/10.1093_gigascience_giad026/3/m_giad026fig1.jpeg?Expires=1747891200&Signature=xH5seBOcmFUkIOkQw7Jh0oF1YfYtVqiDhjWZqMvhTRaxSB3W9JKKWc6v8I-J-PiDiAiZ33BITZh4aVGy6Rs8DmzV0ktVVYoTd8JQE52uNtW2yQQVTUha5-SmC0QtkIofPZR1qdDIGwy~xCSXxGp2b1BMdWBEI5XAQpc09BTOP~jKnCOSGKlGQyJo~aXLzJU-qO0v-nW2QWJ6~lGnjbMma7OqFVvu7tBcE2LYpuIiRtyPQMrNUitkbCIK8t~uVaL-cCdCNEkJYilcZQkTdpjRSQ2gSMMcM8VQLsD6Nb1xFNK-QKXe5AojUBBwy6XGqfj5p32WkTM~Xe4BXBdya5i8mQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
General overview of the TF-Prioritizer pipeline. TF-Prioritizer uses peaks from ChIP-seq or ATAC-seq/DNase-seq and gene counts from RNA-seq. If peaks from the protocols ATAC-seq or DNase-seq were provided, we treat them by using the footprinting method HINT and use the footprints for further processing [20–22]. It then (1) calculates TF binding site affinities using the tool TRAP [23], (2) links candidate regions to potential target genes by employing TEPIC [24], (3) performs machine learning (by using the framework of TEPIC2 [25] and DYNAMITE) to find relationships between TFs and their target genes, (4) calculates background and TF distributions, (5) picks TFs that significantly differ from the background using the Mann–Whitney U test [26] and a comparison between the mean and the median of the background and TF distribution, (6) searches for tissue-specific TF ChIP-seq evaluation data in ChIP-ATLAS [17], (7) creates screenshots using the Integrative Genomics Viewer from predicted regions of interest [27–29], and (8) creates a feature-rich web application for researchers to share and evaluate their results.
To demonstrate the potential and usability of TF-Prioritizer, we use genomic data describing mammary glands in pregnant and lactating mice and compare our analysis to established knowledge [30]. Employing the web application generated by TF-Prioritizer, we found well-studied TFs involved in the mammary gland development process, and we identified additional TFs, which are candidate key factors in mammary gland physiology. Additionally, we use ENCODE cell line data (K562 and MCF-7) to demonstrate the potential and usability of TF-Prioritizer using ATAC-seq, DNase-seq, and HM ChIP-seq data.
Materials and Methods
Implementation
The main pipeline protocol is implemented in Java version 11.0.14 on a Linux system (Ubuntu 20.04.3 LTS). The pipeline uses subprograms written in Python version 3.8.5, R version 4.1.2, C++ version 9.4.0, and CMAKE (RRID:SCR_015875) version 3.16 or higher. External software that needs to be installed before using TF-Prioritizer can be found on GitHub (see Availability Section). We also provide a bash script “install.sh,” that automatically downloads and installs necessary third-party software and R/Python packages. The web application uses Angular CLI version 14.0.1 and Node.js version 16.10.0. We also provide a dockerized version of the pipeline; it uses Docker version 20.10.12 and Docker-Compose version 1.29.2 (Availability Section). TF-Prioritizer is available as a docker that can be pulled from docker hub and GitHub packages (“Availability of source code and requirements” section).
Data processing
Mammary gland development and lactation in mice
Datasets (GEO accession ID: GSE161620) are processed with the nf-core/RNA-seq [31] and nf-core/ChIP-seq pipelines in their default settings, respectively [32, 33]. The FASTQ files of pregnant and lactating mice are processed by Salmon (RRID:SCR_017036) [34] and MACS2 (RRID:SCR_013291) [35] to retrieve raw gene counts and broad peak files.
The dataset spans several time points in mammary gland development from pregnancy to lactation. For each stage, 2 distinct time points are available: pregnancy day 6 (p6), pregnancy day 13 (p13), lactation day 1 (L1), and lactation day 10 (L10). For each time point, the dataset contains RNA-seq data and ChIP-seq data for histone modifications H3K27ac and H3K4me3, as well as Pol2 ChIP-seq data (Table 1). We used H3K27ac, H3K4me3, and Pol2 data for creating the model.
Overview of datasets covering mammary gland development from pregnancy to lactation
. | p6 . | p13 . | L1 . | L10 . | Sum . |
---|---|---|---|---|---|
ChIP-seq H3K27ac | 3 | 1 | 8 | 4 | 16 |
ChIP-seq H3K4me3 | 2 | 3 | 5 | 0 | 10 |
ChIP-seq Pol2 | 2 | 0 | 5 | 4 | 11 |
RNA-seq | 6 | 8 | 3 | 4 | 21 |
. | p6 . | p13 . | L1 . | L10 . | Sum . |
---|---|---|---|---|---|
ChIP-seq H3K27ac | 3 | 1 | 8 | 4 | 16 |
ChIP-seq H3K4me3 | 2 | 3 | 5 | 0 | 10 |
ChIP-seq Pol2 | 2 | 0 | 5 | 4 | 11 |
RNA-seq | 6 | 8 | 3 | 4 | 21 |
Overview of datasets covering mammary gland development from pregnancy to lactation
. | p6 . | p13 . | L1 . | L10 . | Sum . |
---|---|---|---|---|---|
ChIP-seq H3K27ac | 3 | 1 | 8 | 4 | 16 |
ChIP-seq H3K4me3 | 2 | 3 | 5 | 0 | 10 |
ChIP-seq Pol2 | 2 | 0 | 5 | 4 | 11 |
RNA-seq | 6 | 8 | 3 | 4 | 21 |
. | p6 . | p13 . | L1 . | L10 . | Sum . |
---|---|---|---|---|---|
ChIP-seq H3K27ac | 3 | 1 | 8 | 4 | 16 |
ChIP-seq H3K4me3 | 2 | 3 | 5 | 0 | 10 |
ChIP-seq Pol2 | 2 | 0 | 5 | 4 | 11 |
RNA-seq | 6 | 8 | 3 | 4 | 21 |
ENCODE cell lines
ATAC-seq, DNase-seq, ChIP-seq, and RNA-seq data are downloaded from the ENCODE project for the cell lines K562 (human chronic myelogenous leukemia cell line) and MCF-7 (human breast adenocarcinoma cell line), which are both often used to study cancer biology and have been subjected to a large number of different experimental protocols and assays (Table 2, file identifiers in Supplementary Material S1) [90].
Overview of the dataset covering several HM ChIP-seq, ATAC-seq, DNase-seq, and RNA-seq for the cell lines K562 and MCF-7
Protocol . | K562 . | MCF-7 . | Sum . | |
---|---|---|---|---|
ATAC-seq | 4 | 1 | 5 | |
DNase-seq | 4 | 4 | 8 | |
ChIP-seq | H3K27ac | 1 | 2 | 3 |
H3K27me3 | 2 | 2 | 4 | |
H3K36me3 | 2 | 2 | 4 | |
H3K4me3 | 4 | 2 | 6 | |
H3K9me3 | 1 | 2 | 3 | |
H2AFZ | 1 | 1 | 2 | |
H3K4me1 | 2 | 1 | 3 | |
H3K4me2 | 1 | 1 | 2 | |
H3K79me2 | 1 | 1 | 2 | |
H3K9ac | 2 | 1 | 3 | |
H4K20me1 | 1 | 1 | 2 | |
RNA-seq | 15 | 4 | 19 |
Protocol . | K562 . | MCF-7 . | Sum . | |
---|---|---|---|---|
ATAC-seq | 4 | 1 | 5 | |
DNase-seq | 4 | 4 | 8 | |
ChIP-seq | H3K27ac | 1 | 2 | 3 |
H3K27me3 | 2 | 2 | 4 | |
H3K36me3 | 2 | 2 | 4 | |
H3K4me3 | 4 | 2 | 6 | |
H3K9me3 | 1 | 2 | 3 | |
H2AFZ | 1 | 1 | 2 | |
H3K4me1 | 2 | 1 | 3 | |
H3K4me2 | 1 | 1 | 2 | |
H3K79me2 | 1 | 1 | 2 | |
H3K9ac | 2 | 1 | 3 | |
H4K20me1 | 1 | 1 | 2 | |
RNA-seq | 15 | 4 | 19 |
Overview of the dataset covering several HM ChIP-seq, ATAC-seq, DNase-seq, and RNA-seq for the cell lines K562 and MCF-7
Protocol . | K562 . | MCF-7 . | Sum . | |
---|---|---|---|---|
ATAC-seq | 4 | 1 | 5 | |
DNase-seq | 4 | 4 | 8 | |
ChIP-seq | H3K27ac | 1 | 2 | 3 |
H3K27me3 | 2 | 2 | 4 | |
H3K36me3 | 2 | 2 | 4 | |
H3K4me3 | 4 | 2 | 6 | |
H3K9me3 | 1 | 2 | 3 | |
H2AFZ | 1 | 1 | 2 | |
H3K4me1 | 2 | 1 | 3 | |
H3K4me2 | 1 | 1 | 2 | |
H3K79me2 | 1 | 1 | 2 | |
H3K9ac | 2 | 1 | 3 | |
H4K20me1 | 1 | 1 | 2 | |
RNA-seq | 15 | 4 | 19 |
Protocol . | K562 . | MCF-7 . | Sum . | |
---|---|---|---|---|
ATAC-seq | 4 | 1 | 5 | |
DNase-seq | 4 | 4 | 8 | |
ChIP-seq | H3K27ac | 1 | 2 | 3 |
H3K27me3 | 2 | 2 | 4 | |
H3K36me3 | 2 | 2 | 4 | |
H3K4me3 | 4 | 2 | 6 | |
H3K9me3 | 1 | 2 | 3 | |
H2AFZ | 1 | 1 | 2 | |
H3K4me1 | 2 | 1 | 3 | |
H3K4me2 | 1 | 1 | 2 | |
H3K79me2 | 1 | 1 | 2 | |
H3K9ac | 2 | 1 | 3 | |
H4K20me1 | 1 | 1 | 2 | |
RNA-seq | 15 | 4 | 19 |
Technical workflow
Preprocessing
TF-Prioritizer uses peak data from ChIP-seq, ATAC-seq, or DNase-seq and a gene count matrix from RNA-seq as input files (see GitHub repository for detailed formatting instructions). Initially, the pipeline downloads necessary data (gene lengths, gene symbols, and short descriptions of the genes) from BioMart (RRID:SCR_019214) [36]. Optionally, genes with low expression can be removed. TF-Prioritizer uses a transcripts per million (TPM) filter of 1 as default to remove TFs that show very low expression and are most probably not relevant. Subsequently, we use DESeq2 to normalize read counts and calculate the log2-fold change (log2fc) [37]. In parallel, TF-Prioritizer preprocesses the peaks by first employing HINT if the provided peak data are labeled as ATAC-seq or DNase-seq to perform footprinting to correct for the biases (i.e., by analyzing chromatin accessibility data in terms of histone modification state, enabling more accurate comparison between the 2 data types) between the ChIP-seq, ATAC-seq, and DNase-seq protocols [20, 38]. TF-Prioritizer then filters blacklisted regions that would likely lead to false positives [39]. Peak files from the same sample group can be merged to significantly reduce the total runtime of the pipeline without affecting the ability of the TF-Prioritizer to identify candidate CREs.
Discovering cis-regulatory elements using a biophysical model
TEPIC links CREs to target genes using a window-based approach (default: 50,000 bp) [24, 25] using TRAP, a biophysical model to quantify transcription factor affinity [23]. The window-based approach can be enhanced by providing Hi-C loop data, where the prediction window is extended or limited to a chromatin loop around potential CREs and target genes. TEPIC interprets ChIP-seq signal intensity as a quantitative measure of TF binding strength, which also helps in recovering low-affinity binding sites that would be missed in a classical presence/absence model [24]. The default TEPIC framework searches for dips on top of peaks. However, numerous studies have shown that CREs are often enriched between histone peaks (peak–dip–peak or peak–valley–peak model) [40]. To better accommodate histone modification of ChIP-seq data, we thus extended the TEPIC framework to search for transcription factor binding sites (TFBSs) between 2 peaks that have close (default 500 bp) genomic positions. TEPIC aggregates individual TF affinities into a TF-Gene score, which is the sum of the individual affinities normalized by the length of the considered CREs.
According to the description in Schmidt et al. [41], the TF-Gene score |${a}_w(g,t)$| for a gene g and a TF t in window sizew is calculated as in Equation 1:
Equation 1: calculation of the TF-Gene score
In Equation 1, |${a}_{p,t}$| is the affinity of TF t in peak p. The set of peaks |${P}_{g,w}$| contains all open-chromatin peaks in a window of size w around the gene g.|${d}_{p,g}$| is the distance from the center of the peak p to the transcription start site of the gene g, and |${d}_0$| is a constant fixed at 50,000 bp [42]. The affinities are normalized by peak and motif length, where |$|p|$| is the length of the peak p and l is the total length of the motif of TF t (see Schmidt et al. [24, 25, 41] for more specific information on how the TF-Gene score is calculated). Since proximal CREs are expected to have a larger influence on gene expression compared to distal ones, these contributions are weighted following an exponential decay function of genomic distance [25].
We want to point out that the biophysical model calculated by TRAP only returns the center of a potentially large area of high binding energy. The TF is supposed to bind somewhere in this area. In our IGV screenshot, the center of the high binding energy area can appear at a distance up to the window defined by TEPIC. We consider predicted TF peaks as matching if we find TF ChIP-seq peaks inside this window. Following this, we do not expect the predicted TF bindings to overlap exactly with the TF ChIP-seq peaks.
An aggregated score to quantify the contribution of a TF to gene regulation
To determine which TFs have a significant contribution to a condition-specific change between 2 sample groups, we want to consider multiple lines of evidence in an aggregated score. We introduce TF–target gene (TG) scores (Fig. 2) which combine (i) the absolute log2-fold change of differentially expressed genes since genes showing large expression differences are more likely affected through TF regulation than genes showing only minor expression differences and (ii) the TF-Gene scores from TEPIC indicating which TFs likely influence a gene. To further quantify this link, we also consider the total coefficients of a logistic regression model computed with DYNAMITE [25]. DYNAMITE predicts (high/low) expression of a gene based on the fold changes of TF-Gene scores reported by TEPIC and thus helps to prioritize among multiple potential TFs regulating a gene. We calculate TF-TG scores (|$\omega $|) for each time point and each type of ChIP-seq data (e.g., different histone modifications) as in Equation 2:
![Workflow of the distribution analysis to prioritize TFs in a global context by using TF-TG scores. We use several scores conducted by previously performed analysis (see Supplementary Fig. S1), specifically the total log2-fold change (DESeq2), the TF-Gene score (TEPIC), and the total TF regression coefficient (DYNAMITE). We then calculate the TF-TG score for each time point for each TF on each of the TF predicted target genes (TG) and save it to separate files for the background of each histone modification and for each TF in each histone modification. In the next step, we perform a Mann–Whitney U [43] test between the distribution of the background of the histone modification and the distinct TF distribution of the same histone modification. If the TF passes the Mann–Whitney U test and the median and mean of the TF are higher than the background median and mean, we consider this TF as prioritized for the histone modification. We perform a discounted cumulative gain to receive one list with all prioritized TFs and overall histone modifications.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/gigascience/12/10.1093_gigascience_giad026/3/m_giad026fig2.jpeg?Expires=1747891200&Signature=Gwr0izYE8uoRROhoWaa3NacR~g-j6VqdtL00ZtDdRtC0ZCiEhpSSZ5zxNLKRaL0DZdUk6uWrxLyq4ZeHH1Ub~GirZJ1NkyM~Agr4g51oGdL6X~~nfk7YBP8nHtH6y1xfkIoxlyPGjwInt0T0QUidIpvz26M10FDsnWx7UbI1TH-ph~53KWbHFU4jrAZxUfdBFKJvDUrpbwQGK9WhcgWonFjNSXUUgawvGz0bg94c08F8ckrCp3LRKsYlO5XKqki8Ir4BkirpmU-XKw-XXckTr19m3JD-5Ly9Pje1TtaGdpo99vFqwciybBHRZc1gs6nJlHhsQJwRn-6T8iTHeuECnA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Workflow of the distribution analysis to prioritize TFs in a global context by using TF-TG scores. We use several scores conducted by previously performed analysis (see Supplementary Fig. S1), specifically the total log2-fold change (DESeq2), the TF-Gene score (TEPIC), and the total TF regression coefficient (DYNAMITE). We then calculate the TF-TG score for each time point for each TF on each of the TF predicted target genes (TG) and save it to separate files for the background of each histone modification and for each TF in each histone modification. In the next step, we perform a Mann–Whitney U [43] test between the distribution of the background of the histone modification and the distinct TF distribution of the same histone modification. If the TF passes the Mann–Whitney U test and the median and mean of the TF are higher than the background median and mean, we consider this TF as prioritized for the histone modification. We perform a discounted cumulative gain to receive one list with all prioritized TFs and overall histone modifications.
Equation 2: Calculation of the TF-TG score ω for each time point and each type of ChIP-seq data:
where |$fc(g)$| represents the fold change of the target gene g between the 2 conditions, |${a}_w(g,t)$| is the TF-Gene score retrieved by TEPIC as detailed above, and |$\eta (g,t)$| is the total regression coefficient of DYNAMITE's linear model of the expression of the target gene g as a function of the expression of the TF t.
A random background distribution allows TF-Prioritzier to exclude spurious results
The ultimate goal of TF-Prioritizer is to identify those TFs that are most likely involved in regulating condition-specific genes. To judge if a specific TF-TG score is meaningful, we generate a background distribution under the hypothesis that most TFs will not be condition specific. Therefore, we generate 2 different kinds of distributions (see Fig. 2): (i) for each HM m, a background distribution containing all positive TF-TG scores associated with m: |$BG(m) = \{ {\omega }_w(g,t)\mid t \in TF(m),g \in TG(t),{\omega }_w(g,t) > 0\} $|. Here, |$TF( m )$| denotes the set of TFs that can bind to strands of the DNA modified by m, and |$TG(t)$| is the set of target genes of the TF t. (ii) For each HM-TF pair |$(m,t)$| with |$t \in TF(m)$|, a foreground distribution containing all positive TF-TG scores associated with |$(m,t)$|: |$FG(t,m) = \{ {\omega }_w(g,t)\mid g \in TG(t),{\omega }_w(g,t) > 0\} $|. Note that |$FG(t,m) \subseteq BG(m)$| holds for all HM-TF pairs |$(m,t)$|. We then test each TF distribution of each ChIP-seq against the global distribution matching the ChIP-seq data type. If the P value of a Mann–Whitney U (MWU) test [43] is below the threshold (default: 0.05) and the median and mean of the TF are higher than the background distribution, the TF is recognized as a potential candidate. In the last step, we sort the TFs based on the mean of the TF-TG scores and report the ranks.
We obtain a global list of prioritized TFs across several ChIP-seq data types (e.g., different histone modifications) as follows:
Let |$S(m)$| be the set of transcriptions factors t such that the 1-sided MWU test between the foreground distribution |$FG(t,m)$| and the background distribution |$BG(m)$| yields a significant P value. For a fixed TF |$t \in S(m)$|, let |$ran{k}_m(t) = \sum\limits_{t^{\prime} \in S(m)} {[ {{\rm{mea}}{{\rm{n}}}_{g \in TG(t^{\prime})}{\omega }_w(g,t^{\prime}) \le {\rm{mea}}{{\rm{n}}}_{g \in TG(t)}{\omega }_w(g,t)} ]} $| be the rank of t in |$S(m)$| with respect to the mean TF-TG scores across all target genes, where |$[ \cdot ]$| is the Iverson bracket (i.e., |$[{\rm{true}}] = 1$| and |$[{\rm{false}}] = 0$|). We now compute an overall TF score |$f(t)$| by aggregating the HM-specific ranks as follows in Equation 3:
where |$HM(t)$| denotes the set of histone modifications on strands of the DNA where the TF t can bind. Note that if |$t \notin S(m)$|, |${\rm{ran}}{{\rm{k}}}_m(t)$| is not defined. In this case, we set |${\rm{ran}}{{\rm{k}}}_m(t) = |S(m)|$| such that the summand for t equals 0. Last, we sort TFs in ascending order according to the scores |$f(t)$|.
Discovering each score's contribution to the global score
To analyze the impact of the different parts of the TF-TG score, we permute its components (TF score from TEPIC, regression coefficient of DYNAMITE, log2fc of DESeq2). We execute TF-Prioritizer with the exact same configuration but with all possible combinations of the components and compare the prioritized TFs (e.g., solely TF score from TEPIC, a combination of TF score from TEPIC with the regression coefficient of DYNAMITE).
Validation using independent data from ChIP-Atlas
TF-Prioritizer is able to download and visualize experimental tissue-specific TF ChIP-seq data for prioritized TFs from ChIP-Atlas [17], a public database for ChIP-seq, ATAC-seq, DNase-seq, and Bisulfite-seq data. ChIP-Atlas provides more than 362,121 datasets for 6 model organisms (i.e., human, mouse, rat, fruit fly, nematode, and budding yeast) [44]. TF-Prioritizer automatically visualizes TF ChIP-seq peaks on predicted target sites of prioritized TFs to experimentally validate our predictions. TF-Prioritizer also visualizes experimentally known enhancers and super-enhancers from the manually curated database ENdb [45]. Additionally, experimental data from other databases or experimental data retrieved by own experiments can be supplied and processed by TF-Prioritizer.
By employing TF ChIP-seq data from ChIP-Atlas, TF-Prioritizer is capable of performing a TF co-occurrence analysis of prioritized TFs by systematically comparing the experimentally validated peaks of pairs of prioritized TFs. In a co-occurrence analysis, it is checked what percentage of available peaks of one TF is also found in another TF. TF-Prioritizer returns the percentage of similar peaks between prioritized TFs to discover the coregulation of TFs. We investigate the co-occurrence of TFs |${t}_1$| and |${t}_2$| in terms of statistical significance by calculating a log-likelihood score. Let B be the set of all TF binding sites and |$\Pi (t)$| be the set of peaks for TF t. For TF t, let |$count(t)$| be the number of binding sites |$b \in B$| such that there is a peak |$\pi \in \Pi (t)$| within b. For a TF-TF pair |$({t}_1,{t}_2)$|, let |$count({t}_1,{t}_2)$| be the number of binding sites |$b \in B$| such that there is a peak |${\pi }_1 \in \Pi ({t}_1)$| and a peak |${\pi }_2 \in \Pi ({t}_2)$| within b, and then the log-likelihood score |${G}^2$| is calculated for the 4 observations: (i) |$count({t}_1,{t}_2)$| (i.e., |${t}_1$| and |${t}_2$| are co-occurring), (ii) |$count({t}_1) - count({t}_1,{t}_2)$| (i.e., |${t}_1$| is occurring but |${t}_2$| is not), (iii) |$count({t}_2) - count({t}_1,{t}_2)$| (i.e., |${t}_2$| is occurring but |${t}_1$|is not), and (iv) |$count({t}_1,{t}_2) - count({t}_1) - count({t}_2) + |B|$| (i.e., neither |${t}_1$| nor |${t}_2$| is occurring), with their corresponding expectation values (i) |$count({t}_1) \cdot count({t}_2)$|, (ii) |$count({t}_1)*(|B| - count({t}_2))$|, (iii) |$(|B| - count({t}_1))*count({t}_2)$|, and (iv) |$(|B| - count({t}_1))*(|B| - count({t}_2))$| as follows [46–48]:
Note that when interpreting, each log-likelihood score needs to be brought into relation with the number of peaks found in the respective TFs and also set in relation with the other number of peaks determined in the entire log-likelihood table, as the log-likelihood score may differ from TF pair to TF pair. A high log-likelihood score, in combination with a high number of peaks, with respect to the entire log-likelihood table, generally indicates that the co-occurrence relationship is statistically significant and that the 2 TFs could be functionally related. For further details and explanation of the formula and interpretation, consult [46–48].
Explorative analysis of differentially expressed genes
TF-Prioritizer allows users to manually investigate the ChIP-seq signal in the identified CREs of differentially expressed genes. To this end, TF-Prioritizer generates a compendium of screenshots of the top 30 upregulated or downregulated loci (sorted by their total log2-fold change) between 2 sample groups. Additionally, we allow the user to specify loci that are of special interest (e.g., the CSN family or the Socs2 locus in lactating mice). TF-Prioritizer then produces screenshots using the TF ChIP-seq data from ChIP-Atlas and visualizes them in a dynamically generated web application. Screenshots are produced using the IGV standalone application [27–29]. TF-Prioritizer also automatically saves the IGV session so the user can further research the shown tracks.
Handling missing data
In some cases, not all assay types are available for all samples, or the data do not have the same high quality as the rest of the samples. TF-Prioritizer then skips the grouping of missing data points and can still find meaningful results in the rest of the data. For example, the data for 3 time points for 1 histone modification are available, but 1 time point is missing or discarded. TF-Prioritizer then uses only the 3 available time points for grouping and downstream processing and analysis.
Using TF-prioritizer to investigate gene regulation
We use 3 approaches to evaluate the biological relevance and statistical certainty of our results: (i) literature research to validate whether the reported TFs are associated with the phenotype of interest, (ii) considering the top 30 target genes with highest affinity values and determining if their expression cluster by condition (note: we do not preselect differentially expressed genes for this analysis but focus on affinities to avoid a circular line of reasoning; we also review the literature and report whether these genes are known to be involved in either pregnancy or mammary gland development/lactation), and (iii) validation using independent TF ChIP-seq data from ChIP-Atlas. To conduct the third evaluation, we built region search trees, a balanced binary search tree where the leaves of the tree have a start and end position, and the tree returns all leaves that overlap with a searched region for all chromosomes of the tissue-specific ChIP-Atlas peaks for each available prioritized TF [49]. We then iterate over all predicted regions within the window size defined in TEPIC and determine if we can find any overlapping peaks inside the ChIP-Atlas peaks. If we can find an overlap with a peak defined by the ChIP-Atlas data, we count the predicted peak as a true positive (TP) or a false positive (FP). Next, we randomly sample the same number of predicted peaks in random length-matched regions not predicted to be relevant for a TF. If we find an overlap in the experimental ChIP-Atlas data, we consider this region as a false negative (FN) or a true negative (TN). Notably, we expect the FN count to be inflated since we considered condition-specific peaks of active CREs. Inactive CREs may very well have TFBSs that are not active. Nevertheless, we expect to find more such TFBSa in active regions compared to random samples, allowing us to compute sensitivity, specificity, precision, accuracy, and the harmonic mean between precision and sensitivity (F1-score) (see Supplementary Material S2).
Choice of parameters
In a pipeline like TF-Prioritizer, the choice of parameters is crucial to retrieve meaningful results. In this section, we explain our choice of parameters. We filter the RNA-seq data by a mean DESeq2 normalized gene count of 50 and a TPM of 1 to exclude noise of very weakly expressed target genes and TFs that are probably not important for the condition but would negatively impact the predictive models. We use the default configurations of TEPIC with the exception of the TF binding site search—that is, in the histone modification ChIP-seq data, it is important to search for TF binding sites between 2 peaks that are in close proximity (max. 500 bp) to each other (peak–dip–peak or peak–valley–peak model) [40]). The TEPIC2 framework and DYNAMITE were executed in default configurations as provided by the authors. We provide all default parameters in our configuration file.
Results and Discussion
We present TF-Prioritizer, which combines data to identify candidate CREs (e.g., ChIP-seq, ATAC-seq, DNase-seq) and RNA-seq to identify condition-specific TF activity. TF-Prioritizer is built on several existing state-of-the-art tools for peak calling, TF-affinity analysis, differential gene expression analysis, and machine learning tools. TF-Prioritizer is the first to jointly consider multiple types of modalities (e.g., different histone marks and/or time-series data), provide a joint list of active TFs, and enable the user to see a visualized validation of the predictions in an interactive and feature-rich web application.
Exploring TFs in mammary tissue during pregnancy and lactation in mice
We used TF-Prioritizer to identify TFs that are known to control mammary gland development and lactation. The tool also identifies TFs that are important in pregnancy, as well as new candidate TFs that have not yet been widely studied. TF-Prioritizer reported 104 TFs, many of which control Rho family GTPase-associated target genes and Casein family genes. TF-Prioritizer was evaluated using experimental TF ChIP-seq data where it showed high sensitivity, specificity, precision, and accuracy (Supplementary Fig. S2, Supplementary Material S2).
Prioritized TFs are known to play a role in mammary gland development and lactation
TF-Prioritizer prioritized STAT5, a transcription factor that plays an important role in mammary gland development [30, 50, 51]. Stat5 messenger RNA (mRNA) levels are highly upregulated during the last days of pregnancy and at the beginning of lactation, supporting experimental findings that STAT5 is a key driver of mammary gland development. The predicted target genes of STAT5 show a clear expression separation between pregnancy and lactation (Fig. 3A, B). Peaks were predicted with a sensitivity of 57.8%, a specificity of 66.3%, a precision of 78.1%, an accuracy of 60.6%, and an F1-score of 66.5% (Supplementary Fig. S2). Additionally, STAT5 is known to activate the expression of the Socs2 gene during mammary gland development [52, 53]. We can observe predicted peaks of STAT5 near Socs2, which could explain the regulation of its expression by STAT5 (Fig. 3C). STAT5 is further known to regulate the expression of the Casein gene family. Csn2, Csn1s2a, and Csn1s2b [54] mRNA levels are strongly upregulated during lactation, which could be explained by an activator role of STAT5 at the predicted peaks in their close proximity [55–57] (Fig. 3D, Supplementary Fig. S3, Supplementary Material S3, sec. STAT5).
![Validation of selected STAT5 target genes. (A, B) Heatmaps of predicted target genes. We select Socs2 and Csn family genes (black arrows) as they are known to be crucial in either mammary gland development or lactation. In the heatmaps, we can observe a clear separation of these target genes between the time points p13 and L1. (C, D) IGV screenshots of the loci of Socs2 and the Csn family. We included a predicted track in the IGV screenshot that indicates high-affinity binding regions for the TF that are represented by a tick and a black box surrounding it. In (C), we see that we predict peaks in p13 near Socs2. From these data, we suggest that Socs2 mRNA expression is controlled by STAT5 [52, 53]. In (D), we can observe Pol2 tracks that show a distinct change in the expression of Csn family proteins between pregnancy and lactation. This indicates that STAT5 controls the expression of milk proteins.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/gigascience/12/10.1093_gigascience_giad026/3/m_giad026fig3.jpeg?Expires=1747891200&Signature=orzeg3r1FjGFu7Uh39yhXLSb-rkXPKzu7xRy9DK1ufoz5Pkskw-haQkEbHkoIyCBhxGfqxS5BKlTteFbckZF63CCxIrMl~BgJTg3-OSVYBuCyiVQBFgU5R8Oa2CuTnIJVGdasFSgDTcCc1HsIyHjWq~oDWZOH2gL5h94nUlJf8O8JBaIByWMEWrFV~2fdMDtx4PxoEgVxi8FUzMKcYZDWjaxnfEb37~acgGoQPUVHmmdJNCDdEI3tqUIRDJjnCGpFKH3DhvNf~cka~8A0yfWMXIBTXmk~aYsTiN344yREfV8J7di9RJwQMAxKRcIXGyQQs851tHN~ZercUje3NiMLQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Validation of selected STAT5 target genes. (A, B) Heatmaps of predicted target genes. We select Socs2 and Csn family genes (black arrows) as they are known to be crucial in either mammary gland development or lactation. In the heatmaps, we can observe a clear separation of these target genes between the time points p13 and L1. (C, D) IGV screenshots of the loci of Socs2 and the Csn family. We included a predicted track in the IGV screenshot that indicates high-affinity binding regions for the TF that are represented by a tick and a black box surrounding it. In (C), we see that we predict peaks in p13 near Socs2. From these data, we suggest that Socs2 mRNA expression is controlled by STAT5 [52, 53]. In (D), we can observe Pol2 tracks that show a distinct change in the expression of Csn family proteins between pregnancy and lactation. This indicates that STAT5 controls the expression of milk proteins.
Additionally, ELF5, another transcription factor that plays an important role in mammary gland development, was predicted to be relevant by TF-Prioritizer. Elf5 mRNA levels increase at the end of pregnancy and the beginning of lactation, hence supporting ELF5’s role in mammary gland development. Peaks were predicted with a sensitivity of 77.5%, a specificity of 80.5%, a precision of 81.6%, an accuracy of 79%, and an F1-score of 79.5% (Supplementary Fig. S2). TF-Prioritizer predicts ELF5 binding sites near Gli1. Gli1 mRNA levels are downregulated during lactation, and ELF5 is thus probably acting as a suppressor for Gli1. Fiaschi et al. [58] showed experimentally that Gli1-expressing females were unable to lactate, and milk protein gene expression was essentially absent (Supplementary Figs. S4 and S5, Supplementary Material S3, sec. ELF5).
TF-Prioritizer further prioritized ESR1 [59] and NFIB [30], both known for their essential function in mammary gland development and lactation (Supplementary Material S3, sec. ESR1 and NFIB). Our results suggest that the mechanisms of pregnancy, mammary gland development, and lactation could be dependent on Rho GTPase [60, 61] and its regulation by several TFs reported here. Experimental validation is needed to elucidate those complex processes further (see Supplementary Material S3, sec. Rho GTPase's role in pregnancy, mammary gland development, and lactation) [62].
Prioritized novel TFs with a predicted role in pregnancy, mammary gland development, and lactation
We predict 2 TFs, CREB1 and ARNT, suggesting a role in the processes of pregnancy, mammary gland development, and lactation.
CREB1 binding sites show considerable overlap with binding sites of other TFs known to be involved in mammary gland development and lactation, such as ELF5 (22% of binding sites overlap, log-likelihood score 6,914 with a sample size of 16,531), NFIB (29% binding sites overlap, log-likelihood score 15,793 with a sample size of 23,923), and STAT5A (21% binding sites overlap, log-likelihood score 5,902 with a sample size of 15,180) (see Supplementary Fig. S6A–C). The co-occurrences could be significant due to the high log-likelihood values with a high sample size in comparison to the whole co-occurrence table. We hypothesize that a correlation of association strength may offer additional evidence for a functional association between TFs. Indeed, CREB1 shows a moderate correlation of binding site affinities with NFIB, STAT5A, STAT5B, and ELF5 (Supplementary Fig. S7). Our results suggest that CREB1 regulates a member of the Rho GTPase gene family and a member of the Casein gene family. Since CREB1 has not yet been recognized to contribute to aspects of mammary development and physiology, further experimental validation of our findings is needed (Supplementary Material S3, sec. CREB1).
Furthermore, the TF ARNT is prioritized along with 2 cofactors and predicted to be more involved in mammary gland development but less involved in lactation due to its high expression levels during the last state of pregnancy and lower expression during lactation. However, experimental mouse genetics demonstrated that ARNT is not required for mammary development and function [63], suggesting the presence of alternative and compensatory pathways (Supplementary Material S3, sec. ARNT).
Comparing TF-Prioritizer and diffTF
We compared TF-Prioritizer against the state-of-the-art tool diffTF that prioritizes and classifies TFs into repressors and activators given conditions (e.g., health and disease) [16]. diffTF does not allow multiple conditions or time-series data and distinct analysis of histone modification peak data in a single run and does not consider external data for validation. We point out that diffTF cannot use different sample sizes between ChIP-seq and RNA-seq data (i.e., diffTF requires that for each ChIP-seq sample, there is an RNA-seq sample and vice versa). diffTF does not use a biophysical model to predict TFBS but uses general, not tissue-specific, peaks of TF ChIP-seq data and considers all consensus peaks as TFBS [16]. For a comparison of features and technical details, see Supplementary Table S2 and Supplementary Table S3, respectively. Since the diffTF tool does not provide an aggregation approach to different conditions, we aggregate the prioritized TFs the same way as TF-Prioritizer does (i.e., the union of all prioritized TFs overall runs using diffTF's default q value cutoff of 0.1) to enhance the comparability of the overall conditions in the final results. In summary, diffTF prioritized 300 TFs compared to the 104 TFs (including combined TFs like Stat5a..Stat5b that count as 1 TF in TF-Prioritizer) that TF-Prioritizer reported (Fig. 4A). It thus seems that diffTF is less specific than TF-Prioritizer (see Supplementary Table S4 for a comparison of prioritized TFs). diffTF also finds known TFs that TF-Prioritizer captures (e.g., STAT5A, STAT5B, ELF5, and ESR1) but does not capture the well-known TF NFIB. diffTF also prioritizes CREB1 and ARNT, which, in our opinion, are strong candidates for experimental validation. By deploying 20 cores on a general computing cluster, TF-Prioritizer took roughly 7.5 hours to be fully executed, and diffTF took approximately 41 hours to be fully executed. Due to the high number of TFs that are prioritized by diffTF, we ranked the TFs after their P value (where a low P value indicates higher evidence that a TF is involved in the processes) provided by diffTF and cut off the exact same amount of TFs (104 TFs) that are prioritized by TF-Prioritizer to make the benchmarking more comparable and interpretable. We observe that the known TFs drop out (e.g., STAT5A, STAT5B, ELF5, NFIB, ESR1) (Fig. 4B). CREB1, which we suggest to be a good candidate for experimental validation, can still be found in diffTF’s prediction. Notably, only 22 TFs are prioritized by both TF-Prioritizer and diffTF by using this cutoff.

Venn diagram of prioritized TFs by TF-Prioritizer and diffTF. (A) diffTF and TF-Prioritizer found 62 (18.2%) common TFs. diffTF and TF-Prioritizer found known TFs (e.g., STAT5A, STAT5B, ELF5, and ESR1), but diffTF did not capture the well-known TF NFIB. diffTF and TF-Prioritizer both prioritized CREB1 and ARNT as candidates for experimental validation. (B) We ranked the diffTF results by P value and consider the top 104 (the same amount of TFs that the TF-Prioritizer predicted). Here only CREB1 is still predicted to be important by diffTF—other TFs such as STAT5A..STAT5B, ELF5, and NFIB drop out.
Limitations and considerations
TF-Prioritizer has several limitations. TF-Prioritizer is heavily dependent on the parameters of the state-of-the-art tools it is using (e.g., providing Hi-C data to TEPIC could have a significant impact on the search window while linking potential CREs to target genes). We also point out that we neither have any experimental evidence nor existing literature as proof that the default length of 500 bps of the dip model used in the extended TEPIC framework is the ideal cutoff.
We want to highlight the main disadvantage of using the TF-TG score as we significantly center the surveillance of TF-Prioritizer on genes showing a high fold change or high expression, which does not necessarily mean that those genes are the most relevant for a condition. Also, note that TF binding behavior is regulated by factors we do not observe here, such as phosphorylation. The results of the discounted cumulative gain ranking should be considered with care since the biologically most relevant TFs may manifest in only a subset of ChIP-seq data types.
The calculation of TP, TN, FP, and FN is only an approximation, as to the best of our knowledge, there is no known approach to determine if a CRE or TFBS is active in a condition or not. Sensitivity, specificity, precision, accuracy, and the harmonic mean of precision and sensitivity (F1) differ from TF to TF. We believe this is correlated with the prevalence of the binding sites or the motif specificity. We can also see a decline in the metrics if we look at cofactor regulation (Fig. 5A, AHR..ARNT, ARNT, and ARNT..HIF1A). We experience the highest performance of TF-Prioritizer by looking at TFs where no cofactor regulation is currently known or widely accepted (e.g., CREB1, ELF5, ESR1).

(A) Overview of performance metrics of prioritized TFs discussed in this article. (B) Contributions of individual components of the TF-TG score to the accumulated TF-TG score. We systematically considered different components of the TF-TG score (i.e., the score of TEPIC, LOG2FC, and DYNAMITE) as well as their combinations to determine their importance for the overall results. We find all important TFs exclusively using the TF-TG score. (C) Investigation of which TFs are reported in which assay. We can see that the most important TFs only manifest in a subset of HMs.
We further investigated the contribution of every single part of the TF-TG score to the number and quality of the prioritized TFs. To achieve this, we ran every combination of the components of the score (i.e., log2fc, TEPIC, DYNAMITE) with TF-Prioritizer. In Supplementary Table S5, we can see that the distribution analysis filters out about half of the TFs and only returns the most promising TFs. In Fig. 5B, we can see that ELF5, AHR..ARNT, and ARNT..HIF1A manifest in each of the scores independent of any combination. NFIB, CREB1, and ARNT manifest in any score that is related to TEPIC or DYNAMITE. ESR1 manifests in any score that is related to the LOG2FC. STAT5A..STAT5B only manifests in certain combinations of the scores or in the TF-TG score. The LOG2FC alone yields the most prioritized TFs, but at a closer look, the LOG2FC alone would miss NFIB, which is highly relevant in mammary gland development. Looking at these data, we believe that the TF-TG score that combines TEPIC, DYNAMITE, and LOG2FC results in the most promising TFs that are relevant.
In Figure 5C, we can see that STAT5A..STAT5B and ARNT only manifest in the HM H3K4me3. ELF5, CREB1, and NFIB only manifest in H3K27ac. ESR1, AHR..ARNT, and ARNT..HIF1A manifest in both HMs H3K4me3 and H3K27ac. As expected, most TFs only manifest in a subset of HMs, reflecting their association with certain chromatin states [64, 65].
Unraveling the specificity of TFs with respect to HM ChIP-seq, ATAC-seq, and DNase-seq
The ENCODE project generated a plethora of different assays for cell lines such as K562 and MCF-7, which we used here to determine to what extent different protocols (i.e., ATAC-seq, DNase-seq, and HM-ChIP-seq) are suited to reveal condition-specific TFs.
In total, we discovered 381 unique TFs (339 across 11 HM ChIP-seq experiments, 83 in ATAC-seq, and 96 in DNase-seq) if ATAC-seq and DNase-seq open chromatin peaks were processed with HINT to obtain footprints (Fig. 6, Supplementary Fig. S8A–C, Supplementary Fig. S9A–D). Interestingly, the efficacy of footprinting varies between the protocols significantly. Supplementary Fig. S9 shows differences in the number of footprints detected between both protocols. While the number of open chromatin peaks was nearly the same for both protocols, DNase-seq yields fewer footprints compared to ATAC-seq. In general, TF-Prioritizer reports more TFs when using footprinting compared to using open chromatin peaks. Many of these overlap with ChIP-seq TFs, confirming that footprinting is a meaningful strategy (Supplementary Fig. S8A, B, Fig. 6). We found TFs that can only be detected in a subset of the protocols (Fig. S6A, B, Supplementary Table 6). Using ChIP-seq data, we found the largest number of TFs, likely due to the combination of results from 10 different histone modifications and 1 histone variant, which together cover a wide variety of chromatin states. We found the largest number of detected TFs using the H2AFZ histone variant, possibly due to background peaks because of low antibody sensitivity in this histone variant. Of note, in Supplementary Fig. S10A, B, we investigated how the number of identified TFs differs when excluding H2AFZ. We can see a decrease in the total number of prioritized TFs in ChIP-seq from 339 to 301. We further examined how the number of identified TFs changes when only employing frequently studied HM ChIP-seq data from H3K27ac, H3K4me1, and H3K4me3 (Supplementary Fig. S10C, D). We can observe a decrease in identified TFs from 339 to 152, but again, the overlap with ATAC-seq and/or DNase-seq drops. H2AFZ is predominantly found in CREs and is also associated with cancer [66]. Since we have only investigated cancer cell lines, it remains unclear if this histone variant is generally highly informative of TF binding or if this is limited to cancer cells. Surprisingly, DNase-seq and ATAC-seq show a comparably small overlap even though both protocols are aimed at measuring chromatin accessibility. This corroborates earlier findings where it was observed that both protocols reveal assay-specific sites that contribute to predicting gene expression [67].

Guide to determine which experiments fit best by the usage of ATAC-seq, DNase-seq, or several histone modifications. (A) We combined all HM ChIP-seq data and investigated the overlap with ATAC-seq and DNase-seq. We found that ATAC-seq and ChIP-seq have a bigger overlap than ATAC-seq and DNase-seq. We found 26 TFs that are prioritized by all 3 protocols. (B) We separated the TFs of the HM ChIP-seq data in which HMs they were discovered. We can see huge differences between the HMs (e.g., while we can discover 137 TFs in H2AFZ, we can only discover 2 in H4K20me1).
Indeed, some TFs known to be important for both cancer cell lines were reported through several protocols, while others were reported by only 1 protocol. For instance, we found MYC, a key TF for cell proliferation in K562 and MCF-7 cells [68, 69], was highly ranked in ATAC-seq and HM ChIP-seq (H3K4me2, H3K79me2). Conversely, GATA1, another TF important for cell differentiation in K562 [70, 71], was prioritized only by DNase-seq. GATA1 regulates MYB, a key hematopoietic TF involved in stem cell self-renewal and lineage decisions that is prioritized in HM ChIP-seq (H2AFZ, H3K27ac, H3K4me2) [71, 72]. TF-Prioritizer found many members of the SP (SP1, SP2, SP3, SP4, SP8, and SP9) and KLF (KLF1, KLF2, KLF3, KLF4, KLF6, KLF7, KL8, KLF9, KLF10, KLF11, KLF12, KLF14, KLF15, and KLF16) family to be important for K562 cell differentiation in a plethora of HM ChIP-seq, ATAC-seq, and DNase-seq experiments. Notably, TF-Prioritizer uses an individual TF energy pattern during the calculation of TF affinity to potential binding (i.e., TRAP) for each TF of a TF family. The incorporation of TF expression data in our score further boosts this differentiation between TFs of the same family. We identified 6 of 9 TFs from the SP TF family and 14 of 16 TFs from the KLF TF family [73]. Hu et al. [74] found that the SP and KLF TF families are most important in erythroid differentiation in K562 cells and that SP1 and SP3 are involved in activating GATA1 [75].
We further investigated if TF-Prioritizer found biologically relevant TFs for the MCF-7 cell line. We found ELF5, an important TF in breast cancer, to be prioritized in ATAC-seq, DNase-seq, and HM ChIP-seq (H2AFZ). This is of particular interest, as ELF5 is a strong biomarker in breast cancer, and TF-Prioritizer is capable of prioritizing ELF5 in the ATAC-seq, DNase-seq, and ChIP-seq [76–78]. Piggin et al. [78] also postulated that ELF5 modulates the estrogen receptor. TF-Prioritizer found certain estrogen receptors (e.g., ESR2, ESRRG) to be relevant for cell differentiation in MCF-7. Estrogen receptor proteins are highly relevant in breast cancer [79, 80]. The TF GATA3 was also predicted (ATAC-seq, H3K27ac, H3K9ac) to be important for cell differentiation in MCF-7. GATA3 is a key player when it comes to cell differentiation in the MCF-7 cell line [81, 82] and a regulator of estrogen receptor proteins [83]. FOXA1, predicted by TF-Prioritizer (ATAC-seq), is important in cell differentiation for MCF-7 cell lines, is a critical determinant of estrogen receptor function, and affects the proliferation activity of breast cancer [84, 85].
Conclusion and Outlook
TF-Prioritizer is a pipeline that combines RNA-seq and ChIP-seq data to identify condition-specific TF activity. It builds on several existing state-of-the-art tools for peak calling, TF-affinity analysis, differential gene expression analysis, and machine learning tools. TF-Prioritizer is the first tool to jointly consider multiple types of modalities (e.g., different histone marks and/or time-series data) and provide a summarized list of active TFs. A particular strength of TF-Prioritizer is its ability to integrate all of this in an automated pipeline that produces a feature-rich and user-friendly web report. It allows interpreting results in the light of experimental evidence (TF ChIP-seq data) either retrieved automatically from ChIP-Atlas or user-provided and processed into genome browser screenshots illustrating all relevant information for the target genes. Our approach was heavily inspired by DYNAMITE [25, 86], which follows the same goal but requires manually performing all necessary steps.
We show that TF-Prioritizer is capable of identifying already known and validated TFs (e.g., STAT5, ELF5, NFIB, ESR1) that are involved in the process of mammary gland development or lactation and their experimentally validated target genes (e.g., Socs2, Csn milk protein family, Rho GTPase associated proteins). Furthermore, we prioritized some not yet recognized TFs (e.g., CREB1, ARNT) that we suggest as potential candidates for further experimental validation. These results led us to hypothesize that the Rho GTPases undergo major changes in their tasks during the stages of pregnancy, mammary gland development, and lactation, which are regulated by TFs.
In conclusion, each protocol and histone modification can unravel unique transcription factor binding sites that provide insight into gene regulatory mechanisms. It is our opinion that employing TF-Prioritizer on as many protocols and HM ChIP-seq experiments as possible could improve our understanding of given conditions.
In the future, we plan to extend TF-Prioritizer to more closely explore the combined effects of enhancers, which are often nonadditive, as suggested by our current model [87]. We further plan to test the functionality of TF-Prioritizer on ATAC-seq data and to apply TF-Prioritizer in a single-cell context where histone ChIP-seq is currently hard to retrieve. Furthermore, we plan to include a more detailed ranking of the prioritized TFs. We plan to offer the user the ability to apply raw FASTQ files to TF-Prioritizer, where quality checks of the data will be performed. In summary, TF-Prioritizer is a powerful functional genomics tool that allows biomedical researchers to integrate large-scale ChIP-seq and RNA-seq data, prioritize TFs likely involved in condition-specific gene regulation, and interactively explore the evidence for the generated hypotheses in the light of independent data.
Availability of Source Code and Requirements
Project name: TF-Prioritizer
Project homepage: [89]
Operating system(s): Linux
Programming language: Java
Other requirements: Java version 11.0.14 or higher, Python version 3.8.5 or higher, R version 4.1.2 or higher, C++ version 9.4.0 or higher, CMAKE version 3.16 or higher, Angular CLI version 14.0.1 or higher, Node.js version 16.10.0 or higher, Docker version 20.10.12 or higher, and Docker-Compose version 1.29.2 or higher
Open source license: GNU GPL v. 3.0
Additional Files
Supplementary Figure 1: TF-Prioritizer uses nf-core ChIP-seq / ATAC-seq and nf-core RNA-seq preprocessed data as input files (see GitHub repository for detailed formatting instructions). More specifically, broad peaks and gene counts. (1) Once started, the pipeline downloads necessary data (gene lengths, gene symbols, and short descriptions of the genes) from bioMar . (2) The user can then decide to use a transcript per million (TPM) filter or a gene count filter to filter before DESeq2 usage. We also allow for batch correction in DESeq2. TF-Prioritizer uses a TPM filter of 1 as default. DESeq2 normalizes and calculates the log2 fold change (log2fc) from raw gene count data . If the user used ATAC-seq as an input, we use the footprint method HINT to process the peaks, for this process we additionally expect BAM files in the same directory format as the peaks from the user. In parallel, (3) TF-Prioritizer preprocesses the ChIP-seq broad peaks by filtering blacklisted regions . We recommend using the sample combination option to combine similar broad peak samples into one peak file, as the total runtime of the pipeline is reduced significantly without losing the quality of the data. (4) Optionally, the user can decide to use TGene to predict links between target genes and regulatory elements combining distance and histone/expression correlation. If the TGene option is not activated, TEPIC, executed in the next step of the pipeline, uses a window-based approach to link regulatory elements to target genes. (5) TEPIC uses TRAP, an approach that quantifies transcription factor affinity scores based on a biophysical model for regulatory regions . TEPIC “computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength”. TEPIC uses “machine learning to find low-affinity binding sites to improve the ability to explain gene expression variability compared to the standard presence/absence classification of binding sites” . In addition, especially for histone modification ChIP-seq data, we extended the TEPIC framework so that it can also search for transcription factor binding sites (TFBS) between two peaks that have close (~500 bps) genomic positions (default: search between two peaks). (6) The pipeline then executes DYNAMITE an approach that uses a “sparse logistic regression classifier to infer TFs related to gene expression changes between samples” . (7) We added a distribution analysis to the pipeline to further prioritize TFs depending on their distribution compared to the global distribution using (8) a Mann-Whitney U test and the comparison of the means and the medians (for details see Materials and Methods Distribution Analysis Section). (9) We then use a discounted cumulative gain approach to retrieve a global ranking (overall histone modification data provided) of prioritized TFs (see Materials and Methods Discounted Cumulative Gain Section). (10) In the following, TF-Prioritizer generates condition-specific and histone-modification-specific heatmaps for prioritized TFs and their predicted target genes. (11) We then check if we can find publicly available tissue-specific TF ChIP-seq data from ChIP-ATLAS and (12) download the files. (13) Afterward, we take screenshots using the IGV. (14) In the last step, we conclude all analysis and plots in form of an easy-to-use HTML report that could also be used as a webpage.
Supplementary Figure 2: We showcase the discussed TFs and their statistical metrics. We can see the confusion matrix for each TF. We also provide sensitivity, specificity, precision, accuracy, and F1-score. We can see that the metrics differ vastly between the TFs. There is a drop in all metrics when it comes to co-factors. We believe that more research is necessary to obtain better predictions for co-factor TFs.
Supplementary Figure 3: We show the predicted peaks and experimental signals of DDR1. We can see higher Pol2 signals in the area of DDR1 due to higher expression during lactation. We furthermore can also observe a predicted peak of STAT5 in the DDR1 region. Past research showed that DDR1 is necessary to maintain STAT5 signaling during lactation.
Supplementary Figure 4: Validation of selected target genes for Elf5. (a) and (b) show heat maps of predicted target genes. We select Gli1, Lcp1, and Igfals (black arrows) as they are already known to be crucial in either mammary gland development or lactation. We further select the genes Arhgap9, Arhgef2, and Arhgap39 (black arrows) that are known to be essential for Rho GTPases due to their studied role in epithelial morphogenesis during mammary gland development [54,55]. In the heatmaps, we can observe a clear separation of these target genes between the time points p6-L1 and p6-L10. (c) and (d) show IGV screenshots of Arhgap9/ Gli1 and Lcp1 respectively. We included a predicted track in the IGV screenshot that indicates high-affinity binding regions for the TF that are represented by a tick and a black box surrounding it. In (c), we can see predicted Elf5 peaks near Arhgap9 and Gli1. ChIP-Atlas and the experimental TF ChIP-seq data substantiate the prediction near Arhgap9. Experimental data of Elf5 back up the predictions near Gli1. We can also observe upregulated Pol2 activity in L1 in this area. In (d) we can see multiple predictions of Elf5 bindings near Lcp1. ChIP-Atlas and the experimental TF ChIP-seq data corroborate the bindings of Elf5 in this area. We also observe an upregulated Pol2 activity in time points L1 and L10 in this area.
Supplementary Figure 5: We showcase three more examples of predicted peaks to experimental data for the transcription factor ELF5. With our predictions, we found two more genes associated with the Rho GTPase (ARHGAP39, ARHGEF2) and IGFALS that are known to play a role in mammary gland development and lactation (see Suppl. Material 2 for detailed discussion).
Supplementary Figure 6: We show the co-occurrence analysis of prioritized TFs. We can see that CREB1 has a high overlap of peaks with ELF5, NFIB, STAT5A, and STAT5b which are all key players in mammary gland development and lactation.
Supplementary Figure 7: We show the binding sites that co-occur between CREB1 and STAT5A..STAT5B, NFIB, or ELF5. We can see that there is a positive trend between the TFs. IF CREB1 has a higher binding affinity, the other TF that co-occurs on the same binding site also has a higher binding affinity on average.
Supplementary Figure 8: The above plots describe the common TFs across the different methods, ATAC-seq, DNase-seq, and ChIP-seq histone modifications, without correcting for the technical biases between the protocols using HINT. a) shows the overlapping TFs between ChIP-seq, ATAC-seq, and DNase-seq independent of single histone modifications. b) displays individual intersections of TFs between all possible combinations grouped by ATAC- and DNase-seq. c) represents ungrouped intersections between groups of the first X biggest overlaps.
Supplementary Figure 9: a) Analysis of overlaps between open-chromatin peaks in ATAC-seq and DNase-seq. b) Analysis of overlaps between open-chromatin peaks in ATAC-seq, DNase-seq, and ChIP-seq. c) Analysis of protocol bias-corrected footprints between ATAC-seq and DNase-seq. d) Analysis of protocol bias-corrected footprints between ATAC-seq, DNase-seq, and open-chromatin peaks of ChIP-seq.
Supplementary Figure 10: a) and b) Due to the high number of TFs found in H2AFZ, we excluded this histone variant. We can see that the number of identified TFs dropped from a total of 339 to 301. However, it also excluded some TFs that were identified by ATAC-seq and DNase-seq. c) and d) shows how the number of identified TFs behave if one only includes the frequently used HM ChIP-seq data H3K4me3, H3K4me1, and H3K27ac in comparison to DNase-seq and ATAC-seq. We can see a drop in totally identified TFs from 339 to 152 in ChIP-seq. However, also the number of overlaps between ATAC-seq and DNase-seq drops.
Supplementary Material 1: ENCODE file identifiers
Supplementary Material 2: Confusion matrices and the calculation of sensitivity, specificity, precision, accuracy, and F1-score
Supplementary Material 3: Biological findings
Supplementary Table 1: Feature comparison between TEPIC2 + DYNAMITE and TF-Prioritizer
Supplementary Table 2: Feature comparison between TF-Prioritizer and diffTF
Supplementary Table 3: Technical comparison between TF-Prioritizer and diffTF
Supplementary Table 4: Comparison of prioritized transcription factors between TF-Prioritizer and diffTF
Supplementary Table 5: Comparison of prioritized transcription factors before and after the filtering of the background distribution
Supplementary Table 6: Guide which TF was found in which protocol and HM
Abbreviations
Ahr: aryl hydrocarbon receptor; Arhgap9: Rho GTPase activating protein 9; Arhgap12: Rho GTPase activating protein 12; Arhgap39: Rho GTPase activating protein 39; Arhgef1: Rho guanine nucleotide exchange f
actor 1; Arhgef2: Rho/Rac guanine nucleotide exchange factor 2; Arhgef9: Cdc42 guanine nucleotide exchange factor 9; Arhgef18: Rho/Rac guanine nucleotide exchange factor 18; Arhgef40: Rho guanine nucleotide exchange factor 40; Arnt: aryl hydrocarbon receptor nuclear translocator; bp: base pair; ChIP: chromatin immunoprecipitation; CRE: cis-regulatory element; Creb1: CAMP responsive element binding protein 1; Csn: Casein proteins; Csn1s2a: Casein alpha S2 like A; Csn1s2b: Casein alpha S2 like B; Csn2: Casein beta; Csnk1e: Casein kinase 1 epsilon; Csnk2a2: Casein kinase 2 alpha 2; Csnk2b: Casein kinase 2 beta; Ddr1: discoidin domain receptor tyrosine kinase 1; Elf5: E74 like ETS transcription factor 5; Esr1: estrogen receptor 1; Ets2: ETS proto-oncogene 2, transcription factor; F1-score: harmonic mean between precision and sensitivity; FN: false negatives; FP: false positives; Gli1: GLI family zinc finger 1; HM: histone modification; Hif1a: hypoxia inducible factor 1 subunit alpha; Igfals: insulin-like growth factor binding protein acid labile subunit; IGV: Integrative Genome Viewer; L1: lactation day 1; L10: lactation day 10; Lcp1: lymphocyte cytosolic protein 1; mRNA: messenger RNA; MWU: Mann–Whitney U test; Nfib: nuclear factor I B; p6: pregnancy day 6; p13: pregnancy day 13; Socs2: suppressor of cytokine signaling 2; Stat5 (composition of Stat5a and Stat5b): signal transducer and activator of transcription 5A + signal transducer and activator of transcription 5B; TF: transcription factor; TFBS: transcription factor binding sites; TG: target gene; TF-Gene score: retrieved by TEPIC; TF-TG score: retrieved by the distribution analysis; TP: true positives; TPM: transcripts per million; Tp53: tumor protein p53; log2fc: log2 fold-change.
Authors’ Contributions
M.H., N.T., F.S., J.B., M.S., D.B.B., L.H., and M.L. drafted the concept for this pipeline. M.H., N.T., O.L., and K.Y. implemented the pipeline. M.H., L.S., and N.T. conceptualized and implemented the ATAC-seq and DNase-seq integration. J.J. and H.K.L. created the experimental data in the laboratory. J.J., H.K.L., L.-L.W., K.Y., and N.B. prepared the data and performed quality checks. M.H., N.T., D.B.B., L.H., and M.L. wrote the manuscript. All authors reviewed the manuscript and approved it.
Funding
This work was supported by the Technical University Munich—Institute for Advanced Study, funded by the German Excellence Initiative. This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the *e:Med* research and funding concept (*grants 01ZX1908A / 01ZX2208A* and *grants 01ZX1910D / 01ZX2210D*). This work was supported by the Deutsche Forschungsgemeinschaft SFB1531 (S03, project number 456687919). D.B.B. was supported by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0309A). This work was supported in part by the Intramural Research Programs (IRPs) of National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 777111. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 422216132. This publication reflects only the authors' view and the European Commission is not responsible for any use that may be made of the information it contains.
Competing Interests
The authors declare no competing interests.
Data Availability
All supporting data and materials are available in the GigaScience GigaDB database [88].
Acknowledgement
We thank Christina Trummer for designing the TF-Prioritizer logo and Andreas Niekler for his help with the statistical analysis of the co-occurrence analysis.
Figures were created with https://www.biorender.com. Parts of the images were designed using resources from Flaticon.com.
References
Author notes
The first two authors should be considered shared first authors.
The last two authors should be considered shared last authors.