BloodChIP Xtra: an expanded database of comparative genome-wide transcription factor binding and gene-expression profiles in healthy human stem/progenitor subsets and leukemic cells

Abstract The BloodChIP Xtra database (http://bloodchipXtra.vafaeelab.com/) facilitates genome-wide exploration and visualization of transcription factor (TF) occupancy and chromatin configuration in rare primary human hematopoietic stem (HSC-MPP) and progenitor (CMP, GMP, MEP) cells and acute myeloid leukemia (AML) cell lines (KG-1, ME-1, Kasumi1, TSU-1621-MT), along with chromatin accessibility and gene expression data from these and primary patient AMLs. BloodChIP Xtra features significantly more datasets than our earlier database BloodChIP (two primary cell types and two cell lines). Improved methodologies for determining TF occupancy and chromatin accessibility have led to increased availability of data for rare primary cell types across the spectrum of healthy and AML hematopoiesis. However, there is a continuing need for these data to be integrated in an easily accessible manner for gene-based queries and use in downstream applications. Here, we provide a user-friendly database based around genome-wide binding profiles of key hematopoietic TFs and histone marks in healthy stem/progenitor cell types. These are compared with binding profiles and chromatin accessibility derived from primary and cell line AML and integrated with expression data from corresponding cell types. All queries can be exported to construct TF–gene and protein–protein networks and evaluate the association of genes with specific cellular processes.


Introduction
Transcription factors (TFs) and the cis -regulatory DNA sequences they bind are components of gene regulatory networks (GRNs) that control gene expression and ultimately cell identity (1)(2)(3).GRNs are comprised of genes, their associated regulatory elements (promoters and cis -regulatory elements (CREs) such as enhancers), and DNA binding proteins, including TFs, often as part of multi-factor regulatory complexes which are increasingly understood to contain RNA as well as proteins.The accessibility of regulatory elements is modulated by direct DNA modifications such as CpG methylation and post-translational modification of histone proteins, and even when CREs are in an open conformation the DNA sequence of such elements at least partially determines which TFs can bind ( 4 ,5 ).Interactions between promoters and their CREs, mediated by chromatin loops and complexes of transcriptional regulators, modulate transcriptional burst frequency and output and therefore cell identity.
Stem cells have the capacity to either self-renew or give rise to daughter cells with increasingly specialized function and restricted proliferative capacity, and changes in cell identity occurring over differentiation trajectories are directly related to altered transcriptional programs controlled by lineage specific GRNs.Hematopoiesis is one of the best characterized developmental systems in vertebrates, serving as a model system for the multi-lineage differentiation of adult stem cells.Hematopoietic stem cells (HSCs) maintain production of circulating blood cells via their capacity to either self-renew or differentiate to mature cell types.The most primitive HSCs have multi-lineage potential but give rise to progenitor cells with increasing lineage restriction ( 3 ,6 ).However, the components and hierarchy of the GRNs which control the identity of human adult HSCs and their progeny, and the mechanisms by which these are hijacked or reactivated in leukemic cells, remain incompletely understood.
TFs act in the context of multi-protein complexes that bind distal regulatory elements and interact with promoters to initiate, modulate, or suppress gene expression in conjunction with a plethora of chromatin modifiers and transcription initiation / elongation factors.A core set of seven TFs (heptad: ERG, FLI1, GA T A2, RUNX1, T AL1, LYL1 and LMO2), all known regulators of healthy and leukemic hematopoiesis (7)(8)(9)(10)(11)(12)(13)(14), work in combination to regulate gene expression in hematopoietic stem and progenitor cells (HSPCs).Heptad TFs also contribute to stem cell-like gene expression signatures in leukemic cells, where their presence correlates with poor patient survival ( 15 ).Knowledge of how these TFs cooperate and interact with other lineage-specific TFs as HSCs make lineage commitment decisions is vital to improving our understanding of both healthy and leukemic blood development.
To this end, we have recently generated high-quality ChIPmentation and ChIPseq data for heptad TFs, CTCF, STAG2, the master myeloid regulator PU.1, H3K27ac, H3K4me3, H3K27me3 and also H3K27ac HiChIP in rare hematopoietic subpopulations (stem / multipotent progenitors [HSC-MPP] plus common myeloid [CMP], granulocyte macrophage [GMP] and megakaryocyte erythrocyte [MEP] progenitors) spanning HSC commitment along the myeloid / erythroid lineages ( 5 ).To optimize the utility of these data, BloodChIP Xtra integrates our chromatin occupancy data with published chromatin occupancy and accessibility, plus expression data, from healthy HSPCs, primary AML and AML cell lines.These datasets significantly increase the volume of data that can be queried compared to our earlier database BloodChIP ( 16 ).All analyzed and filtered data sets can be exported for downstream analysis, and we also provide links to external resources such as the UCSC genome browser to visualize data and the Gene Expression Atlas to query expression in additional cell types.

Data sources and analysis strategy
Data sources are summarized in Supplementary Table S1 ( 5 ,17-29 ), and a schematic showing included datasets is shown in Figure 1 .A summary of data analysis and output formats is shown in Figure 2 .

A T ACseq data
A T ACseq peak coordinates (summit ±250 bp) for healthy human blood populations were downloaded from the gene expression omnibus (GEO) ( 30 ) accession GSE74912 ( 28 ) and converted from the hg19 to the hg38 reference genome using the UCSC LiftOver tool ( 31 ).The resulting 590650 peaks were assigned to genes using the Genomic Regions Enrichment of Annotations Tool (GREAT) v4.0.4 using the basal extension method with default parameters ( 32 ,33 ).
Sequence files downloaded from the Sequence Read Archive (SRA) ( 34 ) or ArrayExpress ( 35 ) (Supplementary Table S1), were converted to fastq format, and aligned to the GRCh38 genome using BWA v0.7.17 ( 36 ).Mitochondrial and duplicate reads were removed using picardMarkDuplicates, reads shifted to account for Tn5-mediated excision using alignmentSieve, replicate samples merged, peaks called with MACS2 v 2.2.7.1 ( 37 ) with a minimal threshold P -value of 1 × 10 −5 , and tracks for visualization in the UCSC browser generated using deepTools bamCoverage with RPKM normalization ( 38 ).Samples from E-MTAB-9021 were single end reads, and in this case, the alignmentSieve step was omitted from the processing pipeline.A synthetic HSC-MPP A T ACseq track for visualization on the UCSC browser was generated by merging all bam files from HSC and MPP replicates.A T ACseq tracks are available for visualization at http: // genome.ucsc.edu/s/ PimandaLab/ BloodChIP _ Xtra _ A T AC .

Peak overlaps
Intersecting regions between ChIPseq and A T ACseq peaks and the set of 590650 peaks were identified using bedtools intersect ( 40 ).An intersection was considered only if > 10% of the peak width of the ChIPseq peak overlapped with the A T ACseq peak.A T ACseq peaks with no corresponding ChIPseq or A T ACseq peak in any of the included datasets were then excluded from the final data set which included 387291 peaks.

Database implementation, web interface and availability
The BloodChIP Xtra database is stored and accessed in both Comma-Separated Values (CSV) and Tab-Separated Values (TSV) formats.Our user-friendly web interface, accessible at http:// bloodchipXtra.vafaeelab.com/, has been developed using JavaScript, simplifying data access and enabling the visualization of transcription factor (TF) binding and gene  expression through Python-based scripts.This interface leverages Plotly ( https://plot.ly ) for generating downloadable gene expression box plots.The complete dataset is also available for direct download http://bloodchipXtra.vafaeelab.com/downloads/.

General web interface
The default BloodChIP Xtra interface displays data filters, search boxes and a table listing all annotated genes in the dataset (Supplementary Figure S1A).The number of annotated A T ACseq peaks (Binding Profiles) associated with each gene is shown along with links to the UCSC genome browser to allow visualization of TF, epigenetic (Epi) and A T ACseq (A T AC) profiles at that gene locus.The binding status of any combination of TFs / histone and cell types can be displayed simultaneously; the specific binding profiles displayed can be customized using the display filter.A summary table of TF binding at regions associated with each gene or a plot showing expression of individual genes across primary healthy and leukemic cells, and AML-cell lines can also be displayed.
Clicking the arrow expands the gene row to allow exploration of peaks associated with that gene (Supplementary Figure S1B).The genome coordinates of each peak are shown, with red indicating binding of a given TF at that peak in the indicated cell type, and blue indicating no binding peak detected.Peak / cell type combinations with no data are shown in gray.

Querying the database
The database may be queried by either gene name or TF binding (Supplementary Figure S1A).For gene queries, single genes or multiple genes may be entered.For retrieval of genes po-tentially regulated by a particular TF in a particular cell type, users can construct a query with any combination of TFs and cell types to identify genes that are combinatorially bound by multiple TFs ('AND') or bound by any of the TFs ('OR').

Downstream analysis
To facilitate downstream analysis and data visualization, links are provided to the Expression Atlas ( 44 ) to display transcriptomic and proteomic data for the selected gene across ple datasets.The web interface also export from a queried set of genes; TF binding information can be exported for individual genes, and the normalized expression values of all genes retrieved by a query can be exported to facilitate external such as clustering or generating heatmaps.

Utility of BloodChIP Xtra: step by step examples
BloodChIP Xtra allows users to query 10 TFs, A T ACseq and three histone modifications in eight cell lines.Additional A T ACseq profiles for fractionated primary AML cells can be visualized via the UCSC genome browser, and gene expression plots for > 800 unique samples spanning normal hematopoiesis and primary and AML cell lines can be displayed for any gene.To start a query, the user first enters the gene / s of interest.For example, searching for the gene, ERG returns a single row in the data table, while searching [ ERG GAT A2 T AL1 ] returns a row for each of those genes (Supplementary Figure S2A).Search hits can then be explored using the Gene Database Rows the Gene Database table show genes that were returned with the search and the number of genomic regions which mapped to that gene using GREAT are indicated in the 'Binding Profiles' column (Supplementary Figure S2A).Links are provided to pre-compiled sessions in the UCSC browser, and the user may select to visualize TF binding D 1135 (TF), histone marks, H3K27ac HiChIP , CTCF , RNApolII and STAG2 (Epi) or ATACseq (ATAC) data for all available cell types.Users may also display and visualize gene expression across all cell types by clicking the View button in the Gene Expression column which launches a pop up graph (Supplementary Figure S2B, ERG gene).Samples are grouped by cell type, with colours corresponding to the cell types shown in Figure 1 .Samples from the BeatAML2 cohort are plotted as a single group, and also as subgroups of clinically relevant mutational subtypes (Supplementary Table S2) ( 45 ).Individual cell types can be toggled on and off by clicking on the plot legend.Users may further explore the gene using the link to the Expression Atlas, or view a summary of all binding events assigned to that gene by GREAT (Supplementary Figure S2C, GATA2 gene).
Additional detail is revealed by clicking the arrow to the left of a gene name which will reveal the binding profiles and cell types selected in the Display Filters check box (Supplementary Figure S2D).The expanded table shows the presence (red) or absence (blue) of a peak for each factor and cell type at all regions mapped to that gene by GREAT (Supplementary Figure S2E).In the example shown for TAL1 there is pronounced binding of GA T A2, RUNX1, TAL1, LYL1 and LMO2 at chr1:47 232 180-47 232 680 which corresponds to the TAL1 promoter (Supplementary Figure S2F).Grey blocks indicate that no data is available for that factor / cell type combination.Data for any combination of TFs / histones and cell types can be shown simultaneously allowing easy visualization of comparative binding profiles.Additionally, the coordinates of each region are shown and linked to three UCSC browser sessions.
Alternatively, a more complex query can be constructed to return genes bound by selected factors in specific cell types.The 'Add filter' button allows multiple factor / cell type combinations to be entered, with AND / OR available to combine queries.Searching, for example, genes bound by the heptad in MEPs retrieves 856 genes; these genes then populate the Gene Database table (Supplementary Figure S2G).For additional analyses the full data tables containing peak calls for 125 factor / cell type combinations and gene expression data for > 800 individual samples are readily available on the download page.

Discussion
Combinatorial interactions of TFs at enhancers and promoters of co-regulated genes are a core component of the gene regulatory networks which determine cell identity ( 46 ).We have recently generated high-quality ChIPmentation and ChIPseq data for ten key TFs and three histone modifications, along with H3K27ac HiChIP, in rare hematopoietic subpopulations spanning HSC commitment along the myeloid / erythroid lineages ( 5 ).Here, we have integrated these data with additional TF occupancy, chromatin accessibility and gene expression data from healthy and leukemic cells (Figure 1 , Supplementary Table S1) and created a user-friendly database that allows users to (i) query overlapping TF binding at a gene locus of interest, (ii) discover binding coordinates and associated gene targets for TFs of interest and (iii) compare binding profiles between healthy and leukemic cells.Additional features include data downloads and links to further visualize and explore gene expression.The overall aim of this database is to facilitate understanding of dynamic changes to the transcrip-tional network of HSPCs during lineage commitment along the myeloid / erythroid axis and leukemogenesis.
Understanding the regulatory networks which underlie cell fate decisions, and the ways in which such networks are perturbed in leukemic cells, have important translational implications.For example, bone marrow failure syndromes with underlying genetic causes may be amenable to gene therapy approaches targeting HSCs to produce corrected differentiated cells.However, extensive knowledge of spatiotemporal gene regulation during lineage commitment is needed to avoid lineage skewing due to inappropriate transgene expression.Lineage specific enhancers have been successfully used to overcome this problem and rescue erythroid differentiation in Diamond-Blackfan Anaemia ( 47 ).Conversely, gene regulatory networks are perturbed in multiple AML subtypes ( 1 ,48 ), often via recurrent translocation events (49)(50)(51)(52)(53)(54), which are increasingly understood to include small structural variants which can mediate enhancer hijacking and oncogenic transformation (55)(56)(57)(58)(59). BloodChIP Xtra provides an accessible tool for both discovering lineage specific regulatory elements that may be exploited for gene therapy approaches or understanding the potential clinical consequences of novel structural variants in leukemic cells.
BloodChIP Xtra currently includes healthy HSC-MPP, CMP , GMP and MEP , primary AML and four AML cell lines for which relevant TF binding chromatin and / or accessibility data are available plus gene expression data from > 800 healthy and leukemic samples.As additional high-quality datasets are published, the database will be expanded to accommodate additional TF binding profiles and other genome regulators.We believe that BloodChIP Xtra, which integrates combinatorial TF binding with gene expression in normal and malignant cells, will be a useful tool for biologists and bioinformaticians working in hematopoiesis and leukemia and more generally to those working on gene regulation and stem cell biology.

Figure 1 .
Figure 1 .Sc hematic sho wing the datasets held b y the BloodChIP Xtra database and features of the w eb interf ace.T he database integrates TF ChIPseq, Histone ChIPseq, A T ACseq, H3K27ac HiChIP and RNAseq expression data, while the web interface provides methods to query and visualize data.Cell types shown in white have expression data only.

Figure 2 .
Figure 2. Schematic showing the data analysis pipeline and direct data visualizations included in BloodChIP Xtra.