ChIP-Atlas 3.0: a data-mining suite to explore chromosome architecture together with large-scale regulome data

Abstract ChIP-Atlas (https://chip-atlas.org/) presents a suite of data-mining tools for analyzing epigenomic landscapes, powered by the comprehensive integration of over 376 000 public ChIP-seq, ATAC-seq, DNase-seq and Bisulfite-seq experiments from six representative model organisms. To unravel the intricacies of chromatin architecture that mediates the regulome-initiated generation of transcriptional and phenotypic diversity within cells, we report ChIP-Atlas 3.0 that enhances clarity by incorporating additional tracks for genomic and epigenomic features within a newly consolidated ‘annotation track’ section. The tracks include chromosomal conformation (Hi-C and eQTL datasets), transcriptional regulatory elements (ChromHMM and FANTOM5 enhancers), and genomic variants associated with diseases and phenotypes (GWAS SNPs and ClinVar variants). These annotation tracks are easily accessible alongside other experimental tracks, facilitating better elucidation of chromatin architecture underlying the diversification of transcriptional and phenotypic traits. Furthermore, ‘Diff Analysis,’ a new online tool, compares the query epigenome data to identify differentially bound, accessible, and methylated regions using ChIP-seq, ATAC-seq and DNase-seq, and Bisulfite-seq datasets, respectively. The integration of annotation tracks and the Diff Analysis tool, coupled with continuous data expansion, renders ChIP-Atlas 3.0 a robust resource for mining the landscape of transcriptional regulatory mechanisms, thereby offering valuable perspectives, particularly for genetic disease research and drug discovery.


Introduction
Multicellular organisms comprise different cell types with distinct phenotypes despite sharing a common genome in all somatic cells.This phenotypic diversity arises from cell type-specific gene expression and chromatin states orchestrated by genomic accessibility, DNA methylation status, histone modifications and transcription factors (TFs) binding to transcriptional regulatory elements.Numerous omics experiments have

Data collection
The sample metadata described in the BioSample database of all experiments were downloaded from FTP sites of the National Center for Biotechnology Information (NCBI; ftp:// ftp.ncbi.nlm.nih.gov/sra/ reports/ Metadata and ftp:// ftp.ncbi.nlm.nih.gov/biosample ) along with the monthly update of NCBI Sequence Read Archive (SRA).In the SRA, each experiment is assigned an ID prefixed with SRX, DRX, or ERX (hereafter referred to as SRX), which is also used in ChIP-Atlas for unified data management.ChIP-Atlas compiled the data from SRXs with the following criteria: LIBRAR Y_STRATEGY : 'ChIP-Seq', ' A T AC-Seq', 'DNase-Hypersensitivity', or 'Bisulfite-Seq'; LIBRARY_SOURCE: 'GENOMIC'; taxonomy_name: ' Homo sapiens ', ' Mus musculus ', ' Rattus norvegicus ', ' Caenorhabditis eleg ans ', ' Drosophila melanog aster ' or ' Saccharomyces cerevisiae '; and INSTRUMENT_MODEL: 'Illumina', 'NextSeq' or 'HiSeq'.Sample metadata ( ftp:// ftp.ncbi.nlm.nih.gov/biosample/ biosample _ set.xml ) were used to extract the attributes for ChIP antigens as well as cell or tissue types for each SRX.To structure the metadata, we sorted out notational distortions by submitters, which included manually annotating the names of both ChIP antigens and cell types, a task that was completed by well-trained curators who possess PhDs in molecular and developmental biology.Additional details can be found in our previous papers ( 1 ,2 ).

Implementation of the Diff Analysis tool
The Diff Analysis tool for detecting differentially bound regions (DBRs) from ChIP-seq or differentially accessible regions (DARs) from A T AC-seq and DNase-seq datasets is inspired by the R package 'DiffBind' ( 20 ).However, since Diff-Bind requires BAM files as input to perform comparative analysis, while BAM files are not available in the ChIP-Atlas server, we partially modified the algorithm for counting aligned reads within given ChIP-seq, A T AC-seq or DNase-seq peaks.In particular, upon user request, alignment data in bigWig format showing reads per million (RPM) were first converted to bedGraph format for each query SRX.RPMs were then converted to integer values concerning the total number of the mapped sequencing reads for the SRX.Next, the entire genome was fragmented based on the peak calling data from the query SRXs.We then aggregated the number of sequence reads aligning with each genome fragment and organized the result into an m × n matrix, with m representing the number of genome fragments and n representing the number of query SRXs.The matrix was then entered into the R package 'edgeR' ( 21 ), and the difference in read counts between the two sets of query SRXs was assessed for each genome fragment using the standard algorithm used for detecting differentially expressed genes in comparative transcriptome analysis.The outcomes were then further summarized and documented in BED format, which includes coordinates of the genome fragment in columns 1-3 and corresponding in-W 47  ( 4 , 5 , 15 , 16 ) In the context of detecting differentially methylated regions (DMRs) from Bisulfite-seq datasets, it was also necessary to convert bigWig to bedGraph, which includes methylation rates for each query SRX.We used the DMR detector Metilene ( 22 ); in particular, 'metilene_input.pl'provided by metilene was used to aggregate methylation rates per genomic base for each query SRX, using the bedGraph of the query SRX generated in the prior step as input.The resulting TSV file was then used as the input into the main 'metilene' program, which returns DMRs along with statistics such as mean methylation differences and Q -values in a BED format.Default parameters were applied when executing both the 'meti-lene_input.pl'script and the 'metilene' command, in which the minimum mean methylation difference for calling DMRs was set to 0.1.

Overview of the ChIP-Atlas 3.0 update
The ChIP-Atlas project aims to collect and analyze ChIP-seq, A T AC-seq, DNase-seq and whole-genome Bisulfite-seq data, originally archived in the NCBI SRA, along with associated sample metadata that have been manually curated by experts.Since the initial public release of ChIP-Atlas in 2015, the volume of data has steadily increased at a rate of approximately 3000 entries per month, in line with the monthly updates to the NCBI SRA ( Supplementary Figure S1 ).The number of SRXs in ChIP-Atlas 3.0 exceeds 376 000 for six representative model organisms (ChIP-seq, n = 228 495; A T AC-seq, n = 84 615; DNase-seq, n = 6386; Bisulfite-seq, n = 56 668), which corresponds to 83.5% of the total number of SRXs using these sequencing technologies in NCBI SRA for all organisms.The unified processing pipeline identified over 11 billion genomic intervals (protein binding sites for ChIP-seq: n = 2 090 307 752; accessible genomic regions for A T AC-seq and DNase-seq: n = 1 297 948 075; hyper-, hypo and partially methylated regions for Bisulfite-seq: n = 7 885 144 497) (Table 2 ), which showed an approximately 30% increase compared with ChIP-Atlas 2.0 ( 2 ).
Another notable highlight in ChIP-Atlas 3.0 is the addition of a new section in the Peak Browser tool termed 'annotation tracks', which consolidates features such as chromosomal conformation, transcriptional regulatory elements, and disease-or phenotype-associated genome variants.The annotation tracks can be visualized in the IGV genome browser ( 23 )  users understand the whole picture of molecular profiles, including the regulome, epigenome, and transcriptome, along with 3D chromatin architecture and DNA polymorphisms.In addition, we implemented the Diff Analysis tool, which allows users identify statistically significant DBRs, DARs, and DMRs that have the potential to define transcriptional and phenotypic diversity in cells and tissues from two sets of ChIP-seq, A T AC-seq and DNase-seq, and Bisulfite-seq data, respectively.These functional enhancements in ChIP-Atlas 3.0 establish it as an increasingly comprehensive platform for a panoramic view of the cell fate determination process.

Annotation tracks
Annotation tracks in ChIP-Atlas 3.0 can be graphically displayed with the IGV genome browser along with regulome data, including protein-genome interactions (ChIP-seq), chromatin accessibility (A T AC-seq and DNase-seq) and DNA methylation levels (Bisulfite-seq) within the queried genomic region of interest.To implement the annotation tracks, we collected and organized (a) chromosomal conformations such as ENCODE Hi-C and GTEx eQTL datasets ( 4 ,5 ), (b) transcriptional regulatory elements such as cell-specific promoters, enhancers, and heterochromatins from the ChromHMM and FANTOM5 projects, along with JASPAR TF-binding motifs ( 6-8 ), (c) disease-and phenotype-genome associations based on GWAS SNPs and ClinVar variants ( 9 ,10 ), (d) genedisease / phenotype associations from Orphanet and MGI Phenotype ( 11 ,12 ), (e) conserved interspecies genetic sequences from PhastCons ( 13), (f) repeated sequences from Repeat-Masker ( 14) and (g) RNA-seq-based transcriptome from the ENCODE, GTEx, FlyAtlas2 and modENCODE consortia (Table 1 , Supplementary Table S1 ) ( 4 , 5 , 15 , 16 ).These features were then grouped into an 'annotation tracks' item arranged in the 'Track type class' option within the web interface of the ChIP-Atlas Peak Browser tool, which externally controls the IGV ( 23 ) preinstalled on the user's machine (tested on Mac, Windows, and Linux platforms).Users are recommended to run IGV on their own computer with at least 4 GB of RAM and an Internet speed of 100 Mbps.
Here, we show an example of exploring annotation tracks together with the regulome data using Peak Browser.First, ChIP-seq (TFs and others) data of human (hg38) blood cells were specified in the query page of Peak Browser ( Supplementary Figure S2 A, B) by which the corresponding track was automatically streamed into IGV (Figure 1 ).Other experimental data, such as ChIP-seq (Histone: H3K27ac), A T AC-seq, and Bisulfite-seq, as well as annotation tracks (ChromHMM, eQTL, and GWAS SNPs) data of the same cell type class, were subsequently loaded using IGV in the same manner as the ChIP-seq (TFs and others) data.Individual alignment data from ChIP-seq, A T AC-seq, and Bisulfiteseq experiments were also shown in single views.The genomic region in the vicinity of PELATON (also known as SMIM25 ), a long noncoding gene reported as a biomarker and potentially involved in the pathogenesis of inflammatory bowel disease (IBD), is shown in Figure 1 ( 24 ,25 ).The eQTL data indicate that PELATON transcription is primarily influenced by genetic variations located approximately 70 kb downstream of the PELATON locus (highlighted).The co-localization of strong enhancers (ChromHMM) in conjunction with A T AC-seq, Bisulfite-seq, and H3K27ac peaks indicates that the region is protein accessible, hypo-methylated and with an 'active' histone modification in blood cells.In addition, the binding of TFs such as SPI1 and JUND, which contribute to hematopoietic differentiation and the release of pro-inflammatory signals, also suggests inflammatory activity within the genomic region.The presence of SNPs associated with IBD and monocyte count in this particular region is also noteworthy.Together with the above observations, we can make a reasonable inference that these SNPs may alter the chromatin landscape and disrupt normal TF binding in this region, thereby leading to a loss of control in PELATON transcription and resulting in the phenotypic manifestation of IBD.Examples of browsing other annotation tracks (Hi-C, RNA-seq, FANTOM5 enhancers, and JASPAR TF motifs) are shown in Supplementary Figure S3 .IGV session is shown in Supplementary Materials S1 and S2 .

Diff Analysis
The Diff Analysis tool was implemented in ChIP-Atlas 3.0 as a brand-new online function offering both a graphical user interface (GUI) and application programming interface (API) to detect statistical differences between two groups of sequencing data upon query of SRX or GEO ID(s).Guidance on identifying IDs of interest to users is provided in the 'Tips: IDs of experiments' section of the tutorial at https://chip-atlas.dbcls.jp/ data/ manual/ Diff _ Analysis/ Diff _ Analysis.pdf.The calculation algorithm for DBRs (ChIP-seq) and DARs (A T AC-seq and DNase-seq) is inspired by the R package 'DiffBind' ( 20 ), while the detection of DMRs (Bisulfite-seq) is supported by the preexisting 'metilene' tool ( 22 ) (see Materials and methods section for details).Calculation results are returned in BED format containing coordinates of genome regions with statistics such as Q -values.
For example, after submitting the A T AC-seq experiment IDs under the experiment series SRA1075867 ( 26 ) in mouse embryonic stem cells (SRX8347024 and SRX8347025) and myoblasts (SRX8347026, SRX8347027, SRX8347028 and SRX8347029), the Diff Analysis tool returned a clickable HTML link to load the DAR data into IGV, along with a ZIP file consisting of a plain BED file (.bed) for further analysis, a BED9 + GFF3 format file (.igv.bed) for visualization using IGV, event logs (.log), and an IGV session XML file (.igv.xml) containing alignment data for queried SRXs and DARs ( Supplementary Figure S2 C, D).By loading the XML session file to IGV, DARs were clearly shown around the gene loci of Pou5f1 (orange) and Myod1 (blue), which are required for pluripotency and myogenic differentiation, respectively (Figure 2 A, Supplementary Materials S3 ).Meanwhile, no DARs were detected around housekeeping Gapdh locus.In addition, we show an example of using Diff Analysis to detect DMRs between Bisulfite-seq data (SRA960814) ( 27 ) on human brain (SRX6831786, SRX6831787, SRX6831788, SRX6831789, SRX6831790 and SRX6831791) and T cells (SRX6831796, SRX6831797, SRX6831798 and SRX6831799).As a result, significant DMRs were detected around the vicinity of the transcription start sites for a subset of isoforms of MAP2 (orange) and CD4 (blue) encoding neuron-specific cytoskeletal proteins and T lymphocyte-specific surface glycoproteins, respectively (Figure 2 B, Supplementary Materials S4 ).Meanwhile, no DMRs were detected around housekeeping GAPDH locus.These cases mentioned above suggest that Diff Analysis is capable of identifying key differences in the epigenomic W 51 landscape that define transcriptional and phenotypic diversity in cells.
In addition to the pre-analyzed data, Diff Analysis can accommodate user-generated data stored on custom web servers that are publicly accessible.The procedure to perform analysis on their data is outlined in the 'Diff Analysis' section of the documentation and the tutorial PDF ( https://chipatlas.dbcls.jp/data/ manual/ Diff _ Analysis/ Diff _ Analysis.pdf).In brief, users must make a query by inputting the URLs of raw read coverage (bigWig format) and peak-call (BED format) data and total number of mapped reads for ChIP-, A T AC-, and DNase-seq, and bigWig data of methylation rate (between 0 and 1) for Bisulfite-seq.

Discussion
In addition to an extensive increase in the number of SRXs, this paper presents a significant ChIP-Atlas update, specifically centering on offering annotation tracks and a Diff Analysis tool for users to attain a comprehensive understanding of the chromatin architecture associated with transcriptional regulatory mechanisms that have the potential to impact cell fate determination.
As with ChIP-Atlas, a number of other similar web services have also been made available, of which Cistrome DB ( https:// db3.cistrome.org/browser/ ) ( 28 ), ReMap ( https://remap2022.univ-amu.fr/)( 29 ), and GTRD ( https://gtrd.biouml.org/)( 30 ) offer an array of pre-analyzed ChIP-seq and chromatin accessibility data sets, numbering in the tens of thousands, while MethBank ( https:// ngdc.cncb.ac.cn/ methbank/ ) ( 31 ) is known for assembling data from hundreds of methylome analysis projects (Table 3 ).The amount of experimental data in ChIP-Atlas surpasses that of all other services.Quality control filtering is not performed on data in the ChIP-Atlas project.Instead, expert-curated sample metadata is furnished for each SRX, enabling users to independently assess the robustness of W 52 Nucleic Acids Research , 2024, Vol.52, Web Server issue their selected SRX if necessary.ChIP-Atlas only covers ChIPseq, A T AC-seq, DNase-seq and whole-genome bisulfite-seq data sets in six organisms, while certain additional experimental methods, such as ChIP-exo, MNase-seq, and FAIREseq, and organisms such as plants, are still to be supported.Differences between SRXs can be detected using ChIP-Atlas, which provides both a user-friendly GUI and programmable API for batch processing (refer to the ChIP-Atlas documentation).On the contrary, MethBank only offers a tool for analyzing DMRs that need to be run locally in a command-line interface.The ChIP-Atlas website is free and open to all users without requiring login credentials.All data and analysis tools provided by ChIP-Atlas are available for unrestricted use in non-commercial and commercial contexts, given that proper citation is provided.
Since its public release, ChIP-Atlas has been utilized in diverse research areas, including genetics, etiology, developmental biology, and drug discovery and cited in over 700 publications ( https:// chip-atlas.org/publications for full publication list).The updated ChIP-Atlas 3.0 is expected to provide valuable insights into, for instance, genetic disease research by incorporating chromosome architecture data, thus examining and providing a comprehensive view of transcriptional regulatory mechanisms.Although numerous susceptibility SNPs for inherited diseases have been identified through GWAS, how these SNPs alter gene expression and contribute to disease development is not completely understood, as most of them are found in non-coding regions.To overcome this challenge, we previously conducted an enrichment analysis using large-scale ChIP-seq experimental data in ChIP-Atlas and were able to successfully identify TFs that exhibit enriched binding to SNPs associated with atrial fibrillation ( 32 ).By further utilizing the Hi-C and eQTL tracks from ChIP-Atlas 3.0, a systematic elucidation of a cascade of events should be possible, in which the presence of disease-associated SNPs induces anomalous TF binding, consequently resulting in gene expression abnormalities within certain chromosomal conformations.Apart from genetic disease research, we also analyzed TF binding that was enriched in DARs induced by chemical exposure to identify pivotal TFs involved in chemical action modes ( 33 ).Because the detection of DARs in the approach proposed in this study requires the use of external command-line-interface tools, the methodology may be challenging to implement for those without expertise in genome informatics.Nevertheless, after implementing the Diff Analysis tool in ChIP-Atlas 3.0, the entire pipeline becomes readily accessible on the ChIP-Atlas website, thereby significantly contributing to drug discovery research.
The amount of experimental data available on ChIP-Atlas is steadily increasing with consistent monthly updates and expert curation, and the annotation data are to be updated periodically.Our future plan is to further expand ChIP-Atlas with additional experiment types, like CUT&Tag ( 34 ) and ChILseq ( 35 ), and organisms, including fish, plants and non-human primates.Furthermore, to address spatio-temporal gene expression in cells of multicellular organisms, the integration of data from spatial epigenetics technologies is also under active consideration.

W 49 Figure 1 .
Figure 1.Example for browsing annotation tracks in ChIP-Atlas 3.0.Peak-call data for TF and histone ChIP-seq (ChIP-Atlas 1.0), A T AC-seq and Bisulfite-seq experiments (ChIP-Atlas 2.0), along with ChromHMM, eQTL and GWAS SNPs tracks (ChIP-Atlas 3.0) in human (hg38) blood around the PELATON locus are shown in the IGV genome browser.Panels a, c, e and f show single views of individual alignment data from ChIP-seq, A T AC-seq and Bisulfite-seq experiments ( a , IKZF1 ChIP-seq in GM12878 [SRX2424550]; c , H3K27ac ChIP-seq in lymphoblastoid cell line [SRX4288387]; e , A T AC-seq in CD8 + T cells [SRX1 6731 555]; g , methylation level in B cells [SRX9500396]) (see the tutorial PDF of Peak Browser [ ht tps://c hip-atlas.dbcls.jp/data/manual/Peak _ Browser/P eak _ Bro wser.pdf] f or more inf ormation on loading individual data f or single vie ws).Panels b, d, f and h show integrative views of TF ( b ) and histone ( d ) ChIP-seq peaks, A T AC-seq peaks ( f ) and hypo-, hyper-and partially methylated regions ( h ) in the cells categorized as 'blood' cell type class.The highlighted region (in gray) indicates an accessible chromatin and hypomethylated region.Bars in panels b, d and f represent the peak regions, the color of which indicates MACS2 scores (-10 × log 10 [ Q -value]); i.e. if MACS2 scores are 50, 500 or o v er 1000, then the colors are blue, green or red, respectively.Panel h is shown in squished mode, and black, pink and beige bars indicate h yper-, h ypo-and partially methylated regions, respectively.Colored bars in panel i indicate various chromatin states.Arrows from short bars to long bars in panel j indicate the effect of gene polymorphisms (short bars) on gene expression (long bars).Bars in panel k represent SNPs associated to diseases, phenotypes, measurements and drug responses.See Supplementary Figure S2 A and B for details on the procedures for visualizing these tracks.

Table 1 .
Available annotation tracks for each genome assembly

48
Nucleic Acids Research , 2024, Vol.52, Web Server issue upon user request to the ChIP-Atlas web server, together with other Peak Browser track types.This is expected to help W

Table 3 .
Comparison of ChIP-Atlas with other similar services