HiChIPdb: a comprehensive database of HiChIP regulatory interactions

Abstract Elucidating the role of 3D architecture of DNA in gene regulation is crucial for understanding cell differentiation, tissue homeostasis and disease development. Among various chromatin conformation capture methods, HiChIP has received increasing attention for its significant improvement over other methods in profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions. To facilitate the studies of 3D regulatory interactions, we developed a HiChIP interactions database, HiChIPdb (http://health.tsinghua.edu.cn/hichipdb/). The current version of HiChIPdb contains ∼262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types. The functionalities of HiChIPdb include: (i) standardized categorization of HiChIP interactions in a hierarchical structure based on organ, tissue and cell line and (ii) comprehensive annotations of HiChIP interactions with regulatory genes and GWAS Catalog SNPs. To the best of our knowledge, HiChIPdb is the first comprehensive database that utilizes a unified pipeline to map the functional interactions across diverse cell types and tissues in different resolutions. We believe this database has the potential to advance cutting-edge research in regulatory mechanisms in development and disease by removing the barrier in data aggregation, preprocessing, and analysis.


INTRODUCTION
3D genome architecture is central to the complex gene regulation but challenging to interrogate (1)(2)(3)(4). Rapid advances in next-generation sequencing technologies have provided an unprecedented opportunity to investigate 3D genome interactions on a genome-wide scale. Among various chromatin conformation capture (3C)-based methods, HiChIP (5) has received increasing attention for its significant improvement over ChIA-PET (6) and Hi-C (7) in direct profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions (8). The novel protein-mediated highthroughput sequencing technique generates high-quality and high-resolution contact maps of genome spatial organization while lowering cellular input requirements. Recent years have witnessed a surge of HiChIP assays across diverse cell lines that characterize functional relevance of chromatin interactions and provide extraordinary insights into diseases arising from mutations affecting epigenetic modifiers, transcriptional factor binding loci, and architectural proteins (9). However, most publicly available HiChIP data remain underutilized as little effort has been made to assemble a comprehensive collection of HiChIP samples scattered in literature. Existing 3D databases, such as EN-CODE (10) and 4D Nucleome (11) and others (12)(13)(14), feature mostly Hi-C data and fail to incorporate HiChIP assays. The built-in Hi-C-specific preprocessing tools found in these databases are also not optimal for HiChIP data preprocessing. More importantly, most 3D databases focus heavily on interaction visualization which is essential for Hi-C but tend to overlook the functional implications of protein-directed HiChIP samples. Moreover, the public HiChIP data are being accumulated at an unprecedented rate: of the 247 human HiChIP datasets in GEO repository, 85 (34.4%) were generated in the past 12 months. The rapid growth rate of HiChIP data indicates the potential impact on a broad research community to leverage these types of 3D functional interactions to study regulatory mechanism. Therefore, to facilitate the studies of 3D regulatory interactions and promote comprehensive analysis of HiChIP In total, HiChIPdb currently contains ∼262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types with multi resolutions. Under 5 kb resolution, 2 142 294 loops spanning 496 763 anchors were observed among all samples respectively (Figure 1). All HiChIP samples are preprocessed using a unified framework including FastQC basic read quality check, HiC-Pro alignment, quality control and replicate merge (15). Due to the low read-pair density in HiChIP data, multiple resolutions at 1/5/10/50 kb are offered through the computational tool FitHiChIP (16), and variable length resolution through hichipper (17). Each HiChIP sample in HiChIPdb is annotated hierarchically with extensive details such as cell line, tissue, organ, etc. Various statistical measures including the length distribution of HiChIP anchors, the length distribution of HiChIP loops, and the density distribution of HiChIP interactions in different chromosomes are provided on the Browse page for crossreferencing. Unlike Hi-C and other conventional chromatin conformation strategies, HiChIP signals indicate proteincentric in situ chromatin loops that carry significant functional implications. In addition to the above-mentioned visualization tools, HiChIPdb has a strong emphasis on functional annotations of regulatory genes and GWAS catalog SNPs overlapping with HiChIP anchors. Annotated genes are listed on the Detail page of respective interactions with links to external databases (e.g. GeneCards (18), UniProt (19) and NCBI (20,21), etc.). Similarly, annotated SNPs are linked to dbSNP (22). Other functionalities in HiChIPdb include advanced searching, hierarchical browsing, interactive visualization with custom tracks and data downloading with different options. A detailed preprocessing and statistical analysis of a HiChIP sample is presented as an exemplary usage of the database on the Tutorial page.
Overall, HiChIPdb presents a comprehensive compilation of published HiChIP samples with reliable annotations and user-friendly interfaces. In addition to its collection of informative data, HiChIPdb is developed to ultimately benefit prospective downstream applications. Before the development of this database, several studies have already started to leverage protein-centric HiChIP data to computationally model genomic architecture (23)(24)(25). Despite this, systematic efforts to fully leverage HiChIP data across numerous samples remain elusive. HiChIPdb aims to address this challenge and offers a promising avenue for researchers to investigate genome-wide chromatin interactions across different cell types in the context of various proteins of interest. Many advantages of HiChIPdb, such as a stringent preprocessing pipeline with multiple resolution options, extensive annotations of epigenomic profile and GWAS catalog (26), are designed to facilitate integrative analysis and/or modeling of various properties of 3D chromatin structure. We anticipate this database to empower future computational approaches and deepen our understanding of gene regulatory networks and disease mechanisms.

Data collection
To collect HiChIP datasets, we referred to a series of standardized procedures of biological databases for consistent and reliable data collection. First, we searched the keyword 'HiChIP' in Gene Expression Omnibus (GEO) database (20,21) (https://www.ncbi.nlm.nih.gov/geo/) and retrieved 989 results (as of May 2022) with each result corresponding to a candidate HiChIP dataset. Second, these candidate HiChIP datasets were filtered on the availability of raw SRA sequencing data, reducing the total number of datasets to 247. Validated HiChIP datasets were identified by providing at least one sample of a specific human tissue or cell line. Each dataset contains general information such as cell type, treatment, ChIP type, GEO accession. We manually extracted the information of tissue and organ based on cell type to construct a hierarchical structural tree of the curated database ( Figure 1).

Data processing and annotation
In order to make the HiChIP data comparable across different tissues and cell types, we proposed here a unified framework for data processing for each HiChIP dataset. First, we processed each HiChIP dataset from the SRA raw sequencing data (e.g. FASTQ) and produced HiChIP interactions at various resolutions with detailed annotations through the unified data processing pipeline (Figure 1). Specifically, we first applied HiC-Pro (15) software for processing raw FASTQ files (paired-end Illumina data), including reads mapping, valid ligation products detection, quality control and intra/inter-chromosomal contacts map generation. We then chose GRCh37/hg19 as the reference genome for Homo sapiens, and applied FitHiChIP (27) and Hichipper (17) for HiChIP data loop calling. Note that FitHiChIP enables loop calling at specific resolutions (1k, 5k, 10k and 50k), which resulted in fixed loop size across the whole genome. In contrast, Hichipper produces variable-length loops, which makes it difficult to compare across cell types. HiChIP samples with less than 100 loops were filtered for quality control. The current release (as of May 2022) of HiChIP database contained 200 HiChIP samples from 64 GEO repositories. After standardizing the names of cell types into the standard list from ENCODE (28) and removing the ill-formed names, we further categorized each HiChIP dataset into respective cell type, tissue and organ to form a hierarchical structure.
Finally, to further facilitate the study of 3D genome regulatory mechanisms, each loop was annotated with a rich set of features, including the nearest genes to each loop anchor, Genome-Wide Association Studies (GWAS) Catalog SNPs (26) within the loop anchors, the raw/normalized read count within the loop anchors and the p-value for the loop. Each loop was visualized using IGV interactive tool (29) and the reference genes within the loop range were annotated on the illustration.

Database statistics
The current release of HiChIPdb contains 200 highthroughput HiChIP samples across 108 cell types, 19 organs and 28 tissues. Under 5 kb resolution, 21 422 946 loops spanning 496,763 anchors were observed among all HiChIP samples. Among the 12 different ChIP-associated proteins, H3K27ac is the dominant protein that covers 17,612,271 (82.21%) of the full HiChIP chromatin loops as H3K27ac HiChIP directly identifies high-confident functional loops focused around enhancer interactions (5). There are also 355,968 HiChIP loops (1.66%) associated with cohesion protein, which reveal multi-scale genome architecture (5). The total number of HiChIP loops was approximately 6.5M, 29.5M and 28.4M when we set the resolution to 1k, 10k and 50k, respectively. We also notice that the number of HiChIP loops from Hichipper is about 176M, which is much larger than FitHiChIP-based pipeline. The H3K27acassociated HiChIP loops are mainly from blood (31.58%), brain (17.01%) and skin (11.51%) tissues.

User-friendly browsing
We built an interactive web interface for researchers to explore the well-organized HiChIP data across diverse cellular contexts. An interactive image of human anatomy displayed on the Home page enables direct access to HiChIP loops related to a specific human organ (Figure 2A). In brief, we make it easy for users to access relevant cell lines within a pop-up window by clicking one of the organ icons. Alternatively, users can also directly explore the HiChIP datasets in a tree-based hierarchical order (ChIP-Organ-Tissue-Cell type) through the Browser page ( Figure 2B). Users can select a specific ChIP (e.g. H3K27ac) in the left panel tree structure through a unified pipeline (default is 5k resolution). Once selected, a set of comprehensive statistics will be illustrated through pie plots or bar plots which include: (i) distribution of the loops number across various organs; (ii) distribution of the loops number across various tissues; (iii) distribution of the loops number across various cell types; (iv) distribution of the loops number across different chromosomes; (v) distance distribution of the loop length (within 2M bp). In addition, an interactive table is generated, where each row represents a HiChIP loop and columns denote the detailed information including anchor location, anchor nearest gene, SNPs within the anchors, raw/normalized read count within the anchors, cell type, tissue, organ, and the GEO accession number ( Figure 2C). Clicking on the 'Detail' button generates a visualization for each loop, and includes the reference genes within the loop region ( Figure 2D).

Advanced searching
On the Search page, various options in the drop-down menus allow for searching HiChIP loops that meet specific conditions ( Figure 2C). Users can choose different conditions including pipeline type, ChIP type, organs, tissue and cell types. Note that multi-select and select-all are enabled for organs, tissues, and cell types. An example button is provided for users to quickly select a combination of filtering conditions. Users can simply click the submit button to get the filtered HiChIP loops in the result table. A download option is also available for users to download the filtered HiChIP loop table of results.

Detailed information
Detailed information on each HiChIP loop is available through the Browse or Search page, and each HiChIP loop is linked to the Detail page of the entry ( Figure 2D). The redirected Detail page will then demonstrate various information about the chosen HiChIP loop, including interactive visualization, anchor information, nearby gene, and SNPs within the loop region. The interactive visualization reports the loop, SNPs, P-values of the SNPs and RefSeq genes, with zoom in/out option. Users can also specify multiple samples of interest for further comparison and investigation. The Detail page also provides overviews of the annotated nearby genes with gene symbols, chromosomes, start and end sites. More information about the genes can be available through external links, including NCBI (20,21), GeneCards (30), UniProt (19). As the genetic variants may affect the chromatin contact propensity, thus potentially changing the gene expression (31-33), we also provided external EMBL-EBI (34) and NCBI (20,21) links for all the SNPs that fall into the HiChIP loop anchors.

Data download and submission
Various options are available for users to download HiChIP data by ChIP, chromosome, organ, tissue, cell type and sample through the Download page ( Figure 2E). An MD5 checksum is included for each downloaded file for verifying the integrity. For each download table with a specific option (e.g. ChIP), users can search for a specific name to quickly locate the file with interest using the searching box. To make HiChIPdb scalable, a submission window on the About page allows users to add their newly generated HiChIP data to the HiChIPdb by submitting the GEO accession number, URL for data accession, and contributing authors information ( Figure 2F). After approval, the dataset is then automatically deployed to the HiChIPdb web server through a standardized pipeline, and the processed data will be integrated with existing HiChIP datasets.

The uniqueness of HiChIPdb compared to other 3D genome databases
There are currently only four existing 3D genome databases including 4DN (11), 3D genome browser(3DGB) (12), HUGIN2 (not published) and 3DIV (14). To the best of our knowledge, HiChIPdb is the first comprehensive 3D genome database that utilizes a unified pipeline to provide functional interaction data across diverse cell types and tissues. The major differences between HiChIPdb and other 3D genome databases are demonstrated in the following three aspects (Table 1). First, HiChIPdb provides the most comprehensive HiChIP functional loops resource while most databases ignore HiChIP data except HUGIN2. However, HUGIN2 only contains 12 HiChIP datasets without any functional annotation. Other comparing databases mainly focus on Hi-C data and do not provide detailed information of uniformed loops, loop anchors, and loop functional annotations, which are highlighted features of HiChIPdb. As a resource for computational biologists, HiChIPdb provides various export options on the Search and Download page for downstream analysis. Second, HiChIPdb is the only resource that uses a unified pipeline with multiple optional resolutions, thus making it possible to compare the HiChIP functional loops across cell types. To systematically analyze 3D functional genome, the unified strategy is important for calculating the cell type specificity of functional loops and investigating the interactions of functional regulatory elements in whole genome. Third, Our HiChIP database provides various functional annotations (e.g. regulatory genes, GWAS Catalog SNPs, etc.), which can facilitate diverse research related to functional genome studies.

A tutorial of HiChIPdb
To make HiChIPdb easy to use, a comprehensive tutorial on how to analyze processed HiChIP data and how to apply the processed HiChIP data in downstream applications is provided on the Tutorial page. Useful statistical information of the loops is shown for an example dataset. For instance, for the 827,288 HiChIP loops with 5k resolution from GM12878 cell line, the length of loops ranges from 20 kb to 2 Mb, and the median distance between the left loop anchor and right loop anchor to the nearest gene is 39 457 bp and 39 830 bp, respectively. The GC content of the left loop anchor and right loop anchor is both 0.02-0.74, The tutorial also demonstrates the effectiveness of HiChIPdb through the application of annotating GWAS risk genes by using HiChIP loops. Given the GWAS summary statistics data, all HiChIP anchors that contain at least one GWAS SNP are collected. Here, (−log 10 P) × β is defined as the weighted effect size for each variant where β and P denote the effect size and P-value in the original GWAS study, respectively. And then the weighted effect size of all genetic variants that fall on the regions that have interactions with the target risk gene are summed together as the HiChIP annotated gene score (HAGS). The HAGS describe the accumulated effect of all GWAS SNPs with implicated interaction with the risk gene. Different from merely sum up the effect of SNPs that are nearby the target gene. HAGS provide a more comprehensive measurement of the overall genetic variants effect on a specific target gene.
A QT interval duration study (35) is taken as an illustrative example. HAGS for all genes across different cell types using H3K27ac HiChIP interactions are calculated. Interestingly, two risk genes (NOS1AP and KCNH2) reported in (35)(36)(37) were ranked 1st and 12th in HAEC heart cell line, respectively. In contrast, NOS1AP ranked 3rd while KCNH2 ranked 11 729th in PAEC lung cell line, NOS1AP and KCNH2 all ranked 4179th in HARA lung cell line due to no overlapped HiChIP interactions. The mean ranking and median ranking for these two genes in irrelevant cell lines are further calculated as background ranking. NOS1AP has a mean rank of 2229th, median rank of 470th in heart irrelevant cell lines. And KCNH2 has a mean rank of 2093th, median rank of 280th in heart irrelevant cell lines. Such results demonstrate that the HAGS is a useful statistic for discovering potential disease-associated genes.

SYSTEM DESIGN AND IMPLEMENTATION
The HiChIPdb website is developed and maintained on a Linux Apache web server (https://www.apache.org

CONCLUSION
The investigation of transcriptional regulation is one of the most influential areas of life science research. Using the HiChIPdb allows investigators to interpret the underlying mechanisms of regulatory interactions systematically due to the advancements in both experimental and computational techniques. Prior analytical methods have been largely restricted to studying 3D genome using 3C-based technology such as Hi-C (38)(39)(40), with less emphasis given to functional interactions such as enhancer-promoter interactions (41). Recent breakthroughs in 3D functional interactions research, such as HiChIP, brings these functional interactions into the spotlight (24,25,42), allowing investigators obtain additional insight into the spatial and temporal dynamics of gene regulation that was not possible before. However, systematic efforts to provide uniformly processed and fully annotated HiChIP functional 3D interactions across diverse samples remain lacking using prior databases.
To fill the gap, we developed HiChIPdb, a comprehensive database that focuses on HiChIP regulatory interac-Nucleic Acids Research, 2023, Vol. 51, Database issue D165 tions. HiChIPdb uses systematic data pre-processing procedures and has a user-friendly web platform including advanced searching, interactive visualization and convenient download. HiChIPdb can assist biologists and data scientists achieve a better understanding of the role of functional interactions in gene regulatory mechanisms and empower them to construct more comprehensive gene regulatory networks. For example, since GWAS-identified risk variants in non-coding regions of the genome exert phenotypic effects through perturbation of functional 3D interactions, HiChIPdb has the potential to give insights to a more complete interpretation of GWAS risk variants and aid in developing new approaches for disease prevention and treatment (43)(44)(45).
To make HiChIPdb more comprehensive and useful, we plan to incorporate the following features in our future release. The first thing is to collect mouse samples and integrate them into HiChIPdb, which can be supplement for human samples to better understand regulatory mechanism. Second, we plan to incorporate more comprehensive epigenomic annotations, such as different types of transcription factor binding, histone modification and chromatin accessibility annotation for each HiChIP sample. Third, to improve the usefulness of our database, we would like to build a webserver for fast annotation of functional HiChIP loops with user-specific multiple genomic regions as input. Last but not least, in order to facilitate the collection process and expand our database, we would also like to incorporate a web-based tracking and data entry system to carry out a monthly GEO search for a regular HiChIPdb update.

DATA AVAILABILITY
HiChIPdb is publicly accessible for worldwide users without any registration or login. Users can freely access all data host in HiChIPdb at http://health.tsinghua.edu.cn/ hichipdb.