hipFG: high-throughput harmonization and integration pipeline for functional genomics data

Abstract Summary Preparing functional genomic (FG) data with diverse assay types and file formats for integration into analysis workflows that interpret genome-wide association and other studies is a significant and time-consuming challenge. Here we introduce hipFG (Harmonization and Integration Pipeline for Functional Genomics), an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes (e.g. chromatin interactions, genomic intervals, quantitative trait loci). Availability and implementation hipFG is freely available at https://bitbucket.org/wanglab-upenn/hipFG. A Docker container is available at https://hub.docker.com/r/wanglab/hipfg.


Introduction
Genome-wide association studies (GWAS) have successfully identified genome-wide significant variants and genes of interest with respect to the traits and diseases under study (Uffelmann et al. 2021).However, understanding fully what these genetic signals mean biologically is still challenging.Post-GWAS analyses have benefited from recent advances in high-throughput sequencing technologies designed to characterize biology beyond the genome sequence, including epigenetic profiling, chromatin interactions, and gene-variant associations (Cano-Gamez and Trynka 2020).Post-GWAS analyses, including variant fine-mapping or functional annotation, are often powered by genomic data querying and analysis-ready tools applied to these functional genomics (FG) data.However, within even a single assay or datatype of these FG data, there may be differences in file formats, included meta-information, and normalization standards.For example, in expression quantitative trait loci (eQTLs), the Genotype-Tissue Expression (GTEx) project (Aguet et al. 2020) published eQTLs in a 9-column format, and the eQTL catalogue (Kerimov et al. 2021) published the same reprocessed GTEx data with 19 columns.Integrating FG datasets into any analysis requires preprocessing steps specific to its source(s).Complicating this integration further, many data formats are not compatible with standard genomic tools used for querying such as bedtools (Quinlan and Hall 2010), tabix (Li 2011), and Giggle (Layer et al. 2018), therefore precluding them from immediate analysis following download.Some examples of common incompatibilities include: • variable format and amount of provided metainformation across datasets and data sources, • unconventional field names, unusual delimiters, or missing field name information, • the use of different chromosome notations such as chromosome numbers or chrN chr-prefixed chromosome names, • use of 1-based genomic coordinates instead of 0-based indexing in the UCSC Genome Browser BED style (Karolchik et al. 2003), or • use of variable file formats such as text-based TSV formats not amenable to genomic querying.
Although workflows have been established to harmonize some individual data types such as GWAS summary statistics data (Lyon et al. 2021, Murphy et al. 2021), no workflow yet exists to dynamically check, harmonize, and integrate diverse FG datasets and their metadata (Supplementary Table S1).While public databases and resources such as GTEx (Consortium et al. 2020), eQTL catalogue (Kerimov et al. 2021), and QTLbase (Huang et al. 2023) are useful for providing functional annotations including QTLs, they do not allow users to extend the workflow and add other datatypes.
Here, we introduce the Harmonization and Integration Pipeline for Functional Genomics (hipFG), a robust and scalable pipeline for harmonizing FG datasets of diverse assay types and formats.hipFG can quickly integrate FG datasets for use with high-throughput analytical workflows, e.g. for analyzing current population-level studies such as UK Biobank (Sudlow et al. 2015) (500 000 individuals with >2500 phenotypes).In addition, hipFG can be used to normalize users' own data, allowing for integration of user-generated FG data into an analysis or workflow-a functionality not provided by public FG-based resources.Supplementary Figure S1 provides an example workflow for integrating hipFG with custom genomic analyses.
2 Materials and methods hipFG includes datatype-specific pipelines to process diverse types of FG data.These FG datatypes are categorized into three groups: annotated genomic intervals, quantitative trait loci (QTLs), and chromatin interactions (Fig. 1A).User-provided data descriptions (Fig. 1B) guide hipFG in generating customized pipelines to harmonize input data and metadata.These pipelines include type-dependent and type-independent steps, checks, and corrections (Fig. 1C, Supplementary Fig. S2).
Type-independent steps are executed before and after the type-specific steps (Supplementary Fig. S2), and include genomic coordinate-based sorting, tabix-based indexing, metadata generation (Fig. 1D), and consolidation of project-level metadata (Section 2.1).Type-dependent steps are highly variable and described in Section 2.2.Lastly, the output files are indexed for rapid query (Fig. 1E) and organized alongside associated metadata (Fig. 1F).

Metadata for input track dataset
Data descriptions are provided via a minimal descriptions table (MDT) (Supplementary Tables S2 and S3) and an inputfile configuration file ("file config") (Fig. 1B).The MDT includes biological (e.g.cell types, tissues), data source (e.g.DOI, project name), and file (e.g.data file path, paths to config files) information describing the inputs.The file configs (Fig. 1B, Supplementary Fig. S3) associate input columns with standard output fields and describe other attributes of the input data, e.g. the existence of column names and 0-/1-based indexing.While the MDT (Supplementary Table S3) describes all the input data files in a project, the file configs describe input-to-output reformatting for each data file type in the project.
To capture meta-information about the normalized output tracks, hipFG develops a structured, templated metadata table (Fig. 1F, Supplementary Tables S4 and S5).After collecting initial metadata via the MDT, hipFG provides additional context to the provided input descriptions.Using biological annotations provided by the user according to open vocabularies derived from the FILER resource (Kuksa et al. 2022) hipFG assigns standardized "tissue" and "system" categories for each output file.In addition, file information is calculated, including number of records/intervals, genomic base-pair coverage, and md5sum hash.Any additional columns provided in the MDT are included in the "track description" metadata column in key-value pair format.This standardized metadata provides biological and source context to harmonized input data and any downstream genomic queries.Cifello et al.

Standard FG data processing steps
The standard, type-independent steps are executed by the generated scripts regardless of output format and input datatype (Supplementary Fig. S2).One such standard step is column rearrangement according to the output BED-format, if necessary.In addition, 1-indexed input data is adjusted to 0-indexing, and chromosome name formats are standardized.Lastly, the formatted output is sorted based on genomic coordinates, compressed via bgzip, indexed via tabix (Li 2011), and indexed via Giggle (Layer et al. 2018) in its destination folder together with other tracks in the same category (Fig. 1E).
Using Each output file is matched with a hipFG standardized metadata (Fig. 1F).
Standardized file outputs will be saved within a folder hierarchy in the following order, for applicable categories: association significance levels [by default, false discovery rate (FDR)<0.05for QTLs], variant types [e.g.SNPs, insertions and deletions (INDELs)], assay types (e.g.ChIP-seq, RNAseq, DNase-seq), output format types [e.g.bed19, bedInteract (Haeussler et al. 2019)], and genome builds (GRCh37/hg19 or GRCh38/hg38).This allows the outputs of projects with many datatypes and assays to be stored in an organized fashion as data collections, keeping similar datatypes together.

Type-dependent steps
hipFG processes inputs according to datatype-specific custom pipelines, with steps determined via file configs.The diverse input types allowed by hipFG necessitate separate consideration as described below.

Annotated genomic intervals
Input files in BED format that include interval names, scores, and other fields but do not indicate a secondary target like QTLs or chromatin interactions do not require additional steps beyond the standard processing steps.The steps described in Section 2.2 are sufficient to standardize these types of data, which include narrow and broad peaks for epigenetic histone marks (ChIP-seq) or open chromatin regions (DNaseseq, ATAC-seq).

QTLs
Expression, protein, or other types of QTL summary statistics data are highly variable in their format, names of provided fields, and which fields are included.QTL files therefore require additional processing steps (Supplementary Fig. S2C, QTL normalization).Following the type-independent steps (Section 2.2), allele correction and verification are carried out to normalize test statistics, allele frequencies, and variant IDs (to rsIDs when possible) (Supplementary Fig. S4).While variant normalization requires a local copy of a dbSNP-based reference (60 GB for GRCh38/hg38), this reference is required only for the QTL data type to enable normalization of QTL alleles and effect statistics.In addition, QTL targets (e.g.genes, proteins) are annotated with GENCODE and UniProt when possible (Consortium 2018, Frankish et al. 2019), providing gene symbol, strand, and distance from the transcription start site to the associated QTL.Lastly, if FDR is not provided, Benjamini-Hochberg P-value correction will be carried out per target gene-tissue pair.The final QTL results will be split by variant types (e.g.SNP and INDELs) and, if specified, significant QTLs (FDR < 0.05 by default) will be extracted and saved separately from the full QTL summary statistics (Fig. 1E).

Chromatin interactions
Chromatin interactions data [e.g.3C, 4C, Hi-C (Dekker et al. 2002, van Berkum et al. 2010)] are processed such that both anchors (interacting sites) of each interaction are accessible by the genomic search tools mentioned above (Supplementary Fig. S2C, chromatin interactions normalization).With each interaction assigned a unique ID (numbered according to their original appearance), each interaction anchor is written on a separate line with its interaction ID plus an "A" or "B" to indicate it as the interaction source or target.The interaction "score" (integer in range [0-1000] indicating interaction strength) and "value" (a double calculated as the -log 10 of the score), both expected in the UCSC "interact" format (Haeussler et al. 2019), are calculated or assigned null values when missing.In addition, each interaction is assigned a human-readable name combining the interaction's data source, unique ID, score, and value.This name provides context to interactions when viewed with tools such as the UCSC Genome Browser (Karolchik et al. 2003).Lastly, any provided but unused columns are preserved as additional columns following the standard UCSC Interact Track Format fields (Haeussler et al. 2019).

hipFG usage
Following install, the hipFG.inisystem configuration must be updated with the system-specific paths for the required programs, tools, and resources.The input MDT and file config(s) (Section 2.1, Fig. 1B, and Supplementary Figs S2 and S3) are then sufficient for hipFG to process the local input data (Fig. 1A).hipFG is executed with the Bash driver script (hipFG.sh),which generates customized pipeline scripts (Bash/AWK/Python) for each of the input files.hipFG can be executed with, at minimum, the MDT as its only argument.We note that currently any input files must be downloaded locally and available on the user's system (processing via file URLs will be included in future versions of hipFG).
In addition, Jupyter Bash notebooks (available in the hipFG repository) provide both tutorials for processing each of the main FG datatypes and an interactive workspace to apply hipFG to new data.Users may run hipFG in these notebooks to generate and execute pipelines tailored to their input FG data.
Lastly, the input files may be processed sequentially (on a single CPU or with multi-threading), or in parallel by chromosome-splitting or across input files.For full documentation, visit https://bitbucket.org/wanglab-upenn/hipFG.

Results
To demonstrate hipFG for FG data preparation/standardization and integration with high-throughput analysis workflows, we used hipFG to process and harmonize a heterogeneous, large-scale FG data collection including 109 eQTL catalogue eQTL datasets (Kerimov et al. 2021), 48 3DGenome chromatin interaction datasets (Wang et al. 2018), and 831 EpiMap samples with intervals annotated for 1 of 18 epigenetic states (Boix et al. 2021).These sources hipFG: customizable pipelines for functional genomics harmonization correspond to the three FG data types handled by hipFG, described in Sections 2.3.1-2.3.3, and include 17 billion variant-gene association records, 5 million genome-wide interactions, and 98 million annotated genomic intervals, respectively.
We found that variants were detected in a wide range of tissues within and across the three assay types (Supplementary Fig. S5).Following 1 MB binning based on tag variants for each chromosome, the genomic region chr17:45476979-46476979 contained 471 variant-tissue-assay combinations, the most of any such bin.This region includes known genes of interest including MAPT and KANSL1.These genes and this locus have been identified as sites of variants relevant to Alzheimer's disease (Liu et al. 2015).
In addition, hipFG's processing speed was tested on five eQTL datasets from the eQTL catalogue dataset (Kerimov et al. 2021), demonstrating a linear run-time scalability with respect to input size (Supplementary Fig. S6).At the same time, all SNP reference alleles successfully matched to at least one genome reference, with an average 99.98% found in reference dbSNP b156 (Sherry et al. 2001), and the remainder resolved by matching against the GRCh38.p14reference genome (Schneider et al. 2017).The eQTL catalogue is a multistudy effort including various tissues and cell types, demonstrating hipFG's applicability to FG data in any biological context or diseases of interest.These tissues/cell types, biosample types, and life stages were recorded using hipFG's controlled vocabulary to ensure consistent and accurate reporting of metadata annotations.In addition, hipFG will be expanded to support other data types such as gene expression, variantcalling data, and protein-protein interactions to allow for integration of valuable multi-omics and disease-specific resources such as AlzGPS (Zhou et al. 2021) and The Alzheimer's Cell Atlas (Zhou et al. 2022).
In conclusion, hipFG enables scalable harmonization and integration of diverse, heterogeneous FG data.Harmonized data and metadata allow for rapid querying and integration with high-throughput genetics workflows.Compared to resources that store FG data (Kuksa et al. 2022, Huang et al. 2023) or use a fixed set of FG data for annotation purposes (Zhou et al. 2023), hipFG allows users to take normalization into their own hands and more quickly expand, harmonize and integrate FG data acquired from various sources into high-throughput analysis workflows.By simplifying this integration, we hope to allow researchers to focus on the development of innovative workflows that utilize FG annotations.

Figure 1 .
Figure 1.hipFG pipeline.Thin and thick lines indicate the flow of metadata and data, respectively.(A) Input FG data can belong to any of these three groups.Nonstandard file formats are acceptable.(B)The input minimal descriptions table (MDT) provides biological, source, and file information describing all input FG data files.This includes paths to the file configs, which map provided columns to standard BED-format output fields.(C) Data normalization includes steps dependent and independent of input data types (details see Supplementary Fig.S2).(D) Metadata normalization provides additional context to information provided in the MDT by providing additional biological context such as tissue and systems categories, recording provided publication information, calculating file attributes, and saving any supplemental information.(E) The hipFG data outputs have standardized formats and are stored in an organized file-hierarchy according to metadata attributes.(F) The output metadata contains the final output of the metadata normalization, integrated for all datatypes, and stored at the project level. 2 Giggle-based indexing and search allows for efficient annotation of the genomic regions of interest, scalable up to millions of intervals as shown in Layer et al. (2018) (e.g.Fig. 1B and C of cited), or Kuksa et al. (2022) (Fig. 4 of cited).