Integrative computational epigenomics to build data-driven gene regulation hypotheses

Abstract Background Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets. Results In this review, we perform a critical analysis of methods with the explicit aim of harmonizing data, as opposed to case-specific integration. This revealed that matrix factorization, latent variable analysis, and deep learning are potent strategies. Finally, we describe the properties of an ideal universal data harmonization framework. Conclusions A sufficiently advanced universal harmonizer has major medical implications, such as (i) identifying dysregulated biological pathways responsible for a disease is a powerful diagnostic tool; (2) investigating these pathways further allows the biological community to better understand a disease’s mechanisms; and (3) precision medicine also benefits from developments in this area, particularly in the context of the growing field of selective epigenome editing, which can suppress or induce a desired phenotype.

Silencers have the opposite effect, where looping disrupts such events from occurring.
Insulators can act similarly to silencers, but can also repel chromatin formation to maintain a permissive transcriptional environment.
The remaining histone-bound regions of DNA are then identified through high throughput sequencing. Sites are known to exist in promoter as well as enhancer regions.
It is common for multiple transcription factors to work in tandem.
ATAC-Seq (Assay for Transposase-Accessible Chromatin) [10] ChIP-Seq (Chromatin Immunoprecipitation sequencing) [17] ChIP-chip (Chromatin immunoprecipitation on chip) [18] Assays used to probe chromatin occupancy are also used to identify nonhistone proteins. It is hypothesised that they selectively recruit other elements of the epigenetic regulatory machinery [19].
Other studies suggest that their binding to enhancers is sufficient to induce transcription [20], or even chromatin condensation [21].
Conversely, some RNA transcribed at enhancers (eRNA) has been associated with phenotype [22] Little is known about their mechanism of action in all scenarios.
ChIRP-Seq (Chromatin Isolation by RNA Purification sequencing) [24] This variant of ChIP-Seq applies the same concept of immunoprecipitation to target RNA bound to DNA, identifying DNA-RNA binding events.
It is primarily used to survey long non coding RNA.

RNA binding proteins (RBP)
Non-coding RNA is capable of interacting with proteins to perform gene regulatory functions.
These proteins can mediate gene activity both during and after transcription by regulating transcript levels and localisation [23].
Additionally, long non-coding RNA is implicated in genome organisation by acting as a scaffold for DNA and proteins CLiP-Seq (Cross-linking immunoprecipitation sequencing) [17,25,26] PAR-CLiP (photoactivatable ribonucleoside-enhanced cross linking and immunoprecipitation) [27] RIP-Seq (RNA immunoprecipitation sequencing) [28] These variants of ChIP-Seq exploit immunoprecipitation to target RNA bound to protein, identifying these RNA sequences binding events.

Protein mediator complexes
Proteins and other regulatory elements rarely act in isolation.
Large regulatory protein complexes are frequently formed, and the resulting ensemble is often the main functional catalyst.
An example is the DICER complex in catalysing small RNA synthesis [29,30] ChIP + MS (Chromatin immunoprecipitation with mass spectrometry) These are interesting due to their targetspecificity to DNA. transcripts.
CAGE is a special assay which identifies alternative transcription start sites.

Proteomics
Quantitative protein and isoform identification Many proteins have direct or indirect regulatory roles in the cell.
While histones, transcription factors and other molecules are often interpreted in context of binding to another biological molecule, abundance levels of these proteins can be informative depending on the system under study.

GC-MS (Gas Chromatography Mass Spectrometry)
Proteins in a sample are identified by mass spectrometry.
While GC-MS workflows are robust and replicable, they are most effective on small molecules.

Metabolic pathway of interest
Metabolites are indicative of active or inactive metabolic pathways in the cell.
In most contexts, the relative abundances of metabolites within or across pathways yield information pertaining to the experiment, but abundance levels of these proteins can be informative depending on the system under study.

LC-MS (Liquid Chromatography
Mass Spectrometry) GC-MS (Gas Chromatography Mass Spectrometry) Metabolites in a sample are identified by mass spectrometry.
While GC-MS workflows are robust and replicable, they are most effective on small molecules.      [36,37,38], which include SRA (Sequence Read Archive), EBI (European Bioinformatics Institute) and DDBJ (DNA Database of Japan). Single cell data is not included in this. The publication associated with each database is provided in Table S2 for easy reference. Figure S1: Gene expression is the result of a combination of regulatory feature interactions. Clockwise from top: Insulators maintain an accessible chromatin state permitting transcription within a specific region of the genome, Protein-miRNA interactions modulating gene expression, DNA methylation silences a gene, Silencers alter chromosome conformation to prevent assembly of the transcription machinery, Chromatin accessibility for transcription is lowered by histone binding, DNA-lncRNA interactions regulating gene expression, DNA-Protein interactions transcription factor binds to DNA, triggering a signal cascade leading to transcription, Enhancers reconfigure chromosome structure to increase the likelihood of gene expression.