Abstract

In this research, we elucidate the presence of around 11,000 housekeeping cis-regulatory elements (HK-CREs) and describe their main characteristics. Besides the trivial promoters of housekeeping genes, most HK-CREs reside in promoter regions and are involved in a broader role beyond housekeeping gene regulation. HK-CREs are conserved regions rich in unmethylated CpG sites. Their distribution highly correlates with that of protein-coding genes, and they interact with many genes over long distances. We observed reduced activity of a subset of HK-CREs in diverse cancer subtypes due to aberrant methylation, particularly those located in chromosome 19 and associated with zinc finger genes. Further analysis of samples from 17 cancer subtypes showed a significantly increased survival probability of patients with higher expression of these genes, suggesting them as housekeeping tumor suppressor genes. Overall, our work unravels the presence of housekeeping CREs indispensable for the maintenance and stability of cells.

Introduction

Characterizing the different components within the complex transcriptional regulatory mechanism has become a long-standing topic of research (1,2). Thanks to the advances in sequencing techniques and experimental protocols, it is possible to investigate different cis-regulatory elements (CREs), e.g. enhancers and promoters, and their differences and similitudes in numerous cell types with unprecedented resolution (3,4). These advances have also made clear that classical discrete definitions of CREs do not necessarily fit the intricate interplay between them. For example, several studies have described CREs that may have both enhancer and promoter functions with similar epigenetic features to either classical marks of enhancers or promoters (2,5). In this way, promoters with enhancer capabilities could drive the activation of neighboring genes through promoter–promoter interactions (5–7). In a recent study, contrary to the classical enhancer–promoter interactions on the regulation of cell type-specific genes, it was demonstrated that housekeeping genes (HKG) are mainly regulated through promoter–promoter interactions (8), delineating the importance of this kind of interplay on the maintenance of cells. These new insights into the regulation of HKG and the intrinsic specificity of the core promoters of HKG to certain gene regulators (9) suggest the existence of ubiquitously active CREs besides the promoters of HKG. However, to our knowledge, the existence of these CREs that are active in all cell types has not been described. In this study, we leveraged the activity by contact (ABC) dataset (10), which provides a list of active CREs and their predicted target genes from 131 cell types and tissues, to investigate ubiquitously active CREs in healthy cell types. We elucidate the existence of such housekeeping CREs (HK-CREs), e.g. CREs active in at least 90% of the healthy cell types, and describe their main features. Through several bioinformatic analyses of multiomics data, we show that HK-CREs resided in highly conserved regions that are rich in unmethylated CpG sites, and, similar to HKG (8), they reside in specific loci within the genome as compared with cell type-specific CREs. In conformity with current knowledge on the regulation of HKG (8,9), we show that most of these ubiquitously active CREs localize in core promoter regions and interact with neighboring genes in complex long-distance promoter–promoter interactions. However, contrary to expectations, our analysis revealed that the core promoters of HKG, which are inherently assumed to be active in every cell type, were not the only ubiquitously active CREs; the core promoters of HKG only accounted for around 18% of the total number of HK-CREs. We then compared the biological processes of HKG and the rest of the genes with ubiquitously active core promoters revealing an intrinsic association between these sets of genes, e.g. both groups are enriched in similar or identical housekeeping functions, which suggests their complementarity to ensure the proper functioning and maintenance of cells. Finally, we investigated the HK-CREs in diverse cancer cell lines, discovering a reduction in the activity of a subset of these elements. Interestingly, we found that the most affected HK-CREs in cancer cells reside close to the telomere region of chromosome 19 and are inherently related to zinc finger genes. Moreover, we show that while most of these HK-CREs with reduced activity were cancer-specific, there was a set of core promoters found to be inactive in numerous cancer cell lines, suggesting the existence of housekeeping tumor suppressor genes. We leveraged public data from diverse popular cancer projects and found a decrease in the expression of putative housekeeping tumor suppressor genes (ZNF154, ZNF135, ZNF667, and ZNF667-AS1) due to aberrant methylation of their core promoters in several cancer subtypes. Despite potential cancer-specific regulation of these genes during cancer progression, a joint analysis of more than 5,000 patients from 17 cancer subtypes showed a significant increase in the survival probability of patients with high expression of these genes, confirming, to some degree, the existence of housekeeping tumor suppressor genes in the human genome.

Overall, our work unveils key cis-regulatory elements active in multiple healthy cell types that, if lost, could strongly affect the housekeeping biological processes of cells and possibly culminate in cancer development.

Materials and methods

Quality control and pre-processing of regions

First, we filtered out not significant regions with an ABC score lower than 0.015 as in the original publication (11). Then, we lifted up the regions coordinates to the hg38 human genome using the CrossMap assembly converter (12) implemented in the Ensemble website (13).

We manually filtered out cancer cell lines or stimulated cell types from the original 131 cells in the ABC dataset (10), obtaining a healthy dataset of 71 cell types and tissues (Supplementary Table S1). To investigate the dependency of the housekeeping (HK) regions on the number of included cell types and on how strictly we define a region as to be HK, we merge the regions (see Mergingprocess) with an increasing number of randomly selected cell types using a 70, 80, 90 or 100% definition of HK, e.g. a given region is an HK region if it is found in at least 70, 80, 90 or 100% of the cell types. We observed that the number and length of the merged HK regions stabilized after we included around 30–40 cell types with a 90 or 100% definition (Supplementary Figure S1). Therefore, we randomly selected 50 healthy cell types (Supplementary Table S1) and merged their regions for further analyses. The regions from the remaining 21 healthy cells were kept for validation.

After annotating the main histone marks of the merged regions (see Annotation of histone marks), as a quality control step, we filtered out regions with histone marks similar to a set of 10,000 negative samples (see Definition of negative samples and Filtering process). For this, we clustered regions using the Louvain algorithm (14) implemented in the igraph R package (ver. 1.5.0.1) (15) using the histone marks and the distance to the nearest TSS as features (seeFiltering process). Finally, we retained 135,917 regions as putative cis-regulatory elements (see Table 1).

Table 1.

Characterization of clusters

GroupH3K27acH3K4me3H3K4me1Distance to TSS (median)CountsPercentage of total
Core promoters (CP)++Very short (56 bp)28,96821.20%
Promoters (P)++++Short (1,125 bp)24,23317.73%
Enhancers short distance (ESD)++/−++Intermediate (3,823 bp)11,4458.37%
Enhancers long distance (ELD)+/−+Long (11,845 bp)32,56423.83%
Enhancers super long distance (ESLD)++Very long (19,983 bp)8,0635.90%
Enhancer short distance inactive (ESDi)+Intermediate (2,979 bp)5,9754.37%
Other medium distance (OMD)Intermediate (4,163 bp)5,1093.73%
Other long distance (OLD)Long (10,540 bp)19,56314.31%
H3K4me1 negative (H3K4me1_negative)+/−+/−ZeroShort (1,017 bp)7450.56%
GroupH3K27acH3K4me3H3K4me1Distance to TSS (median)CountsPercentage of total
Core promoters (CP)++Very short (56 bp)28,96821.20%
Promoters (P)++++Short (1,125 bp)24,23317.73%
Enhancers short distance (ESD)++/−++Intermediate (3,823 bp)11,4458.37%
Enhancers long distance (ELD)+/−+Long (11,845 bp)32,56423.83%
Enhancers super long distance (ESLD)++Very long (19,983 bp)8,0635.90%
Enhancer short distance inactive (ESDi)+Intermediate (2,979 bp)5,9754.37%
Other medium distance (OMD)Intermediate (4,163 bp)5,1093.73%
Other long distance (OLD)Long (10,540 bp)19,56314.31%
H3K4me1 negative (H3K4me1_negative)+/−+/−ZeroShort (1,017 bp)7450.56%

The annotations of clusters were manually assigned using a combination of their epigenetics signals and their distance to the nearest TSS.

Table 1.

Characterization of clusters

GroupH3K27acH3K4me3H3K4me1Distance to TSS (median)CountsPercentage of total
Core promoters (CP)++Very short (56 bp)28,96821.20%
Promoters (P)++++Short (1,125 bp)24,23317.73%
Enhancers short distance (ESD)++/−++Intermediate (3,823 bp)11,4458.37%
Enhancers long distance (ELD)+/−+Long (11,845 bp)32,56423.83%
Enhancers super long distance (ESLD)++Very long (19,983 bp)8,0635.90%
Enhancer short distance inactive (ESDi)+Intermediate (2,979 bp)5,9754.37%
Other medium distance (OMD)Intermediate (4,163 bp)5,1093.73%
Other long distance (OLD)Long (10,540 bp)19,56314.31%
H3K4me1 negative (H3K4me1_negative)+/−+/−ZeroShort (1,017 bp)7450.56%
GroupH3K27acH3K4me3H3K4me1Distance to TSS (median)CountsPercentage of total
Core promoters (CP)++Very short (56 bp)28,96821.20%
Promoters (P)++++Short (1,125 bp)24,23317.73%
Enhancers short distance (ESD)++/−++Intermediate (3,823 bp)11,4458.37%
Enhancers long distance (ELD)+/−+Long (11,845 bp)32,56423.83%
Enhancers super long distance (ESLD)++Very long (19,983 bp)8,0635.90%
Enhancer short distance inactive (ESDi)+Intermediate (2,979 bp)5,9754.37%
Other medium distance (OMD)Intermediate (4,163 bp)5,1093.73%
Other long distance (OLD)Long (10,540 bp)19,56314.31%
H3K4me1 negative (H3K4me1_negative)+/−+/−ZeroShort (1,017 bp)7450.56%

The annotations of clusters were manually assigned using a combination of their epigenetics signals and their distance to the nearest TSS.

Merging process

For a given set of cell types, we merged overlapping regions following the next iterative process until no regions were left in the raw dataset:

  • Step 1: Merge raw overlapping regions across cell types. We used the function bedr.merge.region from the bedr R package (ver 1.0.7) (16) with a distance of 20bps.

  • Step 2. Select merged regions that overlap on the maximum number of cell types. In the case of housekeeping regions, we merged and selected regions overlapping with a defined 90% percentage of the total number of cells, e.g. regions that overlap in at least 90% of the cell types.

  • Step 3. Filter out raw regions overlapping with the selected merged regions. We used the findOverlapsfunction from the GenomicRanges R package (ver 1.46.1) (17).

  • Step 4. Repeat Step 1 using the filtered raw regions.

Filtering process

We filtered out suspicious regions with similar features, e.g. histone signals (HS) and distance to nearest TSS (dnTSS), to the set of negative samples (NS, see Definition of negative samples). Therefore, we performed a joint unsupervised analysis of regions and NS in an iterative way:

  • Step 1. Normalize and z-transform the histone signals of regions and NS.

  • Step 2. Perform principal components analysis (PCA) of negative samples (NS) and manually select the most informative PCs for further steps.

  • Step 4. Perform clustering analysis of regions and NS using the Louvain algorithm.

  • Step 5. Calculate the percentage of each type of region, e.g. cell type-specific, housekeeping, NS, etc., across clusters.

  • Step 6. For each suspicious cluster which more than 20% of its regions are NS, perform a sub-clustering analysis as follows:

    • Subset the regions and NS in the suspicious cluster and perform steps 1–5.

    • Filter out regions in sub-clusters with a high percentage of NS, e.g. >50% of the regions are NS.

  • Repeat steps 1–6 until there are not suspicious clusters.

At each step, relevant parameters such as the resolution of clustering and the number of PCs were defined with the help of visualizations such as the elbow plot and the embedding coordinates from the uniform manifold approximation and projection (UMAP) method (18).

Definition of negative samples

We defined a set of 10,000 negative samples (NSs) as non-overlapping regions with cis-regulatory elements (CREs) from the ABC database (131 cell types) nor the SCREEN project (4) as follows:

  • Merge regions from the ABC and SCREEN datasets using the reducefunction from the GenomicRanges R package (ver 1.46.1) (17) with a minimum gap width of 100 bp.

  • Calculate the gaps between the merged regions using the gapsfunction from the GenomicRanges R package.

  • Filter out gaps located in centromeres or at the end of telomeres.

  • Filter out gaps less than 3Kbp long. This length intends to keep regions of 1.5Kbp which are at least 500bp away from any extended CRE (see Annotation of histone marks).

  • Select 10,000 random gaps and define NS as 500bp regions located at the center of the selected gaps.

Annotation of histone marks

We annotated the H3K27ac, H3K4me1 and H3K4me3 histone marks of an extended version of the cis-regulatory elements and negative samples. The extended version includes the original regions plus 250 bp on each flank to account for adjacent histones.

We downloaded ChIP-seq histone marks datasets from the ENCODE project (3) (Supplementary Table S2) of samples selected as being similar to the main 50 healthy cells/tissues (see Quality control and pre-processing of regions). Following the pre-processing used on the SCREEN project (4), we performed independent log(signal + 1) normalization and scaling of the samples. Finally, we annotated the histone marks of the extended regions as the average signal of overlapping peaks across pre-processed samples; overlapping peaks were selected as those with at least 250 bp overlap with the extended regions.

Annotation of the distance to the nearest TSS

We annotated the distance to the nearest TSS of putative cis-regulatory elements and negative samples using the ENSEMBL transcripts annotation version GRCh38.107. For this, we calculated the absolute distance from the center of the regions to every TSS in the ENSEMBL transcripts dataset. Then, for each region, we selected the TSS with the minimum absolute distance as the nearest TSS. We also annotated the nearest gene to each region as the gene related to the nearest TSS. This gene annotation was especially important for further analyses of cis-regulatory elements labeled as core promoters.

Annotation of the conservation scores

We annotated the mean conservation scores on a fixed-length version of the cis-regulatory elements (CREs). The fixed-length CREs were defined to be 500 bp at both flanking loci from the center of each CRE. Then, we overlapped each fixed-length CRE with the conservation scores of 241 mammals from the Zoonomia project (19) using the findOverlaps function from the GenomicRanges R package (ver. 1.46.1), and calculated the mean conservation score as the mean score among overlapping regions.

Gene ontology analysis

Gene ontology analysis of genes was performed using the DAVID web tool (20). Significant processes were selected to have FDR ≤ 0.05. The differential enrichment analysis was performed by calculating the fold change of −log(FDR) between housekeeping and non-housekeeping genes. Then, differentially enriched terms were selected as those with at least and absolute 20-fold change (Supplementary Figure S8 and S9). Finally, the top 10 terms were selected to produce plots in Figure 3.

Methylation atlas

We downloaded the methylation scores from 34 randomly picked samples from the DNA methylation atlas of normal human cell types (21) (Supplementary Table S2). Then, similar to the annotation of the conservation scores, we obtained overlapping regions between the methylation dataset and the set of fixed-length CREs and NS using the findOverlaps function from the GenomicRanges R package (ver. 1.46.1); the fixed-length regions were defined as regions of 1Kbp concentric with every CRE and NS. Then, for each CRE and NS, we counted the number of overlapping CpG sites and calculated the mean methylation score across cell types in the methylation dataset. To produce the methylation tracks from Figure 1E and Supplementary Figure S6, we used the track_plot function from the trackplot R package (ver. 1.3.10) (22).

Housekeeping cis-regulatory elements in the human genome. (A) After merging overlapping regions from 50 cell types, we observed a significant number of housekeeping cis-regulatory elements (HK-CREs) found to be active in at least 90% of the cells. (B) HK-CREs interacts with a larger number of surrounding genes in longer distance than CTS-CREs. (C) The density of HK-CREs highly correlates with the density of protein-coding genes across chromosomes. (D) HK-CREs reside in conserved regions rich in unmethylated CpG sites. (E) Conservation scores and methylation levels of three randomly sampled HK-CREs.
Figure 1.

Housekeeping cis-regulatory elements in the human genome. (A) After merging overlapping regions from 50 cell types, we observed a significant number of housekeeping cis-regulatory elements (HK-CREs) found to be active in at least 90% of the cells. (B) HK-CREs interacts with a larger number of surrounding genes in longer distance than CTS-CREs. (C) The density of HK-CREs highly correlates with the density of protein-coding genes across chromosomes. (D) HK-CREs reside in conserved regions rich in unmethylated CpG sites. (E) Conservation scores and methylation levels of three randomly sampled HK-CREs.

ENCODE RNA-seq datasets

All the healthy and cancer RNA-seq data were obtained from the ENCODE project (3) (Supplementary Table S2). For pre-processing and quality control of samples, we followed the next pipeline:

  • Load the TPM data using the tximport Bioconductor package (ver. 1.22.0) (23).

  • Filter out genes with zero TPM in all the samples.

  • Map the gene labels to match the gene names used in the ABC dataset. We used the org.Hs.eg.db (ver. 3.14.0) and AnnotationDbi Bioconductor packages (ver 1.56.2) (24,25) to match the gene names with a combination of ALIAS, ENSEMBL and ENTREZID.

  • Natural log-normalize the TPM data.

DepMap methylation and RNA-seq data

Methylation of promoter regions and expression data of genes ZNF667, ZNF667-AS1, ZNF154 and ZNF135 were downloaded directly from the DepMap website (26); by the time we performed this research, no expression data was available for the ZNF667-AS1 gene. From the secondary_lineage column in the metadata of downloaded datasets, we re-labeled ‘Non-Cancerous’ samples as ‘Normal’ and all the other samples as ‘Cancer’. These new labels were used to compare the methylation and expression levels in Figure 4.

TCGA clinical and RNA-seq data

TCGA data were downloaded and analyzed using the TCGAbiolinks R package (ver. 2.29.6) (27). In this study, we included only projects related to a single primary site (see Supplementary Table S4).

For each selected TCGA project (see Supplementary Table S4), the clinical data was obtained by using the GDCquery_clinic function and setup as follows:

  • Samples were classified as deceased = TRUE | FALSE, based on their ‘Alive’ status from the vital_statuscolumn in the metadata.

  • The overall survival of samples was assigned as: if vital_status== ‘Alive’ then overall_survival = days_to_last_follow_up, else overall_survivaldays_to_death, where days_to_last_follow_up and overall_survival are columns from the clinical metadata.

For each selected TCGA, a query for downloading publicly available RNA-seq data was created using the GDCquery function; data.category = ‘Transcriptome Profiling’, experimental.strategy = ‘RNA-seq’, workflow.type = ‘STAR – Counts’, data.type = ‘Gene Expression Quantification’, sample.type = ‘Primary Tumor’, access = ‘open’.

The RNA-seq data was downloaded using their corresponding query and the GDCdownload function. The TPM of genes was used in all the analyses.

To calculate the mean expression of each sample used for the survival analyses (Figures 4F-G, Figure 4I, and Supplementary Figures S17–S19) we z-transformed the TPM expression levels of the ZNF154, ZNF135, ZNF667 and ZNF667-AS1 genes on each project. Then, a mean_expression was calculated for each sample as the mean value of the transformed expressions.

To perform the survival analyses, we stratified the samples based on their mean_expression as follows: if mean_expression ≥ 3rd quantile(mean_expression of all samples), then strata = ‘HIGH’, else strata = ‘LOW’. Using the strata label, we fit a survival model using the survfit function from the survival R package (ver. 3.5–5) (28); Surv(overall_survival, deceased) ∼ strata. Finally, we created the survival plots using the ggsurvplot function from the survminer R package (ver. 0.4.9) (29).

Statistical information

For the statistical comparison used in the manuscript, we used the stat_compare_means function from the ggpbur R package (ver 0.6.0) (30) with the next convention for symbols indicating statistical significance:

  • ns: P > 0.05

  • *: P ≤ 0.05

  • **: P ≤ 0.01

  • ***: P ≤ 0.001

  • ****: P ≤ 0.0001

In Figure 1D and Figure 3B, we used an unpaired Wilcoxon test with a two-sided alternative hypothesis. In Supplementary Figure S12, we used a paired Wilcoxon test with a two-sided alternative hypothesis.

Other analyses and plots

To analyze the density of regulatory regions, protein-coding genes, and transcripts of genes, we input the center of CREs and the TSS of transcripts and genes from ENSEMBL to the geom_density_ridges function from the ggridges R package (ver. 0.5.4) (31); bandwidth = 1e6.

Results

Section 1. Active housekeeping cis-regulatory elements in the human genome

We investigated the cis-regulatory elements (CREs) from the ABC public dataset (10). This dataset contains a genome-wide list of active CREs from 131 cell types (Supplementary Table S1), identified from the combination of DNase I hypersensitivity (DNase) or assay for transposase-accessible chromatin sequencing (ATAC-seq) data, with chromatin immunoprecipitation sequencing (ChIP-seq) data for the acetylation of the lysine residue at the N-terminal position 27 of the histone H3 protein (H3K27ac). One asset of this dataset is that it provides a list of predicted interactions between CREs and their surrounding genes within 5Mbp; interactions are quantified from 3D chromatin conformation (Hi-C) data. After selecting significant CRE-gene interactions and filtering out CREs from cell types or tissues related to cancer (n = 26) or various stimuli (n = 34), we obtained around 8.5 million predicted interactions from around 2 million CREs in 71 healthy cell types (Supplementary Table S1). After merging overlapping regions from different cell types (see Materials and Methods), we noticed a set of housekeeping CREs (HK-CREs) active in all cell types, which number was stable provided that a sufficient number of cell types were included (see Supplementary Figure S1 and Materials and Methods). Then, we randomly selected 50 healthy cell types for further analysis and left the remaining 21 healthy cell types for validation (see Section 4 and Supplementary Table S1). We then merged overlapping regions across the selected cell types and filtered out suspicious CREs with epigenetic marks similar to a set of 10,000 randomly selected negative sample (NS) regions; NS regions were defined as regions not overlapping with any CRE from the ABC and the SCREEN projects (4,10) (see Materials and methods). Finally, we obtained a merged dataset with around 846,000 predicted interactions from 136,365 CREs. Figure 1A shows the distribution of the resulting CREs across the number of cell types they are active in. While most of the CREs are active in a low number of cell types, e.g. cell type-specific (CTS-CREs), we found 10,965 HK-CREs active in at least 90% of the 50 cell types. Interestingly, HK-CREs are predicted to interact with a larger number of target genes over longer distances (median number of targets = 22, median distance = 553,817bp; Figure 1B) as compared with CTS-CREs (mean number of targets = 1, median distance = 63,411bp). Surprisingly, the HK-CREs presented a considerably high correlation with protein-coding genes in number (r = 0.979; Supplementary Figure S2) and density (r = 0.882; Figure 1C) as compared with CTS-CREs (r = 0.754 and r = 0.601, respectively; Supplementary Figure S3), implying their importance in the gene regulation process. As expected, because of their presumed housekeeping capacity (32,33), the loci of HK-CREs are significantly conserved and enriched in unmethylated CpG sites as compared with CTS-CREs and negative sample (NS) regions (Figure 1D and Supplementary Figure S4 and S5). Figure 1E shows an example of the conservation and methylation scores of three HK-CREs across nine tissues (tracks across 30 cell types and tissues can be seen in Supplementary Figure S6), where it can be observed that the loci of the HK-CREs overlap with highly conserved and unmethylated regions within a locus that is rich in CpG sites across samples. In contrast, a randomly selected CTS-CRE from T cells presents different methylation patterns across cell types within a poor CpG locus (Supplementary Figure S6).

Section 2. Most of the active housekeeping cis-regulatory elements reside in core promoters

We performed an unsupervised classification analysis to further characterize and investigate the functions of housekeeping cis-regulatory elements (HK-CREs). We annotated the CREs with their main epigenetic marks (H3K27ac, H3K4me1 and H3K4me3) and their distance to the nearest transcription start site (dnTSS), using a set of ChIP-seq datasets from the ENCODE project (3) (Supplementary Table S2) and the TSS locations from the ENSEMBL project (13), respectively. We used these annotations in an unsupervised classification and visualization process, where we obtained the principal components (PCs) (34) coordinates of CREs and used them as the input data to the Louvain clustering (14) and to the Uniform Manifold Approximation and Projection (UMAP) method (18). We noticed that the HK-CREs were located at specific parts of the UMAP plot and that they didn’t present a substantial overlap with cell type-specific CREs (CTS-CREs), suggesting their distinctive epigenetic and dnTSS features (Figure 2A, B). The results from clustering confirmed these differences (Figure 2C, D), where CTS-CREs appeared enriched in groups where the number of HK-CREs is scarce and vice versa. Table 1 and Supplementary Figure S7 show the comparison of the feature levels used to manually label nine groups of CREs found by clustering; five groups of active CREs: core promoters (CP), promoters (P), enhancer short distance (ESD), enhancers long distance (ELD), and enhancers super long distance (ESLD); one group of inactive enhancers: enhancer short distance inactive (ESDi); two groups of CREs with low signals of epigenetic histone marks: other medium distance (OMD) and other long distance (OLD); and one small group of CREs negative for H3K4me1 (H3K4me1_negative). Interestingly, most HK-CREs were classified as CPs (Figures 2D and E) with a ‘very short’ mean dnTSS of 56 bp.

HK-CREs reside in core promoter regions. CREs were annotated with signal levels from H3K27ac, H3K4me3, H3K4me1, and their distance to the nearest TSS (dnTSS). These annotations were used as features in an unsupervised analysis including clustering and visualization. (A, B) UMAP plot shows a distinct distribution of HK-CREs and cell type-specific-CREs (CTS-CREs) (A) with distinct signal levels of histone marks and dnTSS (B). (C, D) Clustering analysis showed similar results where CTS and HK-CREs were assigned to different clusters. (E) Annotation of the clusters using histone signal levels and dnTSS showed that HK-CREs reside primarily in core promoter regions.
Figure 2.

HK-CREs reside in core promoter regions. CREs were annotated with signal levels from H3K27ac, H3K4me3, H3K4me1, and their distance to the nearest TSS (dnTSS). These annotations were used as features in an unsupervised analysis including clustering and visualization. (A, B) UMAP plot shows a distinct distribution of HK-CREs and cell type-specific-CREs (CTS-CREs) (A) with distinct signal levels of histone marks and dnTSS (B). (C, D) Clustering analysis showed similar results where CTS and HK-CREs were assigned to different clusters. (E) Annotation of the clusters using histone signal levels and dnTSS showed that HK-CREs reside primarily in core promoter regions.

Section 3. Genes whose core promoter is a HK-CRE complement the biological processes of housekeeping genes

Housekeeping genes (HKGs) have been widely investigated as constitutively expressed genes in any cell type regardless of their function or developmental stage (35). In contrast, to our knowledge, ubiquitously active cis-regulatory elements (CREs) are not clearly defined. We hypothesize that these two elements are closely related, especially for the housekeeping CREs identified as core promoters, e.g. HK-CREs active in >90% of the 50 cell types and very close to a transcription start site (TSS). To get an insight into the relationship between our identified housekeeping core promoters (HK-CPs) and housekeeping genes, we obtained the list of genes whose TSS is within 300 bps from the center of an HK-CP and compared them with a set of known 2,131 HKGs from a public database (35). As expected, most of the core promoters of HKGs were included in our set of HK-CPs (Figure 3A). However, our list included the core promoters of a considerable number of other genes from diverse classes (Table 2) that were not included in the database of HKGs. We labeled them as non-housekeeping genes (nonHKG, n = 8,681) and compared them with genes in the HKGs database. Here, we would like to highlight that our identification of these genes, HKGs and nonHKGs, is based on the constitutively epigenetic activity of their core promoters, e.g. an open region with high levels of H3K27ac signal and predicted to interact with surrounding genes by the ABC method, and not on their constitutive expression; as the regular definition used in the database of HKGs (35). This suggests that, regardless of the activity of their core promoter, the gene expression levels of nonHKG could be different from those of the HKGs. To verify this, we collected RNA sequencing (RNA-seq) data from 37 cell types from the ENCODE project (Supplementary Table S2) and compared the expression levels of HKGs and nonHKGs. Figures 3B and C show the mean gene expression of these sets of genes across cell types and the number of cell types expressing them. As expected, the mean expression and the number of cell types expressing nonHKGs were significantly lower than those of HKGs. However, we argued that nonHKGs, whose core promoter is ubiquitously active, could still play an essential role in cell housekeeping processes regardless of their low RNA expression levels. Therefore, we analyzed the biological process (BP) of HKGs and nonHKG from the Gene Ontology project (36). Interestingly, the analysis showed a high overlap between the BPs of HKGs and nonHKGs with different significance and enrichment levels (Supplementary Figure S8 and S9), suggesting the complementarity of these two sets of ubiquitously active genes. We then performed a differential analysis of the significance and enrichment levels between overlapping BPs of HKGs and nonHKG (max FDR = 0.05). Figures 3D and E show the top 10 resulting BPs by significance and enrichment, respectively. Interestingly, the significance analysis (Figure 3D and Supplementary Figure S8) showed that, while HKGs were highly associated with general metabolic, localization, translation, and transport BPs within the cell, nonHKG were related to more specific metabolic and transcription processes, supporting the hypothesized complementarity between these two sets of genes. Surprisingly, the enrichment analysis showed that BPs related to tRNA and regulation of RNA polymerase II, important BPs within the transcription process, were enriched almost exclusively for nonHKGs (Figure 3E and Supplementary Figure S10). Overall, these results confirm the importance of housekeeping core promoters in the gene regulation process and could explain, to some degree, the high correlation between these cis-regulatory elements and the protein-coding genes (Figure 1C and Supplementary Figure S2 and S10).

Table 2.

Genes associated with a housekeeping core promoter

Gene classHKGs (n = 2,019) (number of associated HK-CPs = 2,151)nonHKGs (n = 8,681) (number of associated HK-CPs = 8,033)
Protein coding2,0197,700
Long noncoding RNA0857
Pseudogene0116
Others08
Gene classHKGs (n = 2,019) (number of associated HK-CPs = 2,151)nonHKGs (n = 8,681) (number of associated HK-CPs = 8,033)
Protein coding2,0197,700
Long noncoding RNA0857
Pseudogene0116
Others08

Description of the biotype of genes associated with a HK-CPs.

Table 2.

Genes associated with a housekeeping core promoter

Gene classHKGs (n = 2,019) (number of associated HK-CPs = 2,151)nonHKGs (n = 8,681) (number of associated HK-CPs = 8,033)
Protein coding2,0197,700
Long noncoding RNA0857
Pseudogene0116
Others08
Gene classHKGs (n = 2,019) (number of associated HK-CPs = 2,151)nonHKGs (n = 8,681) (number of associated HK-CPs = 8,033)
Protein coding2,0197,700
Long noncoding RNA0857
Pseudogene0116
Others08

Description of the biotype of genes associated with a HK-CPs.

Genes whose core promoter is an HK-CP are significantly enriched in housekeeping biological processes. The genes whose TSS is <300 bp from a housekeeping core promoter (HK-CP) were listed and compared with housekeeping genes (HKG). (A) As expected, most of the core promoters of HKG are HK-CP, however, the core promoters of another 8681 genes were listed as HK-CPs, e.g. non-housekeeping genes (non-HKG). (B) Mean gene expression levels of HKGs and nonHKGs over a set of 37 healthy cell types; HKGs present a significantly higher expression than nonHKG. (C) Mean gene expression levels and the number of cells expressing HKGs and nonHKG; HKGs are ubiquitously expressed in most of the cells, in contrast, nonHKGs present a wider distribution. Besides the differences in gene expression levels and the number of cells expressing HKG and non HKGs, the top biological processes (BPs) sorted by significancy (D), e.g. enrichment of genes in general BPs, and by enrichment (E), e.g. enrichment of genes in specific BPs, suggested the complementarity of these two sets of genes in important housekeeping biological processes.
Figure 3.

Genes whose core promoter is an HK-CP are significantly enriched in housekeeping biological processes. The genes whose TSS is <300 bp from a housekeeping core promoter (HK-CP) were listed and compared with housekeeping genes (HKG). (A) As expected, most of the core promoters of HKG are HK-CP, however, the core promoters of another 8681 genes were listed as HK-CPs, e.g. non-housekeeping genes (non-HKG). (B) Mean gene expression levels of HKGs and nonHKGs over a set of 37 healthy cell types; HKGs present a significantly higher expression than nonHKG. (C) Mean gene expression levels and the number of cells expressing HKGs and nonHKG; HKGs are ubiquitously expressed in most of the cells, in contrast, nonHKGs present a wider distribution. Besides the differences in gene expression levels and the number of cells expressing HKG and non HKGs, the top biological processes (BPs) sorted by significancy (D), e.g. enrichment of genes in general BPs, and by enrichment (E), e.g. enrichment of genes in specific BPs, suggested the complementarity of these two sets of genes in important housekeeping biological processes.

Section 4. Inactive housekeeping core promoters in cancer

In our definition of housekeeping cis-regulatory elements (HK-CREs) in Section 1, we labeled putative HK regions as those loci active in at least 90% of 50 randomly selected healthy cell types from the ABC dataset. To further validate our list of HK-CREs, we evaluated them using the rest of the 21 healthy cell types and 26 cancer cell lines from the same ABC dataset (Supplementary Table S1). Thus, for each HK-CRE, we calculated the percentage of cells in which it was found to be active in the healthy-validation and cancer datasets. Figure 4A shows the results from this analysis for the set of HK-CREs annotated as core promoters (HK-CPs; similar results for other annotated HK-CREs are shown in Supplementary Figure S11), where the gray dashed line represents our 90% threshold used to define putative housekeeping elements in Section 1. Most HK-CPs fulfilled this threshold in healthy cell types (96.94% of HK-CPs, n = 9,411), validating them as housekeeping elements. However, we observed a significant reduction in the distribution of HK-CPs in cancer cell lines (86.05% of HK-CPs, n = 8,354), where 1,354 HK-CPs did not fulfill the given housekeeping threshold; Supplementary Table S3 contains the list of HK-CPs that did not satisfy the threshold ranked by their ascending percentage of cancer cells they are found active on. To further explore these findings, similar to the analysis performed in Section 3, we obtained the list of genes whose TSS is within 300 bps from the center of a HK-CP, and compared the gene expression levels of these genes between a set of 13 cancer cell lines from the ENCODE project (3) and the healthy set used in Section 3 (n = 37) (Supplementary Table S2). While most of the genes showed a significant correlation between their expression levels in cancer and healthy cells (Figure 4B), the genes whose core promoter was found active in <35% of cancer cells (core promoters, n = 19; genes, n = 25) presented a considerably lower median gene expression in cancer cells as compared with healthy cell types (red points in Figure 4B; Supplementary Figure S12). Chromosome 19 was the most affected chromosome with nine HK-CPs active in <35% of cancer cells (Supplementary Figure S13 and Table 3). Interestingly, most of the HK-CPs with reduced activity in this chromosome tend to localize towards telomere regions, with the top 3 inactive HK-CPs residing in the region less than 2 Mbp away from the end of this chromosome, e.g. near the telomere region, suggesting the importance of this locus for maintaining the integrity of cells.

Table 3.

Top inactive HK-CPs in cancer cell lines

HK-CP regionGene nameTSS% Cancer cell lines% Healthy cell typesDistance to the end of the chromosome
chr19:58058989–58059488ZNF13558,059,23911.5395.23558,377
chr19:57935138–57935637ZNF41857,935,38730.76100.00682,229
chr19:57708962–57709461ZNF15457,709,20411.5395.23908,412
chr19:57583726–57584309ZIK157,584,14026.92100.001,033,476
chr19:56507593–56508092ZNF47156,507,85030.7695.232,109,766
chr19:56477152–56478177ZNF667-AS156,477,59015.3895.232,140,026
chr19:56477152–56478177ZNF66756,477,37415.3895.232,140,242
chr19:53254646–53255153ZNF67753,254,89823.0795.235,362,718
chr19:51887687–51888274ZNF649-AS151,888,02526.9295.236,729,591
chr19:51887687–51888274ZNF57751,887,98026.9295.236,729,636
chr19:36605027–36605554ZNF38236,605,31334.6190.4722,012,303
chr19:36605027–36605554ZNF52936,605,27634.6190.4722,012,340
HK-CP regionGene nameTSS% Cancer cell lines% Healthy cell typesDistance to the end of the chromosome
chr19:58058989–58059488ZNF13558,059,23911.5395.23558,377
chr19:57935138–57935637ZNF41857,935,38730.76100.00682,229
chr19:57708962–57709461ZNF15457,709,20411.5395.23908,412
chr19:57583726–57584309ZIK157,584,14026.92100.001,033,476
chr19:56507593–56508092ZNF47156,507,85030.7695.232,109,766
chr19:56477152–56478177ZNF667-AS156,477,59015.3895.232,140,026
chr19:56477152–56478177ZNF66756,477,37415.3895.232,140,242
chr19:53254646–53255153ZNF67753,254,89823.0795.235,362,718
chr19:51887687–51888274ZNF649-AS151,888,02526.9295.236,729,591
chr19:51887687–51888274ZNF57751,887,98026.9295.236,729,636
chr19:36605027–36605554ZNF38236,605,31334.6190.4722,012,303
chr19:36605027–36605554ZNF52936,605,27634.6190.4722,012,340

Inactive HK-CPs found in <35% of cancer cells ranked by their distance to the end of the chromosome they reside in, e.g. telomere regions

Table 3.

Top inactive HK-CPs in cancer cell lines

HK-CP regionGene nameTSS% Cancer cell lines% Healthy cell typesDistance to the end of the chromosome
chr19:58058989–58059488ZNF13558,059,23911.5395.23558,377
chr19:57935138–57935637ZNF41857,935,38730.76100.00682,229
chr19:57708962–57709461ZNF15457,709,20411.5395.23908,412
chr19:57583726–57584309ZIK157,584,14026.92100.001,033,476
chr19:56507593–56508092ZNF47156,507,85030.7695.232,109,766
chr19:56477152–56478177ZNF667-AS156,477,59015.3895.232,140,026
chr19:56477152–56478177ZNF66756,477,37415.3895.232,140,242
chr19:53254646–53255153ZNF67753,254,89823.0795.235,362,718
chr19:51887687–51888274ZNF649-AS151,888,02526.9295.236,729,591
chr19:51887687–51888274ZNF57751,887,98026.9295.236,729,636
chr19:36605027–36605554ZNF38236,605,31334.6190.4722,012,303
chr19:36605027–36605554ZNF52936,605,27634.6190.4722,012,340
HK-CP regionGene nameTSS% Cancer cell lines% Healthy cell typesDistance to the end of the chromosome
chr19:58058989–58059488ZNF13558,059,23911.5395.23558,377
chr19:57935138–57935637ZNF41857,935,38730.76100.00682,229
chr19:57708962–57709461ZNF15457,709,20411.5395.23908,412
chr19:57583726–57584309ZIK157,584,14026.92100.001,033,476
chr19:56507593–56508092ZNF47156,507,85030.7695.232,109,766
chr19:56477152–56478177ZNF667-AS156,477,59015.3895.232,140,026
chr19:56477152–56478177ZNF66756,477,37415.3895.232,140,242
chr19:53254646–53255153ZNF67753,254,89823.0795.235,362,718
chr19:51887687–51888274ZNF649-AS151,888,02526.9295.236,729,591
chr19:51887687–51888274ZNF57751,887,98026.9295.236,729,636
chr19:36605027–36605554ZNF38236,605,31334.6190.4722,012,303
chr19:36605027–36605554ZNF52936,605,27634.6190.4722,012,340

Inactive HK-CPs found in <35% of cancer cells ranked by their distance to the end of the chromosome they reside in, e.g. telomere regions

Inactive HK-CPs in cancer cell lines. HK-CP regions were compared with a set of 21 healthy cell types (healthy-validate) and a set of 26 cancer cell lines. (A) While most of the HK-CPs were found active in >90% of the healthy cell types (indicated by the gray dashed line), cancer cells presented a considerable number of inactive HK-CPs; 1354 HK-CPs were found in <90% of the cancer cells. (B) The expression levels of genes which core promoter is a HK-CP were compared between cancer and healthy samples. In general, there is a highly significant correlation between expression levels in healthy and cancer samples. However, the set of genes which HK-CPs were found active in <35% of cancer cells (red-colored dots) presented lower levels of expression on cancer samples. Because of their inactivity in cancer, the genes related to the top 3 inactive HK-CRs are investigated as putative housekeeping tumor suppressor genes: ZNF154, ZNF135, ZNF667 and ZNF667-AS1. (C) An increased methylation fraction of the promoter regions of putative housekeeping tumor suppressor genes is observed in cancer cells (left), which is related to a decrease in their gene expression levels (right). Further analysis of these genes in cancer samples from TCGA revealed significant changes among deceased conditions in diverse cancer subtypes such as pancreas adenocarcinoma (D) and uveal melanoma (E). (F, G) Independent survival analyses on these projects revealed a significantly higher survival probability of samples with high mean expression of these genes. (H) Different patterns of expression levels of samples from distinct projects might suggest cancer specific regulation of these genes. (I) After normalizing the differences between samples from different projects, a joint survival analysis including >5000 samples from 17 different cancer subtypes is performed. A relatively significant increased survival probability is observed in samples with a high expression of the putative tumor suppressor genes.
Figure 4.

Inactive HK-CPs in cancer cell lines. HK-CP regions were compared with a set of 21 healthy cell types (healthy-validate) and a set of 26 cancer cell lines. (A) While most of the HK-CPs were found active in >90% of the healthy cell types (indicated by the gray dashed line), cancer cells presented a considerable number of inactive HK-CPs; 1354 HK-CPs were found in <90% of the cancer cells. (B) The expression levels of genes which core promoter is a HK-CP were compared between cancer and healthy samples. In general, there is a highly significant correlation between expression levels in healthy and cancer samples. However, the set of genes which HK-CPs were found active in <35% of cancer cells (red-colored dots) presented lower levels of expression on cancer samples. Because of their inactivity in cancer, the genes related to the top 3 inactive HK-CRs are investigated as putative housekeeping tumor suppressor genes: ZNF154, ZNF135, ZNF667 and ZNF667-AS1. (C) An increased methylation fraction of the promoter regions of putative housekeeping tumor suppressor genes is observed in cancer cells (left), which is related to a decrease in their gene expression levels (right). Further analysis of these genes in cancer samples from TCGA revealed significant changes among deceased conditions in diverse cancer subtypes such as pancreas adenocarcinoma (D) and uveal melanoma (E). (F, G) Independent survival analyses on these projects revealed a significantly higher survival probability of samples with high mean expression of these genes. (H) Different patterns of expression levels of samples from distinct projects might suggest cancer specific regulation of these genes. (I) After normalizing the differences between samples from different projects, a joint survival analysis including >5000 samples from 17 different cancer subtypes is performed. A relatively significant increased survival probability is observed in samples with a high expression of the putative tumor suppressor genes.

Section 5. Potential housekeeping tumor suppressor genes

We further investigated the genes with the top 3 inactive core promoters: ZNF154, ZNF135, ZNF667-AS1 and ZNF667, using public data from commonly used cancer databases. We performed a pan-cancer analysis, e.g. joint analysis of diverse tumor samples, using the TNMplot web tool (37) for differential gene expression analysis of tumor versus normal samples. We observed similar results to those presented in the previous section (Figure 4B), where the expression levels of these four genes were lower in tumor samples in various cancer subtypes (Supplementary Figures S14 and S15). To confirm the effect on the core promoter regions of these genes in cancer, we compared the methylation fraction around the promoter regions (Figure 4C, left) and the expression levels of these genes (Figure 4C, right) for diverse cancer subtypes from the DepMap portal (26); expression levels of ZNF667-AS1 were not available at the time of the analysis. Surprisingly, the promoter regions were significantly more methylated in cancer samples, whereas normal samples presented almost no methylation. The aberrant methylation of these HK-CPs explains, to some degree, their observed inactivity (see Section 4 and Figure 4A) and the low expression levels of the genes they regulate in cancer samples (Figure 4B, C right, and Supplementary Figures S14 and S15).

Because of the ubiquitous activity of their core promoters in healthy cell types and their inactivity in most of the investigated cancer cell lines (due to aberrant methylation), we hypothesized that these four genes: ZNF154, ZNF135, ZNF667-AS1 and ZNF667, could be related to housekeeping tumor suppression functions. Surprisingly, the tumor suppression capacity of these genes has already been observed for different cancer subtypes in diverse tissues: non-small cell lung cancer, renal cancer carcinoma, thyroid cancer, nasopharyngeal carcinoma, squamous cell carcinoma, uveal carcinoma, etc. (38–62). Moreover, the attenuation of the anti-tumor capacity of these genes has been related to the aberrant methylation of their core promoters in individual (53,56,62) and pan-cancer analyses (57,60,61).

To further confirm our hypothesis of the existence of housekeeping tumor suppressors, we performed a comprehensive data analysis of 17 cancer projects from The Cancer Genome Atlas Program (TCGA) (Supplementary Table S4). We first compared the gene expression levels of samples, e.g. samples from patients diagnosed with cancer, from different deceased conditions across projects (Supplementary Figure S16). While most of the projects presented non-significant variations, the UVM (eye and adnexa; uveal melanoma) and PAAD (pancreas; pancreatic adenocarcinoma) projects presented significant distinctions (Figures 4D, E and Supplementary Figure S16), where the expression levels of the putative tumor suppressor genes were higher in non-deceased samples, suggesting a higher survival probability of patients from these samples. Then, for each project, we performed a joint survival analysis (63) of the four genes with samples stratified into two groups: strata = HIGH, samples with a mean expression level (see Methods) higher than 75% of the samples; strata = LOW, others. As expected, the survival analyses of PAAD and UVM (Figures 4F and G, respectively) showed a significantly higher survival probability of HIGH samples. Interestingly, even though the p-values were not significant, we observed HIGH samples to present a higher survival probability in several projects, e.g. BRCA (breast; breast invasive carcinoma), LUAD (bronchus and lung; lung adenocarcinoma), and ACC (adrenal gland; adrenocortical carcinoma; Supplementary Figures S17 and S18). However, some projects, such as the STAD (stomach; stomach adenocarcinoma) and SKCM (skin; skin cutaneous melanoma), presented an inverse effect, where LOW samples showed a higher survival probability (Supplementary Figure S19). Because of the significantly lower expressions of ZNF135, ZNF667 and ZNF667-AS1 genes in samples from stomach cancer as compared with healthy samples (Supplementary Figures S14 and S15), we argue that this inverse effect in STAD and SKCM projects does not entirely contradict our hypothesis of potential housekeeping tumor suppressor genes; instead, it might suggest cancer-specific changes on these genes driven by other latent variables (See Discussion and Supplementary Figure S20). To exemplify this cancer-specific effect, we compared the expression levels of samples from PAAD and UVM (projects that agree with our hypothesis) with samples from SKCM and STAD (projects that disagree with our hypothesis), where for most of the comparisons (Figure 4H) we observed a significantly lower expression of samples from the latter projects. Given the potential cancer-specific changes of putative housekeeping tumor suppressor genes, we normalized the expression levels of each gene across projects (see Materials and methods) to perform a joint survival analysis of the 17 TCGA projects. Figure 4I shows the results of the analysis where HIGH samples (n = 1,332), e.g. samples with a mean expression level (see Methods) higher than 75% of the samples, presented a significantly higher survival probability as compared with LOW samples (n = 3,964; independent analyses for each gene are shown in Supplementary Figure S19). Together, these results suggest the existence of active housekeeping core promoters related to housekeeping tumor-suppressor genes that, when methylated, may culminate in the development of cancer regardless of their malignancy phenotype variant.

Discussion

Our analysis unveils a key housekeeping element of the gene regulatory mechanism that, to our knowledge, hasn’t been described before. Through several bioinformatics analyses leveraging diverse public epigenetic and gene expression datasets, we reveal the existence of housekeeping cis-regulatory elements (HK-CREs) that are mainly located within core promoter regions and are not necessarily related to housekeeping genes (HKG), e.g. they are not the core promoters or promoters of HKG. We defined HK-CREs by randomly selecting a substantial number (n = 50) of healthy cell types to minimize biases in our analysis, ensuring a representative sample. Moreover, besides using a validation dataset with 21 healthy cell types and a set of 10,000 negative samples to assess our findings, we proved in distinct ways the robust location within the genome of the HK-CREs. For instance, we argue that the probability that most of the overlapping regions from 50 cell types intersect within the loci of core promoters (median distance to TSS = 56 bp) is exceptionally low. In addition, our analysis using a methylation atlas from healthy cells (21) showed that HK-CREs are located in dense unmethylated CpG loci for multiple cell types, which can’t just occur by chance. Therefore, we feel confident about the existence and robust location of the HK-CREs.

When we first observed that most of the HK-CREs overlapped with core promoter regions, we first thought that we captured the core promoters of HKG. However, the HKG only represented <20% of the genes in our list, e.g. genes which core promoter is a HK-CRE. Our immediate conjecture was that the remaining genes in our list would be genes that have not been correctly annotated as HKG. However, our comparison of gene expression levels showed significant differences between both groups. These results do not contradict our finding of ubiquitously active CREs, as their definition is not based on their transcription levels, rather, they might suggest that a housekeeping core promoter of a lowly transcribed gene could still play a regulatory role for promoter regions of actively transcribing genes regardless of whether they are housekeeping or cell type-specific genes. Nonetheless, we acknowledge the existence of potential biases in our datasets, including the list of housekeeping genes (HKGs), and other technical challenges inherent to the annotation processes (refer to Supplementary Note S1 for more details). Further experimental validations with higher sequencing depth may clarify the distinction between these two sets of genes that seem to complement each other in housekeeping biological processes in the cell.

In this manuscript, we focused on exploring and characterizing HK-CREs instead of the interactions predicted by the ABC method. We observed intricate cooperative interactions between HK-CPs, where various HK-CPs exhibit interactions with one another and with the genes in their proximity. This suggests the presence of cis-regulatory hubs with promoter–promoter interactions in their core forming complex regulation networks. We are aware of the importance of these intriguing promoter–promoter interactions not only for regulating HKGs but also for cell type-specific genes. Nonetheless, these interactions come from predictions and must be verified experimentally, but we believe these two results can be treated separately without conflict.

Regarding transcription factor binding motifs found in HK-CREs, we observed an enrichment of common GC-rich motifs, e.g. SP1 and SP2. As this information somehow overlaps with the conservation and enrichment of CpG sites shown in Figure 1D, we did not explore the enriched motifs further. However, together with appropriate experimental validations, differential analyses of enriched motifs across annotations of HK and CTS-CREs, e.g. core promoters vs. promoter regions, could reveal important motifs for the complex enhancer grammar at different levels.

In our analysis of inactive housekeeping core promoters (HK-CPs) in cancer cell lines (see Section 4), we tried to avoid bias toward the 26 cell lines in the ABC dataset by focusing on the top inactive HK-CPs from a list of 1,354 elements. Moreover, our investigation revealed that the most affected elements group together in the telomere of chromosome 19, which proves, to some degree, the robustness of our results. In addition, for the top 3 inactive HK-CPs in cancer cells, we confirmed their aberrant methylation and the decreased expression levels of the genes they regulate through comprehensive analysis using data from diverse public sources as ENCODE, DepMap, and the web tool TNMplot, obtaining a general agreement among results.

Finally, for the set of putative housekeeping tumor suppressor genes (ZNF135, ZNF154, ZNF667 and ZNF667-AS1), e.g. genes whose HK-CPs are active in less than 20% of the cancer cells, we performed a comprehensive analysis using cancer samples of 17 projects from TCGA. Even though we observed an increased survival probability of samples with high expression levels of these genes in diverse projects, only two were significant: PAAD (pancreas adenocarcinoma) and UVM (uveal melanoma). However, we argue the existence of latent cancer-specific mechanisms that affect these genes which must be considered. For instance, we observed interesting cancer-specific dynamics in the expression levels of these genes across samples from different cancer stages (Supplementary Figure S20). For example, the expression levels from PAAD samples showed a decline from the early to intermediate stages, with a final increased expression in the late stages. These dynamic changes might be related to other biological processes of these genes and the inherent cancer-specific necessity to down/up-regulate them during different stages. Further analysis including samples from the same patient at different stages might reveal the intrinsic mechanism behind these intriguing changes.

Together, even though our results should be verified experimentally, we are confident that they continue to hold significant relevance for pan-cancer studies concerning the characterization of housekeeping cis-regulatory elements involved in tumor suppression.

Data availability

The data underlying this article were derived from sources in the public domain:

The code used in this research is publicly available as a public repository on GitHub; https://github.com/MartinLoza/Housekeeping-Core-Promoters. An R data frame of the annotated CREs is publicly available through a figshare repository; https://doi.org/10.6084/m9.figshare.24589137. Any additional information generated in this paper is available from the corresponding author upon reasonable request.

Supplementary data

Supplementary Data are available at NAR Online.

Acknowledgements

The authors thank Prof. Katsuhiko Shirahige and Dr Takashi Fukaya for their fantastic discussion and invaluable comments.

Author contributions: Martin Loza: Conceptualization, Data Curation, Formal analysis, Investigation, Methodology, Validation, Software, Writing—original draft, Visualization. Alexis Vandenbon: Conceptualization, Methodology, Supervision, Writing—review & editing. Kenta Nakai: Conceptualization, Methodology, Supervision, Writing—review & editing.

Funding

Grants-in-Aid for Scientific Research of the Japan Society for the Promotion of Science [22K06189 to K.N., JP22K21301 to M.L.]. Funding for open access charge: Lab budget from University of Tokyo.

Conflict of interest statement. None declared.

References

1.

Nakai
K.
,
Vandenbon
A.
Epigenetics in Organ Specific Disorders
.
2023
;
Elsevier
11
32
.

2.

Andersson
R.
,
Sandelin
A.
Determinants of enhancer and promoter activities of regulatory elements
.
Nat. Rev. Genet.
2020
;
21
:
71
87
.

3.

Encode Project Consortium
An integrated encyclopedia of DNA elements in the human genome
.
Nature
.
2012
;
489
:
57
74
.

4.

Encode Project Consortium
Moore
J.E.
,
Purcaro
M.J.
,
Pratt
H.E.
,
Epstein
C.B.
,
Shoresh
N.
,
Adrian
J.
,
Kawli
T.
,
Davis
C.A.
,
Dobin
A
et al. .
Expanded encyclopaedias of DNA elements in the human and mouse genomes
.
Nature
.
2020
;
583
:
699
710
.

5.

Dao
L.T.M.
,
Galindo-Albarran
A.O.
,
Castro-Mondragon
J.A.
,
Andrieu-Soler
C.
,
Medina-Rivera
A.
,
Souaid
C.
,
Charbonnier
G.
,
Griffon
A.
,
Vanhille
L.
,
Stephen
T.
et al. .
Genome-wide characterization of mammalian promoters with distal enhancer functions
.
Nat. Genet.
2017
;
49
:
1073
1081
.

6.

Li
G.
,
Ruan
X.
,
Auerbach
R.K.
,
Sandhu
K.S.
,
Zheng
M.
,
Wang
P.
,
Poh
H.M.
,
Goh
Y.
,
Lim
J.
,
Zhang
J.
et al. .
Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation
.
Cell
.
2012
;
148
:
84
98
.

7.

Santiago-Algarra
D.
,
Souaid
C.
,
Singh
H.
,
Dao
L.T.M.
,
Hussain
S.
,
Medina-Rivera
A.
,
Ramirez-Navarro
L.
,
Castro-Mondragon
J.A.
,
Sadouni
N.
,
Charbonnier
G.
et al. .
Epromoters function as a hub to recruit key transcription factors required for the inflammatory response
.
Nat. Commun.
2021
;
12
:
6660
.

8.

Dejosez
M.
,
Dall’Agnese
A.
,
Ramamoorthy
M.
,
Platt
J.
,
Yin
X.
,
Hogan
M.
,
Brosh
R.
,
Weintraub
A.S.
,
Hnisz
D.
,
Abraham
B.J.
et al. .
Regulatory architecture of housekeeping genes is driven by promoter assemblies
.
Cell Rep.
2023
;
42
:
112505
.

9.

Zabidi
M.A.
,
Arnold
C.D.
,
Schernhuber
K.
,
Pagani
M.
,
Rath
M.
,
Frank
O.
,
Stark
A.
Enhancer-core-promoter specificity separates developmental and housekeeping gene regulation
.
Nature
.
2015
;
518
:
556
559
.

10.

Nasser
J.
,
Bergman
D.T.
,
Fulco
C.P.
,
Guckelberger
P.
,
Doughty
B.R.
,
Patwardhan
T.A.
,
Jones
T.R.
,
Nguyen
T.H.
,
Ulirsch
J.C.
,
Lekschas
F.
et al. .
Genome-wide enhancer maps link risk variants to disease genes
.
Nature
.
2021
;
593
:
238
243
.

11.

Fulco
C.P.
,
Nasser
J.
,
Jones
T.R.
,
Munson
G.
,
Bergman
D.T.
,
Subramanian
V.
,
Grossman
S.R.
,
Anyoha
R.
,
Doughty
B.R.
,
Patwardhan
T.A.
et al. .
Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations
.
Nat. Genet.
2019
;
51
:
1664
1669
.

12.

Zhao
H.
,
Sun
Z.
,
Wang
J.
,
Huang
H.
,
Kocher
J.P.
,
Wang
L.
CrossMap: a versatile tool for coordinate conversion between genome assemblies
.
Bioinformatics
.
2014
;
30
:
1006
1007
.

13.

Cunningham
F.
,
Allen
J.E.
,
Allen
J.
,
Alvarez-Jarreta
J.
,
Amode
M.R.
,
Armean
I.M.
,
Austine-Orimoloye
O.
,
Azov
A.G.
,
Barnes
I.
,
Bennett
R.
et al. .
Ensembl 2022
.
Nucleic Acids Res.
2022
;
50
:
D988
D995
.

14.

Subelj
L.
,
Bajec
M.
Unfolding communities in large complex networks: combining defensive and offensive label propagation for core extraction
.
Phys. Rev. E Stat. Nonlin. Soft Matter Phys.
2011
;
83
:
036103
.

15.

Csardi
G.
,
Nepusz
T.
The igraph software package for complex network research
.
InterJournal
.
2006
;
Complex Systems
:
1695
.

16.

Haider
S.
,
Waggott
D.
,
Lalonde
E.
,
Fung
C.
,
Liu
F.F.
,
Boutros
P.C.
A bedr way of genomic interval processing
.
Source Code Biol. Med.
2016
;
11
:
14
.

17.

Lawrence
M.
,
Huber
W.
,
Pages
H.
,
Aboyoun
P.
,
Carlson
M.
,
Gentleman
R.
,
Morgan
M.T.
,
Carey
V.J.
Software for computing and annotating genomic ranges
.
PLoS Comput. Biol.
2013
;
9
:
e1003118
.

18.

McInnes
L.
,
Healy
J.
,
Melville
J.
Umap: uniform manifold approximation and projection for dimension reduction
.
2018
;
arXiv doi:
18 September 2020, preprint: not peer reviewed
https://arxiv.org/abs/1802.03426.

19.

Zoonomia
C.
A comparative genomics multitool for scientific discovery and conservation
.
Nature
.
2020
;
587
:
240
245
.

20.

Sherman
B.T.
,
Hao
M.
,
Qiu
J.
,
Jiao
X.
,
Baseler
M.W.
,
Lane
H.C.
,
Imamichi
T.
,
Chang
W.
DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update)
.
Nucleic Acids Res.
2022
;
50
:
W216
W221
.

21.

Loyfer
N.
,
Magenheim
J.
,
Peretz
A.
,
Cann
G.
,
Bredno
J.
,
Klochendler
A.
,
Fox-Fisher
I.
,
Shabi-Porat
S.
,
Hecht
M.
,
Pelet
T.
et al. .
A DNA methylation atlas of normal human cell types
.
Nature
.
2023
;
613
:
355
364
.

22.

Pohl
A.
,
Beato
M.
bwtool: a tool for bigWig files
.
Bioinformatics
.
2014
;
30
:
1618
1619
.

23.

Soneson
C.
,
Love
M.I.
,
Robinson
M.D.
Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences
.
F1000Res
.
2015
;
4
:
1521
.

24.

Carlson
M.
,
Falcon
S.
,
Pages
H.
,
Li
N.
org. Hs. eg. db: genome wide annotation for Human
.
R Package Version
.
2019
;
3
:
3
.

25.

Pages
H.
,
Carlson
M.
,
Falcon
S.
,
Li
N.
,
Maintainer
M.B.P.
AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor
.
2023
;
Bioconductor Packag. Maint.
https://doi.org/10.18129/B9.bioc.AnnotationDb.

26.

Barretina
J.
,
Caponigro
G.
,
Stransky
N.
,
Venkatesan
K.
,
Margolin
A.A.
,
Kim
S.
,
Wilson
C.J.
,
Lehar
J.
,
Kryukov
G.V.
,
Sonkin
D.
et al. .
The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity
.
Nature
.
2012
;
483
:
603
607
.

27.

Colaprico
A.
,
Silva
T.C.
,
Olsen
C.
,
Garofano
L.
,
Cava
C.
,
Garolini
D.
,
Sabedot
T.S.
,
Malta
T.M.
,
Pagnotta
S.M.
,
Castiglioni
I.
et al. .
TCGAbiolinks: an R/bioconductor package for integrative analysis of TCGA data
.
Nucleic Acids Res.
2016
;
44
:
e71
.

28.

Therneau
T.
A package for survival analysis in S
.
2015
;
R Package Version 2
.

29.

Kassambara
A.
,
Kosinski
M.
,
Biecek
P.
,
Fabian
S.
survminer: drawing survival curves using ‘ggplot2’
.
2017
;
R Package Version 0.3, 1
.

30.

Kassambara
A.
ggpubr:“ggplot2” based publication ready plots
.
2020
;
R Package Version 0.4 0, 438
.

31.

Wilke
C.O.
Ggridges: ridgeline plots in’ggplot2’
.
2018
;
R Package Version 0.5, 1, 483
.

32.

Zhu
J.
,
He
F.
,
Hu
S.
,
Yu
J.
On the nature of human housekeeping genes
.
Trends Genet.
2008
;
24
:
481
484
.

33.

Deaton
A.M.
,
Bird
A.
CpG islands and the regulation of transcription
.
Genes Dev.
2011
;
25
:
1010
1022
.

34.

Pearson
K.
LIII. On lines and planes of closest fit to systems of points in space
.
London Edinburgh Dublin Philos. Mag. J. Sci.
1901
;
2
:
559
572
.

35.

Hounkpe
B.W.
,
Chenou
F.
,
de Lima
F.
,
De Paula
E.V.
HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets
.
Nucleic Acids Res.
2021
;
49
:
D947
D955
.

36.

Ashburner
M.
,
Ball
C.A.
,
Blake
J.A.
,
Botstein
D.
,
Butler
H.
,
Cherry
J.M.
,
Davis
A.P.
,
Dolinski
K.
,
Dwight
S.S.
,
Eppig
J.T.
et al. .
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
.
Nat. Genet.
2000
;
25
:
25
29
.

37.

Bartha
A.
,
Gyorffy
B.
TNMplot.Com: a web tool for the comparison of gene expression in normal, tumor and metastatic tissues
.
Int. J. Mol. Sci.
2021
;
22
:
2622
2634
.

38.

Ma
X.
,
Yu
S.
,
Zhao
B.
,
Bai
W.
,
Cui
Y.
,
Ni
J.
,
Lyu
Q.
,
Zhao
J.
Development and validation of a novel ferroptosis-related LncRNA signature for predicting prognosis and the immune landscape features in uveal melanoma
.
Front. Immunol.
2022
;
13
:
922315
.

39.

Wei
L.
,
Gu
W.
,
Hu
L.
,
Wang
K.
,
Huang
H.
,
Shen
Y.
Regulation of IncRNA ZNF667-AS1 in proliferation and invasion of esophageal squamous cell carcinoma cells via mediating ceRNA network
.
Crit. Rev. Eukaryot. Gene Expr.
2022
;
32
:
57
68
.

40.

Yu
C.
,
Chen
W.
,
Cai
Y.
,
Du
M.
,
Zong
D.
,
Qian
L.
,
Jiang
X.
,
Zhu
H.
The lncRNA ZNF667-AS1 inhibits propagation, invasion, and angiogenesis of gastric cancer by silencing the expression of N-cadherin and VEGFA
.
J. Oncol.
2022
;
2022
:
3579547
.

41.

Zheng
Y.J.
,
Liang
T.S.
,
Wang
J.
,
Zhao
J.Y.
,
Zhai
S.N.
,
Yang
D.K.
,
Wang
L.D.
Long non-coding RNA ZNF667-AS1 retards the development of esophageal squamous cell carcinoma via modulation of microRNA-1290-mediated PRUNE2
.
Transl. Oncol.
2022
;
21
:
101371
.

42.

Yang
H.
,
Cai
M.Y.
,
Rong
H.
,
Ma
L.R.
,
Xu
Y.L.
ZNF667-AS1, a positively regulating MEGF10, inhibits the progression of uveal melanoma by modulating cellular aggressiveness
.
J. Biochem. Mol. Toxicol.
2021
;
35
:
e22732
.

43.

Zhao
J.
,
Cui
Z.
,
Dong
Z.
,
Niu
Y.
,
Yang
K.
,
Han
H.
IncRNA ZNF667-AS1 suppresses epithelial mesenchymal transformation by targeting TGF-β1 in oral squamous cell carcinoma
.
Clin. Lab.
2021
;
67
:
1680
1690
.

44.

Zhuang
L.
,
Ding
W.
,
Zhang
Q.
,
Xu
X.
,
Xi
D
lncRNA ZNF667-AS1 (NR_036521.1) inhibits the progression of colorectal cancer via regulating ANK2/JAK2 expression
.
J. Cell. Physiol.
2021
;
236
:
2178
2193
.

45.

Wang
N.
,
Feng
Y.
,
Xie
J.
,
Han
H.
,
Dong
Q.
,
Wang
W.
Long non-coding RNA ZNF667-AS1 knockdown curbs liver metastasis in acute myeloid leukemia by regulating the microRNA-206/AKAP13 axis
.
Cancer Manag. Res.
2020
;
12
:
13285
13300
.

46.

Yuan
Q.
,
Gao
C.
,
Lai
X.D.
,
Chen
L.Y.
,
Lai
T.B.
Analysis of long noncoding RNA ZNF667-AS1 as a potential biomarker for diagnosis and prognosis of glioma patients
.
Dis. Markers
.
2020
;
2020
:
8895968
.

47.

Dong
Z.
,
Li
S.
,
Wu
X.
,
Niu
Y.
,
Liang
X.
,
Yang
L.
,
Guo
Y.
,
Shen
S.
,
Liang
J.
,
Guo
W.
Aberrant hypermethylation-mediated downregulation of antisense lncRNA ZNF667-AS1 and its sense gene ZNF667 correlate with progression and prognosis of esophageal squamous cell carcinoma
.
Cell Death. Dis.
2019
;
10
:
930
.

48.

Li
Y.J.
,
Yang
Z.
,
Wang
Y.Y.
,
Wang
Y.
Long noncoding RNA ZNF667-AS1 reduces tumor invasion and metastasis in cervical cancer by counteracting microRNA-93-3p-dependent PEG3 downregulation
.
Mol. Oncol.
2019
;
13
:
2375
2392
.

49.

Meng
W.
,
Cui
W.
,
Zhao
L.
,
Chi
W.
,
Cao
H.
,
Wang
B.
Aberrant methylation and downregulation of ZNF667-AS1 and ZNF667 promote the malignant progression of laryngeal squamous cell carcinoma
.
J. Biomed. Sci.
2019
;
26
:
13
.

50.

Zhao
L.P.
,
Li
R.H.
,
Han
D.M.
,
Zhang
X.Q.
,
Nian
G.X.
,
Wu
M.X.
,
Feng
Y.
,
Zhang
L.
,
Sun
Z.G.
Independent prognostic factor of low-expressed LncRNA ZNF667-AS1 for cervical cancer and inhibitory function on the proliferation of cervical cancer
.
Eur. Rev. Med. Pharmacol. Sci.
2017
;
21
:
5353
5360
.

51.

Muthamilselvan
S.
,
Raghavendran
A.
,
Palaniappan
A.
Stage-differentiated ensemble modeling of DNA methylation landscapes uncovers salient biomarkers and prognostic signatures in colorectal cancer progression
.
PLoS One
.
2022
;
17
:
e0249151
.

52.

Xi
T.
,
Zhang
G.
Epigenetic regulation on the gene expression signature in esophagus adenocarcinoma
.
Pathol. Res. Pract.
2017
;
213
:
83
88
.

53.

Ito
Y.
,
Usui
G.
,
Seki
M.
,
Fukuyo
M.
,
Matsusaka
K.
,
Hoshii
T.
,
Sata
Y.
,
Morimoto
J.
,
Hata
A.
,
Nakajima
T.
et al. .
Association of frequent hypermethylation with high grade histological subtype in lung adenocarcinoma
.
Cancer Sci.
2023
;
114
:
3003
3013
.

54.

Pearson
P.
,
Smith
K.
,
Sood
N.
,
Chia
E.
,
Follett
A.
,
Prystowsky
M.B.
,
Kirby
S.
,
Belbin
T.J.
Kruppel-family zinc finger proteins as emerging epigenetic biomarkers in head and neck squamous cell carcinoma
.
J. Otolaryngol. Head Neck Surg.
2023
;
52
:
41
.

55.

He
J.
,
Huang
J.
,
Tang
G.
,
Wang
P.
,
He
M.
,
Wei
S.
Low expression of ZNF154 is related to poor prognosis in gastric cancer
.
Cancer Manag Res.
2022
;
14
:
659
672
.

56.

He
W.
,
Lin
S.
,
Guo
Y.
,
Wu
Y.
,
Zhang
L.L.
,
Deng
Q.
,
Du
Z.M.
,
Wei
M.
,
Zhu
W.
,
Chen
W.J.
et al. .
Targeted demethylation at ZNF154 promotor upregulates ZNF154 expression and inhibits the proliferation and migration of esophageal squamous carcinoma cells
.
Oncogene
.
2022
;
41
:
4537
4546
.

57.

Miller
B.F.
,
Petrykowska
H.M.
,
Elnitski
L.
Assessing ZNF154 methylation in patient plasma as a multicancer marker in liquid biopsies from colon, liver, ovarian and pancreatic cancer patients
.
Sci. Rep.
2021
;
11
:
221
.

58.

Noruzinia
M.
,
Tavakkoly-Bazzaz
J.
,
Ahmadvand
M.
,
Azimi
F.
,
Dehghanifard
A.
,
Khakpour
G.
Young breast cancer: novel gene methylation in WBC
.
Asian Pac. J. Cancer Prev.
2021
;
22
:
2371
2375
.

59.

Hu
Y.
,
Qi
M.F.
,
Xu
Q.L.
,
Kong
X.Y.
,
Cai
R.
,
Chen
Q.Q.
,
Tang
H.Y.
,
Jiang
W.
Candidate tumor suppressor ZNF154 suppresses invasion and metastasis in NPC by inhibiting the EMT via wnt/β-catenin signalling
.
Oncotarget
.
2017
;
8
:
85749
85758
.

60.

Margolin
G.
,
Petrykowska
H.M.
,
Jameel
N.
,
Bell
D.W.
,
Young
A.C.
,
Elnitski
L.
Robust detection of DNA hypermethylation of ZNF154 as a pan-cancer locus with in silico modeling for blood-based diagnostic development
.
J. Mol. Diagn.
2016
;
18
:
283
298
.

61.

Sánchez-Vega
F.
,
Gotea
V.
,
Petrykowska
H.M.
,
Margolin
G.
,
Krivak
T.C.
,
DeLoia
J.A.
,
Bell
D.W.
,
Elnitski
L.
Recurrent patterns of DNA methylation in the ZNF154, CASP8, and VHL promoters across a wide spectrum of human solid epithelial tumors and cancer cell lines
.
Epigenetics
.
2013
;
8
:
1355
1372
.

62.

Reinert
T.
,
Borre
M.
,
Christiansen
A.
,
Hermann
G.G.
,
Ørntoft
T.F.
,
Dyrskjøt
L.
Diagnosis of bladder cancer recurrence based on urinary levels of EOMES, HOXA9, POU4F2, TWIST1, VIM, and ZNF154 hypermethylation
.
PLoS One
.
2012
;
7
:
e46297
.

63.

Dinse
G.E.
,
Lagakos
S.W.
Nonparametric estimation of lifetime and disease onset distributions from incomplete observations
.
Biometrics
.
1982
;
38
:
921
932
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.