Epigenetic characterization of housekeeping core promoters and their importance in tumor suppression

Abstract In this research, we elucidate the presence of around 11,000 housekeeping cis-regulatory elements (HK-CREs) and describe their main characteristics. Besides the trivial promoters of housekeeping genes, most HK-CREs reside in promoter regions and are involved in a broader role beyond housekeeping gene regulation. HK-CREs are conserved regions rich in unmethylated CpG sites. Their distribution highly correlates with that of protein-coding genes, and they interact with many genes over long distances. We observed reduced activity of a subset of HK-CREs in diverse cancer subtypes due to aberrant methylation, particularly those located in chromosome 19 and associated with zinc finger genes. Further analysis of samples from 17 cancer subtypes showed a significantly increased survival probability of patients with higher expression of these genes, suggesting them as housekeeping tumor suppressor genes. Overall, our work unravels the presence of housekeeping CREs indispensable for the maintenance and stability of cells.


Introduction
Characterizing the different components within the complex transcriptional regulatory mechanism has become a longstanding topic of research ( 1 ,2 ).Thanks to the advances in sequencing techniques and experimental protocols, it is possible to investigate different cis-regulatory elements (CREs), e.g.enhancers and promoters, and their differences and similitudes in numerous cell types with unprecedented resolution ( 3 ,4 ).These advances have also made clear that classical discrete definitions of CREs do not necessarily fit the intricate interplay between them.For example, several studies have described CREs that may have both enhancer and promoter functions with similar epigenetic features to either classical marks of enhancers or promoters ( 2 ,5 ).In this way, promoters with enhancer capabilities could drive the activation of neighboring genes through promoter-promoter interactions (5)(6)(7).
In a recent study, contrary to the classical enhancer-promoter interactions on the regulation of cell type-specific genes, it was demonstrated that housekeeping genes (HKG) are mainly regulated through promoter-promoter interactions ( 8 ), delineating the importance of this kind of interplay on the maintenance of cells.These new insights into the regulation of HKG and the intrinsic specificity of the core promoters of HKG to certain gene regulators ( 9 ) suggest the existence of ubiquitously active CREs besides the promoters of HKG.However, to our knowledge, the existence of these CREs that are active in all cell types has not been described.In this study, we leveraged the activity by contact (ABC) dataset ( 10 ), which provides a list of active CREs and their predicted target genes from 131 cell types and tissues, to investigate ubiquitously active CREs in healthy cell types.We elucidate the existence of such housekeeping CREs (HK-CREs), e.g.CREs active in at least 90% of the healthy cell types, and describe their main features.Through several bioinformatic analyses of multiomics data, we show that HK-CREs resided in highly conserved regions that are rich in unmethylated CpG sites, and, similar to HKG ( 8 ), they reside in specific loci within the genome as compared with cell type-specific CREs.In conformity with current knowledge on the regulation of HKG ( 8 ,9 ), we show that most of these ubiquitously active CREs localize in core promoter regions and interact with neighboring genes in complex longdistance promoter-promoter interactions.However, contrary to expectations, our analysis revealed that the core promoters of HKG, which are inherently assumed to be active in every cell type, were not the only ubiquitously active CREs; the core promoters of HKG only accounted for around 18% of the total number of HK-CREs.We then compared the biological processes of HKG and the rest of the genes with ubiquitously active core promoters revealing an intrinsic association between these sets of genes, e.g. both groups are enriched in similar or identical housekeeping functions, which suggests their complementarity to ensure the proper functioning and maintenance of cells.Finally, we investigated the HK-CREs in diverse cancer cell lines, discovering a reduction in the activity of a subset of these elements.Interestingly, we found that the most affected HK-CREs in cancer cells reside close to the telomere region of chromosome 19 and are inherently related to zinc finger genes.Moreover, we show that while most of these HK-CREs with reduced activity were cancer-specific, there was a set of core promoters found to be inactive in numerous cancer cell lines, suggesting the existence of housekeeping tumor suppressor genes.We leveraged public data from diverse popular cancer projects and found a decrease in the expression of putative housekeeping tumor suppressor genes (ZNF154, ZNF135, ZNF667, and ZNF667-AS1) due to aberrant methylation of their core promoters in several cancer subtypes.Despite potential cancer-specific regulation of these genes during cancer progression, a joint analysis of more than 5,000 patients from 17 cancer subtypes showed a significant increase in the survival probability of patients with high expression of these genes, confirming, to some degree, the existence of housekeeping tumor suppressor genes in the human genome.
Overall, our work unveils key cis-regulatory elements active in multiple healthy cell types that, if lost, could strongly affect the housekeeping biological processes of cells and possibly culminate in cancer development.

Quality control and pre-processing of regions
First, we filtered out not significant regions with an ABC score lower than 0.015 as in the original publication ( 11 ).Then, we lifted up the regions coordinates to the hg38 human genome using the CrossMap assembly converter ( 12 ) implemented in the Ensemble website ( 13 ).
We manually filtered out cancer cell lines or stimulated cell types from the original 131 cells in the ABC dataset ( 10 ), obtaining a healthy dataset of 71 cell types and tissues ( Supplementary Table S1 ).To investigate the dependency of the housekeeping (HK) regions on the number of included cell types and on how strictly we define a region as to be HK, we merge the regions (see Merging process) with an increasing number of randomly selected cell types using a 70, 80, 90 or 100% definition of HK, e.g. a given region is an HK region if it is found in at least 70, 80, 90 or 100% of the cell types.We ob-served that the number and length of the merged HK regions stabilized after we included around 30-40 cell types with a 90 or 100% definition ( Supplementary Figure S1 ).Therefore, we randomly selected 50 healthy cell types ( Supplementary Table S1 ) and merged their regions for further analyses.The regions from the remaining 21 healthy cells were kept for validation.
After annotating the main histone marks of the merged regions (see Annotation of histone marks ), as a quality control step, we filtered out regions with histone marks similar to a set of 10,000 negative samples (see Definition of neg ativ e samples and Filtering process ).For this, we clustered regions using the Louvain algorithm ( 14 ) implemented in the igraph R package (ver.1.5.0.1) ( 15 ) using the histone marks and the distance to the nearest TSS as features (see Filtering process ).Finally, we retained 135,917 regions as putative cis-regulatory elements (see Table 1 ).

Merging process
For a given set of cell types, we merged overlapping regions following the next iterative process until no regions were left in the raw dataset: • Step 1: Merge raw overlapping regions across cell types.
We used the function bedr.merge.regionfrom the bedr R package (ver 1.0.7)( 16) with a distance of 20bps.• Step 2. Select merged regions that overlap on the maximum number of cell types.In the case of housekeeping regions, we merged and selected regions overlapping with a defined 90% percentage of the total number of cells, e.g.regions that overlap in at least 90% of the cell types.• Step 3. Filter out raw regions overlapping with the selected merged regions.We used the findOverlaps function from the GenomicRanges R package (ver 1.46.1) ( 17 ).• Step 4. Repeat Step 1 using the filtered raw regions.

Filtering process
We filtered out suspicious regions with similar features, e.g.histone signals (HS) and distance to nearest TSS (dnTSS), to the set of negative samples (NS, see Definition of neg ativ e samples ).Therefore, we performed a joint unsupervised analysis of regions and NS in an iterative way: a. Subset the regions and NS in the suspicious cluster and perform steps 1-5.b.Filter out regions in sub-clusters with a high percentage of NS, e.g.> 50% of the regions are NS.
• Repeat steps 1-6 until there are not suspicious clusters.The annotations of clusters were manually assigned using a combination of their epigenetics signals and their distance to the nearest TSS.
At each step, relevant parameters such as the resolution of clustering and the number of PCs were defined with the help of visualizations such as the elbow plot and the embedding coordinates from the uniform manifold approximation and projection (UMAP) method ( 18 ).

Definition of negative samples
We defined a set of 10,000 negative samples (NSs) as nonoverlapping regions with cis-regulatory elements (CREs) from the ABC database (131 cell types) nor the SCREEN project ( 4 ) as follows: • Merge regions from the ABC and SCREEN datasets using the reduce function from the GenomicRanges R package (ver 1.46.1) ( 17 ) with a minimum gap width of 100 bp.• Calculate the gaps between the merged regions using the gaps function from the GenomicRanges R package.• Filter out gaps located in centromeres or at the end of telomeres.• Filter out gaps less than 3Kbp long.This length intends to keep regions of 1.5Kbp which are at least 500bp away from any extended CRE (see Annotation of histone marks ). • Select 10,000 random gaps and define NS as 500bp regions located at the center of the selected gaps.

Annotation of histone marks
We annotated the H3K27ac, H3K4me1 and H3K4me3 histone marks of an extended version of the cis-regulatory elements and negative samples.The extended version includes the original regions plus 250 bp on each flank to account for adjacent histones.We downloaded ChIP-seq histone marks datasets from the ENCODE project ( 3 ) ( Supplementary Table S2 ) of samples selected as being similar to the main 50 healthy cells / tissues (see Quality control and pre-processing of regions ).Following the pre-processing used on the SCREEN project ( 4 ), we performed independent log(signal + 1) normalization and scaling of the samples.Finally, we annotated the histone marks of the extended regions as the average signal of overlapping peaks across pre-processed samples; overlapping peaks were selected as those with at least 250 bp overlap with the extended regions.

Annotation of the distance to the nearest TSS
We annotated the distance to the nearest TSS of putative cis-regulatory elements and negative samples using the EN-SEMBL transcripts annotation version GRCh38.107.For this, we calculated the absolute distance from the center of the regions to every TSS in the ENSEMBL transcripts dataset.Then, for each region, we selected the TSS with the minimum absolute distance as the nearest TSS.We also annotated the nearest gene to each region as the gene related to the nearest TSS.This gene annotation was especially important for further analyses of cis-regulatory elements labeled as core promoters.

Annotation of the conservation scores
We annotated the mean conservation scores on a fixed-length version of the cis-regulatory elements (CREs).The fixedlength CREs were defined to be 500 bp at both flanking loci from the center of each CRE.Then, we overlapped each fixedlength CRE with the conservation scores of 241 mammals from the Zoonomia project ( 19 ) using the findOverlaps function from the GenomicRanges R package (ver.1.46.1), and calculated the mean conservation score as the mean score among overlapping regions.

Gene ontology analysis
Gene ontology analysis of genes was performed using the DAVID web tool ( 20 ).Significant processes were selected to have FDR ≤ 0.05.The differential enrichment analysis was performed by calculating the fold change of −log(FDR) between housekeeping and non-housekeeping genes.Then, differentially enriched terms were selected as those with at least and absolute 20-fold change ( Supplementary Figure S8 and  S9 ).Finally, the top 10 terms were selected to produce plots in Figure 3 .

Methylation atlas
We downloaded the methylation scores from 34 randomly picked samples from the DNA methylation atlas of normal human cell types ( 21 ) ( Supplementary Table S2 ).Then, similar to the annotation of the conservation scores, we obtained overlapping regions between the methylation dataset and the set of fixed-length CREs and NS using the findOverlaps function from the GenomicRanges R package (ver.1.46.1); the fixed-length regions were defined as regions of 1Kbp concentric with every CRE and NS.Then, for each CRE and NS, we

ENCODE RNA-seq datasets
All the healthy and cancer RNA-seq data were obtained from the ENCODE project ( 3 ) ( Supplementary Table S2 ).For pre-processing and quality control of samples, we followed the next pipeline: • Load the TPM data using the tximport Bioconductor package (ver.1.22.0)( 23 ).• Filter out genes with zero TPM in all the samples.• Map the gene labels to match the gene names used in the ABC dataset.We used the org.Hs.eg.db (ver.3.14.0)and AnnotationDbi Bioconductor packages (ver 1.56.2) ( 24 ,25 ) to match the gene names with a combination of ALIAS, ENSEMBL and ENTREZID.• Natural log-normalize the TPM data.

DepMap methylation and RNA-seq data
Methylation of promoter regions and expression data of genes ZNF667, ZNF667-AS1, ZNF154 and ZNF135 were downloaded directly from the DepMap website ( 26 ); by the time we performed this research, no expression data was available for the ZNF667-AS1 gene.From the secondary_lineage column in the metadata of downloaded datasets, we re-labeled 'Non-Cancerous' samples as 'Normal' and all the other samples as 'Cancer'.These new labels were used to compare the methylation and expression levels in Figure 4 .

TCGA clinical and RNA-seq data
TCGA data were downloaded and analyzed using the TC-GAbiolinks R package (ver.2.29.6) ( 27 ).In this study, we included only projects related to a single primary site (see Supplementary Table S4 ).
For each selected TCGA project (see Supplementary Table S4 ), the clinical data was obtained by using the GDC-query_clinic function and setup as follows: • Samples were classified as deceased = TRUE | FALSE, based on their 'Alive' status from the vital_status column in the metadata.• The overall survival of samples was assigned as: if vital_status == 'Alive' then overall_survival = days_to_last_follow_up , else overall_survival = days_to_death , where days_to_last_follow_up and overall_survival are columns from the clinical metadata.
The RNA-seq data was downloaded using their corresponding query and the GDCdownload function.The TPM of genes was used in all the analyses.
To calculate the mean expression of each sample used for the survival analyses (Figures 4 F-G, Figure 4 I, and Supplementary Figures S17 -S19 ) we z -transformed the TPM expression levels of the ZNF154, ZNF135, ZNF667 and ZNF667-AS1 genes on each project.Then, a mean_expression was calculated for each sample as the mean value of the transformed expressions.

Statistical information
For the statistical comparison used in the manuscript, we used the stat_compare_means function from the ggpbur R package (ver 0.6.0)( 30 ) with the next convention for symbols indicating statistical significance:

Other analyses and plots
To analyze the density of regulatory regions, protein-coding genes, and transcripts of genes, we input the center of CREs and the TSS of transcripts and genes from ENSEMBL to the geom_density_ridges function from the ggridges R package (ver.0.5.4) ( 31 ); bandwidth = 1e6.

Section 1. Active housekeeping cis-regulatory elements in the human genome
We investigated the cis-regulatory elements (CREs) from the ABC public dataset ( 10 ).This dataset contains a genome-wide list of active CREs from 131 cell types ( Supplementary Table S1 ), identified from the combination of DNase I hypersensitivity (DNase) or assay for transposase-accessible chromatin sequencing (A T AC-seq) data, with chromatin immunoprecipitation sequencing (ChIP-seq) data for the acetylation of the lysine residue at the N-terminal position 27 of the histone H3 protein (H3K27ac).One asset of this dataset is that it provides a list of predicted interactions between CREs and their surrounding genes within 5Mbp; interactions are quantified from 3D chromatin conformation (Hi-C) data.After selecting significant CRE-gene interactions and filtering out CREs from cell types or tissues related to cancer ( n = 26) or various stimuli ( n = 34), we obtained around 8.5 million predicted interactions from around 2 million CREs in 71 healthy cell types ( Supplementary Table S1 ).After merging overlapping regions from different cell types (see Materials and Methods ), we noticed a set of housekeeping CREs (HK-CREs) active in all cell types, which number was stable provided that a sufficient number of cell types were included (see Supplementary Figure S1 and Materials and Methods ).Then, we randomly selected 50 healthy cell types for further analysis and left the remaining 21 healthy cell types for validation (see Section 4 and Supplementary Table S1 ).We then merged overlapping regions across the selected cell types and filtered out suspicious CREs with epigenetic marks similar to a set of 10,000 randomly selected negative sample (NS) regions; NS regions were defined as regions not overlapping with any CRE from the ABC and the SCREEN projects ( 4 ,10 ) (see Materials and methods ).Finally, we obtained a merged dataset with around 846,000 predicted interactions from 136,365 CREs. Figure 1 A shows the distribution of the resulting CREs across the number of cell types they are active in.While most of the CREs are active in a low number of cell types, e.g.cell typespecific (CTS-CREs), we found 10,965 HK-CREs active in at least 90% of the 50 cell types.Interestingly, HK-CREs are predicted to interact with a larger number of target genes over longer distances (median number of targets = 22, median dis- tance = 553,817bp; Figure 1 B) as compared with CTS-CREs (mean number of targets = 1, median distance = 63,411bp).Surprisingly, the HK-CREs presented a considerably high correlation with protein-coding genes in number ( r = 0.979; Supplementary Figure S2 ) and density ( r = 0.882; Figure 1 C) as compared with CTS-CREs ( r = 0.754 and r = 0.601, respectively; Supplementary Figure S3 ), implying their importance in the gene regulation process.As expected, because of their presumed housekeeping capacity ( 32 ,33 ), the loci of HK-CREs are significantly conserved and enriched in unmethylated CpG sites as compared with CTS-CREs and negative sample (NS) regions (Figure 1 D and Supplementary Figure S4  and S5 ). Figure 1 E shows an example of the conservation and methylation scores of three HK-CREs across nine tissues (tracks across 30 cell types and tissues can be seen in Supplementary Figure S6 ), where it can be observed that the loci of the HK-CREs overlap with highly conserved and unmethylated regions within a locus that is rich in CpG sites across samples.In contrast, a randomly selected CTS-CRE from T cells presents different methylation patterns across cell types within a poor CpG locus ( Supplementary Figure S6 ).

Section 2. Most of the active housekeeping cis-regulatory elements reside in core promoters
We performed an unsupervised classification analysis to further characterize and investigate the functions of house-keeping cis-regulatory elements (HK-CREs).We annotated the CREs with their main epigenetic marks (H3K27ac, H3K4me1 and H3K4me3) and their distance to the nearest transcription start site (dnTSS), using a set of ChIPseq datasets from the ENCODE project ( 3 ) ( Supplementary Table S2 ) and the TSS locations from the ENSEMBL project ( 13 ), respectively.We used these annotations in an unsupervised classification and visualization process, where we obtained the principal components (PCs) ( 34 ) coordinates of CREs and used them as the input data to the Louvain clustering ( 14 ) and to the Uniform Manifold Approximation and Projection (UMAP) method ( 18 ).We noticed that the HK-CREs were located at specific parts of the UMAP plot and that they didn't present a substantial overlap with cell type-specific CREs (CTS-CREs), suggesting their distinctive epigenetic and dnTSS features (Figure 2 A, B).The results from clustering confirmed these differences (Figure 2 C, D), where CTS-CREs appeared enriched in groups where the number of HK-CREs is scarce and vice versa.Table 1 and Supplementary Figure S7 show the comparison of the feature levels used to manually label nine groups of CREs found by clustering; five groups of active CREs: core promoters (CP), promoters (P), enhancer short distance (ESD), enhancers long distance (ELD), and enhancers super long distance (ESLD); one group of inactive enhancers: enhancer short distance inactive (ESDi); two groups of CREs with low signals of epigenetic histone marks: other medium distance (OMD) and other Section 3. Genes whose core promoter is a HK-CRE complement the biological processes of housekeeping genes Housekeeping genes (HKGs) have been widely investigated as constitutively expressed genes in any cell type regardless of their function or developmental stage ( 35 ).In contrast, to our knowledge, ubiquitously active cis-regulatory elements (CREs) are not clearly defined.We hypothesize that these two elements are closely related, especially for the housekeeping CREs identified as core promoters, e.g.HK-CREs active in > 90% of the 50 cell types and very close to a transcription start site (TSS).To get an insight into the relationship between our identified housekeeping core promoters (HK-CPs) and housekeeping genes, we obtained the list of genes whose TSS is within 300 bps from the center of an HK-CP and compared them with a set of known 2,131 HKGs from a public database ( 35 ).As expected, most of the core promoters of HKGs were included in our set of HK-CPs (Figure 3 A).However, our list included the core promoters of a considerable number of other genes from diverse classes (Table 2 ) that were not included in the database of HKGs.We labeled them as non-housekeeping genes (nonHKG, n = 8,681) and compared them with genes in the HKGs database.Here, we would like to highlight that our identification of these genes, HKGs and nonHKGs, is based on the constitutively epigenetic activity of their core promoters, e.g. an open region with high levels of H3K27ac signal and predicted to interact with surrounding genes by the ABC method, and not on their constitutive expression; as the regular definition used in the database of HKGs ( 35 ).This suggests that, regardless of the activity of their core promoter, the gene expression levels of nonHKG could be different from those of the HKGs.To verify this, we collected RNA sequencing (RNA-seq) data from 37 cell types from the ENCODE project ( Supplementary Table S2 ) and compared the expression levels of HKGs and nonHKGs.Figures 3 B and C show the mean gene expression of these sets of genes across cell types and the number of cell types expressing them.As expected, the mean expression and the number of cell types expressing nonHKGs were significantly lower than those of HKGs.However, we argued that nonHKGs, whose core promoter is ubiquitously active, could still play an essential role in cell housekeeping processes regardless of their low RNA expression levels.Therefore, we analyzed the biological process (BP) of HKGs and nonHKG from the Gene Ontology project ( 36 ).Interestingly, the analysis showed a high overlap between the BPs of HKGs and nonHKGs with different significance and enrichment levels ( Supplementary Figure S8 and S9 ), suggesting the complementarity of these two sets of ubiquitously active genes.We then performed a differential analysis of the significance and enrichment levels between overlapping BPs of HKGs and nonHKG (max FDR = 0.05).Figures 3 D and E show the top 10 resulting BPs by significance and enrichment, respectively .Interestingly , the significance analysis (Figure 3 D and Supplementary Figure S8 ) showed that, while HKGs were highly associated with general metabolic, localization, translation, and transport BPs within the cell, nonHKG were related to more specific metabolic and transcription processes, supporting the hypothesized complementarity between these two sets of genes.Surprisingly, the enrichment analysis showed that BPs related to tRNA and regulation of RNA polymerase II, important BPs within the transcription process, were enriched almost exclusively for nonHKGs (Figure 3 E and Supplementary Figure S10 ).Overall, these results confirm the importance of housekeeping core promoters in the gene regulation process and could explain, to some degree, the high correlation between these cis-regulatory elements and the protein-coding genes (Figure 1 C and Supplementary Figure S2  and S10 ).

Section 4. Inactive housekeeping core promoters in cancer
In our definition of housekeeping cis-regulatory elements (HK-CREs) in Section 1 , we labeled putative HK regions as those loci active in at least 90% of 50 randomly selected healthy cell types from the ABC dataset.To further validate our list of HK-CREs, we evaluated them using the rest of the 21 healthy cell types and 26 cancer cell lines from the same ABC dataset ( Supplementary Table S1 ).Thus, for each HK-CRE, we calculated the percentage of cells in which it was found to be active in the healthy-validation and cancer datasets.Figure 4 A shows the results from this analysis for the set of HK-CREs annotated as core promoters (HK-CPs; similar results for other annotated HK-CREs are shown in Supplementary Figure S11 ), where the gray dashed line represents our 90% threshold used to define putative housekeeping elements in Section 1 .Most HK-CPs fulfilled this threshold in healthy cell types (96.94% of HK-CPs, n = 9,411), validating them as housekeeping elements.However, we observed a significant reduction in the distribution of HK-CPs in cancer cell lines (86.05% of HK-CPs, n = 8,354), where 1,354 HK-CPs did not fulfill the given housekeeping threshold; Supplementary Table S3 contains the list of HK-CPs that did not satisfy the threshold ranked by their ascending percentage of cancer cells they are found active on.To further explore these findings, similar to the analysis performed in Section 3, we obtained the list of genes whose TSS is within 300 bps from the center of a HK-CP, and compared the gene expression levels of these genes between a set of 13 cancer cell lines from the ENCODE project ( 3 ) and the healthy set used in Section 3 ( n = 37) ( Supplementary Table S2 ).While most of the genes showed a significant correlation between their expression levels in cancer and healthy cells (Figure 4 B), the genes whose core promoter was found active in < 35% of cancer cells (core promoters, n = 19; genes, n = 25) presented a considerably lower median gene expression in cancer cells as compared with healthy cell types (red points in Figure 4 B; Supplementary Figure S12 ).Chromosome 19 was the most  Inactive HK-CPs found in < 35% of cancer cells ranked by their distance to the end of the chromosome they reside in, e.g.telomere regions The expression levels of genes which core promoter is a HK-CP were compared between cancer healthy samples.In general, there is a highly significant correlation between expression levels in healthy and cancer samples.Ho w e v er, the set of genes which HK-CPs were found active in < 35% of cancer cells (red-colored dots) presented lower levels of expression on cancer samples.Because of their inactivity in cancer, the genes related to the top 3 inactive HK-CRs are investigated as putative housekeeping tumor suppressor genes: ZNF154, ZNF135, ZNF667 and ZNF667-AS1.( C ) An increased methylation fraction of the promoter regions of putativ e housek eeping tumor suppressor genes is observ ed in cancer cells (lef t), whic h is related to a decrease in their gene e xpression le v els (right).Further analysis of these genes in cancer samples from TCG A re v ealed significant changes among deceased conditions in diverse cancer subtypes such as pancreas adenocarcinoma ( D ) and u v eal melanoma ( E ). ( F , G ) Independent survival analyses on these projects revealed a significantly higher survival probability of samples with high mean expression of these genes.( H ) Different patterns of expression levels of samples from distinct projects might suggest cancer specific regulation of these genes.( I ) After normalizing the differences between samples from different projects, a joint survival analysis including > 50 0 0 samples from 17 different cancer subt ypes is perf ormed.A relativ ely significant increased surviv al probability is observ ed in samples with a high expression of the putative tumor suppressor genes.

16
Nucleic Acids Research , 2024, Vol.52, No. 3 affected chromosome with nine HK-CPs active in < 35% of cancer cells ( Supplementary Figure S13 and Table 3 ).Interestingly, most of the HK-CPs with reduced activity in this chromosome tend to localize towards telomere regions, with the top 3 inactive HK-CPs residing in the region less than 2 Mbp away from the end of this chromosome, e.g.near the telomere region, suggesting the importance of this locus for maintaining the integrity of cells.
Section 5. Potential housekeeping tumor suppressor genes We further investigated the genes with the top 3 inactive core promoters: ZNF154, ZNF135, ZNF667-AS1 and ZNF667, using public data from commonly used cancer databases.We performed a pan-cancer analysis, e.g.joint analysis of diverse tumor samples, using the TNMplot web tool ( 37 ) for differential gene expression analysis of tumor versus normal samples.We observed similar results to those presented in the previous section (Figure 4 B), where the expression levels of these four genes were lower in tumor samples in various cancer subtypes ( Supplementary Figures S14 and S15 ).To confirm the effect on the core promoter regions of these genes in cancer, we compared the methylation fraction around the promoter regions (Figure 4 C, left) and the expression levels of these genes (Figure 4 C, right) for diverse cancer subtypes from the DepMap portal ( 26 ); expression levels of ZNF667-AS1 were not available at the time of the analysis.Surprisingly, the promoter regions were significantly more methylated in cancer samples, whereas normal samples presented almost no methylation.The aberrant methylation of these HK-CPs explains, to some degree, their observed inactivity (see Section 4 and Figure 4 A) and the low expression levels of the genes they regulate in cancer samples (Figure 4 B, C right, and Supplementary Figures S14 and S15 ).
Because of the ubiquitous activity of their core promoters in healthy cell types and their inactivity in most of the investigated cancer cell lines (due to aberrant methylation), we hypothesized that these four genes: ZNF154, ZNF135, ZNF667-AS1 and ZNF667, could be related to housekeeping tumor suppression functions.Surprisingly, the tumor suppression capacity of these genes has already been observed for different cancer subtypes in diverse tissues: non-small cell lung cancer, renal cancer carcinoma, thyroid cancer, nasopharyngeal carcinoma, squamous cell carcinoma, uveal carcinoma, etc. .Moreover, the attenuation of the anti-tumor capacity of these genes has been related to the aberrant methylation of their core promoters in individual ( 53 , 56 , 62 ) and pancancer analyses ( 57 , 60 , 61 ).
To further confirm our hypothesis of the existence of housekeeping tumor suppressors, we performed a comprehensive data analysis of 17 cancer projects from The Cancer Genome Atlas Program (TCGA) ( Supplementary Table S4 ).We first compared the gene expression levels of samples, e.g.samples from patients diagnosed with cancer, from different deceased conditions across projects ( Supplementary Figure S16 ).While most of the projects presented non-significant variations, the UVM (eye and adnexa; uveal melanoma) and PAAD (pancreas; pancreatic adenocarcinoma) projects presented significant distinctions (Figures 4 D, E and Supplementary Figure S16 ), where the expression levels of the putative tumor suppressor genes were higher in non-deceased samples, suggesting a higher survival probability of patients from these sam-ples.Then, for each project, we performed a joint survival analysis ( 63 ) of the four genes with samples stratified into two groups: strata = HIGH, samples with a mean expression level (see Methods ) higher than 75% of the samples; strata = LOW, others.As expected, the survival analyses of PAAD and UVM (Figures 4 F and G, respectively) showed a significantly higher survival probability of HIGH samples.Interestingly, even though the p-values were not significant, we observed HIGH samples to present a higher survival probability in several projects, e.g.BRCA (breast; breast invasive carcinoma), LUAD (bronchus and lung; lung adenocarcinoma), and ACC (adrenal gland; adrenocortical carcinoma; Supplementary Figures S17 and S18 ).However, some projects, such as the STAD (stomach; stomach adenocarcinoma) and SKCM (skin; skin cutaneous melanoma), presented an inverse effect, where LOW samples showed a higher survival probability ( Supplementary Figure S19 ).Because of the significantly lower expressions of ZNF135, ZNF667 and ZNF667-AS1 genes in samples from stomach cancer as compared with healthy samples ( Supplementary Figures S14 and S15 ), we argue that this inverse effect in STAD and SKCM projects does not entirely contradict our hypothesis of potential housekeeping tumor suppressor genes; instead, it might suggest cancerspecific changes on these genes driven by other latent variables (See Discussion and Supplementary Figure S20 ).To exemplify this cancer-specific effect, we compared the expression levels of samples from PAAD and UVM (projects that agree with our hypothesis) with samples from SKCM and STAD (projects that disagree with our hypothesis), where for most of the comparisons (Figure 4 H) we observed a significantly lower expression of samples from the latter projects.Given the potential cancer-specific changes of putative housekeeping tumor suppressor genes, we normalized the expression levels of each gene across projects (see Materials and methods) to perform a joint survival analysis of the 17 TCGA projects.Figure 4 I shows the results of the analysis where HIGH samples ( n = 1,332), e.g.samples with a mean expression level (see Methods ) higher than 75% of the samples, presented a significantly higher survival probability as compared with LOW samples ( n = 3,964; independent analyses for each gene are shown in Supplementary Figure S19 ).Together, these results suggest the existence of active housekeeping core promoters related to housekeeping tumor-suppressor genes that, when methylated, may culminate in the development of cancer regardless of their malignancy phenotype variant.

Discussion
Our analysis unveils a key housekeeping element of the gene regulatory mechanism that, to our knowledge, hasn't been described before.Through several bioinformatics analyses leveraging diverse public epigenetic and gene expression datasets, we reveal the existence of housekeeping cis-regulatory elements (HK-CREs) that are mainly located within core promoter regions and are not necessarily related to housekeeping genes (HKG), e.g. they are not the core promoters or promoters of HKG.We defined HK-CREs by randomly selecting a substantial number ( n = 50) of healthy cell types to minimize biases in our analysis, ensuring a representative sample.Moreover, besides using a validation dataset with 21 healthy cell types and a set of 10,000 negative samples to assess our findings, we proved in distinct ways the robust location within the genome of the HK-CREs.For instance, we argue that the probability that most of the overlapping regions from 50 cell types intersect within the loci of core promoters (median distance to TSS = 56 bp) is exceptionally low.In addition, our analysis using a methylation atlas from healthy cells ( 21 ) showed that HK-CREs are located in dense unmethylated CpG loci for multiple cell types, which can't just occur by chance.Therefore, we feel confident about the existence and robust location of the HK-CREs.
When we first observed that most of the HK-CREs overlapped with core promoter regions, we first thought that we captured the core promoters of HKG.However, HKG only represented < 20% of the genes in our list, e.g.genes which core promoter is a HK-CRE.Our immediate conjecture was that the remaining genes in our list would be genes that have not been correctly annotated as HKG.However, our comparison of gene expression levels showed significant differences between both groups.These results do not contradict our finding of ubiquitously active CREs, as their definition is not based on their transcription levels, rather, they might suggest that a housekeeping core promoter of a lowly transcribed gene could still play a regulatory role for promoter regions of actively transcribing genes regardless of whether they are housekeeping or cell type-specific genes.Nonetheless, we acknowledge the existence of potential biases in our datasets, including the list of housekeeping genes (HKGs), and other technical challenges inherent to the annotation processes (refer to Supplementary Note S1 for more details).Further experimental validations with higher sequencing depth may clarify the distinction between these two sets of genes that seem to complement each other in housekeeping biological processes in the cell.
In this manuscript, we focused on exploring and characterizing HK-CREs instead of the interactions predicted by the ABC method.We observed intricate cooperative interactions between HK-CPs, where various HK-CPs exhibit interactions with one another and with the genes in their proximity.This suggests the presence of cis-regulatory hubs with promoterpromoter interactions in their core forming complex regulation networks.We are aware of the importance of these intriguing promoter-promoter interactions not only for regulating HKGs but also for cell type-specific genes.Nonetheless, these interactions come from predictions and must be verified experimentally, but we believe these two results can be treated separately without conflict.
Regarding transcription factor binding motifs found in HK-CREs, we observed an enrichment of common GC-rich motifs, e.g.SP1 and SP2.As this information somehow overlaps with the conservation and enrichment of CpG sites shown in Figure 1 D, we did not explore the enriched motifs further .However , together with appropriate experimental validations, differential analyses of enriched motifs across annotations of HK and CTS-CREs, e.g.core promoters vs. promoter regions, could reveal important motifs for the complex enhancer grammar at different levels.
In our analysis of inactive housekeeping core promoters (HK-CPs) in cancer cell lines (see Section 4), we tried to avoid bias toward the 26 cell lines in the ABC dataset by focusing on the top inactive HK-CPs from a list of 1,354 elements.Moreover, our investigation revealed that the most affected elements group together in the telomere of chromosome 19, which proves, to some degree, the robustness of our results.In addition, for the top 3 inactive HK-CPs in cancer cells, we confirmed their aberrant methylation and the decreased ex-pression levels of the genes they regulate through comprehensive analysis using data from diverse public sources as EN-CODE, DepMap, and the web tool TNMplot, obtaining a general agreement among results.
Finally, for the set of putative housekeeping tumor suppressor genes (ZNF135, ZNF154, ZNF667 and ZNF667-AS1), e.g.genes whose HK-CPs are active in less than 20% of the cancer cells, we performed a comprehensive analysis using cancer samples of 17 projects from TCGA.Even though we observed an increased survival probability of samples with high expression levels of these genes in diverse projects, only two were significant: PAAD (pancreas adenocarcinoma) and UVM (uveal melanoma).However, we argue the existence of latent cancer-specific mechanisms that affect these genes which must be considered.For instance, we observed interesting cancer-specific dynamics in the expression levels of these genes across samples from different cancer stages ( Supplementary Figure S20 ).For example, the expression levels from PAAD samples showed a decline from the early to intermediate stages, with a final increased expression in the late stages.These dynamic changes might be related to other biological processes of these genes and the inherent cancerspecific necessity to down / up-regulate them during different stages.Further analysis including samples from the same patient at different stages might reveal the intrinsic mechanism behind these intriguing changes.
Together, even though our results should be verified experimentally, we are confident that they continue to hold significant relevance for pan-cancer studies concerning the characterization of housekeeping cis-regulatory elements involved in tumor suppression.

Figure 1 .
Figure 1.Housekeeping cis-regulatory elements in the human genome.( A ) After merging o v erlapping regions from 50 cell types, we observed a significant number of housekeeping cis-regulatory elements (HK-CREs) found to be active in at least 90% of the cells.( B ) HK-CREs interacts with a larger number of surrounding genes in longer distance than CTS-CREs.( C ) The density of HK-CREs highly correlates with the density of protein-coding genes across chromosomes.( D ) HK-CREs reside in conserved regions rich in unmethylated CpG sites.( E ) Conservation scores and methylation levels of three randomly sampled HK-CREs.

Figure 2 .
Figure 2. HK-CREs reside in core promoter regions.CREs were annotated with signal levels from H3K27ac, H3K4me3, H3K4me1, and their distance to the nearest TSS (dnTSS).These annotations were used as features in an unsupervised analysis including clustering and visualization.( A, B ) UMAP plot shows a distinct distribution of HK-CREs and cell type-specific-CREs (CTS-CREs) ( A ) with distinct signal levels of histone marks and dnTSS ( B ). ( C , D ) Clustering analysis showed similar results where CTS and HK-CREs were assigned to different clusters.( E ) Annotation of the clusters using histone signal le v els and dnTSS sho w ed that HK-CREs reside primarily in core promoter regions.

Figure 3 .
Figure 3.Genes whose core promoter is an HK-CP are significantly enriched in housekeeping biological processes.The genes whose TSS is < 300 bp from a housekeeping core promoter (HK-CP) were listed and compared with housekeeping genes (HKG).( A ) As expected, most of the core promoters of HKG are HK-CP, ho w e v er, the core promoters of another 8681 genes were listed as HK-CPs, e.g.non-housekeeping genes (non-HKG).( B ) Mean gene e xpression le v els of HKGs and nonHKGs o v er a set of 37 health y cell types; HKGs present a significantly higher e xpression than nonHKG.( C ) Mean gene expression levels and the number of cells expressing HKGs and nonHKG; HKGs are ubiquitously expressed in most of the cells, in contrast, nonHKGs present a wider distribution.Besides the differences in gene expression levels and the number of cells expressing HKG and non HKGs, the top biological processes (BPs) sorted by significancy ( D ), e.g.enrichment of genes in general BPs, and by enrichment ( E ), e.g.enrichment of genes in specific BPs, suggested the complement arit y of these two sets of genes in important housekeeping biological processes.

1 1 15 Figure 4 .
Figure 4. Inactive HK-CPs in cancer cell lines.HK-CP regions were compared with a set of 21 healthy cell types (healthy-validate) and a set of 26 cancer cell lines.( A ) While most of the HK-CPs were found active in > 90% of the healthy cell types (indicated by the gray dashed line), cancer cells presented a considerable number of inactive HK-CPs; 1354 HK-CPs were found in < 90% of the cancer cells.(B ) The expression levels of genes which core promoter is a HK-CP were compared between cancer healthy samples.In general, there is a highly significant correlation between expression levels in healthy and cancer samples.Ho w e v er, the set of genes which HK-CPs were found active in < 35% of cancer cells (red-colored dots) presented lower levels of expression on cancer samples.Because of their inactivity in cancer, the genes related to the top 3 inactive HK-CRs are investigated as putative housekeeping tumor suppressor genes: ZNF154, ZNF135, ZNF667 and ZNF667-AS1.( C ) An increased methylation fraction of the promoter regions of putativ e housek eeping tumor suppressor genes is observ ed in cancer cells (lef t), whic h is related to a decrease in their gene e xpression le v els (right).Further analysis of these genes in cancer samples from TCG A re v ealed significant changes among deceased conditions in diverse cancer subtypes such as pancreas adenocarcinoma ( D ) and u v eal melanoma ( E ). ( F , G ) Independent survival analyses on these projects revealed a significantly higher survival probability of samples with high mean expression of these genes.( H ) Different patterns of expression levels of samples from distinct projects might suggest cancer specific regulation of these genes.( I ) After normalizing the differences between samples from different projects, a joint survival analysis including > 50 0 0 samples from 17 different cancer subt ypes is perf ormed.A relativ ely significant increased surviv al probability is observ ed in samples with a high expression of the putative tumor suppressor genes.

Table 2 .
Genes associated with a housekeeping core promoter

Table 3 .
Top inactive HK-CPs in cancer cell lines