Assessing genome-wide dynamic changes in enhancer activity during early mESC differentiation by FAIRE-STARR-seq

Abstract Embryonic stem cells (ESCs) can differentiate into any given cell type and therefore represent a versatile model to study the link between gene regulation and differentiation. To quantitatively assess the dynamics of enhancer activity during the early stages of murine ESC differentiation, we analyzed accessible genomic regions using STARR-seq, a massively parallel reporter assay. This resulted in a genome-wide quantitative map of active mESC enhancers, in pluripotency and during the early stages of differentiation. We find that only a minority of accessible regions is active and that such regions are enriched near promoters, characterized by specific chromatin marks, enriched for distinct sequence motifs, and modeling shows that active regions can be predicted from sequence alone. Regions that change their activity upon retinoic acid-induced differentiation are more prevalent at distal intergenic regions when compared to constitutively active enhancers. Further, analysis of differentially active enhancers verified the contribution of individual TF motifs toward activity and inducibility as well as their role in regulating endogenous genes. Notably, the activity of retinoic acid receptor alpha (RARα) occupied regions can either increase or decrease upon the addition of its ligand, retinoic acid, with the direction of the change correlating with spacing and orientation of the RARα consensus motif and the co-occurrence of additional sequence motifs. Together, our genome-wide enhancer activity map elucidates features associated with enhancer activity levels, identifies regulatory regions disregarded by computational prediction tools, and provides a resource for future studies into regulatory elements in mESCs.


INTRODUCTION
Gene expression in eukaryotic cells is a tightly regulated process which is a prerequisite for cellular identity. Regulation of transcription is controlled by transcription factors (TF) and the regulatory genomic elements (enhancers, promoters) they target (1,2). The selective and combinatorial activation of enhancers in a spatiotemporal manner allows for the complexity of higher eukaryotic organisms, which consist of a large number of different highly specialized cells although they all possess the same genome (3,4). Traditionally, enhancers are defined as the genomic elements that can control the activity of promoters whereas promoters are the regions where transcription of genes is initiated. Further, promoters and enhancer regions can be distinguished and predicted based on distinct patterns of histone modifications (HMs) (5). However, recent research indicates that the function of enhancers and promoters may not always be distinct as studies have demonstrated that promoters can act as enhancers for other genes (6)(7)(8) and enhancers frequently give rise to transcripts (9), a feature traditionally associated with promoter function. The assignment of enhancers to their target promoters is an important step in elucidating gene regulation and has been addressed in recent years with rapidly evolving high-throughput chromatin interaction assays (10)(11)(12). However, the functional relevance of identified enhancer-promoter pairs was mainly investigated for individual genes or loci (13)(14)(15)(16)(17)(18)(19) and remains a largely unsolved problem at the genome-wide level.
The identification and prediction of enhancers is often based on indirect measures of activity, such as correlating HMs and chromatin accessibility (1,(20)(21)(22)(23). Notably, some enhancer prediction tools discard promoter regions as potential enhancers even though there is evidence showing that promoters can act as enhancers of other genes. Moreover, enhancer prediction based on these marks gives rise to myriads of putative enhancers but does not provide quantitative information regarding their activity. This is of particular interest, since gene expression is not subject to an on/off switch type of regulation, but rather the result of a complex interplay between multiple enhancers, TFs, and coactivators which can fine-tune gene expression levels to meet the cell's current needs. Consequently, it remains largely unclear which of the thousands of predicted enhancers are actually functional, how enhancer usage changes during differentiation and what features are conferring distinct activity levels.
Embryonic stem cells (ESCs) are characterized by their ability to differentiate into any given cell type and therefore represent a versatile system to study the link between gene regulation, differentiation and cellular identity (24,25). Murine ESCs (mESCs) in the pluripotent state exhibit relatively permissive chromatin, with many accessible regions which are thought to comprise active mESC enhancers but also primed enhancers that can be activated at later stages during differentiation (26,27). The expression of genes in mESCs is also controlled by transposable elements, for example from the ERVK family, that can act as enhancers that control the expression of associated genes (28)(29)(30). The pluripotency of mESCs and their ability to self-renew critically depend on the actions of specific TFs including OCT4 and SOX2, NANOG, KLF4 and ESRRB (25,31). All these TFs can bind and activate promoters as well as enhancers of pluripotency-associated genes in ESCs (27). mESCs can be cultivated in the pluripotent state when leukemia inhibitory factor (LIF) is added to the media to activate the STAT3 pathway (32), which in turn promotes c-MYC expression and transcriptional programs important for selfrenewal (33).
Differentiation of ESCs can be used to study the molecular mechanisms that underly cellular commitment decisions with potential therapeutic relevance. A highly efficient, yet simple, protocol to induce cellular differentiation is to treat mESCs with all-trans retinoic acid (RA) (34). Treatment with RA induces exit from pluripotency, marks a phase of increased susceptibility to lineage-defining signals (35) and ultimately pushes mESCs towards the neuronal lineage (36). RA is the ligand of retinoic acid receptors (RARs), which together with retinoid X receptors (RXRs) bind to genomic response elements and drive expression of differentiationassociated genes but also repression of genes involved in pluripotency (37).
In recent years, several studies applying massive parallel reporter assays based on STARR-seq (38) have been conducted to assess enhancer function of candidate regions in different species and cell types (6,(39)(40)(41)(42)(43)(44)(45). In STARR-seq, the regulatory activity of candidate sequences is measured by placing them downstream of a minimal promoter to assess their ability to activate their own expression. Here, we developed a modified quantitative STARR-seq protocol to assess the activity of accessible chromatin isolated by FAIRE (formaldehyde assisted isolation of regulatory elements). By focusing on accessible chromatin, we could reduce the complexity of the investigated fragment library while retaining regions of interest, specifically promoter and enhancer regions, as candidate enhancers.
Overall, this allowed us not only to identify active enhancers genome-wide in mESCs, but also to quantify enhancer activity and thus to identify features, such as sequence motifs and their quantities, that correlate with enhancer activity. We demonstrate the versatility of our data by characterizing enhancer subsets with atypical HM patterns and by showing how the underlying sequence of enhancers can be used to build an enhancer classifier. Moreover, we used our quantitative approach to identify enhancers that change their activity during the early stages of differentiation. Additionally, we used our data set to characterize 'functional' TF binding sites by intersecting RAR␣ binding sites with RA-induced changes in enhancer activity. This resulted in the identification of sequence features associated with RAR␣ binding events with distinct changes in enhancer activity. Together, these examples illustrate how our enhancer activity data can be used as a resource to study various aspects of gene regulation and mESC differentiation.

mESC culture and differentiation
E14 mESCs were cultured under feeder-free conditions and routinely passaged every two days in ES-medium: Glasgow Minimum Essential Medium (Sigma-Aldrich) supplemented with 17% FBS (Hyclone™, SV30160.03, GE Healthcare), 2 mM GlutaMAX™ (Gibco), 100 U/ml Penicillin-Streptomycin (Gibco), 1× MEM Non-Essential Amino Acid Solution (Gibco), 1 mM Sodium Pyruvate (Gibco), 0.5 mM 2-Mercaptoethanol (Gibco) and recombinantly expressed leukemia inhibitory factor (LIF). To exit from pluripotency and induce differentiation, LIF was withdrawn and retinoic acid (RA, Sigma, R2625) was added to the medium to a final concentration of 1 M. For all experiments, 4 h prior to harvest, cell culture medium was removed, cells washed twice with PBS and fresh medium containing either LIF or RA was added.

FAIRE-STARR-seq
As input material for the reporter screen, accessible chromatin from E14 mESCs treated with RA (Sigma, #R2625) for 4 h was isolated by formaldehyde-assisted isolation of regulatory elements (FAIRE, (46)) and subsequently cloned into the STARR-seq screening vector (Addgene #71509) following the protocol described in Arnold et al. (38).
The details of our library preparation are provided in the supplementary information.

STARR-qPCR
Putative enhancer sequences (Supplementary Table S1) were amplified by nested PCR from genomic DNA derived from E14 cells using standard PCR procedures. Primers (Supplementary Table S1) were designed to generate the same overhangs as used for Illumina sequencing. The negative (nc1 and nc2, GR responsive elements) and positive (CMV enhancer) control regions as well as RAR␣ motif variants were ordered as gBlocks (IDT) and are listed in Supplementary Table S1. DNA fragments were subsequently cloned into the STARR-seq screening vector (pSTARR-seq human, Addgene plasmid #71509) using the In-Fusion ® HD Cloning Kit (Takara/Clonetech). For transfection of reporter plasmids, E14 mouse ESCs were plated at a density of 1.4 × 10 4 cells/cm 2 of a 24-well plate with ES medium supplemented with 17% FBS and LIF. The next day, cells were washed with PBS and fresh medium was added. Subsequently, cells were transfected with individual reporter plasmid using Lipofectamin 2000 (Invitrogen) according to the manufacturer's instructions. Twenty hours after transfection, cells were washed twice with PBS and fresh ES medium, containing LIF, 1 M RA or no additional reagent, was added. After another 4 h of incubation, cells were harvested, RNA extracted (RNeasy Mini Kit, Qiagen), followed by cDNA synthesis (PrimeScript RT Reagent Kit, Takara, using oligodT and random hexamer primers). Reporter transcript levels were quantified by qPCR with primers specific for GFP and normalized to the expression of two housekeeping genes (Rpl19 and Actb). Primers are listed in Supplementary Table S1.

ATAC-, ChIP-and RNA-seq experiments
ATAC-, HM ChIP-and RNA-seq experiments from our laboratory have been published previously (23). RAR␣ ChIP was performed for this study. In short: ATAC-seq. 75 000 low passage (<10) E14 cells were cultivated for 48 h in ES medium prior to subjecting them to an improved ATAC-seq protocol as described in Corces et al. (47). The resulting transposase-fragmented and PCRamplified DNA was cleaned up using AMPure XP beads (Agencourt). High-throughput sequencing was performed generating ∼50 million 50 bp paired-end reads per sample using the HiSeq 4000 (Illumina) device.
ChIP-seq. HM ChIP experiments were performed essentially as described in the standard BLUEPRINT protocol (www.blueprint-epigenome.eu, details in supplementary information), while the RAR␣ ChIP was performed as described elsewhere (48), with small modifications. A more detailed description can be found in the supplementary information.
RNA-seq. 2 × 10 5 low passage (< 10) E14 cells were plated per 10 cm dish and cultivated for 48 h in regular ES medium. Next, medium was exchanged for fresh ES medium containing either LIF or 1 M retinoic acid (Sigma, R2625). After 4 h, cells were harvested and RNA extracted using the RNeasy Mini Kit (Qiagen) according to the manufacturer's instructions. The experiment was performed in biological triplicates. Sequencing libraries were generated using the TruSeq ® Stranded mRNA Kit (Illumina) and highthroughput sequencing was performed on a HiSeq 2500 (Illumina) device generating approx. 100 million 50 bp pairedend reads per sample.

Generation of clonal cell lines with CRISPR/Cas9-mediated genomic deletions and mutations
sgRNAs targeting regions of interest were designed using the CRISPOR tool (http://crispor.org/, 49), ordered as complementary DNA oligonucleotides (Sigma) with overhangs for BbsI, and cloned into the pSpCas9(BB)-2A-Puro (PX459) V2.0 plasmid (Addgene plasmid #62988) as described in Ran et al. (50). All sgRNA sequences are listed in Supplementary Table S1. To delete regions of interest, two million E14 cells were transfected with a pair of sgRNA plasmids (as indicated), 1 g per plasmid, using a Nucle-ofector™ 2b device and the Mouse ES Cell Nucleofector Kit (Lonza, VAPH-1001) according to the manufacturer's instructions and plated into two 10 cm dishes. Twentyfour hours after transfection, medium was exchanged for fresh ES medium, and after another 24 h the medium was exchanged for fresh ES medium containing 2.5 g/ml Puromycin. The next day, medium was exchanged again for ES medium without selection. Subsequently, medium was exchanged every 2 days until round colonies formed (7-10 days post transfection). Colonies were picked by pipetting and individually transferred into 48-well plates. E14 clonal lines were expanded, genomic DNA was extracted (QI-Aamp DNA Mini Kit, Qiagen) and lines were genotyped using primers listed in Supplementary Table S1 and Phusion High-Fidelity PCR Master Mix (with GC Buffer) (Thermo Scientific, F532). PCR products of candidate clonal lines showing predicted PCR band sizes in agarose gel electrophoresis, were send for validation by sanger sequencing (Eurofins). To probe for biallelic alterations, PCR products were cloned into the Zero Blunt™ vector (PCR Cloning Kit, Thermo, K270020), transformed into Escherichia coli, four to eight individual bacterial colonies were picked, plasmid DNA isolated (QIAprep Spin Miniprep Kit, Qiagen) and send for Sanger sequencing. Genomic deletions and mutations of E14 clonal lines are listed in Supplementary  Table S3.

RT-qPCR
RNA from E14 or clonal cell lines, treated as indicated, was extracted using the RNeasy Mini Kit (Qiagen) according to the manufacturer's instructions including a DNase treatment. 1 g total RNA was subjected to cDNA synthesis applying the ProtoScript ® First Strand cDNA Synthesis Kit (NEB, E6300S) with the included Oligo d(T)23 VN primer according to the manufacturer's instructions. cDNA was diluted 1:12.5-1:20 prior to qPCR which was performed as described in Thormann et al. (51). Transcript levels were normalized to the expression of housekeeping genes (Rpl19 and Actb). Genes quantified to asses the interferon response were Ifna4, Ifnb1, Ifti1 and Isg15. Primers are listed in Supplementary Table S1.

NGS data analyses
All NGS data analyses are described in the supplementary information.

Generation of a quantitative enhancer-activity map of the mESC genome
To assess the enhancer activity of putative regulatory regions in mESCs, we performed a massively parallel enhancer reporter assay. To limit the complexity of the library, we prioritized regions that are likely to act as enhancers (52) by focusing on accessible chromatin isolated by FAIRE as input material for our STARR-seq (self-transcribing active regulatory regions) (38) library ( Figure 1A-C). Although this idea was new at the time of conception, a similar approach has by now also been described by others that isolated putative regulatory elements by either FAIRE (44) or ATAC (43).
To get quantitative information regarding enhancer activity, we developed a modified version of the STARRseq assay, which introduces unique molecular identifiers (UMIs) during the reverse transcription step ( Figure 1A, Supplementary Figure S1A). A similar approach has been described, however in that case UMIs are introduced in a first PCR cycle after the reverse transcription step (53). The introduction of UMIs allows one to distinguish between independent transcript replicates and PCR duplicates that can dramatically distort the relative quantities of individual fragments within a library (54). Genome-wide correlation analyses of read distributions of individual FAIRE-STARR-seq samples revealed that outlier regions with extremely high read coverage in only one replicate were efficiently removed during the UMI filtering step (Supplementary Figure S1D) indicating that such regions are PCR amplification artefacts.
Next, active FAIRE-STARR enhancers were identified by performing peak calling for significantly enriched regions using the input library as background. This was done for each of the three biological replicates individually and for the merged replicates. We focused only on highconfidence regions by filtering for enhancers which were called in at least two replicates and were captured by at least three different fragments. Using these criteria, we identified 4765 active enhancers with assigned quantitative STARRscores, while the majority of the input regions showed no enhancer activity ( Figure 1D, E, and Supplementary Figure S1H). To determine what distinguishes active enhancers from their inactive counterparts (182 194 accessible regions without STARR activity covered by our library), we analyzed sequence composition, TF occupancy and enrichment of a panel of histone modifications (HMs) linked to enhancers in mESCs (2,55). In order to reduce the redundancy inherent in many motif databases that contain multiple highly similar motifs for related TFs, we used the JASPAR 2018 clustered vertebrate motif matrices which group related motifs into non-redundant clusters (56,57, Supplementary Table S4). As expected, we found that active enhancers are enriched for sequence motifs of pluripotency TFs such as POU5F1::SOX2 (cluster 18), SOX2 (cluster 33), MYC (cluster 4) and STAT3 (cluster 32) ( Figure  1G). Furthermore, CG-rich motif clusters (28: SP/KLF, 54: ZNF263 and 72: NRF1) were enriched for these regions. We also compared the quantity of enriched motifs between active and inactive regions and found that the number of significant motif hits is higher for active enhancers when compared to their inactive counterparts ( Figure 1G). On the other hand, inactive regions are characterized by an enrichment for motif clusters NEUROD2 (cluster 8) and RUNX3 (cluster 38), which contain TFs associated with differentiation and cell-type specific TFs, most of which are not expressed in mESCs (Supplementary Figure S1G) suggesting that these regions might be primed for activation when ESCs differentiate towards different cell types. Interestingly, the motif for CTCF, a master regulator of the genomic architecture (58), is enriched for both groups, but this enrichment is more pronounced for inactive regions ( Figure 1G). Consistent with the observed motif enrichments, we found that ChIP-seq data for a panel of TFs involved in pluripotency showed higher levels of genomic occupancy at active enhancers than at their inactive counterparts whereas CTCF occupancy was slightly higher for inactive regions ( Figure 1H). To compare the chromatin landscape at the endogenous genomic loci between active enhancers and inactive regions, we performed ChIP-seq for eight HMs in mESCs. Intersection of the HM data showed that all three investigated HMs associated with active enhancers, H3K4me1, H3K27ac, and H3K122ac, as well as the promoter mark H3K4me3 are highly enriched at activecompared to inactive input regions. For HMs associated with transcription, H3K36me3 and H3K79me2, we also find an enrichment however not directly at the active enhancer but rather in the regions flanking it ( Figure 1H). In contrast, repressive marks H3K27me3 and H3K9me3 are depleted at active regions when compared to inactive input regions. Consistent with elevated H3K4me3 levels, we found that almost half of the active FAIRE-STARR enhancers are promoter-proximal regions. This percentage is much higher than in our library for which less than 10% of the regions map near promoters ( Figure 1F). These findings are consistent with a published study showing that promoters can act as enhancers that control the expression of other genes (6). Taken together, we established a quantitative approach to determine the enhancer activity of accessible genomic regions resulting in a genome-wide catalog of putative regulatory regions in mESCs. The enhancer activity map we generated can serve as a resource and several examples of how it can be used and intersected with other types of data to study various aspects related to enhancer activity and gene regulation in mESCs can be found in the next paragraphs.

Quantitative FAIRE-STARR-seq identifies transcription factors associated with distinct enhancer activity levels
In addition to identifying which regions are active, the UMIs added during the reverse transcription step facilitate a quantitative assessment of enhancer activity based on the FAIRE-STARR data. This allowed us to rank the identified active FAIRE-STARR regions by their activity and, for example, to screen for features associated with different activity levels. To determine if enhancer activity correlates with expression of nearby genes, we first grouped the active regions into five consecutive quantiles of ascending activity ( Figure 2A). Next, for each group the individual regions were associated to neighboring genes by distance. Using this approach, we found that the expression levels for  were ranked for their activity (log STARR score) and divided into five groups of ascending enhancer activity (highlighted by increasing background coloring). Dashed lines depict the 10th and 90th percentiles of STARR activity. (B) Expression of genes paired with FAIRE-STARR enhancers, for each of the five activity groups as depicted in (A). Genes were paired with FAIRE-STARR enhancers by distance using GREAT (84) and TPM values of RNA-seq data are shown. Boxplots depict the distribution of expression of all genes per group, whiskers extend to 1.5 IQR. TPM values of individual genes are shown as dots. P-values for unpaired Wilcoxon tests comparing neighboring groups are indicated. (C) TF sequence motifs enriched at active FAIRE-STARR enhancers, comparing the most active 10% (high) and least active 10% (low) of the active enhancers. Enriched motifs (E ≤ 1e−3) with a minimum 1-fold -log enrichment ratio between the two groups are shown. The JASPAR 2018 vertebrate clustered motif database was used as reference and a representative TF for each cluster is listed (57). Boxplots depicting (D) the number of significantly enriched motifs and (E) length of low-or high-ranking enhancers. Means are indicated as well as P-values for unpaired Wilcoxon tests comparing the two groups. the genes of each category correlate with the enhancer activity scores with significant differences between the neighboring groups ( Figure 2B). These findings suggest that our quantitative FAIRE-STARR scores recapitulate the activity of enhancers in their endogenous genomic setting where they influence the expression level of nearby genes. Similarly, H3K27ac levels, a mark that is used as a proxy for enhancer activity (59), correlate positively with our STARR score (Supplementary Figure S2A and B). This further indicates that our STARR activity scores capture the activity of enhancers in their endogenous genomic setting.
To investigate the role of DNA sequence in directing different levels of enhancer activity, we performed TF motif enrichment analysis comparing active enhancers that are ranked either at the top or bottom ten percent by STARR activity score ('high' and 'low', Figure 2C). Interestingly, we found that the motifs for pluripotency TFs OCT4, SOX2 and NANOG (cluster 18) are enriched for high-as well as low-ranking enhancers indicating that high activity levels are not dictated by the presence of sequence motifs for these TFs. Rather, the top-ranking enhancers are characterized by motifs of the SP/KLF (cluster 28) and ETS (cluster 7) TF families. These factors are ubiquitously enriched at promoters, irrespective of the cell type and are accompanied by motifs for cell-type specific TFs (60). For the low-activity enhancers, we found enrichment of motifs of cell type defining TFs, such as MYOG (cluster 9) and POUF4 TFs (cluster 30), suggesting a priming of enhancers that might play a role in later developmental stages when these TFs become expressed. Low activity enhancers were also enriched for the motif of p53, which was recently found to bind 'dormant' enhancers in mESCs that are located in inaccessible chromatin and become activated upon cellular stress or during reprogramming (45). Another plausible explanation for increased activity levels of enhancers could be the absolute number of motifs per enhancer as well as on the diversity of these motifs (61). Accordingly, we found that the highactivity enhancers, on average, contain more motifs (10.6 compared to 9.8 average motifs/enhancer, Figure 2D) and were 14% larger than enhancers with low activity (385 bp versus 338 bp for low-activity enhancers, Figure 2E). Together, these findings indicate, that the nature of the sequence motifs present as well as the absolute number of motifs are critical drivers of enhancer activity.

Epigenetic features and transcription factor occupancy define distinct enhancer subsets
Enhancers can be predicted based on the patterns of HMs present, with active enhancers harboring a high H3K4me1 to H3K4me3 ratio as well as high H3K27ac levels at flanking nucleosomes (2,23,55). However, alternative histone marks for active enhancers have been described (62)(63)(64). Moreover, although HMs correlate with enhancer activity it is largely unclear if this reflects a causative link (reviewed in 65). To gain insight into the epigenetic landscape present at the active STARR enhancers, we clustered the active enhancers based on ChIP-seq signal for a panel of eight different HMs ( Figure 3A). Consistent with our finding that active enhancers are enriched in promoter regions ( Figure 1F), we found that about half of the active FAIRE-STARR enhancers show HMs characteristic for active promoters (cluster A: high H3K4me3, low H3K4me1, and high H3K27ac signal). An overlay with annotated promoter regions confirmed that the enhancers of cluster A (from here onwards, we will refer to these as 'enhancer-promoters' or in short 'E-promoters') map to annotated promoters (Figure 3A). Moreover, RNA-seq as well as H3K36me3 and H3K79me2 levels show that the genes at these promoters are actively transcribed in mECSs. Notably, enhancers of this cluster do not display the classical enhancer signature of high H3K4me1 over H3K4me3 levels, and consequently are not recognized by the CRUP enhancer prediction tool (23), which like many prediction tools prioritizes high H3K4me1 over H3K4me3 levels to call enhancers. This is different for enhancer clusters B to E which display high H3K4me1 levels and varying levels of H3K27ac, thus displaying a typical epigenetic signature of active enhancers and accordingly a larger overlap with enhancers predicted by CRUP. Clusters B and C, which display higher H3K27ac signals than clusters D and E, are highly active enhancers, showing high STARR activity as well as higher CRUP prediction scores. Cluster F, on the other hand, displays a typical H3K4 methylation pattern of enhancers but is lacking high H3K27ac levels, indicative of enhancers poised for activity (55,66). Accordingly, these enhancers have quite low enhancer prediction scores, but still can be identified as active in our FAIRE-STARR-seq assay. This indicates, that FAIRE-STARR-seq is able to pick up enhancers with a poised HM signature, while CRUP discards those regions by design.
Cluster G, shows a rather uncommon HM pattern of enriched H3K36me3, a mark for active transcription, together with elevated H3K9me3, a hallmark of heterochromatin. The combination of these two marks has previously been reported to occur at the same nucleosome to mark lowly expressed genes and weak enhancers (67). Interestingly, alignment with the RepeatMasker database (68) revealed that 93% of cluster G enhancers map to repeat elements. The majority of these repeats are from the endogenous retrovirus-K (ERVK) family (Supplementary Figure  S3F), a rather young family of mouse-specific endogenous retroviruses, which can act as enhancers in mESCs (28). Finally, cluster H, which exhibits the lowest STARR signal, also shows the lowest accessibility based on our ATAC-seq data and lowest enrichment of each HM except H3K9me3. This indicates that these regions are not very accessible, nor active, in the endogenous genomic setting and may only be able to unleash their activity in the episomal STARR-seq setting where such sequences are taken out of their repressive endogenous chromatin context.
Motivated by a study claiming that H3K122ac marks a unique class of active enhancers lacking H3K27ac (62), we included this mark in our ChIP-seq experiments. However, contrary to the published study, we did not observe a convincing cluster with H3K122ac but lacking H3K27ac. Rather, we found that, in general, the signal for H3K27ac and H3K122ac is essentially the same at enhancers. This is different for promoters, where we found H3K122ac enriched at H3K4me3-marked gene promoters, irrespective of the H3K27ac state (Supplementary Figure S3H).
To study the link between enhancer clusters and nearby gene expression and to test if they are associated with different categories of genes, we paired the clustered enhancers with genes by distance and analyzed gene expression levels ( Figure 3B) as well as gene ontologies ( Figure 3C). Overall, we found that the expression of enhancer cluster-associated genes correlates well with epigenetic signatures of individual clusters. For example, we find the highest mean expression level for genes associated with clusters B and C, that show the most prominent signatures of active enhancers ( Figure  3B). In contrast, we find the lowest expression levels for genes associated with clusters G and H, two enhancer clusters with low levels of active enhancer marks. Interestingly, the function of the genes associated with different clusters also diverges. For instance, E-promoter cluster A-associated genes are involved in more general cellular processes, such as nucleic acid, nitrogen compound, and organic substance metabolic processes, whereas genes associated with clusters B and C play a role in early embryonic stages ( Figure  3C). Enhancer clusters D to F are associated with genes of medium expression levels, which are enriched for genes typically expressed at later time points during embryonal development. Cluster G enhancers correspond to genes with rather low expression levels which are enriched for zinc finger and KRAB-domain containing genes. The genes of this family of TFs have been described as marked by H3K36me3 and H3K9me3 (69), the combination we now identify at the cluster of associated enhancers as well. Finally, enhancer cluster H is associated with genes with the lowest mean expression levels of all investigated clusters and no significantly enriched GO terms could be identified. This is in line with our hypothesis that these enhancers are a heterogenous group, which are repressed in mESCs but can be activated at different points during differentiation.
Next, we compared the sequence composition of the enhancer clusters, and found that they display characteristic patterns of motif enrichment ( Figure 3D). One example is the E-promoter cluster A, which is enriched for many motifs including CG-rich motifs like SP/KLF family (cluster 28) and NRF1 (cluster 72), as well as motifs of the ETS family (cluster 7). Similarly, analysis of published TF ChIPseq data showed different binding patterns for individual enhancer clusters (Supplementary Figure S3A). Consistent with a selective enrichment of their motifs at E-promoters, we found that c-MYC and Ronin preferentially occupy enhancers of E-promoter cluster A (Supplementary Figure  S3A). The situation is different for KLF4 (a member of the SP/KLF family) which, as expected, binds E-promoter cluster A, but also to other enhancer clusters that are enriched for the SP/KLF motif (Supplementary Figure S3A).
Of note, apart from the transcriptionally active Epromoters, we also find many actively transcribed promoters without FAIRE-STARR-seq activity (Supplementary Figure S3B) showing that enhancer activity is not a general feature associated with active promoters. Comparison of these promoters with E-promoters shows stronger enrichment of c-MYC and RONIN at E-promoters, but similar KLF4 occupancy at both groups (Supplementary Figure  S3C and D). Further, motif enrichment analysis comparing E-promoters and promoters lacking STARR-seq activity revealed differences in sequence composition (Supplementary Figure S3E). For example, consistent with the observed selective enrichment by ChIP, the motif for MYC (cluster 4) was more highly enriched for E-promoters than for regular promoters, while motifs for pluripotency factors OCT4, SOX2, and NANOG (cluster 18) were more enriched at regular promoters that at E-promoters.
A global comparison between enhancers predicted based on chromatin features using CRUP and active enhancers identified by FAIRE-STARR revealed that only 2,437 (52.1%) of the active STARR enhancers were also predicted by CRUP ( Figure 3E) while CRUP predicted 22,833 regions that were not identified by FAIRE-STARR. The majority of FAIRE-STARR enhancers which were not predicted by CRUP fall into cluster A, the E-promoters, and thus display a chromatin signature which is filtered-out by CRUP. The second largest group of STARR enhancers not detected by CRUP are cluster H regions, repressed enhancers, which are not marked by the classical enhancer signature recognized by CRUP. On the other hand, enhancers which were only predicted by CRUP but not picked-up by FAIRE-STARR-seq showed very low chromatin accessibility by ATAC, and overall lower activation marks (Supplementary Figure S3G). This indicates, that these regions were not included in the FAIRE-STARR input library and thus could not be identified as active. Additionally, it is possible that a subset of the enhancers predicted only by CRUP and not by FAIRE-STARR are not active.
Together, we find that enhancers can be grouped based on different HM patterns. These enhancer clusters have different activities and are associated with genes with different functions. The partial overlap with CRUP enhancer predictions highlights that HM-based enhancer predictions and functional assays such as FAIRE-STARR-seq can provide complementary information. Additionally, our data is in line with other studies suggesting that a formal distinction between promoters and enhancers, and the exclusion of promoter signatures from HM-based predictions, might not make sense given that promoters can exert both functions (6,8,70).

Sequence-based prediction of active enhancers in mESC
As shown above, enhancer prediction based on HMs often excludes E-promoters and depends on several ChIPseq data sets which are not always available. Here, we set out to build an enhancer prediction model solely based on DNA sequence using their activity scores from FAIRE-STARR-seq to predict if a given DNA sequence shows enhancer activity in mESCs or not. Given the different sequence composition and accordingly motif enrichment of promoter and enhancer regions ( Figure 3D), we decided to build two distinct classifiers: one for 'traditional' enhancers and one for E-promoters. Specifically, we took all accessible regions from our input library and divided them into two groups: Those overlapping with annotated promoters and those that do not overlap. Next, within each group, we ranked the regions by their STARR-score and used the 1% (enhancers, Figure 4A-C, 1% of 170 190 putative enhancer regions: 1702 regions) or 10% (E-promoters, Supplementary Figure S4A, 10% of the 16 769 input regions overlapping with promoters: 1677) of regions with the highest or the lowest STARR-score to train two regularized logistic regression (elastic net) models to classify active and inactive DNA sequences. To this end, a nested cross-validation approach was performed, where 75% of regions were used as the training data to optimize the model, while the remaining regions were used to assess the predictive performance on unseen data ( Figure 4C). As features for each model, we used the width of the region and enrichment of clustered JASPAR motif matrices (56). Each elastic net regression classifier was fitted to maximize the cross-validated mean AUC which yielded 0.75 for enhancer regions ( Figure  4D) and 0.87 for E-promoters (Supplementary Figure S4B). Thus, the classifier for both enhancers and E-promoters performed quite well, suggesting that enrichment patterns of TF motifs alone can be used to distinguish between active and inactive regions for both promoter and enhancer regions with reasonable accuracy.
To determine which features were most important in predicting if a region is active, we analyzed the ranked model coefficients ( Figure 4E and Supplementary Figure S4C) which reflect the importance of individual features for each optimized activity-prediction model. Interestingly, the two features with the highest coefficient for both the E-promoter and the enhancer-prediction model are enhancer width and the motif for ETS TFs (cluster 7). For classification of enhancers, the motif for pluripotency TFs SOX2, OCT4 and NANOG (cluster 18) was among the top features associated with active regions ( Figure 4E), while it showed a negative coefficient in the E-promoter classification (Supplementary Figure S4C), indicating that pluripotency factors contribute to enhancer but not E-promoter activity. Similarly, the motif for CTCF scored a positive coefficient for the Epromoter classification (Supplementary Figure S4C), while it was slightly negative for active enhancers (not in figure, coeff. = −0.045).
Together, our modeling demonstrates that the activity state of a sequence (active versus inactive) can be predicted from DNA sequence for both enhancers and E-promoters using a rather small feature set of 79 clustered TF motifs and enhancer width as input.

FAIRE-STARR-seq identifies enhancers that change their activity upon exiting pluripotency
Next we set out to make use of the quantitative nature of our assay, by investigating the changes of enhancer activity following differentiation. To identify enhancers that change their activity upon exiting pluripotency, we analyzed enhancer activity during the early stages of differentiation using the FAIRE-STARR-seq approach. Specifically, we compared cells treated with LIF to maintain pluripotency with cells from the same transfection-batch that were treated for 4 hours with retinoic acid (RA) to initiate differentiation towards the neuronal lineage ( Figure  5A) (34). We then focused on enhancer regions which either lost (LIF-dependent) or gained (RA-inducible) activity upon differentiation. This resulted in the identification of 616 LIF-dependent and 386 RA-inducible enhancers, which show varying degrees of loss or gain of STARR activity ( Figure 5B and Supplementary Figure S5D). The activity of these enhancers correlated with changes in the expression of nearby genes, with genes near RA-inducible enhancers showing, on average, an increase in gene expression (and more associated upregulated than downregulated genes) whereas the expression genes near LIFdependent enhancers showed, on average, a slight decrease in expression upon differentiation (and more repressed than upregulated genes, Figure 5C and Supplementary Figure  S5B, C). Notably, when compared to all active mESC en-   Table S1) for enhancers as indicated were deleted by site-directed mutagenesis and activity was analyzed as described in (E). (G and H) Genomic loci encompassing STARR enhancers selected for genomic deletion using CRISPR/Cas9. Normalized FAIRE-STARR-input, -seq, and RNA-seq signals are shown. RefSeq genes are shown in either black (protein coding genes) or red (non-coding genes). (I and J) Expression of genes (RT-qPCR) near the deleted enhancer and of control genes for clonal wild type (wt) and enhancer heterozygous (E+/−) and homozygous (E−/−) deletion clones. Bar plots represent mean gene expression ± SE of three biological replicates (dots) and 1-3 clonal cell lines (number indicated in brackets) after 4 h of LIF or RA treatment. Data points for individual clonal lines are shown as dots with matching shapes. hancers ( Figure 1F) that are typically promoter-proximal, the differentially active enhancers are most frequently found at distal intergenic regions (Supplementary Figure S5A) and display distinct TF motif enrichments ( Figure 5D). For instance, RA-inducible enhancers are more enriched for RAR␣::RXR␣ and ETS-family motifs than the LIFdependent enhancers consistent with the activation of RAR upon treatment of cells with its cognate ligand. Accordingly, when we performed ChIP-seq targeting RAR␣ from RA-treated cells, we found that RAR␣ is enriched at RAinducible enhancers whereas no such enrichment was found for the LIF-dependent enhancer regions (Supplementary Figure S5E). Similarly, consistent with STAT3 activation by LIF, STAT3 binding is more enriched at the LIF-inducible enhancers than at RA-inducible enhancers (Supplementary Figure S5E). Moreover, LIF-dependent enhancers are more enriched for OCT4:SOX2, SP1-like family, SOX2, NFY, TEAD and NRF1 motifs than the RA-inducible enhancers. The enrichment of these sequence motifs is reflected in enriched binding of OCT4, SOX2, NANOG and KLF4 based on published ChIP-seq data (Chen et al. (71), Supplementary Figure S5E). Interestingly, binding of OCT4, SOX2 and NANOG was not only enriched at LIFdependent but also at RA-inducible enhancers when compared to all active mESC enhancers. This indicates that pluripotency TFs play a supportive role at RA-inducible enhancers during the early stages of differentiation.
Next, the regulatory behavior of several LIF-dependent and RA-inducible enhancers identified in our screen was tested by cloning individual active regions into the STARRseq vector and analyzing their activity by qPCR ( Figure  5E). One exemplary LIF-dependent enhancer we tested is located distal to the Socs3 (suppressor of cytokine signaling-3) gene, which is activated by LIF-mediated STAT3 signaling (72). We confirmed that the Socs3 enhancer is LIF-inducible and this induction is blunted when the two identified STAT3 motifs were mutated ( Figure 5F). Furthermore, mutation of the two SOX2 motifs of the Socs3 enhancer leads to a marked loss of basal activity. Finally, the combined mutation of all SOX2 and STAT3 motifs abolished enhancer activity completely. This finding illustrates the importance of these TFs in facilitating LIFdependent activity as suggested by the enrichment of sequence motifs for these TFs at LIF-dependent enhancers ( Figure 5D). We also characterized an RA-inducible enhancer upstream of the RA target gene Cyp26a1, which is pivotal for proper differentiation (73). The Cyp26a1 enhancer is inactive during pluripotency but is massively upregulated upon differentiation ( Figure 5E and F). Consistent with a role of RAR in activating this enhancer, mutation of both of the identified RAR␣::RXR␣ motifs resulted in a complete loss of induction upon RA treatment (Figure 5F). As a final enhancer, we analyzed the RA-inducible Hoxa enhancer ( Figure 5E and F), which is located over 70 kb upstream of the RA-responsive Hoxa gene cluster (74). The Hoxa enhancer shows basal activity in pluripotency which increases upon differentiation ( Figure 5F). Interestingly, mutation of the two identified RAR␣::RXR␣ motifs reduced basal activity during pluripotency but did not impair the RA-induced activation, suggesting that other motifs mediate the activation. As expected, mutation of the only identified OCT4:SOX2 motif of the Hoxa enhancer reduced activity in pluripotency whereas the combined mutation of both RAR␣::RXR␣ and the OCT4:SOX2 motifs completely abolished basal as well as RA-induced activation of the Hoxa enhancer.
To test the role of two RA-inducible enhancers in the regulation of nearby genes in the endogenous genomic context, we generated CRISPR/Cas9-mediated genomic deletions of these enhancers in mESCs. The first enhancer we targeted was the Cyp26a1 enhancer (E) described above ( Figure 5G). We were able to generate a single homo-and three heterozygous clonal lines for the Cyp26a1 E deletion (Supplementary Figure S5F). Analysis of the clonal lines revealed that heterozygous deletion of the Cyp26a1 E did not lead to an apparent impairment in upregulation of Cyp26a1 whereas the homozygous enhancer knock-out led to a complete loss of inducibility ( Figure 5I). Interestingly, our RNA-seq data showed RA-inducible expression of two unannotated, spliced, and poly-adenylated transcripts, located anti-sense and only a few hundred basepairs upstream of Cyp26a1 (Cyp26a1 as trx1&2). For these enhancer-proximal transcripts we found that RA-induction was impaired in the heterozygous lines whereas activation was completely lost in the homozygous Cyp26a1 E deletion clone. Inducibility of Hoxa1 by RA and expression of OCT4 (Pou5f1), two genes located on other chromosomes than Cyp26a1 E, was not affected by the Cyp26a1 E deletion, indicating that RA-signaling is still functional in the homozygous enhancer knockout and that the effects observed are specific for transcripts that are enhancer-proximal. For the other investigated enhancer, upstream of the Hoxa gene cluster (Hoxa E Figure 5H), we were able to generate only heterozygous deletion mutants (Supplementary Figure S5F). These mutants were still able to activate Hoxa1 and Hoxa4 upon RA treatment to induce differentiation, however with slightly reduced levels compared to wildtype indicating that this enhancer might contribute to the RA-dependent upregulation of the Hoxa gene cluster ( Figure 5J). The impact of the Hoxa E deletion was more prominent for the non-coding RNA Halr1, which is located upstream of the Hoxa genes and thus closer to the investigated enhancer. Specifically, the heterozygous deletion of the Hoxa E resulted in a marked decrease in Halr1 expression during pluripotency and much lower levels when cells were treated with RA to stimulate differentiation. In contrast, expression of Skap2, a gene upstream of the Hoxa gene cluster, and mESC marker Pou5f1 are not impacted by the Hoxa enhancer deletion, which indicates specificity of the observed effects. Thus, consistent with the data for the episomal reporter ( Figure 5F), this indicates a dual function of this enhancer to facilitate basal Halr1 expression during pluripotency as well as induced expression upon differentiation.
Altogether, these results show, that the FAIRE-STARRseq assay can be used to trace the dynamics of enhancer activity and can be used to identify enhancers which gain, but also those that loose enhancer activity upon induced differentiation. These enhancer subsets are characterized by distinct motif enrichments and the binding of specific TFs and are associated with regulation of nearby genes.

Sequence features associated with RAR␣-occupied enhancers that change activity upon ligand binding
The quantitative nature of our FAIRE-STARR-seq data can be used as a resource to study the coupling of transcription factors binding sites and other features to enhancer activity. For example, our ChIP-seq experiments targeting RAR␣, the receptor of RA and key effector in RA-induced differentiation, uncovered thousands of binding sites. However, only a subset of these RAR␣-occupied sites show a change in activity upon RA-treatment in our STARR-seq experiments ( Figure 6A). Moreover, depending on the RAR␣-occupied region examined, we found that enhancer activity can either stay the same, go up, or go down upon addition of RAR's cognate ligand RA (Figure 6A and B). To identify sequence features that may play a role in determining if an RAR␣-occupied site changes its activity upon RA treatment, we first determined the effect of RA treatment on enhancer activity for all RAR␣occupied sites covered in our STARR-seq library (Supplementary Figure S6A). Next, we defined three categories: Active RAR␣-occupied enhancers which (i) do not change activity upon RA treatment ('non-responding)', (ii) become more active upon RA treatment ('induced') and finally, (iii) RAR-occupied enhancers whose activity decreases upon treatment ('repressed', Figure 6B). Comparison of the motif composition of these three categories of RAR␣-occupied enhancers showed several differences that could play a role in determining if RA treatment induces a change in enhancer activity. For example, RAR␣::RXR␥ heterodimer motifs (cluster 25) are more enriched for RAR␣-occupied regions that are activated in response to RA than either regions that are not regulated or those with repressed enhancer activity ( Figure 6C). Furthermore, clustered TF motifs of nuclear receptors, such as RXRA::VDR (cluster 2) and PPARG (cluster 41), and motifs for pluripotency factors (OCT4, SOX2, and NANOG, cluster 18) and SP/KLF (cluster 28) were more significantly enriched for induced RAR␣ binding sites. On the other hand, RAR␣-occupied sites that lose activity upon RA treatment had the lowest enrichment of the canonical RAR␣::RXR␥ heterodimer motif and were characterized by a relatively high occurrence of motif clusters ZNF384 (cluster 55), HOXA10 (cluster 22), and CTCF (cluster 48), which could thus play a role in RAR␣-dependent repression of enhancer activity. Sideby-side comparison of inducible, non-responding, and repressed RAR␣ sites showed that RAR␣ occupancy was comparable for inducible and non-responding regions with only slightly lower occupancy at repressed sites ( Figure 6F). Interestingly, chromatin accessibility assessed by ATAC was comparable for induced and repressed RAR␣-occupied sites but higher for non-responding sites in pluripotency and only increased marginally at induced sites upon RA treatment ( Figure 6F). Similarly, H3K27ac enrichment in pluripotency was lower at inducible and repressed than at non-responding RAR␣ sites and inducible regions only reached comparable levels to repressed sites upon RA treatment ( Figure 6F). Accordingly, the enrichment of RAR␣ and H3K27ac at RAR␣ sites did not correlate positively with RA-inducibility (Supplementary Figure S6E), indicating that enhancer inducibility of an RAR␣ site cannot sim-ply be inferred from ChIP enrichment. This further indicates that sequence composition acts as an additional regulatory layer to control not only if an enhancer changes its activity but also the direction in which enhancer activity is modulated upon RA treatment.
RAR␣ typically binds as a heterodimer together with retinoic X receptor (RXR) to retinoic acid response elements (RAREs) that can have distinct motif architectures, depending on cellular background and differentiation stage (37,75). These motifs share the same consensus hexameric direct repeat (DR) however they differ in terms of orientation and the spacing between the repeat elements (76). To elucidate the possible role of different spacings of the RAR␣::RXR␣ consensus motif (MA0159.1) on RAinduced changes in enhancer activity, we constructed different repeat orientations and spacings in silico (DR, everted (ER) and inverted repeats (IR) of the consensus motif with spacing from 0 to 8 nucleotides) and compared inducible, non-responsive, and repressed RAR␣-occupied regions for enrichment of these motifs ( Figure 6D). As previously described (76), we find DR0 to be the most enriched spacing for RAR␣-occupied sites for all three enhancer subgroups (data not shown). Moreover, induced RAR␣occupied regions are more enriched for each of the investigated motif architectures than their repressed counterparts and display higher abundance of the consensus repeat halfsite (Supplementary Figure S6B) indicating that activation might be driven by direct RAR␣ binding to its response element whereas repression is not. To determine how DR spacing influences RA-dependent regulation of enhancer activity in mESCs, we cloned single DRs with different spacings, but the same DR sequence into the STARR vector. Consistent with previous studies (76), we found that activation was most prominent for the DR5 spacing, indicating that the ability of RAR␣ to activate enhancers depends on the spacing of the DRs ( Figure 6E). When we flanked the DR5 element by either an ETS binding site, which was highly enriched for RA-inducible mESC enhancers ( Figure  5D), or a SoxOct motif, we found no clear change in enhancer activity for the ETS binding site. In contrast, when the DR5 element was flanked by the SoxOct motif, we observed an increase in basal enhancer activity and also in RA-induced activation. This finding indicates a supportive role of pluripotency factors in the RA-induced enhancer activation by RAR␣ and aligns well with the motif enrichment ( Figure 5D) and mutation analyses ( Figure 5F) for differentially active enhancers showing that the SoxOct motif is important for both basal and RA-dependent activation of RAR␣-occupied enhancers.

DISCUSSION
This study provides a comprehensive genome-wide enhancer activity map in mESCs assessed by FAIRE-STARRseq, that can be used as a resource for further dissection of enhancer function in mESCs, and identifies various sequence features associated with enhancer activity.
To understand what discriminates active enhancers from inactive accessible regions covered by our FAIRE-STARR library, we compared these two groups and found that active regions are characterized by the presence of specific TF Nucleic Acids Research, 2021, Vol. 49, No. 21  motifs as well as the presence of an overall higher quantity of enriched motifs ( Figure 1G). As expected, among the most prominently enriched motifs for active enhancers were the motifs of pluripotency factors OCT4, SOX2 and NANOG (cluster 18) but also SP/KLF and ETS family TFs (cluster 28 and 7), which are TFs almost ubiquitously expressed across cell types. Inactive accessible regions showed fewer enriched motifs. Moreover, these enriched motifs typically belong to cell-type specific TFs that are not expressed in mESCs (Supplementary Figure S1F) and are associated with differentiation. Thus, it might be possible that these open but inactive regions become active in upon differentiation. Consistent with our findings, a recent study which systematically analyzed the quantity and composition of TF motifs for mESC enhancers described a threshold for the minimal number of TF motifs required for enhancer activity (61). Additionally, this finding of motifs adding synergistically to enhancer activity, once passing a threshold of minimal number of motifs, could explain the nature of superenhancers, regulatory elements that are much larger in size (> 12 kb) and show much higher activity than regular enhancers. We did not identify such super-enhancers in our data, but rather individual regions within a super-enhancer that show high activity, which is in line with previous findings in human ESCs (41). The introduction of UMIs during the reverse transcription step allowed us to efficiently distinguish between independent transcript replicates and PCR duplicates of reporter-derived reads and to analyze enhancer activity quantitatively. When we started our project, UMIs were not part of the STARR-seq protocol. However, in the meantime a similar approach has been proposed for low complexity libraries where UMIs are introduced in the first PCR cycle (53). By applying UMIs, we could not only identify active enhancers, but also show that specific sequence features are associated with activity levels ( Figure 2C). For example, motifs for SP/KLF (cluster 28) and ETS TFs (cluster 7), which are essentially ubiquitously expressed across cell types, but also CG-rich motif clusters ZBTB33 (cluster 50), ZNF143 (cluster 63), ZNF263 (cluster 54), and SPIB (cluster 49) are specifically enriched at highly active enhancers and could contribute to high enhancer activity in mESCs. In contrast, motifs for pluripotency TFs (cluster 18) are similarly enriched for both highly-and lowly active enhancers and thus seem required for an enhancer to be active yet not for specifying its activity level.
Since many of the TFs associated with low enhancer activity are not expressed in mESCs, we speculate that these enhancers are primed for high activity once the specific TFs are expressed, e.g. at later stages during differentiation to exert their cell-type specific enhancer functions. The quantitative nature of our assay also allowed us to assess changes in enhancer activity during the early stages of RA-induced differentiation and to identify enhancers that gain or lose activity as well as associated and required TF motifs. As expected, we found that pluripotency TFs and STAT3 are associated with LIF-dependent enhancer function, however they are also found at RA-inducible enhancers ( Figure 5D, Supplementary Figure S5D). Mutation experiments of individual reporter constructs ( Figure 5F) highlighted the crucial contribution of OCT4 and SOX2 motifs to enhancer activity in pluripotency, RAR␣::RXR␣ motif importance for RA-inducible activation, but also the cooperation of pluripotency and RAR␣::RXR␣ motifs in maintaining enhancer activity. Genomic deletion of selected RA-inducible enhancers ( Figure 5G-J) further validated the impact of the identified enhancers on expression of neighboring genes. Additionally, the quantitative analysis of RAR␣ binding sites revealed a possible synergistic activation of RA-inducible RAR␣-bound enhancers by pluripotency TFs and RAR␣ ( Figure 6E). A role of OCT4 in recruiting RAR::RXR to enhancers of differentiationassociated genes has been demonstrated (77) and together with our data points towards an additional role of OCT4, and possibly SOX2, in facilitating increased enhancer activity during differentiation.
While our FAIRE-STARR-seq approach has the advantage to quantify enhancer activity on a fragment basis and to reduce the complexity of an investigated input library, it also has its limitations. For example, the resolution of our assay is lower when compared to another type of STARR-seq using accessible chromatin enriched by ATAC (43). Further, enhancer activity is measured for populations of cells, which might mask activity that is restricted to a subset of individual cells. Finally, the episomal nature of the assay may not capture various levels of regulation that occur in the endogenous genomic context.
In our study, only a minor fraction of the probed accessible regions (4765 of 186 959) showed reproducible significant enhancer activity. A recent enhancer identification study based on STARR-seq, assessing activatory potential of the whole genome, found over 18 500 active enhancers in mESCs (45). However, the mentioned study assessed a larger input library and did not remove PCR duplicates from their analyses, and thus applied less stringent cut-offs that would lead to similar quantities of active enhancers in our assay (e.g. we would call 21 thousand active enhancers without UMI-aware deduplication). A comparison of enhancers identified in these two studies reveals that 25.1% of the active enhancers called for our data in serum/LIF conditions without UMI-deduplicated enhancers coincide with enhancers that overlap with accessible regions from Peng et al. Vice versa, 33.9% of their enhancers overlap with our dataset. The difference between these studies might also be related to the different promoters of the reporters that were used in these studies, since differences in promoter-enhancer compatibility can influence whether an enhancer can activate a promoter or not (78). Given the different experimental set-up of these studies, these two enhancer catalogs in mESCs could complement each other.
When we analyzed the active FAIRE-STARR enhancers, which we identified by an episomal assay, for the chromatin signatures present at their endogenous genomic loci, we identified that many show the expected signature of active enhancers (high H3K4me1/H3K4me3 ratio and high H3K27ac). In addition, we identified enhancer clusters which can be classified as poised (no H3K27ac), repressed (elevated H3K27me3 or H3K9me3) or promoters (higher H3K4me3 than H3K4me1) based on their chromatin context. Strikingly, we found that almost half of the active enhancers are located at gene promoters (defined as the region up to 1 kb upstream of a TSS, Figure 1F). The identification of such E-promoters by FAIRE-STARR-seq is in line with previous reports of promoters that act as enhancers for other genes (6)(7)(8) and the high percentages of promoterproximal enhancers identified by STARR-seq based assays in other cell types (43,44). Poised enhancers (cluster F) display enrichment of TF motif cluster 24, which encompasses TFs ZIC1, ZIC3 and ZIC4. ZIC3 is a critical TF for the transition from naïve to primed pluripotency (79) and was shown to activate chromatin-masked enhancers in mESCs, when taken out of the endogenous context (45). Based on our data, we expect that that ZIC3, or another TF from the ZIC family, activates cluster F enhancers during differentiation. Furthermore, we identify a subset of enhancers (cluster G) which display enrichment of two contradictory marks: H3K36me3 associated with active transcription and H3K9me3 which marks repressed chromatin. This combination of HMs can occur on the same nucleosome (67), to demark 3 exons of zinc finger (ZNF) genes which consist of repetitive sequences (Zinc finger (ZNF) domains) (80,81), and is possibly the result of two independent mech-anisms, active transcription (H3K36me3) and ATRX chromatin remodeler-mediated preservation of genomic stability by repressing recombination between ZNFs (H3K9me3) (69). We now find ZNF genes to be associated with distal cluster G enhancers ( Figure 3C), which are marked with the same chromatin signature ( Figure 3A). Interestingly, the vast majority of cluster G enhancers map to endogenous retrovirus-K (ERVK) family repeats (Supplementary Figure S3F), which possess endogenous enhancer function in mESCs (28,29). Thus, a similar mechanism that ensures ZNF gene stability might also play a role in preventing recombination between repetitive ERVK elements that serve as enhancers of these genes.
Motivated by a study claiming that H3K122ac marks a novel class of enhancers in mESCs (62), we added this HM to our panel of modifications assayed. However, contrary to the published study, we did not find an enhancer cluster demarked by H3K122ac while lacking the H3K27ac mark, even when we forced k-means clustering to search for more clusters (data not shown). Rather, we found that H3K122ac and H3K27ac essentially always co-occur at enhancers. This also fits with the fact that the same enzymes, histone acetyltransferases p300 and CBP, deposit both the H3K27ac and H3K122ac marks (82,83). The situation is different for a subset of H3K4me3-marked promoter regions, which have high H3K122ac levels while H3K27ac levels are relatively low (Supplementary Figure S3H). Here a possible explanation is that the selective methylation of H3K27 but not H3K122 at promoter regions by polycomb repressive complex 2 prevents acetylation of H3K27 whereas H3K122 can still be modified. Accordingly, we find that H3K27me3 levels are higher at H3K27ac low, H3K122ac high regions. Taken together, these findings highlight the ability of STARR-seq to identify enhancers that are most likely poised for activation or even repressed, when taken out of the genomic context. These regions, and also promoters, are frequently excluded from enhancer prediction tools. Conversely, prediction tools like CRUP identify more putative enhancer regions that lack accessibility (Supplementary Figure S3G) but could be activated once bound by pioneering TFs. Thus, STARR-seq and enhancer prediction display complementary information about enhancers.
In summary, we generated a genome-wide enhancer activity map by FAIRE-STARR-seq which catalogs active regulatory regions in mESCs, in pluripotency and after induced differentiation. We identified features associated with enhancer activity and regulatory elements which are omitted by standard prediction tools. Our findings can serve as a reference for future functional studies of the regulatory network of genomic elements in mESCs and contribute to the refinement of computational methods to predict regulatory elements.

DATA AVAILABILITY
All NGS data generated in this study were deposited at GEO (GSE171771) and published NGS data that were reanalyzed in this study are listed in Supplementary Table S2.