Embryonic tissue differentiation is characterized by transitions in cell cycle dynamic-associated core promoter regulation

Abstract The core-promoter, a stretch of DNA surrounding the transcription start site (TSS), is a major integration-point for regulatory-signals controlling gene-transcription. Cellular differentiation is marked by divergence in transcriptional repertoire and cell-cycling behaviour between cells of different fates. The role promoter-associated gene-regulatory-networks play in development-associated transitions in cell-cycle-dynamics is poorly understood. This study demonstrates in a vertebrate embryo, how core-promoter variations define transcriptional output in cells transitioning from a proliferative to cell-lineage specifying phenotype. Assessment of cell proliferation across zebrafish embryo segmentation, using the FUCCI transgenic cell-cycle-phase marker, revealed a spatial and lineage-specific separation in cell-cycling behaviour. To investigate the role differential promoter usage plays in this process, cap-analysis-of-gene-expression (CAGE) was performed on cells segregated by cycling dynamics. This analysis revealed a dramatic increase in tissue-specific gene expression, concurrent with slowed cycling behaviour. We revealed a distinct sharpening in TSS utilization in genes upregulated in slowly cycling, differentiating tissues, associated with enhanced utilization of the TATA-box, in addition to Sp1 binding-sites. In contrast, genes upregulated in rapidly cycling cells carry broad distribution of TSS utilization, coupled with enrichment for the CCAAT-box. These promoter features appear to correspond to cell-cycle-dynamic rather than tissue/cell-lineage origin. Moreover, we observed genes with cell-cycle-dynamic-associated transitioning in TSS distribution and differential utilization of alternative promoters. These results demonstrate the regulatory role of core-promoters in cell-cycle-dependent transcription regulation, during embryo-development.


INTRODUCTION
Understanding transcription regulation in development represents a challenge due to the dynamic nature of multiple cellular lineages continuously evolving into an increasingly complex organism. Increasing evidence suggest that cell cycle control affects transcription regulation in development (1)(2)(3)(4), while the rules and regulatory 'grammar' on the level of cis-regulatory elements, associated with cell cycle regulation is little understood. A particularly difficult question is, how transcription is controlled in dynamically interacting embryonic cells of the embryo, which follow distinct cell proliferation fates independently from their lineage commitment?
Over the last decade, the core promoter, a stretch of DNA surrounding the transcription start site (TSS) of all genes, has emerged as a key site of transcriptional regulation, integrating signals received from multiple cis-regulatory elements (5,6). The advent of 5 end transcript sequencing (e.g. Cap analysis of gene expression [CAGE]), has greatly enhanced our ability to interrogate the role the core promoter of a gene plays in transducing regulatory signals into gene transcription (7)(8)(9)(10). Single base pair resolution analysis of TSS location (using CAGE) has revealed an immense diversity in the pattern of transcription initiation on the core promoter, from a narrow distribution of TSSs, with a single base pair dominant site (termed sharp promoters), to a dispersed pattern of TSS usage across the promoter (broad), with a spectrum of different promoter architectures between these two extremes (7)(8)(9)(10)(11). Investigation of divergent core promoter architectures have revealed these to be a strong indicator of distinct regulatory networks, acting on the core promoter, modulating cell behavior through transcriptional profile changes (10,(12)(13)(14). This is of importance in understanding how key transitions in cellular behaviour, during embryonic development, are regulated at the level of transcription initiation. The role that promoter-associated gene regulatory networks play in development associated transitions in cell cycle dynamics (e.g. during differentiation) however, is poorly understood.
Embryonic development is marked by several dramatic transitions in the regulatory make up of cells, to permit changes and limitations in their potency, leading to the formation of an organized hierarchical body map. These transitions are often associated with changes in cell cycle dynamics, alongside shifts in transcriptional repertoire (1)(2)(3)(4). This process commences with the fusion of two gamete cells into a single fertilized embryo. In many eukaryotes, including zebrafish, this is followed by a number of rapid, synchronous cell cycles, with embryonic behaviour exclusively controlled by maternally deposited factors. At the midblastula transition (MBT) the zygotic genome activates and this process is marked by a slowing of the cell cycle and a loss of synchrony (reviewed in (15,16)). We have previously shown that the transition in cell behaviour from the rapidly cycling synchronous divisions before MBT, to slower, asynchronous, divisions after MBT, accompanied by the activation of the zygotic genome, is marked by a switch in transcription initiation grammar from defined, W-box mediated transcriptional output, to a broader unrestricted initiation grammar, but confined by nucleosome positioning (17). Extensive regulatory reprograming is seen in other model organisms, during this period too, with the first stages of mouse embryo development marked by extensive chromatin remodelling, with lineage-specific expression of several chromatin modifiers, underscoring the potential role of gene regulatory networks in controlling cell fate decisions (18,19).
The Fluorescent Ubiquitination-based Cell Cycle Indicator (FUCCI) is a system that allows the visualization of cell cycle progression in living cells, through the fusion of fluorescent markers onto cell cycle phase restricted factors (often Cdt1 and Geminin) (20) as detailed in Figure 1A. Studies using this system in developing zebrafish have revealed that in subsequent stages of development, the process of cell differentiation is marked by a further slowing in cell cycle dynamics as tissue-lineages are specified (21). This is in agreement with in vitro studies of human and murine embryonic stem cells, showing that a key indicator of cell differentiation from pluripotency, are transitions in cell cycle dynamics from rapid cycling to a slower cycling identity, characterized by an elongated G1 phase (1)(2)(3)(4). In addition to this, studies, investigating transitions from lineage defined multipotent stem cells to terminally differentiated cells, in both muscle (22) and liver development (5), have shown that this process is marked by wholesale depletion of RNA polymerase II regulatory cofactors. Additionally, various cell lineages on the path towards differentiation, appear to be regulated by distinct general transcription factors, forming preinitiation complexes (23)(24)(25). These find-ings suggest that the core promoter and associated basal transcription factors are an important yet unexplored component in the regulation of gene expression, controlling differentiation transitions.
We hypothesized that analysis of core promoter architecture during dynamic changes in cell cycle behaviour, in the developing embryo, will reveal the function of promoter regulation in key transitions of cellular behaviour during embryonic development. In this study, we aimed to explore, in an in vivo model of development, the role the core promoter plays in defining transcriptional output in cells undergoing differentiation coupled changes in cell cycle dynamics, through both promoter-level regulatory and behavioural changes. The zebrafish FUCCI transgenic system differentially marks cells in G1 and S/G2/M phases of the cell cycle and can therefore be used to separate rapidly and slowly cycling cells in vivo, by virtue of the cell cycle stage they primarily inhabit. This system was used to ask how cell cycle dynamics affect promoter usage during differentiation stages. We segregated cells as they go through differentiation coupled changes in cell cycle dynamics in segmentation stage embryos, which are characterized by marked spatial separation of differentiating cells with distinct cell cycle dynamics (21). CAGE 5 end transcript analysis was used to interrogate the promoter features of these cell populations, to identify changes in the usage of transcriptional regulatory machinery. This investigation identified that the FUCCI system can successfully be used to segregate cells on the basis of cell cycle stage and cycling behaviour, as well as differentiation status, when this is marked by changes in cell cycle dynamics. Interrogation of promoter behaviour between populations revealed a distinct sharpening (condensation of TSS usage to a narrower region of the promoter) in promoter usage, in genes upregulated in slowly cycling, differentiating tissues. This event was associated with enhanced utilization of the TATA-box, in addition to Sp1 binding sites. A concurrent enhancement in CCAAT-box utilization in genes upregulated in rapidly cycling cells, was also observed, in genes with a broad range of TSS utilization in particular. A greater utilization of TATA-like and W-box motifs was also identified in rapidly cycling cells. A similar pattern of regulatory motif utilization was observed in genes transitioning in their TSS distribution, with differentially used alternative promoters and with temporal expression dynamics across the segmentation period. These changes were shown to be primarily associated with differential cell cycle dynamics, rather than tissue/lineage. This demonstrates a switch in core promoter associated transcriptional regulatory machinery utilization, leading to changes in promoter behaviour, as cells go through differentiation coupled changes in cell cycle dynamics. This investigation explores, for the first time, the regulation of tissuelineage-specification on the promoter-level, in a whole organism, in vivo context.

Fluorescent imaging and image processing
Zebrafish embryos were decorionated using 1 mg/ml Protease from Streptomyces griseus (Sigma). For Lightsheet imaging (Zeiss), embryos were screened for fluorescence using a fluorescent microscope (Zeiss axio zoom V16), and mounted in a column of 1% low melting point agarose (Sigma). Zebrafish embryos were imaged using dual side illumination, with maximum intensity projection (MIP) processed images generated using the Zen Black imaging software (Zeiss).
For assessment of the relative proportions of red and green fluorescent cell populations, 10 embryos from each embryonic stage were imaged on a fluorescent microscope (Zeiss axio zoom V16) and pixel count for each fluorescent marker extracted using the ImageJ software package (27).

Cell sorting and RNA preparation
Zebrafish were decorionated and selected as previously described at the 14-somite stage. Roughly 200 embryos were dissociated into a single cell suspension using enzymefree cell dissociation buffer, PBS based (Gibco). Dissociated cells were pelleted and resuspended in Hanks balanced salt solution without calcium chloride or magnesium sulphate (Sigma), for sorting. Cells were fluorescenceassociated cell sorted (FACS) into populations displaying red and green fluorescence. Validation of correct sorting was performed by propidium iodide DNA content analysis following manufacturer's conditions (Invitrogen). RNA was extracted from isolated cells using the miRNeasy kit (Qiagen). RNA quality was analysed by capillary electrophoresis (Bioanalyzer 2100, Agilent). All samples had an RNA integrity number (RIN) >9.

CAGE library preparation and sequencing
NanoCAGE libraries were generated following a protocol described in (28). Pooled libraries, representing FACS sorted populations, in triplicate, were sequenced with the Illumina HiSeq 2500 50 cycles single-read run operation program, following manufacturer's protocol (Illumina).

CAGE mapping and CTSS calling
The zebrafish genome assembly (Zv9) was downloaded from UCSC Genome Browser (29). Nano-CAGE reads were trimmed (15 bp from 5 end) to remove the linker and unique molecular identifier (UMI) region. For previously published data reanalysed in this study, raw CAGE sequencing data was downloaded from the repositories detailed in the respective publications (17,30,31). Reads were mapped using Bowtie (32), allowing a maximum of two mismatches and only uniquely mapping tags with MAPQ of 20. R/Bioconductor package CAGEr was used to remove the additional G nucleotide, due to the CAGE protocol, where it did not map to the genome (33). All unique 5 ends of reads were defined as CAGE defined TSS's (CTSS) and reads were counted at each CTSS per sample. These raw read counts were subsequently normalized based on a power-law distribution based on 10 6 reads (34) and defined as normalized tags per million (tpm). After quality control per sample, a high level of inter-replicate correlation was observed and the biological replicates were merged for downstream analyses. One library of a biological replicate of 3 was excluded in the S/G2/M-phase merger, based on low complexity of the library.

Calling transcriptional clusters
CTSS that were supported by at least 0.5 tpm in one of the samples were clustered based on a maximum allowed distance of 20 bp between two neighbouring CTSS. These transcriptional clusters (TCs) were then trimmed on the edges to obtain more robust boundaries of TCs, by obtaining the positions of the 10th and 90th percentiles of expression per TC. Only TCs with higher than 5 tpm expression were considered. Finally, TCs across all samples were aggregated if within 100 bp of each other to form consensus clusters (CC) for downstream analyses.

Annotation
The CCs were annotated to the nearest reported TSS from Ensembl (danRer7) using the R/Bioconductor package ChIPseeker (35). We selected only CCs that mapped within 1 kb upstream of the reported TSS as well as CCs mapping to 5 UTR.

Differential gene expression analysis
The raw read counts were extracted for the CCs across triplicates described earlier and collapsed into total count per CC. DESeq2 R/Bioconductor package was used to define differential expression and the threshold of differential expression was set at adjusted P-value of <0.05. These results were cross referenced to the CC information of the merged samples. In cases of more than one CC mapping to the region, the CC with the highest expression was chosen to represent the region.

Gene ontology
CCs were annotated with entrez gene IDs and analysed with GOstats R/Bioconductor package for overrepresentation of GO terms for biological processes. The up and down regulated genes were tested separately against all genes expressed amongst the samples.

Tissue-specificity and cell cycle enrichment
The top 600 ranked human genes were downloaded from the Cyclebase database version 3.0. For each category of differential expression, the human orthologs (one to one & one to many) of zebrafish genes were determined using 'mar2017.archive.ensembl.org' archive of hg19. The peak time phenotype was determined by Cyclebase. Enrichment of cell cycle genes was determined with a permutation test (n = 10 000). Tau and entropy scores for tissue-specificity were determined by RNA-seq expression of eight adult zebrafish tissues from DanioCode series 391: brain, gill, heart, intestine, kidney, liver, muscle and spleen. Analysis of the relative representation of tissue-specific terms, was performed by cross-referencing differentially expressed genes with mRNA in situ expression data extracted from the ZFIN database, for wild-type zygote to 20 days post fertilization larval stages (https://zfin.org/downloads/wildtypeexpression fish.txt). This data was used to generate an anatomical specificity score that for each gene defines the pattern of spatial expression across development (Vucenovic and Lenhard, unpublished data). Data was then used to compare restrictedness of spatial localization of gene expression for two groups of genes, stratified across developmental time.

Core promoter motif enrichment analysis
Position weight matrices (PWMs) for TATA-box, CCAATbox, Sp1 binding site, INR, and YY1 binding site motifs were obtained from converting frequency matrices from JASPAR (7th release; 2018 (36)). Each CC was centred on the most expressed CTSS (the dominant TSS) and each sequence was scanned from 120 bp up and 50 bp downstream. A hit was reported if the scanned region contained a sequence with a 90% match to the PWM. For each group of differentially expressed genes, occurrence was counted and compared to the non-significant set of CCs per sample. Significance was assessed using Fishers' exact test. Obtained P-values were considered if <0.01. The log 2 odds ratios were visualized as a heatmap. For W-box sequence analysis, all possible variations of a poly-W pentamer were identified and their relative frequency in the region 20-40 bp upstream of dominant TSS was measured. For each of the two groups, occurrences were counted and compared to the non-significant set of CCs per sample. Significance was assessed using Fishers' exact test.

Promoter shape classification
Width of CCs (interquantile-width, IQW) were defined as the distance between the positions of the 10th and 90th percentiles of expression per CC. Sharp promoters were characterized by a width <10 bp, peaked broad promoters as ≥10 bp with one CTSS expressing more than 60% of expression of entire CC, and broad promoters as the remaining set.

Alternative promoter utilization
The genomic location of 1612 twinned canonical and alternative promoters identified in (31), were extracted and intersected with the G1 and S/G2/M CAGE data generated for this paper. TPM values within these regions were calculated and the expression patterns of genes with a TPM >5 in either the canonical or alternative promoter region, in both the G1 and S/G2/M populations identified. Genes with a 2-fold change in the expression of the alternative promoter normalized to canonical promoter expression, were selected for further investigation (n = 79). These genes were segregated for the relative behaviours of alternative and canonical promoters as follows. Canonical down (expression of the canonical promoter in S/G2/M cells is >50% decreased vs.

Segregation of embryonic cells with distinct cell cycle dynamics
The FUCCI transgenic system, differentially marks cells in G1 and S/G2/M phases of the cell cycle ( Figure 1A) and can therefore be used to separate rapidly and slowly cycling cells in vivo, by virtue of the cell cycle stage they primarily inhabit. Sugiyama et al. (21) demonstrated a switch in the ratio G1 versus S/G2/M marked cells in zebrafish undergoing somitogenesis, during the segmentation period, from primarily S/G2/M in early stages, to principally G1 in later stages, associated with cell differentiation. In order to further investigate this observation, longitudinal assessment of cell proliferation rate during zebrafish embryo development, using this system, was performed and revealed a spatial and tissue-specific separation in cell cycling behaviour across post-gastrulation embryos ( Figure 1B, Supplementary Figure S1A, Movie S1). Cells, principally in the developing somites, had slowed cell cycling, displayed by an elongated period in G1-phase and therefore an accumulation of red fluorescent signal in the somites. In contrast, green (S/G2/M) cells marked primarily neuroectoderm derived lineages, such as optic cup, neural tube and notably, clearly identifiable cells in the notochord and circulating cells over the yolk ball ( Figure 1B, Supplementary Figure  S1A, Movie S1). 14-somite embryos showed the clearest spatial and tissue-specific segregation of cells on the basis of cell cycling dynamics ( Figure 1B, Supplementary movie 2 and 3). In order to investigate the role of promoter associated transcriptional regulatory machinery in defining this transition, the 14-somite stage was selected for further investigation.
To this end, 14-somite FUCCI embryos were dissociated to a single cell suspension and segregated by fluorescence associated cell sorting (FACS) into cells in G1 (red) and S/G2/M (green) (Supplementary Figure S1B), with correct sorting confirmed by fluorescent imaging of the cells ( Figure 1C). In order to validate that this process successfully segregates cells on the basis of cell cycle stage, segregated cells were subjected to DNA content analysis ( Figure  1D). This analysis showed a marked enrichment for diploid (2N/G1) cells in the red, G1 segregated population, over the total background population (81 versus 39%) and enrichment for both 2-4N (S-phase) and tetraploid (4N/G2/Mphase) in the green, S/G2/M segregated population over total (27 versus  successful segregation of cells on the basis of cell cycle stage using the FUCCI-transgenic system.

Global transcription initiation patterns at known promoters
Next, we asked about the state of the mRNA transcriptome in fast and slow dividing cells of the embryo, with particular focus on the mRNA 5 end, in order to identify features of promoter utilization. To achieve this goal, we chose a small cell number optimized protocol for detection of mRNA 5 ends (nanoCAGE (28)), which reports steady state mRNAs quantitatively, and simultaneously informs about TSS usage and core promoter architecture (37). NanoCAGE was performed on three biological replicates of G1 and S/G2/M-phase segregated cells from the 14-somite stage zebrafish embryo, together with unsegregated cells (Total). CAGE reads were mapped to the zebrafish genome assembly (Zv9) and CTSSs assigned with a high level of inter-replicate correlation observed, with the exception of the S/G2/M (green) replicate 2, which was consequently excluded from further analysis (Supplementary Figure S2, Table S2). Based on this the biological replicates were merged for downstream analysis.
In order to validate that this approach successfully identifies the transcription start sites, the distribution of mapped CTSSs was compared with previously published CAGE data (generated using the tagging-CAGE version of the protocol (38)), from the 14-somite stage (31). Identified TSSs from each nanoCAGE sample, along with the tagging-CAGE, were grouped into consensus clusters (CCs) between samples and the distribution of TSSs (interquantile width [IQW]) within well expressed clusters (TPM ≥ 5) compared. This analysis revealed very similar interquantile width distributions between the nanoCAGE data and previously published data (31) ( Table S3). This is additionally exemplified by the very similar TSS distribution between samples of si:ch211-113a14. 29 (an orthologue of human histone 2B) and mmp30 (matrix metallopeptidase 30) ( Figure 2B). The determination of TSS distribution interquantile width is an established method for determining promoter shape, an important comparator of TSS utilization (7)(8)(9)(10).
A high degree of overlap was additionally observed for cluster position between nanoCAGE and previously published data, with 80% the same genes represented in all samples ( Figure 2C, Supplementary Tables S4&S5). Identified CTSSs were additionally mapped to genomic features ( Figure 2D, Supplementary Figure S2C). The majority of CCs should fall within the core promoter region, <1 kb upstream of the annotated 5 end of genes, as this is known to be the major site of transcriptional initiation and accordingly in this analysis ∼70% of TSSs mapped to this region (Supplementary Table S2), a similar level to previously published CAGE ( Figure 2D). Taken together, these results indicate efficient isolation of gene promoter activities in cycling cells of differentiating embryos.

Transcriptomes of G1 and S/G2/M cells reflect differential cell cycle and tissue-specific identities
CAGE is comparable to RNA-seq as a robust tool for quantitative transcriptomic analysis (31,37,39). Therefore, in or-der to profile the identity of cells segregated by cell cycle dynamics, and to identify differentially regulated genes, the expression of the promoter-associated consensus clusters was compared between samples. 12,865 consensus clusters were found to be shared between G1 (red) and S/G2/M (green), and clusters with a significant (P  Table S6). This result indicates a large degree of overlap in the transcriptomes of the two populations, while differential regulation of a subset of transcripts opens the way to address their promoter regulation.
Fluorescence microscopy analysis of cell cycle dynamics in embryo development, revealed significant lineage-specific segregation of cells on the basis of their cell cycle dynamics ( Figure 1B). Notably, the FUCCI system indicates highly dynamic variation in the distribution of cell populations segregating into fast and slow dividing groups and domains in the embryo (Supplementary movies 2 and 3). Therefore, differentially expressed genes in cells segregated by cell cycle stage, in this context, may represent both cell cycle and lineage-specification related differences. In order to determine the contribution of each of these elements to the populations of differentially expressed genes, they were crossreferenced with a databases of cell cycle periodic genes (Cy-cleBase) ( Figure 3B) and tissue-lineage-specific genes, as determined by expression patterns in both adult and embryonic tissues ( Figure 3C and D respectively).
Analysis of differentially expressed genes revealed a significant enrichment for cell cycle periodically expressed genes upregulated in the S/G2/M (green) population (46%, P < 0.0001), with a peak of expression between G2 and M phase of the cell cycle ( Figure 3B, Supplementary Table  S6). Conversely only 3.2% of genes upregulated in the G1 (red) population were cell cycle periodic. A significant majority of these however had peak expression in G1. This is in contrast to the group of genes with unchanged expression, where the periodicity of genes is more evenly distributed between cell cycle stages ( Figure 3B). We hypothesized that differential expression in cells with distinct cell cycle dynamics is a result, at least in part, to distinct cell cycle behaviour of cell-lineages, which are distinct in nature, as well as represent varying levels of cell differentiation state. To test this, we asked about the contribution of lineage differentiation to selective expression of genes.
Analysis of the tissue-lineage-specificity of differentially expressed genes revealed a clear enrichment in those upregulated in the G1 (red) population over the other populations (demonstrated by a higher average Tau and lower average entropy score) ( Figure 3C). Analysis of the anatomical specificity of G1 upregulated genes across embryo development, revealed this enrichment to be highly specific to the segmentation stage of development, from which the samples were collected ( Figure 3D). These results suggest that a contributing factor, leading to differential expression between the populations, is cellular replication on the part of the S/G2/M (green) population and tissue-specification on the part of the G1 (red) population. This observation is supported by the gene ontology of the differentially expressed genes, revealing in the S/G2/M (green) population, a clear enrichment for genes involved in DNA and chromatin pro-  cessing, key for rapidly cycling cells, and in the G1 (red) population, enrichment for muscle development associated gene expression ( Figure 3E).
In order to further dissect the contribution of different cell types and developing lineages to the slowly versus rapidly cycling (G1 and S/G2/M) populations, differentially expressed gene sets were cross-referenced with the ZFIN database of gene with known tissue and spatial specific expression at the 14-19 somite stage (Supplementary Figure S3). The majority of genes did not match tissuespecific terms (72% [S/G2/M], 61% [G1] and 81% [Unchanged]) representing the fact that a minority of genes are tissue-specific and many tissues have not specified at the stage. This data does demonstrate however that the differentially expressed genes between the G1 and S/G2/M populations contain a disproportionate number of tissue-specific terms, particularly in the G1 population, in agreement with previous findings. Analysis of the relative contribution of genes with tissue-specific expression revealed a strikingly divergent expression pattern between populations (Supplementary Figure S3), with somite and muscle specific terms highly enriched in the G1 population, in agreement with gene ontology analysis and previously discussed fluorescence imaging ( Figure 1B). Representation of terms related to the viscera, peripheral tissues (such as the periderm) and extra-embryonic tissues was also enriched in the G1 population, representing a population of cells starting to specify tissues within the embryo. In agreement with gene ontology analysis ( Figure 3E), representation of proliferative and pluripotent cell types (such as the germ layers and proliferative region) was greatly enriched in the S/G2/M population. Interestingly, fluorescence imaging analysis ( Figure  1B), suggest spatial distribution bias in the S/G2/M population compared to the G1 population, thereby S/G2/M phase green cells are particularly enriched in the eye cup and neural tissues. This apparent tissue bias is not borne out on the transcriptional level however, with neural and sensory tissue terms fairly evenly represented between populations (Supplementary Figure S3). A closer inspection of the tissue distribution of red and green cells however, suggest that neural lineages share both fast and slowly dividing cells ( Figure 1B, surrounding panels). Global transcriptomic analysis will reveal most enriched tissues, but will not reflect tissue-specificity of cell cycle regulation. Nevertheless, this analysis shows that the dynamics of cell cycle regulation follows certain trends, which manifests as an enrichment for the transition to a tissue-specific expression profile, marked by cell cycle dynamic changes, as seen in the slowly cycling, G1 population.

Genes differentially expressed between G1 and S/G2/M cells utilize different core-promoter classes and regulatory elements
As revealed by the analyses detailed above, a clear transition in cell population identities is occurring during the segmentation period in these embryos, marked by a dramatic phenotypic change (speed of cell cycle) and a divergence in transcriptional output. In order to determine whether differential gene expression, associated with cell cycle dynamics, is also associated with changes in core promoter regulatory element distribution, the frequency of known, regulatory motifs 120 bp upstream and 50 bp downstream of the dominant TSSs of each promoter was determined. Both groups of differentially expressed genes showed marked changes in motif utilization compared to genes with unchanged expression ( Figure 4A, Supplementary Table S7). In the S/G2/M population there was a statistically significant enrichment for the NF-Y factor associated CCAAT-box, upstream of the TSS (P-value = 4.22 × 10 −12 ). In the G1 population there was also a strong enrichment for a canonical TATAbox (P-value = 2.04 × 10 −7 ), enrichment for Sp1 binding sites (P-value = 0.002), and a depletion for a YY1 binding site motif (P-value = 0.002). Strikingly, CCAAT-box and TATA-box relative strength of enrichment were inverse for S/G2/M and G1 enriched genes, relative to those with unchanged expression ( Figure 4A). Motif enrichment in each case was also specific, being positionally restrained relative to the TSS (Supplementary Figure S4A). Examples of differentially expressed gene promoters, containing CCAAT/TATA-box motifs, are shown in ( Figure 2B).
Besides altered utilization of specific regulatory motifs, previous analyses of development linked changes in promoter utilization, such as during zygotic genome activation (17), have found it to be associated with changes in the utilization of the W-box, an A/T (WW) rich stretch with similar positional constraint as the TATA-box to the TSS. In order to investigate whether this motif is differentially used, the relative frequency of WW-dinucleotides and TATA-box (>90% PWM match), 120 bp upstream and 50 bp downstream of the dominant site of transcription, in differentially expressed genes was calculated (Supplementary Figure S5). TATA-box frequency analysis was done alongside to differentiate TATA and W-box frequency, as a canonical TATA-box will appear as a WW dinucleotide enrichment in this analysis. This analysis confirmed TATAbox enrichment in genes upregulated in G1 (∼30 bp upstream of the TSS), but no enrichment in WW dinucleotide frequency between the G1 and S/G2/M differentially expressed gene sets (Supplementary Figure S5). This suggests that TATA and A/T rich motifs (such as W-box and TATAlike) may be differentially used between G1 and S/G2/M populations. W-box and TATA-like motifs are distinct from the canonical TATA-box by virtue of a looser motif specificity (17,40). In order to investigate their relative utilization, the approximate location of the TATA-box (20-40bp upstream of the TSS) was analysed for TATA-box position weight matrix (PWM) match ( Figure 4B, Supplementary Figure S4B). This analysis revealed a shift from canonical TATA (>90% match) utilization enriched in G1, to 75-90% match (previously identified as distinct TATA-like and W-box motifs (17,40)  ble S8). Two pentamers (TATAA, ATAAA) were significantly enriched (P < 0.001) in gene promoters upregulated in G1, however in 90% [21/23] (TATAA) to 100% [21/21] (ATAAA) of these cases the sequence formed part of a canonical TATA-box rather than a W-box. Overall this suggests a change in the degree of utilization of canonical vs. non-canonical TATA regulatory elements in genes differentially expressed between G1 and S/G2/M populations.
Canonical TATA-box utilization is strongly associated with sharp promoters, where a single or condensed cluster of TSSs are used in the promoter (7-9) and is associated with high level of expression often associated with structural genes (10). In order to determine whether enhanced TATA utilization is associated with a difference in the shape of promoter utilization, the distribution of TSSs (interquantile width [IQW]) within clusters was determined and compared between samples ( Figure 4D). Additionally, consensus clusters were classified into classes on the basis of the pattern of TSS utilization ( Figure 4E). Classifications were as follows; sharp (at least 90% of the expression from the promoter emanating from TSSs within 10 bp of one another (IQW<10) and ≥60% of expression emanating from a single dominant TSS); Peaked broad (≥60% of expression emanating from a single dominant TSS, but IQW ≥ 10) and Broad (<60% of expression emanating from a single dominant TSS and IQW≥10) (7). This analysis identified a distinct enrichment in sharp promoter utilization in G1 versus S/G2/M and non-significantly differentially expressed genes (43.7% versus 29.0% and 29.2% respectively) ( Figure  4E). In order to determine whether this relative sharpening of promoter utilization in G1 differentially expressed genes was due to increased TATA utilization, relative TATA motif frequency (>90% PWM match) was determined in the promoter proximal region (120 bp upstream and 50 bp downstream of the dominant site of transcription) in G1 versus S/G2/M differentially expressed genes, as previously described, but segregated by promoter shape (sharp, peaked broad and broad) (Supplementary Figure S5). This analysis showed that indeed TATA-box utilization is highly enriched in sharp promoters over other behaviours, interestingly however this is only true in the gene set upregulated in G1. In the S/G2/M upregulated gene set promoter shape was only weakly associated with TATA utilization (Supplementary Figure S5). WW-dinucleotide frequency analysis done in an identical manner also showed no association with promoter shape. Collectively these findings suggest that enhanced TATA utilization and a greater proportion of sharp promoters are associated in genes upregulated in slowly cycling G1 cells ( Figure 4D and E, Supplementary Figure S5).

Differential core-promoter class and regulatory element utilization between S/G2/M and G1 populations is tissueindependent
As illustrated in Figure 3 and Supplementary Figure S3, differential tissue representation contributes alongside cell cycle dynamics in defining differential expression of promoters in the G1 and S/G2/M populations. In order to dissect whether the reported changes in promoter motif utilization and TSS distribution are linked to differential cell cycle dy-namics, or is rather the product of distinct tissue identities, promoter behaviour analysis was repeated on tissuespecific genes alone, non-tissue specific genes, and a tissue independent gene set (Supplementary Figure S6A (i), (ii) and (iii) respectively). Tissue and non-tissue specific gene sets were segregated as described in Supplementary Figure  S3 Fisher's exact test). This analysis also revealed a highly significant enrichment for TATA-box utilization in the G1 population of each gene set (P < 0.0001, Fisher's exact test), albeit larger in the tissue specific population, alongside an enrichment in Sp1 binding site utilization in the tissue independent gene sets (Supplementary Figure S6A and B).
This represents a similar pattern of motif utilization to the complete gene set ( Figure 4A versus Supplementary Figure S6B). Analysis of promoter interquantile width (IQW) distribution in the tissue independent gene set (iii), showed a distinct narrowing and broadening of TSS distribution in G1 and S/G2/M enriched genes respectively, relative to genes with unchanged expression (Supplementary Figure S6B). The G1 sharpening however was less pronounced than in the analysis on the complete gene set (Figure 4D versus Supplementary Figure S6B). The tissue specific gene set was too small to accurately perform IQW analysis. Overall these analyses revealed that the reported pattern of promoter motif utilization and TSS distribution in S/G2/M versus G1 cells is characteristic of these populations, independent of the tissue specificity of the gene sets. However the differential tissue component between populations does somewhat contribute to G1 TATA enrichment and narrow TSS distribution.

Gene promoters with marked change in TSS distribution between populations display differential regulatory element utilization
Analysis of genes differentially expressed between the G1 and S/G2/M population (Figures 3 and 4) show distinct differences in promoter behaviour (sharper promoter utilization in slowly cycling G1 cells) and the regulatory networks associated with their expression (often involving the differential usage of TATA or CCAAT-boxes). As described in Figure 2A, the global distribution of promoter usage is unchanged between populations, however analysis of TSS distribution, in promoters highly expressed (TPM > 10) in both populations, revealed a significant proportion of genes transitioning in promoter shape (sharp, peaked broad and broad) between populations (394/4774, 8.3%) ( Figure  5A). This event is potentially representative of a transition in dominant promoter regulatory network (7-9). Mi- tochondrial fission regulator 2 (mtfr2) for example, a gene associated with cell proliferation in human (41), displays a shape change transition between populations, broadening TSS distribution in the S/G2/M population versus G1. This gene additionally contains a CCAAT-box within its promoter, proximal to the shape change event ( Figure 5B). This population of promoters show context dependent changes in their utilization. They therefore could represent an important population defining alterations in the promoterome (and therefore the transcriptome) of cells undergoing differentiation coupled changes in cell cycle dynamics. In order to determine whether, like differentially expressed promoters, these transitioning promoters are marked by different regulatory machinery to each other, promoter proximal regulatory motif analysis, was performed, as before ( Figure 5C). This analysis revealed, in line with previous studies (10,(12)(13)(14), significant enrichment for the TATA-box motif in genes that remained sharp in both populations and a depletion of TATA in genes where broad promoter utilization is retained ( Figure 5C). Interestingly CCAAT-box and Sp1 binding site motif frequency were also enriched in the population of genes with shared sharp initiation (though to a 17-fold lesser degree than the TATA-box), despite both being associated with broad promoters in vertebrates (6,11,42). Of note YY1 bindingsite motif occurrence is enriched in genes with a condensed TSS distribution in the S/G2/M population relative to G1 (G1 peaked/S/G2/M sharp, G1 broad/S/G2/M sharp, G1 broad/S/G2/M peaked), the latter significantly (n<0.01). Of greatest interest however are gene promoters where there is a significant shift in TSS distribution (from sharp to broad) between populations. Direct comparison of these populations (G1 sharp/S/G2/M broad versus G1 broad/S/G2/M sharp) revealed a significant (P < 0.05) enrichment for CCAAT-box occurrence in the former population ( Figure 5C), further supporting the association of the CCAAT-box motif with differential promoter usage between populations. Gene ontology analysis of these broad-to-sharp shifting populations revealed a mixed set of terms with both proliferation and differentiation genes represented ( Figure 5D). This suggests that transitions in the TSS distribution on the promoter has a subtler effect on transcriptional output, than on differentially expressed genes, however both processes are driven by the utilization of similar regulatory motifs and argue for distinct activity of general transcription factor complexes on promoters, in fast and slow dividing cells.

Differential utilization of alternative promoters between cell cycle dynamic divergent populations
A major source of divergence in the transcriptome of cell populations, particularly during differentiation, is through the use of alternative promoters (31). In order to investigate whether the regulatory differences, observed with differentially expressed and promoter shape transitioning genes, also impact on the utilization of alternative promoters, the relative expression of previously identified alternative promoter containing genes in zebrafish (31), was determined between populations. TPM values were calculated for genomic regions identified to correspond to canonical and alternative promoters, associated with the same genes (n = 1612). Genes with significant expression (TPM > 5 in either the canonical or alternative promoter region, in both the G1 and S/G2/M populations) were selected for further analysis (n = 231). In order to identify genes with differential utilization of alternative promoters, alternative promoter TPM values were normalized to canonical ('alternative promoter relative expression') and compared between populations ( Figure 6A). This analysis identified 79 genes (34% of significantly expressed candidates) with a 2-fold change in the relative expression of the alternative promoter.
In order to determine which of these relative changes in expression were due to an upregulation in the canonical or alternative promoter in either the G1 or S/G2/M populations, these genes were subdivided into four groups as detailed in Figure 6B. They are as follows; relative to the G1 population; Canonical down: S/G2/M alternative promoter expression is unchanged, but canonical expression is depleted (n = 12); Alternative down: S/G2/M alternative promoter expression is depleted, but canonical expression is unchanged (n = 22); Canonical up: S/G2/M alternative promoter expression is unchanged, but canonical expression is enhanced (n = 10); Alternative up: S/G2/M alternative promoter expression is enhanced, but canonical expression is unchanged. Situations falling outside of these criteria were discarded (n = 9). Full selection criteria are detailed in the materials and methods. In order to determine whether, like differentially expressed and shape changing promoters, these alternative promoter utilization events are marked by differential regulatory machinery to each other, promoter proximal regulatory motif analysis was performed as before on the promoter, where the change in expression is occurring ( Figure 6C). This analysis revealed a significant enrichment for TATA utilization in genes where the canonical promoter is depleted in S/G2/M (P < 0.05) and significant CCAAT utilization enrichment where the alternative promoter is enriched in S/G2/M (P < 0.01). An example of the latter is shown in Figure 6D. The utilization of the canonical promoter of inositol polyphosphate-5-phosphatase B (inpp5b) is unchanged between population, however expression from an alternative, CCAAT-box containing, promoter is enhanced in S/G2/M. Publicly available RNA-seq data from (17,31) suggests that this promoter drives an alternative segmentation period specific transcript from the inpp5b gene ( Figure 6D). Further analysis of the canonical promoter of inpp5b also reveals that despite overall expression being unchanged, it harbors an S/G2/M specific isoform proximal to two CCAAT-boxes ( Supplementary Figure S7). The RNA-seq data suggest this may also be a segmentation period specific isoform of inpp5b. This analysis further supports a role for differential TATA-and CCAATbox core-promoter utilization in defining disparate transcriptional output, between populations segregated by cell cycle dynamic behaviour, during the segmentation period of embryo development.

Genes with temporal expression dynamics during zebrafish embryonic segmentation are marked by differential promoter motif utilization and TSS distribution
As previously discussed, the zebrafish FUCCI system revealed a clear transition from a predominantly rapidly cycling cell population in early segmentation stages, to  (17,31) show that this alternative promoter drives a somitogenesis specific transcript. an increasing predominance of slowly cycling cells in late segmentation/early pharyngula stages ( Supplementary Figure S1A, movie 1, (21)). We therefore wanted to determine whether the cell cycle associated transitions in promoter utilization, demonstrated in this study, are specific to the 14-somite stage of zebrafish development, or rather reflect a transition in promoter behaviour over the segmentation period, marked by cell cycle dynamic changes. In order to investigate this, we used previously published CAGE data from the 4-somite, 14-somite and Prim-5 stages of zebrafish development (30,31). This data was processed and analysed in the same manner described previously, to generate a dataset where promoter behaviour can be compared between segmentation stages.
In order to confirm the selected stages have suitably distinct cell cycle dynamic behaviour, FUCCI transgenic embryos from the selected stages were imaged and the ratio of green (rapid cycling) to red (slow cycling) fluorescence pixel density was measured ( Figure 7A  Analysis of the multi-segmentation-stage CAGE revealed that 8130 consensus clusters were shared between the 4somite, 14-somite and Prim-5 stages. Comparative analysis of TPM values in each sample revealed two populations of genes with temporal expression dynamics over the segmentation period of embryo development. These are those with expression lost over time (1.5-fold expression level change between each stage, 4-somite > 14-somite > Prim-5, n = 93) and those with expression gained over time (4-somite < 14-somite < Prim-5, n = 434) ( Figure 7A). Analysis of promoter motif utilization revealed a significant enrichment in the utilization of the CCAAT-box in genes with expression lost over time (P < 0.01), as the embryo transitions from a predominance of rapid to slowly cycling cells, in agreement with previous analyses in this study ( Figure 7C). Conversely, genes with expression gained over time were significantly enriched for TATA-box utilization (P < 0.001), again in agreement with this study's data on motif association with cell cycle dynamics. Interestingly Sp1 motif utilization, previously shown to be associated with slowly cycling cells, was enriched in both gene sets relative to genes with unchanged temporal expression, suggesting some divergence in the gene sets selected by temporal versus cell cycle dynamics. Assessment of promoter interquantile widths in these gene sets revealed a subtle, but measurable broadening of TSS distribution in genes lost versus gained over time (average IQW = 27.5 versus 23) ( Figure 7D). It is not as decisive as in the gene set selected by cell cycle dynamics, but follows the same pattern.
Comparison of gene ontology between the two gene sets revealed a remarkable overlap in terms associated with slowly cycling cells and genes with expression gained over time, both being strongly associated with differentiation, particularly of muscle tissues. There was considerably more divergence between terms associated with rapidly cycling cells and those with expression lost over the segmentation period, the latter being associated with the regulation of morphogenesis and gastrulation, which has just finished at the 4-somite stage, rather than the regulation of cell proliferation marking rapidly cycling cells ( Figure 7E). In agreement with this, Chi-square analysis of temporal and cell cycle dynamic gene set overlap revealed a highly significant association (chi-square statistic: 135.96, P < 0.00001) between genes enriched in slowly cycling cells and those with expression gained over the segmentation period of embryo development ( Figure 7F). The promoter behaviour of both these gene sets was defined by enhanced TATA utilization and narrow TSS distribution. There was no significant overlap between the gene sets enriched in rapidly cycling cells and with expression lost over the segmentation period. Strikingly however both are characterized by enhanced CCAATbox utilization and broadening of TSS distribution. These finding suggest that the divergence in promoter behaviour described in this study does not only reflect differentiation associated changes in cell cycle dynamics occurring at the mid-segmentation stage (14-somite), but also marks a temporal switch in gene expression behaviour occurring over this period of embryonic segmentation.

DISCUSSION
The developing zebrafish embryo displays extensive transitioning in cell cycle dynamics associated with cell differentiation. Previous studies have described how this process is accompanied by changes in transcriptional output, but this study is the first to show the role the core-promoter plays in defining this output through local regulatory changes. In this investigation, cells from segmentation period FUCCI transgenic embryos were dissociated and segregated by cell cycle stage (G1 versus S/G2/M). Differential expression analysis on these populations revealed a striking separation in the identity of these cells. Beside differences in cell cycle stage, slowly cycling (G1) cells showed extensive specification in terms of gene expression to different tissues, associated with increased representation of tissue-specific regulatory motifs in their promoters. Rapidly cycling (S/G2/M) cells, on the other hand, were far less specified, with overwhelming representation of ubiquitously expressed genes, associated with DNA and chromatin processing required for rapid proliferation. This divergence in cellular identity was found to be concurrent with changes in promoter behaviour. Differentially expressed genes were sharper in the G1 population and associated with an increase in TATAbox utilization, alongside Sp1 binding sites. In contrast, the broader promoter usage in S/G2/M upregulated genes was accompanied by a far greater utilization of CCAATbox general transcription factor binding sites and a lesser dependence on TATA. TATA-like and W-box motif frequency was found to be enriched in rapidly cycling cells however. These changes in promoter behaviour were found to be predominantly associated with differential cell cycle behaviour, rather than tissue lineage identity. In addition, global analysis of changes in promoter associated TSS distribution (shape change) and usage of alternative promoters, revealed these again to be associated with divergent utilization of TATA and CCAAT-box TF binding sites. Additionally, this pattern of promoter behaviour was found to be replicated in genes with temporal changes in expression over the segmentation period of zebrafish embryo development, Figure 7. Genes with opposing temporal expression dynamics are marked by differential promoter motif utilization and TSS distribution. Published CAGE data from (30,31) was processed to generate a differential expression and promoter behaviour matrix for 4-somite, 14-somite and prim-5, three stages of zebrafish segmentation with differing ratios of rapid to slowly cycling cells, as demonstrated by Fucci imaging (A) and fluorophore pixel ratios Collectively this data suggests a strong divergence in the utilization of general transcription factor binding sites, TATA and CCAAT, as well as Sp1 binding sites, between cells undergoing differentiation coupled changes in cell cycle dynamics. This divergence impacts on the transcriptome and promoterome of these cells, resulting in differential gene expression as well as differential canonical and alternative promoter utilization.
Promoter proximal regulatory elements have previously been associated with marking genes with cell cycle periodic expression. The promoter-associated cell cycle-dependent element (CDE) and the cell cycle genes homology region (CHR) have both been found to regulate genes with maximum expression in G2-M phase, through cell cycle stage dependent binding of transcription factors (reviewed in (43)). CCAAT-boxes have also been found to play a role, in association with these factors. Three CCAAT-boxes, along with a single cell cycle gene homology region (CHR), were found to be major regulatory sites for the transcription of human cyclin B2, through NF-Y binding (44). Additionally CCAAT/enhancer binding protein ␤ (C/EBP␤) was found to be a key driver of stromal cell progress through the cell cycle (45). This suggests that promoter based regulatory signalling is key to controlling gene expression cell cycle periodicity. This study extends this by displaying a role for CCAAT-boxes in genes differentially regulated during differentiation coupled changes in cell cycle dynamics, associated with broad TSS distribution across the promoter.
This study shows that cells going through the striking changes in cell cycle dynamics, occurring embryo wide, during the segmentation period, have differential tissuespecification. This may suggest a marked transition in multiple tissue progenitors during this process. Rapidly cycling cells, show weak tissue-specificity despite their clear segregation to distinct tissue domains such as the eye cup, notochord and brain tissues. Additionally, they show a transcriptional profile dominated by ubiquitously expressed DNA processing genes and cell cycle regulators (46). This population of genes has been found to be downregulated at the onset of organogenesis in mouse and shortly after gastrulation in drosophila (47), highlighting the modulation of this class of genes as a key marker of tissue-specification. Slowly cycling cells on the other hand show a far more tissue-specified transcriptional program, linked with a greater utilization of the TATA-box associated gene regulatory machinery.
Single cell RNA-seq analysis of early stage embryo development (from high stage to 6-somite) does reveal an early specification point, differentiating notochord and brain tissues (both rapidly cycling) from somites and cardiac tissues (both slowly cycling) (48). This data supports the idea that the spatial segregation of cells based on cell cycle dynamics is associated with lineage-specification. As stated however, this analysis only extended to the 6-somite stage, when the vast majority of cells are rapidly cycling (Supplementary Figure S1A). In combination with the findings of this investigation, this data suggests that while lineages are specified early in development by small transcriptional changes (but with cell cycle and DNA processing genes predominant and no discernable cell cycle dynamic changes), it is at the point of tissue differentiation that a major shift in transcrip-tional output occurs, associated with an elongation of the G1 phase. In the future, combining the approaches used in both investigations, possibly using newly developed single cell CAGE protocols (such as C1-CAGE) (49) will permit this process to be further explored.
Investigations into differentiation associated changes in gene regulatory programs, from terminally differentiated gamete, to pluripotent stem cell (17) and lineage defined multipotent stem cell, to terminally differentiated tissue cell (5,22), have also found dramatic changes in gene regulatory program. These have been associated with changes in the utilization of TATA, TATA-like and W-box machinery (5,17,22). Terminally differentiated gametes reprogram to pluripotent cells in the early embryo, marked by W-box restricted programs being replaced with open CpG island associated promoter utilization, restricted only by +1 nucleosome positioning. This potentially primes pluripotent cells for a more diversified repertoire of behaviour as their lineage is specified (17). This process is also marked by a transition from rapid synchronous cell cycles, to slower asynchronous cycling, in zebrafish embryos. In this paper we show that cells then slow their cycling to differentiate and defining tissues as the body map starts to form, and this process is marked by an upregulation of TATA driven expression of tissue-specific genes and down regulation of ubiquitously expressed DNA and chromatin processing machinery. Studies, investigating transitions from lineage defined multipotent stem cells to terminally differentiated cells in both muscle (22) and liver development (5) have shown that this process is marked by wholesale depletion of RNA polymerase II regulatory cofactors, in particular TATAassociated factors. This is accompanied by a restriction to a small cohort of functional regulatory elements, leading to a limitation of the transcriptional repertoire, to be highly specialized and tissue-specific, in terminally differentiated cells. Collectively, this suggests that differentiation transitions in embryonic development are intimately associated with promoter level changes in the gene regulatory program, with the TATA-box a key component.
The TATA-box has previously been characterized as playing a crucial role in defining the tissue-specificity of associated genes, with precise TATA-TSS spacing a vital factor (8). Additionally, tissue-specific genes in D. melanogaster and mammalian systems are often associated with the presence of a promoter proximal, spatially constrained TATA-boxes, alongside sharp transcription initiation, differentiating this class of promoter (type I) from non-TATA dependent housekeeping (type II) or developmentally regulated (type III) gene associated promoters (12,(50)(51)(52). This paper extends the tissue-specification/TATA-box association story by showing that an increase in TATA-box utilization is one of the major defining factors in specifying differentiating cells at the earliest stages of tissue-lineagespecification.
Interestingly, in contrast to TATA-box utilization, TATA-like and W-box motifs were found to be enriched proximal to the promoters of genes upregulated in rapidly cycling cells. TATA-like elements have previously been associated with more dispersed TSS distribution on promoters, in particular multi-modal promoters (two or more dominant TSSs in the promoter) (8). Additionally, transcription mediated by TATA-like motifs had been found to predomi-