Assembly of a parts list of the human mitotic cell cycle machinery

Abstract The set of proteins required for mitotic division remains poorly characterized. Here, an extensive series of correlation analyses of human and mouse transcriptomics data were performed to identify genes strongly and reproducibly associated with cells undergoing S/G2-M phases of the cell cycle. In so doing, 701 cell cycle-associated genes were defined and while it was shown that many are only expressed during these phases, the expression of others is also driven by alternative promoters. Of this list, 496 genes have known cell cycle functions, whereas 205 were assigned as putative cell cycle genes, 53 of which are functionally uncharacterized. Among these, 27 were screened for subcellular localization revealing many to be nuclear localized and at least three to be novel centrosomal proteins. Furthermore, 10 others inhibited cell proliferation upon siRNA knockdown. This study presents the first comprehensive list of human cell cycle proteins, identifying many new candidate proteins.


Introduction
Mitotic cell division is a process common to all eukaryotic organisms and achieved through a highly orchestrated series of events classified into four sequential phases: G1 (gap phase), S (DNA replication), G2, and M (mitosis). The concerted action of hundreds of proteins is required to drive the process through to a successful conclusion, many of which are expressed in a phase-specific manner. They mediate processes such as DNA replication and repair, chromosome condensation, centrosome duplication, and cytokinesis. Dysregulation or mutation of genes encoding proteins essential for high-fidelity DNA replication is often associated with disease, in particular cancer (Vermeulen et al., 2003;Delaval and Birnbaum, 2007). Accordingly, known components of this system are important therapeutic targets (Wiman and Zhivotovsky, 2017) and novel components might present new therapeutic opportunities.
Many of the key proteins required for mitotic division are known from studies in model organisms including yeast, as well as mammalian cells (Nurse and Thuriaux, 1980;Evans et al., 1983). With the aim of identifying all the components of the system, high-content analysis platforms have been utilized. For example, RNAi screens (Kittler and Buchholz, 2005;Lents and Baldassare, 2006;Neumann et al., 2010), CRISPR/Cas9 (McKinley and Cheeseman, 2017), proteomics (Dephoure et al., 2008;Pagliuca et al., 2011;Ly et al., 2014) studies have all proposed lists of cell cycle genes/proteins but a consensus between studies has not emerged. In particular, genome-wide transcriptomics studies (Spellman et al., 1998;Ishida et al., 2001;Whitfield et al., 2002;Bar-Joseph et al., 2008;Peña-Diaz et al., 2013;Grant et al., 2013;Dominguez et al., 2016) have identified the sets of transcripts sequentially regulated during the cell cycle phases in multiple species but comparison of results from four studies of different human cell lines identified only 96 genes in common (Grant et al., 2013). Our reanalysis of these data, taking account of some of the technical variables, suggested that the true concordance of the cell cycle transcriptional network across cell types is much greater (Giotti et al., 2017). Furthermore, our analyses of large collections of tissue and cell transcriptomics data commonly identify a large cluster of cell cycle transcripts whose expression is elevated in cells or tissues with a high mitotic index (Balakrishnan et al., 2013;Doig et al., 2013;Mabbott et al., 2010Mabbott et al., , 2013. We report here on a data-driven curation exercise aimed at identifying the cohort of genes upregulated in all human cell types from the G1/S boundary through to the completion of M phase, hereafter referred to as S/G2-M genes. There are, of course, many growth-associated transcriptional regulatory events including activation of cyclins, cyclin-dependent kinases, and E2F transcription factors, that occur during G1 and are a precondition for entry into S phase (Bertoli et al., 2013), but these are not the focus of this study. We monitored genomewide gene expression in primary human dermal fibroblasts (NHDF) cells as they synchronously enter the cell cycle from a resting state (G0). Using network co-expression analysis and clustering of the data, we identified a cell cycle-enriched cluster associated with the S/G2-M phases. To refine this initial list, we identified those genes that were robustly co-expressed when their transcription was examined across multiple different human primary cell types in the promoter-based FANTOM5 transcriptional dataset (Forrest et al., 2014) and in synchronized murine fibroblasts. Manual curation of these data resulted in a list of 701 genes strongly associated with the S/G2-M phase transcriptional network. Of these, 496 encode proteins with known functions within cell division, 145 of which were not identified in any of the previous human cell cycle transcriptomics studies. Of the remaining 205 genes, 53 encode functionally uncharacterized proteins. To further validate this discovery set, we examined their expression across a range of human tissues and in mouse embryonic tissues during development. We also performed functional assays including protein localization by overexpression of GFP-tagged proteins and knockdown by RNAi.

Identification of cell cycle-regulated genes in primary human fibroblasts
Two time-course microarray experiments were performed on populations of normal human dermal fibroblasts (NHDF) synchronized by serum starvation, as used previously for such studies (Iyer et al., 1999;Bar-Joseph et al., 2008). Partial cell synchronization was confirmed by flow cytometric analysis of propidium iodidestained cells. Following serum starvation approximately 40% fewer cells were in the DNA replication phase (S) than in unsynchronized populations, and 24 h after the re-addition of serum the number of cells undergoing division had increased by 3-4 folds (S and G2-M phases) relative to the starved state ( Figure 1A). Data derived from two transcriptomics experiments, one monitoring the cells every 6 h for 48 h following release from starvation, the second every 2 h over a period of 24 h, were subjected to quality control and corrected for batch variation. The datasets were combined and analysed together using Graphia Professional, a tool designed to analyse numerical data matrices into correlation networks (Freeman et al., 2007). A sample-tosample correlation network confirming the correspondence between the two experiments and time-dependent transcriptional changes is shown in Figure 1B.
A gene correlation network (GCN) was then generated using a threshold of r ≥0.88. This value is well above the distribution of random correlations (Supplementary Figure S1A) and set to minimize the number of edges, while retaining a large number of nodes (Supplementary Figure S1B). After manual removal of non-cell cycle-related expression modules, MCL clustering (Enright et al., 2002) of the network was used to define the main phases of transcription associated with the cell cycle, generating 23 gene clusters. The three largest clusters accounted for 96% of the genes (probesets): NHDF_C1 (G0; 1270 nodes, 1176 unique Entrez IDs), NHDF_C2 (G1; 1793 nodes, 1739 unique IDs), and NHDF_C3 (S/G2-M; 1052 nodes, 963 unique IDs) ( Figure 1C). The average gene expression profile of the three clusters over the first 24 h following the re-addition of serum is shown in Figure 1D. NHDF_C1 comprised of genes induced during the starvation period (0 h), but downregulated soon after the re-addition of serum to the growth medium. The average expression of genes within NHDF_C2 peaked around 6 h postrefeeding, consistent with gap (growth) phase (G1) (Campisi et al., 1984). NHDF_C3 included genes that were induced between 12 h and 20 h post refeeding, many of which remained elevated in their expression at later time points. Enrichment analysis performed on each gene cluster reported highly significant GO_BP term enrichments for all three clusters, the most significant of which are shown in Figure 1E. NHDF_C1 (G0-associated) was highly enriched with genes involved in lipid metabolism, such as 'lipid metabolic process' and 'sterol biosynthetic process', supporting the evidence that these pathways are activated to adjust cellular metabolism during the starvation period (Chang et al., 2002) ( Figure 1E). NHDF_C2 was enriched in biological processes associated with cell growth, such as 'ribosome biogenesis', 'macromolecule metabolic process', 'cellular component organization or biogenesis', and 'intracellular transport' and included many of the known regulators of G1 including E2F3 and CDK6 (Meyerson and Harlow, 1994;Leone et al., 1998). Finally, NHDF_C3 was highly enriched with terms such as 'chromosome organization', 'DNA repair', 'centrosome organization', 'telomere organization', 'DNA strand elongation', 'spindle assembly', and 'cytokinesis' ( Figure 1E). A more granular cluster analysis of this co-expression network (MCLi 2.2) is presented in Supplementary Figure S1C and Table S1. We then fragmented NHDF_C3 by applying the MCL algorithm with more stringent clustering and highlighted sub-clusters representative of G1/S transition, S and G2-M phases (Supplementary Figure S2A), as they contained many of the cell cycle genes known to be induced at those phases. Accordingly, their pattern of expression showed a sequential induction beginning at 12-14 h for G1/S-related clusters (NHDF_C3c/d), at 18 h for the S cluster (NHDF_C3a) and at 20-22 h for the G2-M phase cluster (NHDF_C3b) (Supplementary Figure S2B)  Samples of starved cells (0 h) and early proliferative populations (1-12 h), form distinct sub-groupings in the network, with a clear progression from early to late time points. Synchrony is lost at later time points, and samples group with unsynchronized populations. (C) Correlation graph of the transcriptional network of synchronized fibroblasts from a quiescence through to mitosis. The graph divides in three large clusters: NHDF_C1 (yellow) corresponds to genes whose expression is greatest in quiescent fibroblasts and decreases during the entry into mitosis (G0); NHDF_C2 genes (green) expression is associated with G1, their expression peaking between 1 h and 8 h after the addition of serum; and the expression of genes in NHDF_C3 (red) start to rise from the beginning of the G1/S transition to mitosis. Nodes represent individual probesets. (D) Corresponding average expression profiles of genes in NHDF_C1-3. (E) GO enrichment analysis on the gene content of the three clusters.
were many known cell cycle checkpoint genes (as annotated by gene ontology (GO)), which, expectedly, were found overall enriched in the corresponding clusters (Supplementary Figure S2C and Table S1). Many of these genes were annotated as being associated with multiple checkpoint pathways, such as CHK1, activated in DNA damage pathways both during S phase and at mitosis onset. Transcripts within NHDF_C3 plus all nodes immediately adjacent to them (n + 1), representing 1207 unique Entrez IDs, were then selected for further analysis. A more granular cluster analysis of this co-expression network (MCLi 2.2) is presented in Supplementary Figure S1C and Table S2.
Refinement of the core cell cycle gene signature To refine the candidate list of NHDF S/G2-M phase-associated genes and eliminate genes that may be specific to differentiated fibroblast function, we examined their expression using the FANTOM5 consortium promoter level CAGE data (HCF5), derived from more than 100 different primary human cell types (495 samples) (Forrest et al., 2014). The 1207 genes identified in the NHDF data were mapped to the FANTOM5 dataset returning 3145 promoters with expression >5 TPM (tags per million) in at least one sample. These data were subjected to network analysis (r > 0.5). The graph contained 2889 promoters (nodes) and 175516 edges from which the MCL cluster algorithm (MCLi = 1.7) also generated 23 clusters ( Figure 2A). HCF5_C1 (1230 promoters of 745 genes) was enriched for cell cycle-associated genes ( Figure 2B). Of the other clusters, only HCF5_C2 exhibited any enrichment for the GO_BP term 'cell cycle' but at a much lower significance ( Figure 2B). The average expression of HCF5_C1 gene promoters was greatest in highly proliferative cell populations such as embryonic stem cells, epithelial cells, and a population of CD34 + hematopoietic stem cells. In contrast, monocytes displayed minimal expression of these genes, reflecting the low rate of proliferation in these populations (Swirski et al., 2014) ( Figure 2C, top profile). The remaining HCF5 clusters contained promoters with a diverse range of expression profiles ( Figure 2C). Many genes within HCF5_C1 also included alternative promoters (254 genes, 526 promoters) with distinct expression profiles that clustered independently ( Figure 2D). Highlighted are six genes with known functions in the cell cycle, three of which, RFC2, MCM5, and MCM7 encode proteins known to be required for DNA replication (Fragkos et al., 2015). The alternative promoters were most highly expressed in immune cell types ( Figure 2E).
Based upon the merge of the two datasets, the initial list of 963 genes (1052 probesets) generated from the analysis of NHDF cells, was reduced to a list of 745 genes with promoters in HCF5_C1 from the FANTOM5 data where the expression was tightly correlated across diverse human cell populations.

Manual curation of the S/G2-M cell cycle list
The 745 genes identified above were individually curated. We removed 198 genes that were induced late in G1 and in advance of the likely onset of S phase. Conversely, we restored 132 genes. The literature or other data (see below) indicated that they function in the cell cycle, and individual examination of the FANTOM5 data indicated that they were, indeed, relatively more expressed in proliferating cells, albeit not included in HCF5_C1.
To examine the inter-species conservation of the S/G2-M transcriptional network, an additional transcriptomics experiment was performed on synchronized mouse embryonic fibroblasts (MEF). The majority of mouse/human orthologues showed a conserved expression pattern across the cell cycle ( Figure 3A). The transcriptional network of the mouse fibroblast data was similar in topology to the NHDF data (Supplementary Figure S3A and B) and an additional 22 known cell cycle genes were observed to co-cluster with the S/G2-M phase genes in these cells.
The merged outcomes of the comparative analysis and manual curation of these data produced a set of 701 cell cycleregulated genes (see Supplementary Table S3). The genes were then assigned to either 'S' or 'G2-M' phases by correlating them with the expression of known cell cycle phase-specific factors: CDC25A and BRCA1 (S phase), and CDK1 and CCNB1 (G2-M phase) (Blomberg and Hoffmann, 1999;Xu et al., 2001;Stark and Taylor, 2006). Accordingly, 380 genes were assigned to S phase and 321 to G2-M phase (Supplementary Figure S4A). The two sets of phase-associated genes were analysed for enrichment of known binding motifs. Both sets were significantly enriched with cell cycle transcription factor binding sites. Among others, S phase genes were shown to be highly enriched for E2F sites (P-value = 1E−59) and the G2-M genes for CHR (P-value = 1E−15) and NFY (P-value = 1E−24) sites (for detailed results see Supplementary Figure S4B). Phase annotation was also consistent overall with those of previous cell cycle studies (Supplementary Figure S4C).
After a systematic database and literature-based curation of the gene list, the majority (496) were found to be functionally associated with a cell cycle-related process ( Figure 3Bi). For example, 'DNA damage' and 'DNA replication' linked predominantly with S phase annotated genes, and 'Chromosome partition' and 'Spindle assembly and regulation' being associated mainly with G2-M phase ( Figure 3Bii). Other categories included a similar number of genes induced at either phase, such as 'Cell cycle regulation'. For 205 genes little or no direct evidence of a direct involvement in the cell cycle could be found, although in some cases there was circumstantial evidence to support this relationship, e.g. publications showing their expression to be elevated in cancer. These genes are classified as 'putative' cell cycle genes. Others in this category encode proteins that function within pathways that potentially relate to cell division, e.g. apoptosis, whilst the association of yet others would appear more tenuous, e.g. RNA processing and immunity. For 53 genes, no functional information was found.
There is a significant overlap between our S/G2-M gene list and the cell cycle gene lists generated by previous studies (Figure 3Biii). However, the current list showed a greater enrichment of genes with the GO_BP term cell cycle when compared representing the promoters of the S/G2-M phase-associated genes identified in the NHDF data and their correlated expression in the context of the FANTOM5 primary cell atlas. Nodes represent individual promoters, their colour representing membership to co-expression clusters. (B) GO enrichment analysis for the GO_BP term 'cell cycle' on each of the 23 clusters identified, Cluster 1 to be highly enriched in cell cycle genes. (C) The expression profile of the HCF5_C1 promoters showed them to be transcribed in a wide variety of primary cells with highest expression in embryonic cells and a number of epithelial cells, but a relatively low expression in monocytes (top). In contrast other clusters, not enriched in cell cycle gene promoters, exhibited a different pattern of expression. The average expression of clusters HCF5_C2, 3, and 4 promoters was greatest in immune cell populations (middle). Others (bottom) exhibited cell type-specific expression, e.g. hepatocytes (HCF5_C7), adipocytes (HCF5_C9), whole blood (HCF5_C12) and melanocytes (HCF5_C22). (D) Nodes in the graph shown in A were colourcoded to show differential promoter expression. HCF5_C1 (green nodes) is comprised of 1230 promoters corresponding to 745 genes, the red nodes represent an additional 526 promoters associated with 254 of the HCF5_C1 genes. (E) Promoter expression profiles of six genes being driven by promoters associated with the cell cycle (green profile) and expression of their alternative promoters (red profile).

Validation of the S/G2-M transcriptional network using independent data
To further validate the conservation of co-expression of the S/G2-M gene list, we explored two additional datasets. The first was a human tissue expression atlas (HTA) of RNA-seq data derived from 95 samples of 27 human tissues (Fagerberg et al., 2014). Of the 701 S/G2-M genes defined here, 655 were identified in these data and their co-expression examined. At a correlation of r ≥ 0.5, 641 genes were present in the graph, which divided into two MCL-defined clusters encompassing 549 genes ( Figure 4A). The genes in these clusters were expressed widely, with an elevated expression level associated with proliferative tissues ( Figure 4B). HTA_C1 was comprised of genes whose expression in the testis was higher ( Figure 4B, top) as compared to the expression of HTA_C2 genes, which showed highest expression in bone marrow and lymph node ( Figure 4B, bottom). The majority of known S/G2-M genes clustered together and significantly, so did the putative cell cycle genes ( Figure 4C and D), supporting their association with this system. A second promoter level dataset produced by the FANTOM consortium (MDF5) (Forrest et al., 2014), comprised of 17 mouse tissues sampled at multiple intervals during embryogenesis and post-neonatal development. Again the data for only the S/G2-M genes (2141 promoters mapping to 658 genes) was examined. Similarly, the majority of the promoters/genes co-clustered, with the exception of a few small clusters ( Figure 4E). In general, the promoters in MDF5_C1 exhibited the highest expression in embryonic tissues, their expression markedly decreasing with developmental age, a pattern reflecting the reducing rate of proliferation during development ( Figure 4F, top profile). A notable exception to this was in the case of the spleen and thymus, where expression peaked around birth. In line with observations in the human tissue atlas dataset, a portion of S/G2-M genes (MDF5_C2) were predominately expressed in adult testis ( Figure 4F, middle profile). Multiple smaller clusters, the majority of which were associated with alternative promoters of cell cycle-associated genes, exhibited tissue-specific promoter expression ( Figure 4F, bottom profile). Again putative S/G2-M genes were co-expressed with the known cell cycle genes ( Figure 4G and H). Co-expression of the S/G2-M genes within the context of all genes is shown in Supplementary Figure S6A and B for the HTA dataset and in Supplementary Figure S6C and D for the MDF5 dataset. Promoter analysis of the 205 putative cell cycle genes alone showed that they were enriched in known S/G2-M transcription factor binding sites, i.e. for E2Fs and NFY, further supporting their associated with cell division (Supplementary Figure S6E).

Experimental corroboration of the uncharacterized S/G2-M genes
Many gene products required for S/G2-M phase are localized to specialized cell cycle-associated organelles or structures. For example, chromosome segregation during mitosis requires the formation of kinetochores at centromeres and the correct attachment of kinetochores to spindle microtubules emanating from microtubule organizing centres, e.g. centrioles and centrosomes. Accordingly, we tested the subcellular localization of 28 candidate genes by cDNA transfection in HEK293T cells. As positive controls, we included CENPA, TACC3, DONSON, and MGME1 (Piekorz et al., 2002;Foltz et al., 2006;Fuchs et al., 2010;Kornblum et al., 2013), the localization of which were confirmed by these assays (Supplementary Data S1). Each ORF was tagged with GFP at both the C-and N-terminals (Simpson et al., 2000;The ORFeome Collaboration, 2016). After inspection of the expression of the 56 protein constructs (two per clone), their subcellular localization was analysed (Supplementary Data S1). As summarized in Figure 5A, nuclear localization was the most frequently observed (15/28) followed by centrosomal-like localization (11/28) and cytosol (9/28). In some instances, localizations were congruent with organelles such as the ER, Golgi apparatus, vesicles, and mitochondria, possibly representing non-specific protein deposits. In around half of cases, the C-and N-terminal tagged proteins produced the same localization ( Figure 5A). No cases of completely discrepant localizations between the two constructs were observed. For the 11 constructs showing centrosomal-like staining, we examined their localization along with centrosomal marker γ-tubulin. Of these, C18orf54, C3orf14, and CCDC150 clearly co-localized with γtubulin during different mitotic stages, i.e. prophase, prometfaphase, and metaphase ( Figure 5B-D). For the other constructs, no clear co-localization with γ-tubulin was demonstrated (not shown) which may be explained by peri-centrosomal localizations (e.g. in the case of C9orf40, see Supplementary Data S1). Confocal images of all 28 proteins screened can be found in Supplementary Data S1.
To examine whether reducing the expression of the novel S/G2-M phase genes affected cell proliferation, we tested the effect of mRNA knockdown in human fibroblasts using esiRNAs. We achieved around 80% knockdown efficiency in all cases examined ( Figure 5E). As a positive control, knockdown of cyclin B1 (CCNB1) produced a strong inhibition of cell proliferation compared to control esiRNAs ( Figure 5F) and transfection had no effect on cell viability ( Figure 5G). Of the 39 knockdowns of known cell cycle-regulated genes tested, 12 (ARHGAP11A, CCNB1, CCNE1, CENPA, CEP85, ESPL1, FAM111A, FIGNL1, FOXM1, KIF11, MAD2L1, and REEP4) had a significant impact on the rate of cell proliferation. Similarly, of the 22 uncharacterized cell cycles genes, 10 (C17orf53, CCDC77, DEPDC1B, FAM72B, GSTCD, NEMP1, RIBC2, RPL39L, UBR7, and ZNF367) significantly inhibited proliferation ( Figure 5E; results of these analyses in Supplementary Data S2). The Mitocheck database is a resource listing the cellular phenotypes from a genome-wide RNAi-screen of human proteins, recorded by high-throughput live cell imaging (Neumann et al., 2010). Of the known gene components listed for which results were available, 136/490 (27.8%) resulted in one or multiple cell phenotypes pointing to cell cycle defects. Among the candidate cell cycle genes, 12.8% were associated with a cell cycle phenotype. For example, ZNF85 silencing led to abnormal chromosome segregation and mitotic metaphase plate congression, knockdown of ZNF90, UBALD2 and CCDC34 led to cell death and ZNF738, CCDC150, and ZNF788 knockdowns resulted in abnormalities in the size and shape of nuclei. Mitocheck results have been added to the gene list presented in Supplementary Table S3. Finally, ToppGene (Chen et al., 2009) was used to search for phenotypes significantly associated with mutations in the S/G2-M genes. In man, 79 phenotypes were identified, the most significant of which included embryonic growth defects, e.g. microcephaly, growth retardation, and various cancers. In mouse, 242 phenotypes were recorded as enriched, the most significant were abnormal cell cycle, embryonic lethality, and abnormal nuclear morphology, again supporting a strong association with cell cycle defects. A full list of the phenotypes enriched for this list and the genes associated with them are provided in Supplementary Table S4.

Discussion
The cell cycle is perhaps the most fundamental of all biological processes and functional orthologues of many of the core components are conserved across species. Curated databases list and classify the function of cell cycle components (Ashburner et al., 2000) and place them into pathways (Kanehisa, 2002;Fabregat et al., 2016). In every system studied, from yeast to man, there are numerous genes required for cell division that are transcriptionally regulated and associated with a given phase of the cell cycle (Spellman et al., 1998;Iyer et al., 1999). It could be argued that all genes involved in anabolic processes are cell cycle-related, since an increase in cell size is usually a precondition for cell division. Similarly, genes regulating entry into the cycle, e.g. growth factors, are often considered to be cell cycle proteins. In the context of this work, we use the term to refer only to the set of proteins that are required when a cell commits to undergo mitosis (Giotti et al., 2017). Accordingly, we have sought to define the core set of cell cycle genes expressed during mammalian S/G2-M, demonstrating them to form a highly correlated transcriptional network across tissues and cell types. As a gene signature, S/G2-M genes effectively define the mitotic index of a cell population.
The gene expression patterns observed here in fibroblasts were broadly consistent with previous studies using the same cell type and synchronization method (Iyer et al., 1999;Bar-Joseph et al., 2008). However, a wound-healing response, triggered by the serum, may confuse efforts to identify cell cycle-related transcripts in fibroblasts (Whitfield et al., 2002). To circumvent this issue, we complemented our analysis by examining the co-expression of the fibroblast-derived S/G2-M associated genes using the FANTOM5 primary cell atlas (Forrest et al., 2014) to remove genes that showed evidence of cell-specificity in their expression. These analyses were further refined by comparison to expression studies in synchronized mouse fibroblasts and detailed examination of the primary data. The result is a list of 701 S/G2-M-regulated genes, which are highly enriched in relevant GO terms and transcription factor binding sites. Based on manual curation of published reports, 496 of these genes encode 'known' cell cycle proteins, many listed as such in annotation databases, e.g. GO and UniProtKB. This list partially overlaps with the findings of previous transcriptomics studies on human cells but interestingly, transcriptional regulation of 145 of the known S/G2-M associated genes was not detected in any of the earlier reports (Whitfield et al., 2002;Bar-Joseph et al., 2008;Peña-Diaz et al., 2013;Grant et al., 2013;Dominguez et al., 2016). The majority of previous studies sought to define cell cycle genes as having a wave-like expression profile over multiple rounds of cell division, using Fourier transform-based methods to identify them. However, cell division in populations of cells rapidly becomes asynchronous, and a fraction of them do not commit to a second cycle (Bar-Joseph et al., 2008). In the current study, this fact was reflected in the loss of synchrony in the cell cycle gene expression signature after 30 h, consistent with flow cytometric analyses (data not shown). Not only did previous studies exclude many bona fide cell cycle genes, the different criteria and analytical methods used produced a poor consensus (Grant et al., 2013). The correlation-based network approach used in this study is a more efficient way to identify phase-specific cell cycle genes (Giotti et al., 2017).
The strong association of the many putative cell cycle genes identified here was further demonstrated by their conserved co-expression across adult human and developing mouse tissues. Some of the S/G2-M phase genes we identified have only been validated relatively recently. For example, PRR11 (proline rich 11), mutations in which have been associated with cancer, was shown to regulate S to G2-M phase transition (Zhang et al., 2015). Links to cancer biology also suggest the function for two of the three lncRNAs identified by this study, DLEU1&2 (Morenos et al., 2014;Wang et al., 2017). There are nine genes annotated as being involved in apoptosis, a process that can be initiated if a cell fails a mitotic checkpoint. CASP2, long considered to be an orphan caspase (Forsberg et al., 2017), is recognized as a key factor in driving cell apoptosis (mitotic catastrophe) triggered by mitotic abnormalities, such as defects in chromosomes, mitotic spindles, or the cytokinesis apparatus (Dawar et al., 2017;Vitale et al., 2017). Other genes within the list await functional validation.
Among the 205 putative cell cycle genes, there are 53 complete functional orphans. Fourteen of the 27 we tested localized wholly or partially to the nucleus and 11 showed evidence of centrosomal localization, an organelle vital for cell cycle progression (Doxsey et al., 2005). Another three, CCDC150, C3of14, and C18orf54 co-localized with γ-tubulin (a centrosomal marker). A recent study confirmed this localization for C3orf14 (Gupta et al., 2015). The subcellular localizations reported here were in many cases also supported by IHC results reported by the Human Protein Atlas database (Uhlen et al., 2010(Uhlen et al., , 2010) (data not shown). RNAi knockdown assays were also performed on a range of known and uncharacterized genes from the list. In these assays, 10 of the 22 uncharacterized proteins showed differences in the rate of cell proliferation following gene knockdown, suggesting non-redundant functions in cell proliferation, with a hit rate slightly higher than the known cell cycle genes tested ( Figure 5H). Taken together, these validation data suggest that the large majority of the novel cell cycle-regulated genes identified here will be found to function in some aspect of S/G2-M biology.
The FANTOM5 human and mouse promoterome data provide definitive locations for the transcription start sites of genes. Of the 701 genes identified here, in the primary cell atlas data at least 254 use alternative promoters that drive their expression outside of the context of the cell cycle. Among them, three are involved in the assembly of the replisome, namely: MCM5, MCM7, and RFC2 (Fragkos et al., 2015), and had significant expression from alternative promoters in certain immunerelated cell populations. This observation is in accordance with a previous report showing that factors of the minichromosome maintenance complex (MCMs), including MCM5 and MCM7, were found to be present on the IRF1 promoter in STAT1mediated transcriptional activation, when cells were treated with IFN-γ (Snyder et al., 2005). MCM5, in particular, was shown to directly interact with STAT1 and to be necessary for transcriptional activation (DaFonseca et al., 2001). Similar observations were made in the analysis of the mouse development timecourse data, where many bona fide cell cycle proteins are strongly expressed in the testis, where they may play a role in meiosis or be part of the centrosomal biology associated with flagella. The 'moonlighting' of cell cycle genes in other scenarios also likely breaks up the transcriptional signature in co-expression analyses across datasets comparing tissues or cells (Mabbott et al., 2010(Mabbott et al., , 2013Balakrishnan et al., 2013;Doig et al., 2013). These alternative transcripts may be regulated in a unique manner to support DNA-dependent processes such as recombination, somatic hypermutation, and class switching that are unique to leukocytes, or other distinct functions.
In summary, this study set out to define the transcriptional network associated with the final stages of the human cell cycle, between entry into S phase through to the completion of mitosis. The aim was not only to summarize the known components of this system but to identify new ones. Through detailed analyses of multiple human and mouse datasets, we have defined 701 genes as being upregulated during the S/G2-M phase of the cell cycle, many of which are conserved across species. Based on promoter expression some proteins would appear to function exclusively within the context of cell division, others would appear to have additional roles outside of this system. Functional assays performed on a number of the uncharacterized genes strongly suggests that many are indeed novel components of the cell cycle machinery. The gene list provided represents the first comprehensive list of experimentally derived and validated S/G2-M phase-associated genes. As such this work provides a valuable resource of both the known and potentially novel components that make up the many pathways and processes associated with mitotic cell division.

Cell culture and synchronization
Primary human dermal fibroblasts (NHDF) isolated from neonate foreskins (gifted by Dr Finn Grey, University of Edinburgh, UK) were plated on 175-cm 2 tissue culture flasks (Thermo Fisher) at density of 6 × 10 3 cells/cm 2 . Cells were cultured in DMEM (Sigma-Aldrich) with 10% (v/v) foetal calf serum (FCS) (GE Healthcare) and antibiotics (25 U/ml penicillin and 25 μg/ml streptomycin, Life Technologies). Starvation-induced synchronization was achieved by replacing full medium with DMEM containing 0.5% FCS for 48 h in accordance with published methods (Brooks, 1976). After this time, medium was replaced with DMEM containing 10% FCS promoting the synchronized entry of the NHDF back into the cell cycle. Similarly, MEF were cultured in 175-cm 2 tissue culture flasks (Thermo Fisher) at a density of 6000 cells/cm 2 in DMEM with 10% FCS and the same protocol followed as for NHDF synchronization.
In both cases, cell synchronization was assessed after 48 h of serum starvation and at various time points following the readdition of complete medium using a BD LSR Fortessa X-20 flow cytometer (BD Biosciences) with propidium iodide staining. Unsynchronized populations were evaluated to assess the degree of synchronization achieved. For protein localization assays, 1 × 10 5 HEK293T cells were grown in DMEM (Sigma-Aldrich) medium plus 10% FCS, 1% Glutamax (Gibco), 1% non-essential amino acids (Gibco) and 25 U/ml penicillin/streptomycin on 13-mm glass coverslips previously coated with poly-L-lysine (0.1 mg/ml) in each well of a 24-well plate. Cells were grown until coverage of approximately 70% was obtained. To increase the percentage of cells undergoing mitosis, HEK293T were reversibly blocked at the G2/M boundary with RO3306, as described previously (Vassilev, 2006).

Microarray preparation
Two human microarray datasets were generated using NHDF. For the first microarray experiment, duplicate samples were taken at 6 h intervals over a 48 h period (24 samples in total including unsynchronized control cells cultured in parallel and harvested at 0 and 24 h). In a second independent experiment, samples were collected at 1 and 2 h following readdition of complete medium, and then every 2 h for a 24-h period (16 samples in total including two control samples). In a third microarray experiment using mouse fibroblasts, MEF samples were collected at 0, 0.5, 1, 2 h following re-addition of complete medium and then every 2 h for a 30-h period (24 samples in total including replicates for 0 h, 12 h, 18 h, 24 h samples and unsynchronized control samples). For all experiments described above total RNA was isolated using the RNeasy Mini Kit (QIAGEN) according to manufacturer's instructions. cDNA was generated by the reverse transcription of total RNA (500 ng) using the Ambion WT Expression Kit (Life technologies), fragmented and then labelled by TdT DNA labelling reagent using GeneChip® WT Terminal Labelling Kit (Affymetrix) according to manufacturer's instructions. Samples were hybridized to Affymetrix Human Gene 1.1-ST Arrays for both NHDF time-course experiments and to the Mouse Gene 2.0 ST Arrays for the MEF experiment using an Affymetrix GeneAtlas system. Raw data of experiments have been submitted to Gene Expression Omnibus repository (GSE104619). For cross-validation, the Fantom5 (F5) primary cell atlas of human promoter expression (Balakrishnan et al., 2013) was used, including 495 samples from about 100 human primary cell types. The data are publicly available at http:// fantom.gsc.riken.jp/5/ (Lizio et al., 2015).

Data pre-processing
Raw data (.cel files) derived from the three microarray experiments were pre-processed using Bioconductor (www.bioconductor. org). The package ArrayQualityMetrics was used to perform QC on the data. All arrays passed the various tests carried out by the package and expression levels were normalized using Robust Multiarray Averaging (RMA) normalization using the Oligo package. The two normalized NHDF datasets were also adjusted with the batch correction algorithm ComBat (Johnson et al., 2007) to adjust for variations in the average intensity between experiments. Low-intensity signal probesets (<20) were removed (a total of 9408 probesets). Likewise, a filtering of low-end signal was applied on the FANTOM5 primary cell atlas removing promoters with <5 tags per million (TPM) reads. Probe set annotations were retrieved with the hugene11transcriptcluser.db package for the human data and with mogene20sttranscriptcluster.db package for the mouse data. Mouse to human orthologues were retrieved from the web resource Mouse Genome Informatics (MGI) (http://www.informatics.jax.org/).

Network analysis
The NHDF, FANTOM5 (primary cell and mouse development datasets) (Forrest et al., 2014), Tissue atlas dataset (Fagerberg et al., 2014) and MEF datasets were subjected to network-based correlation analyses. Data was loaded into the tool Graphia Professional (Kajeka Ltd.) and Pearson correlation matrices were calculated comparing expression profiles between individual samples or genes, and these were used as the basis to construct GCNs as described previously (Theocharidis et al., 2009). Correlation thresholds for all analyses were set to allow minimal contribution of random correlations to the analyses. These were based on a comparison of the correlation distributions of the experimental datasets vs. permuted measurements from 2000 randomly selected measurements. Values selected also minimized the number of edges whilst maintaining a maximum number of nodes (Supplementary Figure S1A). The two NHDF time-course experiments were analysed together. A correlation network was constructed using a threshold of r ≥ 0.88 and the graph clustered to identify the co-expression modules of genes with a broadly similar expression pattern using the MCL clustering algorithm (Enright et al., 2002) with the inflation value (which controls the granularity clustering) set to 1.3 (MCLi = 1.3). Clusters of genes whose expression varied for technical reasons, i.e. profile associated with a batch or experiment, were removed. The correlation network of the remaining data comprised of 4735 nodes (probesets) connected by 153809 edges. Using different inflation values, the network was divided into a few (MCLi 1.3) or many (MCLi 2.2) clusters of transcripts. Transcripts within the S/G2-M cluster (Cluster 3) plus all nodes immediately adjacent to them (n+1), were then selected for further analysis. The node walk expansion was to capture a number of similarly expressed genes on the periphery of the main cluster. Entrez IDs of the cell cycle-associated transcripts identified in the NHDF data were used to subset the FANTOM5 primary cell atlas, prior to network analysis. The subset FANTOM5 primary cell atlas data was then parsed at r ≥ 0.5 and clustered at MCLi = 1.7. The MEF time-course data were parsed at r ≥ 0.88 and the resultant networks clustered using MCLi = 2.2. The Tissue Atlas and FANTOM5 mouse development datasets were subset for the curated S/G2-M gene list, plotted at r ≥ 0.5, and clustered at MCLi 3.2 and 1.7, respectively.
Assembly of evidence and annotation of the cell cycle 'parts list' In assembling a list of cell cycle genes, we have sought to bring together various sources of evidence to support this association. This includes whether they were implicated by the current studies of their expression in experiments performed on NHDF, MEF, or human primary cell atlas, previous transcriptomics studies on human cells (Whitfield et al., 2002;Bar-Joseph et al., 2008;Peña-Diaz et al., 2013;Grant et al., 2013;Dominguez et al., 2016), the Mitocheck database (Neumann et al., 2010) and human protein atlas (HPA) (Uhlen et al., 2010) resource, and finally, our own functional assays. Furthermore, we examined evidence from the literature as well as annotation and pathway resources to provide, where possible, a functional grouping and annotation for each gene. This was carried out by retrieving UniprotKB biological process terms (UniprotKB keywords) and when none were found for a given gene, annotation was supplemented from other sources, namely GO (Gene Ontology, 2015) and Reactome (Fabregat et al., 2016). These efforts were backed up by extensive review of the published literature. The full list of genes with their corresponding functional annotation can be found in Supplementary Table S3. Based on this work, genes were further classified based on evidence of their involvement in cell cycle: the 'Known' group defines genes for which there is robust evidence of their involvement in one of the pathways associated with the cell cycle, whereas the 'Putative' group includes genes for which there is little or no direct evidence of them being involved in the cell cycle. This group also includes a number of functionally uncharacterized genes. Finally, a simple confidence score was used to order the cell cycle list based on the weight of evidence supporting a gene's involvement in the cell cycle; One point was awarded to all genes for each line of evidence supporting their association with the cell cycle, i.e. they were identified by the current or five previous human transcriptomics studies (Whitfield et al., 2002;Bar-Joseph et al., 2008;Peña-Diaz et al., 2013;Grant et al., 2013;Dominguez et al., 2016), their knockdown generated a mitosis-related phenotype in the current Mitocheck screen (Neumann et al., 2010) and whether the gene has been associated with a cell cycle-related phenotype in human and mouse (Blake et al., 2017;Kohler et al., 2017).

GO and motifs enrichment analysis
GO enrichment analyses were conducted with the Database for Annotation, Visualization and Integrated Discovery (DAVID, v6.8), a web-based tool for GO enrichment analysis (http:// david.abcc.ncifcrf.gov/). Gene sets within the clusters generated by the MCL algorithm were analysed for GO_BP terms using the Functional Annotation clustering tool. Motifs enrichment analysis was conducted using HOMER (Heinz et al., 2010) through the CAGEd-oPOSSUM web tool (Arenillas et al., 2016). Genomic loci of the cell cycle-associated promoters were inputted in the software. Enrichments for known motifs were searched between 1000 bp upstream and 300 bp downstream from the TSS.

RTCA analysis following gene knockdown by RNAi
NHDF cells were cultured in Dulbecco Modified Eagle Medium (DMEM, Sigma-Aldrich) with 10% (v/v) foetal bovine serum (FBS, GE Healthcare) and 25 U/ml penicillin and 25 μg/ml streptomycin (Life technologies). The xCELLigence (Roche) real time cell analyser (RTCA) system was used to monitor the effect of gene knockdown on cell impedance, taken as a proxy for cell proliferation. Background impedance for the E-plates 96 (ACEA Biosciences) was standardized by the addition of culture medium (DMEM with 10% FBS, 25 U/ml penicillin, and 25 μg/ml streptomycin) following the manufacturer's instructions. Following trypsination, cells were seeded at density of 6000 cells/cm 2 in each well of the 96-well E-plates with the additional of 100 μl complete medium. Baseline levels of cell impedance index recorded and 24 h later, esiRNA transfection esiRNAs (Sigma-Aldrich) was performed while plates were undocked from the RTCA station. Transfection of esiRNA was carried out using the transfection reagent SilenceMag (OZ bioscience). esiRNA was combined with 3.3 μl SilenceMag and 3.0 μl H 2 O, and then mixed with antibiotic-free medium in a final volume of 100 μl and a concentration of 50 nM esiRNA per well. Complete medium was then replaced with the transfection mix, placed on magnetic plates (OZ bioscience) for 30 min in the incubator under the condition of 5% CO 2 at 37°C. The transfection mix was then replaced with 200 μl complete medium before placing the plates back to the RTCA system. Cells were then incubated monitoring the cell impedance index every 15 min for 200 sweeps in first stage, 30 min for 200 sweeps in second stage, and continued at 60 min intervals for 100 sweeps in final stage. Time series cell impedance indices were extracted at regular time intervals. Negative controls were tested across each plate used to screen known (Fabregat et al., 2016) and potentially novel cell cycle-associated genes. At the time of screen, a number of the known genes were considered uncharacterized. Each assay was based on results gained from running three replicate assays per plate, and repeated on three separate runs. The raw dataset was exported as a cell impedance (CI) index with rows named by N time points (measurement time points at 30 min intervals following the transfection) and columns named by well of samples. Statistical analysis of the data was performed using the R package 'RTCA' to transform cell-impedance values into cellindex growth rate (CIGR) at regular time intervals during the measurement time (Zhang et al., 1999). For scoring of the effect of gene silencing-induced proliferation arrest, the package 'cellHTS2' was used to normalize average CIGR across samples (Boutros et al., 2006).
The library of esiRNAs (endoribonuclease-prepared short interfering RNAs, Sigma-Aldrich) employed here has been described elsewhere (Kittler et al., , 2007. Negative control esiRNA reagents against sucrose isomaltase (SI), a gene not expressed by fibroblasts and collagen 1A2 (COL1A2), a gene expressed by fibroblasts but not associated with cell division. All control esiRNA reagents were used in each assay to verify the lack of non-specific effects of esiRNA treatment.

Clone preparation
Entry clones in pDONR223 and containing open reading frames for candidate genes were sourced from the ORFeome collection (The ORFeome Collaboration, 2016). First, 50-150 ng of each entry clone was combined with 150 ng destination vector pcDNA-DEST47 or pcDNA-DEST53 (Life technologies) and 2 μl LR Clonase II enzyme mix (Life technologies). The reaction was incubated at 25°C for 1 h. Next, 1 μl (2 μg/μl) proteinase K was added to terminate the reaction, incubating for 10 min at 37°C. Then, 2 μl of the recombination reaction was added to chemically competent DH5α bacterial cells on ice and incubated for 20 min. DH5α cells were subjected to heat shock for 45 sec at 42°C followed by 2 min on ice. Finally, 1 ml SOC medium was added and incubated at 37°C with aeration. Cells were centrifuged at 2000 rpm and resuspended in 100 μl LB before plating out on LB plates with 100 μg/ml ampicillin. Plates were incubated at 37°C overnight.

DNA transfection and confocal microscopy
Transfection of HEK293T cells with Gateway destination clones and K2 transfection system (Biontex Laboratories) was performed following manufacturer's instructions. Following optimization studies, 1 μg of expression plasmid and 2 μl of the transfection reagent were diluted in 500 μl in each well (24-well plate). For GFP fluorescence protein imaging, cells were fixed in 4% paraformaldehyde and labelled for 30 min with Texas RedX Phalloidin (1:40) (Invitrogen) and then stained for 5 min in 300 nM DAPI. For centrosomal staining, polyclonal anti γ-tubulin antibody (Sigma-Aldrich) was used. Alternatively, a polyclonal antibody anti α-tubulin (Abcam) was used to stain microtubules during the formation of mitotic spindles. Fixation was carried out by applying 300 μl of cooled methanol per well for 2 min on ice. Cells were washed three times with PBS and then blocked with 5% goat serum (Sigma-Aldrich) and 0.1% Triton in PBS for 1 h. Primary antibodies were then diluted accordingly in blocking solution, applied on coverslips and incubated either overnight at 4°C or for 2 h at room temperature. Cells were then washed three times for 1 h with PBS. The secondary antibody was diluted in blocking solution (1:500), applied on coverslips and incubated for 1-2 h at room temperature. Alexa Fluor® 594 raised in donkey anti-rabbit IgG (H+L) (Life Technologies) was used as secondary for both primary antibodies, since they were used separately. Fluorescence images were captured on a Nikon EC-1 confocal microscope using Nikon EZ-C1 software. The following laser/filter combinations were used: DAPI nuclear stain (excitation 405 nm, emission BandPass 460/50 nm), eGFP (excitation 488 nm, emission BandPass 509 nm) and Texas Red X Phalloidin (Invitrogen) (excitation 543 nm, emission BandPass 605/70 nm).

Supplementary material
Supplementary material is available at Journal of Molecular Cell Biology online.