We report a database of circadian genes in eukaryotes (CGDB, http://cgdb.biocuckoo.org), containing ∼73 000 circadian-related genes in 68 animals, 39 plants and 41 fungi. Circadian rhythm is ∼24 h rhythm in behavioral and physiological processes that exists in almost all organisms on the earth. Defects in the circadian system are highly associated with a number of diseases such as cancers. Although several databases have been established for rhythmically expressed genes, a comprehensive database of cycling genes across phyla is still lacking. From the literature, we collected 1382 genes of which transcript level oscillations were validated using methods such as RT-PCR, northern blot and in situ hybridization. Given that many genes exhibit different oscillatory patterns in different tissues/cells within an organism, we have included information regarding the phase and amplitude of the oscillation, as well as the tissue/cells in which the oscillation was identified. Using these well characterized cycling genes, we have then conducted an orthologous search and identified ∼45 000 potential cycling genes from 148 eukaryotes. Given that significant effort has been devoted to identifying cycling genes by transcriptome profiling, we have also incorporated these results, a total of over 26 000 genes, into our database.
Circadian rhythms are fundamental phenomena of life that are manifested as daily oscillations in vast biological processes and driven by endogenous clocks that exist in most if not all organisms on the earth (1–4). The molecular clock consists of a number of genes that form transcriptional–translational feedback loops that function with a near 24 h period, resulting in rhythmic expression of its components which are referred to as core clock genes (3–5). In mammals, these genes include BMAL1 and CLOCK, which form the positive limb of the feedback loops, while PERIOD1-3 (PER1-3) and CRYPTOCHROME1-2 (CRY1-2) form the major negative limb of the feedback loops (3,4). Since most of our physiological processes and behaviors are circadianly regulated, disruptions of circadian rhythms have been implicated in various diseases and disorders, such as sleep disorders, metabolic diseases, psychiatric disorders, neurological diseases and cancers (6–10). Indeed, genetic variations in core clock genes have been shown to be associated with numerous metabolic diseases, cancers, sleep disorders and mental disorders (11). Animal models revealed that core clock genes are involved in sleep and mood regulation, as well as the development of obesity, diabetes and cancer (7,12–15). At the molecular level, ∼1% to over 60% of the transcriptome exhibits oscillatory expression, and the majority of the best-selling drugs as well as essential medicines of World Health Organization target cycling genes (1,16,17). As many of these drugs have short half-lives, delivering them at the appropriate time of the day will likely improve their efficacy (17,18). Therefore, the identification of circadian genes and understanding the role of circadian clock in regulating rhythmic global gene expression will benefit the development of treatments and therapies for human diseases.
Numerous efforts have been undertaken to identify oscillating genes (1,19,20). Due to the rapid progresses in the development of high-throughput techniques such as microarray and RNA sequencing (RNA-seq), a number of computational methods, such as COSOPT (21), Fisher's G-test (22), HAYSTACK (23), JTK_CYCLE (24), ARSER (25) and LSPR (26), were developed for identifying potential circadian genes from the temporal transcriptome data. Recently, a regularized supervised learning algorithm was used, and ZeitZeiger was developed as an excellent R package for identifying periodic genes from genome-wide gene expression datasets (27). Also, Agostinelli et al. adopted deep learning approaches, and constructed a more accurate predictor of BIO_CYCLE (28). Because more and more cycling genes have been identified, the collection and integration of the data is helpful for further experimental consideration. Previously, four major databases were developed for circadian genes, including DIURNAL (23), CircaDB (29), SCNseq (30) and Bioclock (31–33). The DIURNAL database maintains circadian microarray data for three plants, including Arabidopsis thaliana, poplar and rice (23). CircaDB contains over 3000 potential circadian genes in human and mouse (29). SCNseq contains 4569 cycling genes, and 3187 intergenic non-coding RNAs in mice, whereas Bioclock contains 1674 potential circadian genes exclusively in Aedes aegypti and ∼1000 potential circadian genes in Anopheles gambiae (31–33). Although a number of computational efforts have been contributed to this area, an integrative resource is still not available.
In this study, we aimed to collect known oscillating genes across phyla. Given that many genes exhibit different temporal expression patterns in different tissues/cells within an organism, and that each tissue/cell type has a distinct collection of cycling genes, we have included information regarding the phase and amplitude of the oscillation, as well as the tissue/cells in which the oscillation was identified (3). From the scientific literature, we first collected 1382 cycling genes that were validated by methods such as RT-PCR, Northern blot and in situ hybridization. We then conducted an orthologous search and further identified 44 836 potential oscillating genes. We also collected 26 582 oscillating genes that were identified by microarray and/or RNA-seq. Finally, we developed a comprehensive circadian gene database (CGDB), containing 72 800 non-redundant cycling genes in 148 eukaryotes. We believe CGDB can serve as an integrated and convenient source for genes with oscillatory expression, which will provide useful information for circadian biologists working in any eukaryote. Moreover, since a substantial portion of the transcriptome exhibits daily oscillation, it would be important to take this into consideration when studying the expression and function of a particular gene. In this sense, CGDB will be valuable to a much broader group of researchers, and not limited to those working in the field of circadian rhythms.
CONSTRUCTION AND CONTENT
In this study, we defined circadian genes as genes which have daily oscillatory expression patterns. To construct a reliable benchmark data set, we manually curated experimentally identified circadian genes by using the keyword of ‘circadian gene’ to directly search the literature in PubMed (published before 1 March 2016, Figure 1). First, we collected 1382 cycling genes identified using methods such as RT-PCR, in situ hybridization and Northern blot. These genes were from 25 eukaryotes, which were 2 fungi, Saccharomyces cerevisiae and Neurospora crassa, 13 animals including Macaca mulatta, Caenorhabditis elegans, Gasterosteus aculeatus, Meleagris gallopavo, Gadus morhua, Ovis aries, Equus caballus, Gallus gallus, Drosophila melanogaster, Danio rerio, Rattus norvegicus, Mus musculus and Homo sapiens and 10 plants including Arabidopsis lyrata, Oryza indica, Glycine max, Triticum aestivum, Solanum tuberosum, Physcomitrella patens, Zea mays, Hordeum vulgare, Oryza sativa and Arabidopsis thaliana.
The oscillation of these genes were measured either under entrained condition (zeitgeber time, ZT) with environmental cues such as light/dark cycles or temperature cycles, or under free-running condition (circadian time, CT) which lacks environmental cues. We have included this information along with the peak and trough time points of the oscillation, as well as the amplitude values (amplitude is defined as peak value/trough value). Considering that many genes exhibit distinct temporal expression patterns in different tissues/cells in an organism, we have included information regarding the tissue/cells in which the oscillation was identified and a brief description of the potential function of the oscillation.
To provide information for species beyond the 25 organisms, we performed homologous detection to search for orthologs, which might be potential cycling genes and could be helpful in further studies (Figure 1). The complete reference proteomes of 148 eukaryotes with sequenced genomes were downloaded from Ensembl database (34). These include 68 animals, 39 plants and 41 fungi. As described previously (35), the strategy of reciprocal best hits (RBHs) with BLAST package (36) were applied to detect potential orthologs of cycling genes. In total, 44 836 orthologs were computationally identified as potential cycling genes, which were also integrated into the CGDB database. Lastly, we included 26 582 oscillating genes identified by microarray and RNA-seq studies (Figure 1). The detailed information of 27 964 known cycling genes was provided (34,37). All cycling information and sequences of proteins in CGDB were available for download at http://cgdb.biocuckoo.org/download.php.
The CGDB database web interface was constructed in an easy-to-use manner. Four search options were provided, including the simple search option providing an interface to query the CGDB database with one or multiple keywords or database accession numbers such as UniProt ID or CGDB ID (Figure 2A), ‘Advanced search’ based on combined keywords with up to two search terms (Figure 2B), ‘Multiple search’ using multiple keywords or accession numbers in a line-by-line format (Figure 2C), and ‘BLAST search’ based on protein sequence (Figure 2D). For example, if a keyword ‘PER2_HUMAN’ in ‘UniProt_Accession’ was submitted for a simple search (Figure 2A), the website will return the circadian gene PER2 from H. sapiens in a tabular format with CGDB ID, UniProt/Ensembl accession, species and gene name/alias (Figure 2A). In advanced search option, two terms specified in two areas are combined with operators ‘and’, ‘or’ and ‘exclude’ to conduct a complex query (Figure 2B). For example, querying the database with ‘Per’ in ‘Gene name’ and ‘Human’ in ‘Organism’ will return five PER genes in H. Sapiens (Figure 2B). Moreover, users could input a list of keywords to perform a multiple search. For example, three core clock genes could be retrieved by querying a list of their UniProt Accessions (Figure 2C). In addition, users could search identical or homologous proteins by submitting a protein sequence in FASTA format in ‘BLAST Search’ (Figure 2D). For example, the FASTA sequence of mouse CLOCK protein could be input in the FASTA format to search for homologous proteins in the database. In particular, there is a checkbox of ‘ONLY experimentally identified circadian genes’ for each search option (Figure 2). Once selected, only experimentally identified cycling genes will be queried and returned.
To conveniently browse CGDB database, two strategies were implemented, including ‘by species’ and ‘by external condition’. In the ‘Browse by species’ option, the right tree is a representation of phylogenetic relations or classification of eukaryotic species in Ensembl (34), while the left tree lists the Ensembl taxonomy categories, including primates, rodents, laurasiatheria, afrotheria and so on (Figure 3A). By clicking on the ‘Homo sapiens’ icon, with the checkbox ‘ONLY experimentally identified circadian genes’ selected, experimentally validated circadian genes in H. sapiens can be shown (Figure 3B). Furthermore, CGDB can be browsed by external condition (Figure 3C). The numbers on the ZT clock can be clicked to browse genes that have been experimentally verified to peak or trough at a specific time point under entrained conditions. The checkbox ‘Peak only’ and ‘Trough only’ can also be selected individually to show genes that peak or trough at a specific time point (Figure 3C). The CGDB ID was adopted to organize the database, while the UniProt/Ensembl ID was used as secondary accession (Figure 3D). The users can click on ‘CGD-HoS-021288΄ to view the detailed information of human HTR1B including cycling information, primary references, PTM information and its orthologs (Figure 3D).
RESULTS AND DISCUSSION
Early physiological experiments have established that circadian clocks exist endogenously in almost every organism and drives oscillatory changes in a myriad of behavioral and physiological processes (1–3,5). Circadian rhythms are believed to arise from endogenous molecular clocks consisting of transcriptional and translational feedback loops that are more or less conserved across phyla (1,4). At the heart of these feedback loops are transcriptional activators which promote the expression of transcriptional repressors. The repressors suppress the activity of the activators, thus repressing their own expression. These repressors also undergo a series of translational and PTM regulations, leading to their degradation (1–3,5). As the protein levels of the repressors decrease, the activators can once again initiate transcription, starting a new cycle. Here we establish CGDB as a database of genes with oscillatory expression at the transcript level, accompanied by relevant phase and amplitude information, which are the two key indices of oscillation. We have also included known PTMs which is a critical part of circadian regulation.
To further understand the function of circadian clocks and how they regulate physiological rhythms, we analyzed the functional distribution of human cycling (and potential cycling) genes by mapping the non-redundant human proteome set to two databases of Gene Ontology Annotation (UniProt-GOA, http://www.ebi.ac.uk/GOA) (38) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (39). In total, we detected 6352 and 19 194 human proteins annotated with at least one KEGG or GO term, respectively. Then the hypergeometric test was adopted to perform enrichment analyses for 1889 human circadian genes, whereas the statistically over-represented KEGG (p-value < 0.001) and GO terms (p-value < 10−5) were shown, respectively (3,4). Interestingly, the most enriched GO cellular compartment is ‘caveola’ (GO:0005901) which has not been known to be oscillatory, revealing the power of our database to identify new potential roles of the circadian clock.
Besides circadian rhythm, several metabolic pathways are significantly enriched (40–43). Among these pathways, insulin signaling has the largest number of cycling (or potential cycling) genes in human (data not shown). Therefore we are using insulin signaling pathway as a detailed example of how cycling genes may carry out time-dependent functions (Figure 4). We have indicated the peak expression time point in human and mouse in the figure, which likely reflects the time of day when these genes exert most of their effects. For example, the great majority of genes involved in glucose homeostasis peaks at night (ZT/CT12-ZT/CT0) in the mouse. This is consistent with the mouse being nocturnal and consuming most of the food at night (44).
Because human cycling genes were mostly obtained from blood samples, it is possible that the enriched pathways are biased by the specific physiological properties of blood and do not reflect a more general mechanism. We next conducted KEGG analyses of mouse cycling genes, which were identified from various different tissues/organs in the body (
To get a better understanding of potential functions of the circadian clock across phyla, we also performed the KEGG-based enrichment analysis on Drosophila, Neurospora and Arabidopsis, three other commonly used model organisms in circadian research (
Taken together, our study provides a highly useful data resource for further analyzing temporal pattern of gene expression and function. Some of these oscillating genes are part of the clock or directly controlled by the clock, whereas others are driven by rhythmic changes in the environment (i.e. daily changes of light, temperature, food availability, etc.) or indirectly regulated by the clock (e.g. driven by rhythms in behavior or physiological processes such as the sleep/wake cycle or timed meals). Nonetheless, it is the combined actions of all these genes that carry out the rhythmic biological processes in response and adaptation to the daily changes in environment. CGDB will be continuously maintained and updated when new cycling genes are experimentally characterized. Since some of these oscillations are clock-driven, this shall provide implications regarding the function of circadian clocks. However, not all clock genes or clock-controlled genes are rhythmic at the transcript level, as rhythmic regulations at the translational or post-translational level and/or rhythmic activities are also critical steps in circadian regulation (1–3,5). Therefore in the future, genes with oscillating protein levels and PTMs will be integrated for an even more comprehensive understanding of the regulatory mechanism and function of daily rhythms.
National Basic Research Program (973 project) [2013CB933900]; Natural Science Foundation of China [31471125, 81272578, 31671360, J1103514]; International Science & Technology Cooperation Program of China [2014DFB30020]; Natural Science Foundation of Anhui Province [1608085QC51]. Funding for open access charge: National Natural Science Foundation of China.
Conflict of interest statement. None declared.