Co-Expression Networks in the Green Alga Chlamydomonas reinhardtii Empower Gene Discovery and Functional Exploration

The unicellular green alga Chlamydomonas reinhardtii is a choice reference system for the study of photosynthesis, cilium assembly and function, lipid and starch metabolism and metal homeostasis. Despite decades of research, the functions of thousands of genes remain largely unknown, and new approaches are needed to categorically assign genes to cellular pathways. Growing collections of transcriptome and proteome data now allow a systematic approach based on integrative co-expression analysis. We used a dataset comprising 518 deep transcriptome samples derived from 58 independent experiments to identify potential co-expression relationships between genes. We visualized co-expression potential with the R package corrplot, to easily assess co-expression and anti-correlation between genes from manually-curated and community-generated gene lists. We extracted 400 high-confidence cilia-related genes at the intersection of multiple co-expressed lists, illustrating the power of our simple method. Surprisingly, Chlamydomonas experiments did not cluster according to an obvious pattern, suggesting an underappreciated variable during sample collection. One possible source of variation may stem from the strong clustering of nuclear genes as a function of their diurnal phase, even in samples collected in constant conditions, indicating substantial residual synchronization in batch cultures. We provide a step-by-step guide into the analysis of co-expression across Chlamydomonas transcriptome datasets to help foster gene function discovery. One-sentence summary we reveal co-expression potential between Chlamydomonas genes and describe strong synchronization of cells grown in batch cultures as a possible source of underappreciated variation.


Remapping and Normalization Steps of the Chlamydomonas Transcriptome 126
The analysis of changes in gene expression typically covers a limited number of 127 conditions on selected genotypes to identify treatment-specific modulators of the 128 transcriptome in a given organism. While this approach is powerful, we wished to 129 integrate multiple transcriptome datasets that represent multiple variables in growth 130 conditions and genotypes. To this end, we collected 58 transcriptome deep-sequencing 131 (RNAseq) datasets generated by the community and by our own laboratory that 132 correspond to 518 samples. We remapped all reads to version v5.5 of the 133 Chlamydomonas genome to remove changes in gene models between experiments as 134 a variable, as our collection of datasets span about 10 years. We did not attempt to 135 compensate for batch effects or variation in sequencing platforms, which were all 136  We then assessed the global expression of all 17,741 Chlamydomonas nuclear 138 genes across our set of 518 samples. Most nuclear genes were expressed at levels of 1 139 Fragment Per Kilobase of transcript per Million mapped reads (FPKM) in the majority of 140 samples, although a large subset of nuclear genes was seldom expressed even at this 141 low expression cut-off (Supplemental Table 1). With a higher threshold for expression, 142 the fraction of expressed nuclear genes decreased (Supplemental Table 1). This pattern 143 indicated that most genes are expressed at moderate levels and only in a limited 144 number of conditions. 145 We next normalized our RNAseq dataset following the same steps used for the 146 ALCOdb gene co-expression database for microalgae (illustrated in Supplemental 147 Figure 1; Aoki et al., 2016). The final normalization step centered expression estimates 148 to zero, as a Z-score normalization would (Supplemental Figure 1B). ability to identify bona fide cilia and centriole components based on co-expression also 344 offered the opportunity to subject larger lists to a similar analysis. The cilium proteome 345 is predicted to comprise close to a thousand proteins based on proteomics analysis 346 (Pazour et al., 2005), although a fraction is likely to correspond to contaminants. 347 Likewise, a comparative genomics approach uncovered around 200 genes conserved 348 between ciliated species and absent in all other species: Ciliacut (Li et al., 2004). These 349 two lists overlap only partially, with 81 genes belonging to both. We wondered if co-350 expression profiling might allow to pull high-confidence cilia components: we measured 351 co-expression in three groups (Ciliacut only; Ciliacut+cilium proteome overlap; cilium 352 proteome only). The resulting correlation matrix is shown in Figure 4B. Genes only 353 included in Ciliacut were on average not co-expressed with each other (mean PCC: 0. 354 03 ± 0.24) and consisted of many MOTILITY (MOT) genes not found in Caenorhabditis 355 elegans (which lacks motile cilia) and SENSORY, STRUCTURAL AND ASSEMBLY 356 (SSA) genes. Similarly, about 550 genes only present in the cilium proteome gene list 357 showed no pattern of co-expression, with a mean PCC of 0.01 ± 0.22. In sharp contrast, 358 76 genes that belonged to both lists were highly co-expressed (mean PCC: 0.63 ± 359 0.20). Equally highly co-expressed was a set of ~300 genes whose encoded proteins 360 are only found in the cilium proteome (mean PCC: 0.63 ± 0.15), with many 361 uncharacterized FLAGELLAR ASSOCIATED PROTEIN (FAP) genes. Together, these 362 two sets comprised over 400 co-expressed genes that are prime candidates for 363 functional dissection. They are listed in Supplemental Data Set 8. 364 365 Ribosome Protein Genes. Nucleus-encoded ribosomal protein genes (RPGs) 366 code for proteins with three cellular destinations. The co-expression pattern observed 367 between RPGs largely reflected the organelle in which their encoded subunits will 368 function ( Figure 4A). Plastid RPGs exhibited the strongest degree of co-expression 369 (mean PCC = 0.88 ± 0.06). The sole exceptions were PLASTID SPECIFIC 370 RIBOSOMAL PROTEIN1 PSRP1 and PSRP4, which are among the lowest expressed 371 genes encoding small subunits proteins, and the gene encoding the Chloroplast Stem-372 Co-expression networks in Chlamydomonas 14 loop binding Protein of 41 kDa CSP41 (mean PCC = 0.27 ± 0.09) ( Figure 4B). Neither 373 PSRP1 or CSP41 are thought to be plastid ribosomal proteins, but both participate in 374 efficient translation, either by inducing conformational changes within the ribosome 375 (PSRP1, Sharma et al., 2010) or by stabilizing target plastid RNAs (CSP41, Qi et al., 376 2012). Large and small plastid ribosomal subunits were co-expressed equally strongly 377 (PRPLs: 0.89 ± 0.04; PRPSs: 0.86 ± 0.09 excluding PSRP1 and PSRP4) ( Figure 4C). 378 Plastid translation factors also displayed a high degree of co-expression with one 379 another (mean PCC: 0.52 ± 0.18) and with plastid RPGs (mean PCC: 0.59 ± 0.20). Co-380 expression between chloroplast translation regulators defined three sub-groups: one 381 group that was highly co-expressed with plastid RPGs (11 genes), one group that was 382 not co-expressed (4 genes The co-expression of RPGs encoding proteins destined for the mitochondrion or 388 cytosol was less pronounced, but similar between large and small subunits RPGs 389 ( Figure 4C). For both compartments, correlation coefficients between RPGs followed a 390 bimodal distribution, with a fraction of PCCs around zero. For mitochondrial RPGs, high 391 expression levels appeared to come at the cost of lower PCCs, whereas the opposite 392 was true for cytosolic RPGs. Mitochondrial RPGs tended to be weakly co-expressed 393 with plastid RPGs (mean PCC: 0.13 ± 0.14) while anti-correlated with cytosolic RPGs 394 (mean PCC: -0.08 ± 0.15) ( Figure 4D). There was no clear correlation between the 395 expression of most plastid and cytosolic RPGs (mean PCC: -0.0006 ± 0.14) (Figure 396 4D). As the single exception, the cytosolic RPG RPS27E2/RPS27B, which is generally 397 expressed at much lower levels than all other cytosolic RPGs, stood out with a 398 pronounced anti-correlation with plastid RPGs (mean PCC: -0.54 ± 0.05) ( Figure 4B). 399 Nitrogen deficiency results in a sharp increase in RPS27E2 expression, concomitant 400 with a global arrest in plastid translation until more auspicious conditions return 401 (Schmollinger et al., 2014;Kajikawa et al., 2015;Plumley and Schmidt, 1989), which 402 may explain the pattern observed here. 403 Given the strong correlation between sets of RPGs in Chlamydomonas, we 404 wondered how conserved this pattern might be. We determined the correlation between 405 Arabidopsis RPGs using normalized data from microarrays downloaded from 406 AtGenExpress. The Arabidopsis genome contains 429 RPGs; as in Chlamydomonas, 407 their encoded products will locate to one of three compartments (cytosol, mitochondria 408 or chloroplasts). We observed a correlation matrix very reminiscent of that of 409 Chlamydomonas RPGs: indeed, each organellar RPG set is co-expressed. A subset of 410 Arabidopsis RPGs lacked a clear functional localization; however, co-expression with 411 other RPGs clearly predicted their localization as being either plastidic ("unknown 1") or 412 cytosolic ("unknown 2" in Figure 4E). We provide the Arabidopsis normalized microarray 413 dataset as Supplemental Data Set 9. We obtained similar results with Physcomitrium 414 patens (not shown), although the exact interpretation is likely muddled by the multiple 415 splice variants listed for each gene. 416 417 Transcription factors. As regulators of gene expression, transcription 418 factors and other DNA-binding proteins will bind to their cognate cis-regulatory elements 419 to modulate gene expression. We wished to test whether co-expression cohorts 420 associated with transcription factors may help in deciphering their biological function. To 421 this end, we calculated the mean PCC between a given transcription factor and its co-422 expressed cohort from networks N1, N2 and N3. As shown in Figure 5A, PCC values 423 ranged from 0.42 to 0.92, with a mean of 0.64 ± 0.09. The gene encoding the 424 transcription factor NITROGEN RESPONSIVE REGULATOR (NRR1) showed one of 425 the highest PCCs (0.885 for its N1 cohort) and was highly co-expressed with two other 426 transcription factor genes, both encoding Helix-Loop-Helix proteins (Cre01.g011150 and 427 Cre04.g216200. The genes LOW-CO2 RESPONSE REGULATOR (LCR1) and 428 CIA5/CCM1 participate in the induction of gene expression in response to low CO 2 429 conditions (Fang et al., 2012;Xiang et al., 2001;Fukuzawa et al., 2001;Yoshioka et al., 430 2004), with LCR1 predicted to act downstream of CIA5 (Yoshioka et al., 2004). Both 431 genes showed high correlation with their N1 cohorts (CIA5: mean PCC of 0.61 with 20 432 genes and LCR1: mean PCC of 0.715 with 34 genes), but their cohorts did not overlap. 433 In addition, we failed to identify LCR1 as a gene co-expressed with CIA5, suggesting 434 Co-expression networks in Chlamydomonas 16 that each transcription factor may act in parallel rather than in converging pathways. We 435 also looked at the extent of co-expression between transcription factors, as illustrated in 436 Figure 5B and 5C. When subjected to hierarchical clustering with the First Principle 437 Component (FPC) method from corrplot, transcription factor genes showed weak to 438 moderate co-expression, as well as anti-correlations. The potential for co-expression (or 439 anti-correlation) did not appear to follow simple rules related to the family of the 440 transcription factors. The dataset generated here will provide an interesting opportunity 441 to compare the output from methods such as DNA Affinity Purification and sequencing 442 (DAP-Seq) (O'Malley et al., 2016). 443 We performed the same analysis with Arabidopsis transcription factors. We 444 calculated the mean PCC for 1,864 transcription factors represented by a probe on the 445 ATH1 Affymetrix microarray: mean PCCs per gene ranged from 0.49 to 0.96, with a 446 mean of 0.86 ± 0.06 ( Figure 5D. Co-expression between Arabidopsis transcription 447 factors was much more evident than in Arabidopsis, as shown in Figure 5E and 5F. The 448 vast majority of genes encoding transcription factors showed strong co-expression, but 449 a subset was anti-correlated ( Figure 5E) that was not explained based on a crude 450 separation into leaf-and root-specific expression pattern. We thus selected a list of 150 451 genes exhibiting the highest anti-correlation and subjected them to Gene Ontology (GO) 452 enrichment analysis to determine their function. The biological processes associated 453 with these genes included "heterochronic regulation of development", "photoperiodism", 454 "regulation of seed development" and "regulation of flower development", raising the 455 possibility that the observed pattern may reflect the temporal rather than the spatial 456 specificity of regulatory proteins. 457 Returning to Chlamydomonas genes encoding DNA-binding proteins, we took a 458 closer look as histone genes, most of which are coordinately expressed with a peak in 459 expression shortly before cell division (Strenkert et al., 2019a;Zones et al., 2015). 460 However, a small group of histone genes remain constantly expressed over the diurnal 461 cycle and are termed "non-replication" (or emergency) histones. Histone genes 462 displayed a striking co-expression pattern, with all replication histones being highly co-463 expressed ( Figure 5G). Similarly, non-replication histones were strongly co-expressed 464 as a group, but less so when probed against replication histones. Histone variants showed little correlation in their expression with either group. While assembling the 466 gene list for histones, we noticed their high numbers (117 histone genes, not counting 467 histone variants) and their tight clustering along only five chromosomes ( Figure 5G). 468 Even more remarkable was their arrangement as divergent gene pairs: all histone H2A 469 and H2B genes were present as divergent pairs, and all histone H3 genes occurred as 470 a divergent partner to a histone H4 gene. In many cases, each major histone class was 471 represented in a 4-gene cluster, corresponding to 84 (out of 117) histone genes ( Figure  472 5H). The high number of histone genes appeared to be unique to Chlamydomonas, as 473 the genomes of the other unicellular algae Micromonas sp., Chromochloris zofingiensis 474 and Volvox carteri encoded far fewer histones. However, the clustering of histones as 475 divergent gene pairs was partially maintained, as summarized in Figure 5I. In 476 Micromonas, the four histone genes were arranged as two divergent pairs, with H2A 477 and H2B belonging to one pair, and H3 and H4 found in the second pair. Likewise, most 478 histone genes from C. zofingiensis and V. carteri grouped in divergent pairs. 479 480

Co-Expression Cohorts and Co-Expression Modules 481
Testing co-expression between members of a gene family may help assign 482 specific functions, or group them in functionally homologous groups. We applied the co-483 expression cohort approach to the ferredoxin gene family, consisting of 12 genes 484 (FDX1-FDX12). FDX1, also known as PETF, is the main electron acceptor from 485 Photosystem I during photosynthesis. FDX5 has been shown to function in fatty acid 486 desaturation (Yang et al., 2015), while FDX9 likely plays a critical role in fermentation 487 (Strenkert et al., 2019b). We extracted the co-expression cohort associated with each 488  Table 2). 512 We restricted our efforts to the N3 network as a good compromise between larger 513 module sizes and significant GO enrichment within modules. Out of 117 N3 modules, 514 we grouped 37 modules into 8 functional groups based on their significant enrichment in 515 biological processes: transcription, translation, ribosome biogenesis, protein 516 degradation, DNA replication, transport, photosynthesis and flagella biogenesis and 517 function (Supplemental Table 3, Supplemental Data Set 11). A single module defined a 518 ninth group associated with response to phytohormones, specifically cytokinin, whose 519 signaling cascade is incomplete in the microalga (Lu and Xu, 2015). These categories 520 were not surprising and satisfying all the same: they broadly mapped to conserved 521 cellular functions, or to processes where Chlamydomonas is a premier model organism 522 for their study. 523 To obtain genes that are co-expressed with a list of interest, we separately used 524 manually-curated gene lists as baits to extract their co-expressed genes from the N1, 525 N2 and N3 networks. As stringency decreases from the N1 to the N3 networks, the 526 number of selected genes increased, but the resulting lists were nested. Co-expression cohorts associated with gene lists expanded the number of potentially informative genes 528 2-20 fold, with an average increase of 10-fold (Supplemental Figure 8). Using genes 529 from co-expression modules as baits, we thus identified their associated co-expressed 530 cohorts and determined the extent of overlap with other user-defined lists (as illustrated 531 in Figure 3C) to obtain high-confidence genes. We also established the timing of peak 532 expression over the diurnal cycle for each module, group and co-expressed cohorts, 533 using the diurnal phase of all genes considered rhythmic in two diurnal datasets 534 (Supplemental Figure 9)  propose that these non-annotated genes play a role in some aspect of cell division. 540 Only 19 out of the 245 genes overlapped with 79 genes identified by forward genetic 541 screens for defects in cell cycle progression; this overlap was limited to the highly co-542 expressed genes within both sets ( Figure 4A) (Tulin and Cross, 2014;Breker et al., 543 2018). We then determined the co-expression cohorts associated with each gene list 544 and assessed their overlap. By definition, all genes within our modules are highly inter-545 connected, but they also exhibited co-expression with ~ 400 additional genes that define 546 a larger cohort with presumptive function in cell division ( Figure 6B). Similarly, hundreds 547 of genes showed strong co-expression with the 30 co-expressed genes from the 548 genetics list ( Figure 6C). Finally, we defined a third list comprising genes critical for 549 DNA replication, chromosome segregation and cell division proper, for which we 550 determined the co-expression cohorts ( Figure 6D, Supplemental Data Set 12).Notably, 551 although the initial gene lists were quite distinct ( Figure 6E), their cohorts shared more 552 genes as network stringency decreased, suggesting that the intersection of co-553 expression cohorts converged on a common set of genes. 554 555 Proteasome-Dependent Protein Degradation. Two modules shared a function 556 in protein degradation. They largely overlapped and defined a set of 96 genes that 557 included all but two of the 26S proteasome subunit genes. Most subunits of the 26S 558 Co-expression networks in Chlamydomonas 20 proteasome were highly co-expressed (mean PCC: 0.67 + 0.13). CSN2 and CSN6 were 559 however not part of the protein degradation modules; they exhibited the weakest co-560 expression profile with other 26S proteasome subunit genes, although clearly still quite 561 high (CSN2 mean PCC: 0.54 ± 0.15; CSN6 mean PCC: 0.53 ± 0.06) (Supplemental 562 Figure 10A). The Chlamydomonas ortholog for the E3 ubiquitin ligase CONSTITUTIVE 563 PHOTOMORPHOGENIC 1 (COP1), Cre13.g602700 (currently annotated as SPA1, 564 Gabilly et al., 2019), showed no co-expression with the 26S proteasome (mean PCC: -565 0.09 ± 0.10), consistent with a role as a regulatory component of the proteasome. We 566 observed the same absence of co-expression in Arabidopsis between COP1 and the 567 remaining subunits of the proteasome, indicating a conserved mode of control from 568 unicellular algae to land plants. 569 Proteasome-dependent proteolytic degradation entails the addition of ubiquitin 570 onto the protein targeted for removal by the concerted action of E1 ubiquitin-activating 571 enzymes, E2 ubiquitin-conjugating enzymes and E3 ubiquitin ligases. The 572 Chlamydomonas genome bears 13 ubiquitin genes, three genes encoding potential E1 573 enzymes (Cre09.g386400, Cre06.g296983, and Cre12.g491500) and 17 genes coding 574 for E2 enzymes. We did not compile a list of all E3 ubiquitin ligase genes, as they form 575 large gene families. Our protein degradation modules only incorporated a single gene 576 each for ubiquitin (UBQ2), E1 activating enzyme (Cre12.g491500, annotated as UBA2) 577 and E2 conjugating enzyme (UBC21, although it was the second lowest-expressed 578 UBC gene in our dataset; Supplemental Figure 10A). No other ubiquitin gene displayed 579 a co-expression pattern with our protein degradation modules. By contrast, both 580 remaining E1 enzyme genes (Cre09.g386400 and Cre06.g296983) were highly co-581 expressed with genes from our protein degradation modules. Likewise, we identified a 582 subset of genes encoding E2 conjugating enzymes that were co-expressed with 26S 583 proteasome subunit genes: UBC3 (Cre03.g167000), UBC9 (Cre16.g693700, also the 584 most highly expressed UBC gene) and UBC13 (Cre01.g046850) and present in the co-585 expression cohort linked to our modules. In addition, the gene UBC22 (Cre12.g515450) 586 appeared anti-correlated with other 26S proteasome subunit genes, hinting at a 587 previously unexpected level of control. 588 Co-expression networks in Chlamydomonas 21 We used the 96 genes that formed the protein degradation modules as baits to 589 identify their co-expressed cohorts in each of our three most stringent networks (N1-590 N3). Via guilt-by association prediction, we thus assigned a potential function in protein 591 degradation for 350-760 genes in addition to those already found within our modules 592 (Supplemental Figure 10B, Supplemental Data Set 13). 593 594 Cilia Modules. Four modules were associated with GO terms with a function in 595 cilia assembly or intraciliary transport. They also demonstrated partial overlap between 596 themselves, indicating that these four modules defined a single, larger cilia group 597 consisting of 221 nuclear genes ( Figure 6F). The genes making up these modules were 598 highly co-expressed with a fraction of genes identified in CiliaCut and the cilium 599 proteome ( Figure 6F). The intersection of the initial gene lists (modules, CiliaCut, 600 overlap and cilium proteome) defined a set of 44 genes, nine of which (ODA1, DRC3, 601 IFT121, IFT46, IFT74, MBO2, MIA1, PF16 and PF20) were previously identified through 602 forward genetic screens. We also extracted the co-expression cohorts associated with 603 cilia modules, CiliaCut and the cilium proteome ( Figure 6G-I, Supplemental data Set 8), 604 linking several hundred genes to cilia. Their overlap (when using the N1 network) 605 consisted of a set of 193 high-confidence cilia-related genes. 606 607 Photosynthesis modules. Four modules defined a larger photosynthesis group 608 ( Figure 6K) that we sub-divided into three modules containing many of the genes 609 encoding tetrapyrrole biosynthetic enzymes, while the last module was related to 610 photosystems components. We extracted their co-expression cohorts ( Figure 6L-N), 611 resulting in hundreds of genes exhibiting strong co-expression. We also determined the 612 overlap between the initial gene lists ( Figure 6O) and their N1 cohorts ( Figure 6P): the 613 co-expression modules clearly included both photosynthesis-and tetrapyrrole 614 biosynthesis-related genes. As might be expected for genes necessary for proper 615 chloroplast function, the overlap between N1 cohorts was substantial across all 616 categories tested (modules, photosynthesis and tetrapyrroles), highlighting interesting 617 genes for potential follow-up studies within the modules and the N1 cohort. 618

Genes in Co-Expression Modules Cluster Based on their Diurnal Phase 620
During our analysis of co-expression modules, we noticed a high proportion of 621 diurnal synchronization between co-expressed genes within modules and their 622 associated co-expression cohorts, even though diurnally-expressed genes occupy the 623 entire diurnal time landscape (Figure 7A, 7B). We therefore asked how frequently genes 624 within co-expressed modules shared the same phase. Out of 117 modules extracted 625 from the N3 network, 110 contained at least two rhythmic genes ( Figure 7C). with a 626 mean percentage of rhythmic genes of 65% and a median value of 71.6% ( Figure 7C). 627 Modules with few rhythmic genes tended to be associated with large standard 628 deviations, indicative of little synchronization between the genes comprising them 629 ( Figure 7C). By contrast, modules consisting of a higher frequency of rhythmic genes 630 showed high synchrony; their mean phase provided information relating to the biological 631 function of each module, as illustrated below. Notably, the anti-correlated cohorts to 632 most modules exhibited a mean phase that was 6-12 h out of phase with that of their 633 related module (not shown), highlighting the importance of time-of-day when 634 considering co-expression. 635 Molecular events leading to cell division are coordinately expressed with a phase 636 distribution between 10-12 h after dawn: in agreement, we determined that the phase 637 distribution of cell division modules and genes from the cell division "genetics" list 638 showed the same phase preference, with 232 out of 245 genes being rhythmic, as did 639 their associated co-expressed cohorts from the N1 network, (Figure 7D 7F). The degree of synchrony may provide an additional selection criterion for co-644 expressed genes, as seen with phase distributions of genes belonging to CiliaCut only 645 (that is, CiliaCut genes whose gene products were not detected in the cilium proteome). 646 Indeed, CiliaCut only genes displayed a wide range of diurnal phases, whereas co-647 expressed cilium proteome genes and genes at the intersection of CiliaCut and the 648 cilium proteome were highly rhythmic and synchronized to the middle of the night 649 ( Figure 3C and Figure 7G).
We used the 96 genes ( Figure 7H, inset) that form the protein degradation 651 modules as baits to identify their co-expressed cohorts. They displayed a high degree of 652 synchronized rhythmicity across diurnal datasets ( Figure 7H). Only two out of the 96 653 genes from the protein degradation modules did not show rhythmic expression over a 654 diurnal cycle. The occurrence of diurnal rhythmicity remained very high in their 655 associated co-expression cohorts, with 391 rhythmic genes out of 450. The distribution 656 of their diurnal phases was also quite narrow for both sets of genes, with a peak in the 657 second half of the day ( Figure 7H). We speculate that timed protein degradation offers a 658 mechanism for the removal of photo-oxidized proteins, which is broadly consistent with 659 the recent characterization of Chlamydomonas mutants lacking activities for the E3 660   Table 3). Cytosolic RPGs were constitutively 667 expressed and thus had no clear diurnal phase, whereas both plastid and mitochondrial 668 RPGs exhibited preferred diurnal phases between 1-2 h and 3-5 h after dawn, 669 respectively ( Figure 7J), as expected (Zones et al., 2015). 670 Four modules defined a larger photosynthesis group that we sub-divided into 671 three modules containing many of the genes encoding tetrapyrrole biosynthetic 672 enzymes, while the last module was related to photosystems components. Both sub-673 groups were highly rhythmic over the diurnal cycle and restricted to a small time-674 window. Their respective phases agreed with their underlying biological function: genes 675 encoding tetrapyrrole biosynthetic enzymes peaked ~ 2 h prior to components of both 676 photosystems ( Figure 6K). While highly co-expressed, photosynthesis-and 677 tetrapyrroles-related modules did not substantially overlap (Supplemental Data Set 14), 678 indicating that a diurnal phase difference of 2 h was sufficient to form independent 679 clusters.
We conclude that co-expression modules are strongly influenced by the diurnal 681 phase of their constituent genes. While this result may in itself not be surprising, it also 682 raised the question of the overlap contribution of diurnal phase to clustering in our 683 dataset, which we addressed next. we plotted their diurnal phase (on the y axis) as a function of AOE gene order (on the x 718 axis). As shown in Figure 8C, the relationship between AOE gene order and diurnal 719 phases was far from random, and instead followed a linear pattern, whereby genes that 720 appeared first in the AOE correlation matrix had phases with peaks in the late evening. 721 As gene row number increased, diurnal phases gradually decreased, demonstrating the 722 widespread influence of diurnal phase on correlation potential between gene pairs. In 723 addition, the overall pattern of the AOE correlation matrix was reminiscent of that seen 724 for diurnal experiments ( Figure 1C relationship between the two gene order series ( Figure 8D). FPC clustering also sorted 738 genes according to their diurnal phase, although along distinct parameters 739 Supplemental Figure 11B). 740 We conclude that diurnal phase contributes substantially to the clustering of 741 genes, even for samples obtained from cells grown in constant light. Such samples appear to retain diurnal information that shapes the clustering outcome at the genome 743 level. 744 745

Chlamydomonas Transcriptome 747
That genes clearly clustered according to their diurnal phases even in a dataset 748 comprised solely of samples collected from cells grown in constant light raised the 749 possibility that these samples exhibited residual rhythmicity. We thus applied the 750 molecular timetable method (Ueda et al., 2004) to all RNAseq samples to determine the 751 extent of rhythmicity they might exhibit. The molecular timetable method, whose 752 principle is briefly explained in Supplemental Figure 12, extracts the rhythmic (diurnal or 753 circadian) information from single time-point transcriptomes using the known phases 754 and expected expression levels from a reference diurnal (or circadian) dataset. We 755 selected 480 genes across 24 phase bins; their peak time of expression is known 756 exactly, as well as their expression levels. We then extracted their normalized 757 expression from RNAseq4 and calculated the mean expression for each phase bin. 758 Finally, we plotted this mean for each RNAseq sample and each diurnal phase bin as a 759 heatmap. 760 We first looked at the two large diurnal time-courses, shown in Figure 9A, to 761 validate out methodology. Indeed, each diurnal sample (one row) showed a rhythmic 762 pattern with each peak and trough separated by ~12 h. In addition, successive time-763 points were more similar to one another than to later time-points, as observed earlier in 764 the correlation matrix ( Figure 1E). These results demonstrated the applicability of the 765 molecular timetable method to Chlamydomonas RNAseq samples, paving the way for 766 the extraction of the internal time of the collected sample, as determined by the phase 767 bin with maximal normalized expression. 768 We next subjected all remaining RNAseq samples to the same analysis and 769 clustered them based on their underlying pattern while generating the heatmap shown 770 in Figure 9B. Completely asynchronous samples should appear off-white across all 771 phase bins ("as", bottom of Figure 9B); overwhelmingly, Chlamydomonas RNAseq 772 samples instead displayed remarkable residual rhythmicity. Diurnal time-courses were easy to distinguish from other samples when we plotted the minimum and maximum 774 normalized expression values associated with each sample ( Figure 9C). Notably, most 775 other samples, collected from cells grown in constant light, retained strong global 776 oscillations, which we estimate to represent a synchronization between cells ranging 777 from 21-96%, with a mean rhythmicity of 48%, based on the amplitude between minima 778 and maxima relative to diurnal time-course samples ( Figure 9C). 779 The timing of minimum and maximum gene expression should be ~ 12 h apart in 780 diurnal and rhythmic samples: we therefore plotted peak and trough times predicted for 781 all samples based on the molecular timetable data. As shown in Figure 9D autotrophic (no reduced carbon source provided, but cultures were bubbled with CO 2 ) 800 or heterotrophic (with acetate as reduced carbon source) conditions. We normalized 801 FPKM counts to the mean inferred from the full RNA-seq dataset, and used the same 802 diurnal phase values as above. As shown in Figure 10A, autotrophic cultures exhibited 803 a very similar molecular timetable profile, with an estimated internal phase around dawn 804 Co-expression networks in Chlamydomonas 28 across all three iron concentrations. In sharp contrast, heterotrophic cultures responded 805 very differently: indeed, iron-limited cultures (0.25 µM FeEDTA) were 12 h out of phase 806 with the other two samples. Iron-limited heterotrophic cultures grow more slowly than 807 iron-deficient (1 µM FeEDTA) or iron-replete cultures (20 µM FeEDTA). We hypothesize 808 that the difference in internal phase between heterotrophic samples may thus partially 809 reflect the time at which cultures were sampled, as cells were harvested at the same 810 cell density (Urzica et al., 2012c). However, we cannot exclude a contribution to a 811 slower circadian clock under low iron conditions, as described for land plants (Chen et 812 al., 2013;Salomé et al., 2013;Hong et al., 2013). Nonetheless, we conclude that the 813 molecular timetable method is applicable to Chlamydomonas samples after performing 814 log 2 and mean normalization. 815 We then explored the applicability of this method to other algae where a high-816 density diurnal time course is not available: Vovox carteri and Chromochloris 817 zofingiensis. V. carteri samples consisted of two technical replicates each collected from 818 somatic and gonidial cells (Matt and Umen, 2018). We obtained one-to-one orthologs 819 between Chlamydomonas and V. carteri from Phytozome, after which we subjected all 820 C. carteri genes with a rhythmic Chlamydomonas ortholog to log 2 normalization and to 821 normalization with Chlamydomonas means. We then calculated the average normalized 822 expression for all genes, in 1 h bins. Gonidial cells appeared strongly rhythmic, with a 823 peak phase around 4-5 h after dawn and a trough ~12 h later ( Figure 10B). Remarkably, 824 somatic cells exhibited a completely different profile with a peak phase in the middle of 825 the night. We performed the same analysis of transcriptome samples collected in C. We initially set out to analyze multiple RNAseq datasets to prioritize genes 847 whose expression responded to changes in iron status in Chlamydomonas and 848 Arabidopsis. Our working hypothesis was that genes with a prominent role in iron 849 homeostasis should closely follow the expression pattern of known iron-responsive 850 genes like the iron transporters IRT1 and IRT2 or NATURAL RESISTANCE 851 ASSOCIATED MACROPHAGE PROTEIN 4 (NRAMP4), the Fe ASSIMILATION (FEA) 852 genes FEA1 and FEA2, or the FERRIC REDUCTASE (FRE). We quickly realized that 853 the assembly of 518 RNAseq samples into one dataset offered a unique opportunity to 854 explore the transcriptome landscape of the alga. We believe that we have only skimmed 855 the surface during our meta-analysis and invite others to use this dataset for their own 856 research questions. 857 We were surprised to see how little correlation existed between Chlamydomonas 858 experiments, even though several queried the same biological question, such as 859 responses to nitrogen deficiency or metal deficiencies (Figure 2). Samples collected in 860 the same laboratory similarly failed to show strong correlations, although growth 861 conditions are likely to be similar. We do not fully understand the underlying source of 862 variation, but we propose that strong residual rhythmic gene expression may contribute 863 to the observed pattern. As a test of our analysis pipeline, we determined the correlation 864 matrix of Arabidopsis microarray datasets, downloaded from AtGenExpress. As shown 865 in Supplemental Figure 13, samples (using the expression data for all genes as data 866 points) clearly grouped as a function of the tissue of origin, with shoot and leaf samples 867 generally strongly correlated, while anti-correlated with root samples. It is likely that 868 Arabidopsis samples show strong differentiation of their expression profiles as a 869 function of the tissue of origin, as might be expected, thus validating our pipeline. More 870 puzzling though is the fact that Chlamydomonas samples behave as independent units 871 that share little correlation with others. In this regard, it would be informative to perform 872 a comparative analysis of transcriptome datasets from multiple uni-and multi-cellular 873 organisms, to determine whether multicellularity drives the more polarized differentiation 874 of expression profiles seen across Arabidopsis samples relative to Chlamydomonas.
Co-expression modules assemble the most consistent gene pairs into a coherent 876 list that is characterized by high connectivity. However, each gene is itself co-expressed 877 with many genes not included in the module (Supplemental Figure 8). These co-878 expression cohorts can provide cues as to the function of a gene, especially when it 879 does not belong to a module. In addition, genes with the opposite expression profile can 880 give hints as to the function of a gene of interest. We have extracted co-expression and 881 anti-correlation cohorts for all Chlamydomonas genes, provided as Supplemental Data 882 Sets 2-7. We also provide an example script to run the same analyses presented here 883 on any gene list, from extracting the co-expression cohort to plotting the corresponding 884 correlation matrix (Supplemental File 1). We hope that this type of analysis spurs new 885 discoveries, not only in Chlamydomonas but also in Arabidopsis and other plants. Our 886 results with Arabidopsis RPGs ( Figure 4E) demonstrates the applicability of the method 887 to other organisms. 888 We do not anticipate all candidate genes identified based on co-expression to be 889 functionally tested, at least not in Chlamydomonas. Rather, we expect co-expressed 890 genes to be compared to other gene lists, generated by other means, in order to narrow 891 down the number of interesting candidates for follow-up studies further. For example, 892 large-scale non-targeted mutant screens in Chlamydomonas pave the way for the 893 systematic genetic dissection of phenotypes (Li et al., 2015(Li et al., , 2019; we envisage that the 894 intersection of co-expression and large-scale genetic screens will empower research, 895 not only in Chlamydomonas, but also in other algae. 896 The Chlamydomonas life cycle resolves around cell division, the timing of which 897 can be synchronized to dusk by light-dark cycles (Zones et al., 2015;Strenkert et al., 898 2019;Cross and Umen, 2015). When maintained under entraining conditions, at least 899 80% of the Chlamydomonas transcriptome exhibits rhythmic expression. It is unclear 900 how quickly algal cells become asynchronous when transferred to constant light 901 conditions. It is thought that cultures grown in constant light are largely arrhythmic at the 902 population level due to loss of synchrony. When applying the molecular timetable to 903 Chlamydomonas RNAseq samples, we discovered that the vast majority of samples 904 exhibited substantial rhythmicity, even when collected from cells grown in constant light 905 ( Figure 9). About one third of all samples appeared to have been collected 5-6 h after subjective dawn (that is, the dark-to light transition, had the cells been maintained under 907 entraining conditions). Based on the amplitude between minima and maxima extracted 908 from phase marker genes, we estimate that 21-96% of cells within a given culture were 909 synchronized, with a mean of 48%. Chlamydomonas strain stocks are typically kept in 910 constant light on solid medium before inoculating a liquid culture, which will itself be 911 placed in constant light. Pre-cultures are common before inoculating the test culture; 912 cells are generally collected by centrifugation when they reach mid-log. It is therefore 913 possible that diluting cells at the beginning of an experiment sends a resetting signal to 914 the Chlamydomonas circadian clock, the signature of which is still present 2-3 d later, 915 as evidenced by the degree of residual synchronization in all samples analyzed. We are 916 here only seeing the bulk behavior of Chlamydomonas cultures. Single-cell RNAseq 917 (scRNA-seq) analysis will allow a more detailed dissection of the diurnal contribution to 918 the Chlamydomonas transcriptome landscape. To begin to explore this possibility, we 919 recently performed scRNA-seq on almost 60,000 Chlamydomonas cells grown under 920 three growth conditions and from two genotypes. Indeed, we observed a substantial 921 heterogeneity among the cells that was partially explained by the endogenous phase of 922 individual cells (Ma et al.). Although cultures were grown in constant light for several 923 weeks, we hypothesize that diluting cells at the beginning of an experiment may act as 924 a resetting signal for the endogenous cell cycle and circadian clock. 925 Our observations also raise a question regarding the design of RNA-seq experiments: 926 when assessing the effect of a mutation or a treatment on cultures, Is it more important 927 to collect samples at the same cell density or at the same time? Our results suggest that 928 sampling time exerts a far greater influence on expression outcomes than sampling 929 density would. Best practices for RNAseq analysis may therefore dictate that a matched 930 control sample be collected at each time-point in order to remove any contribution to 931 differential gene expression from the strong rhythmicity exhibited by cultures. Genes 932 belonging to the same co-expressed modules tended to have the same diurnal phase 933 ( Figure 9C); the narrow window of expression seen in rhythmic genes would thus be 934 missed when comparing samples collected hours apart. In Arabidopsis, samples 935 collected 30 min apart already exhibited differential expression (Hsu and Harmer, 2012 In conclusion, we describe here an analysis of co-expression in the green 947 unicellular alga Chlamydomonas. We observed known and new connections between 948 genes ad provide the tools to take this analysis further for any gene of interest, in both 949 Chlamydomonas and other system with a body of transcriptome data available. 1000000000. We assembled all expression estimates as FPKM into one file and did not 963 attempt to correct for batch effect at this stage, with the thought that such effects would 964 contribute to the variation in expression. We then log 2 -transformed mean FPKMs across 965 replicates was with a pseudo-count of "1" added prior to conversion, followed by 966 quantile normalization with the R package preprocessCore. Finally, we subtracted mean 967 expression across all experiments for each gene, which removed any potential batch 968 effects from the data. We calculated Pearson's correlation coefficients (PCC) with the cor function in R and visualized for each gene pair using the R package corrplot, using 970 all 518 expression estimates. We maintained four expression datasets following each 971 normalization step: RNAseq1 (mean FPKMs); RNAseq2 (log 2 -normalized); RNAseq3 972 (quantile-normalized); RNAseq4 (normalized to mean). 973 We calculated the rank for all gene pairs by inverting the sign of PCCs by 974 multiplying the data frame by -1, then converting PCC values for each gene into ranks 975 with the function rank in R. We derived the mutual ranks (MRs) for two genes a and b 976 from the formula MR(a,b) = √(rank a  b x rank b  a ). Considering a matrix of ranks, the 977 ranks rank a  b and rank b  a are geometrically linked on either side of the diagonal: if 978 rank a  b has the coordinates (x,y) in the rank matrix, then rank b  a will have the 979 coordinates (y,x). We therefore transposed the rank matrix with the t function in R. We 980 obtained MR values for each gene pair by multiplying each cell from the rank matrix by 981 their counterpart in the transposed rank matrix, then square-rooted. 982 For network selection and visualization, we calculated edge weights from MR 983 values with the formula: Nx = e -(MR-1)/x , with x = 5, 10, 25, 50 or 100. Only Nx ≥ 0.01 984 were considered significant. We extracted gene pairs with significant edge weights from 985 the full edge weight matrix and loaded them into Cytoscape 3.5.1. We detected modules 986 of co-expressed genes with ClusterONE with default parameters. Modules with a p-987 value ≤ 0.1 were considered significant. 988 We also determined lists of anti-correlated genes by ranking PCC values from 989 the non-inverted PCC matrix generated by corrplot, and by calculating associated edge 990 weights as above. In this case, we limited our analysis to identifying anti-correlated 991 genes, as ClusterONE cannot detect modules using edge weights from anti-correlated 992 genes. 993 994

Co-expression Analysis Network in Arabidopsis 995
Microarray datasets were downloaded from the AtGenExpress project site 996 (http://jsp.weigelworld.org/AtGenExpress/resources/), and collated into a single file that 997 consisted of 34 Arabidopsis accessions, 16 sets of etiolated seedlings exposed to 998 various light treatments, 36 sets of seedlings exposed to pathogens, 13 cell culture 999 samples, 68 sets each for shoots and roots exposed to various abiotic stresses, 79 developmental samples (72 from shoots or leaves, 7 from roots), and 18 sets each for 1001 leaves and roots subjected to iron deficiency, with controls included. We log 2 -1002 normalized all data when not already done, and followed the same normalization steps 1003 described for the Chlamydomonas data set. 1004 1005

Analysis of Co-Expression from ClusterONE Modules 1006
We extracted normalized expression data (from RNAseq4) for genes belonging 1007 to a given cluster in R using the stack and unstack functions, and generated the 1008 corresponding co-expression matrix with corrplot. We tested for overlap between co-1009 expression modules with similar predicted function with the online tool Venny (Oliveros, 1010(Oliveros, 2007, and redrew co-expression matrices with a non-redundant gene list as input. 1011 Unless stated otherwise, we ordered genes based on the FPC (First Principle 1012 Component) clustering method built into corrplot. 1013 1014

Analysis of Co-expression from Manually-Curated and Community Gene Lists 1015
We extracted normalized expression data for genes that belonged to manually-1016 curated or community-generated lists as described above for co-expression modules. 1017 We maintained the same gene order when working with community lists, as the genes 1018 were sorted and grouped based on shared function. We sorted genes from manually-1019 curated lists following the FPC method in corrplot. 1020 1021

Identification of Co-Expression Cohorts 1022
We extracted the sets of genes co-expressed with each gene belonging to our 1023 co-expression modules in R by merging each module-specific gene list with a file 1024 representing all nodes and edges from networks N1 to N3. We collapsed each co-1025 expression cohort into a non-redundant list by using the function unique in R and tested 1026 each subset for overlap with merge or join. 1027 Manually-curated and community-generated gene lists presented an initial 1028 challenge, since not all of their constituents are necessarily co-expressed (for example, 1029 only a fraction of the genes defined by the mutant screen carried out by Fred Cross for 1030 cell cycle mutants is co-expressed). We therefore 1) ordered genes using the FPC