Exploring large-scale gene coexpression networks in peach (Prunus persica L.): a new tool for predicting gene function

Abstract Peach is a model for Prunus genetics and genomics, however, identifying and validating genes associated to peach breeding traits is a complex task. A gene coexpression network (GCN) capable of capturing stable gene–gene relationships would help researchers overcome the intrinsic limitations of peach genetics and genomics approaches and outline future research opportunities. In this study, we created four GCNs from 604 Illumina RNA-Seq libraries. We evaluated the performance of every GCN in predicting functional annotations using an algorithm based on the ‘guilty-by-association’ principle. The GCN with the best performance was COO300, encompassing 21 956 genes. To validate its performance predicting gene function, we performed two case studies. In case study 1, we used two genes involved in fruit flesh softening: the endopolygalacturonases PpPG21 and PpPG22. Genes coexpressing with both genes were extracted and referred to as melting flesh (MF) network. Finally, we performed an enrichment analysis of MF network and compared the results with the current knowledge regarding peach fruit softening. The MF network mostly included genes involved in cell wall expansion and remodeling, and with expressions triggered by ripening-related phytohormones, such as ethylene, auxin, and methyl jasmonate. In case study 2, we explored potential targets of the anthocyanin regulator PpMYB10.1 by comparing its gene-centered coexpression network with that of its grapevine orthologues, identifying a common regulatory network. These results validated COO300 as a powerful tool for peach and Prunus research. This network, renamed as PeachGCN v1.0, and the scripts required to perform a function prediction analysis are available at https://github.com/felipecobos/PeachGCN.


Introduction
The advent of omics technologies has allowed the scientific community to generate enormous amounts of biological information.In parallel, increasingly efficient bioinformatic tools help us transform this information into structured biological knowledge.To date, more than seven million RNA-Seq libraries are available at the National Center of Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/),representing a great opportunity for large-scale bioinformatics exploration and biological data integration.Therefore, taking advantage of this valuable resource is essential in the age of big data analysis.
In transcriptomics, representing this complex data as gene coexpression networks (GCNs) is becoming a widespread practice.GCNs are usually represented as undirected graphs, where nodes correspond to genes and edges correspond to correlations in gene expression patterns.GCNs can be built across multiple experimental conditions (condition-independent GCNs) or in specific experimental conditions (condition-dependent GCNs, e.g.tissue specific GCNs).They are based on the 'guilt-by-association' (GBA) principle [1], which states that genes with related functions share similar expression patterns.Following this principle, and using the functional annotation of the genes within a network, GCNs can be a very powerful tool to infer gene functions and to understand the regulation of specific metabolic pathways.For this reason, GCNs are extremely useful in crop species, where most of the bioinformatic and genetic tools are modest and our understanding of gene function is still limited [2].Several studies have already created GCNs in the plant model Arabidopsis thaliana ( [3-6]), maize ( [7]; S. [8]), rice [9,10], wheat [5], and grapevine ( [11][12][13][14]).
Peach [Prunus persica L. (Batsch)] has been used as a model organism for genetics and genomics in the Rosaceae, and more specifically in the Prunus genus, which encompasses other crops such as sweet and tart cherry, European and Japanese plum, apricot, and almond.However, in peach, the validation of genes responsible for breeding traits is a complex task.Long intergeneration times, phenological cycles, and space constraints due to the large size of the individuals under study are some of the hindrances for the work of peach geneticists [15].Moreover, Table 1.General topological characteristics of non-aggregated and aggregated GCNs with 100 and 300 top coexpressed genes (HRR100, HRR300, COO100, and COO300) there is a lack of efficient genetic transformation systems [16,17].

GCN
As a result of these limitations, only two genes, DRO1 and TAC1, have been biologically validated based on mutant analysis [18,19].
Although small-scale condition-dependent GCNs have been reported in peach and other Prunus species [20][21][22][23][24][25], these were created ad-hoc to study specific biological processes and so cannot be used in other experimental contexts.Therefore, a GCN capable of capturing robust gene-gene relationships across various experimental conditions, developmental stages and tissues, is needed.A GCN with these characteristics would help researchers overcome the intrinsic limitations of peach genetic studies and outline future research opportunities.
In this study, we present the first large-scale GCN in peach.We constructed four GCNs from publicly available RNA-Seq data and evaluated the performance of every GCN using a machinelearning algorithm based on the GBA principle.The GCN with the best performance was validated by predicting gene functions of well-characterized genes.Finally, we provide the scripts and data needed for function prediction analyses using the GCN presented in this study.These resources can be found at https://github.com/felipecobos/PeachGCN.

Aggregated GCNs included 82% of the protein-coding genes annotated in the peach reference genome
Six hundred and eight public RNA-Seq libraries, belonging to different organs and developmental stages of P. persica, were downloaded from SRA, classified and reanalyzed to generate different GCN types.To understand the differences between each GCN generated, we analyzed general topological characteristics of the four GCNs inferred in this study (Table 1).The two aggregated GCNs (COO100 and COO300) had 21 956 genes, while the other two, built by non-aggregated methods (HRR100 and HRR300) had 17 505 genes.From the total number of 26 873 protein-coding genes annotated in the peach reference genome, this represented 81.7% for aggregated and 65.1% for non-aggregated networks.The number of genes present in the aggregated GCNs represented 16.6% more genes (4451) from the peach whole-genome annotation, compared to non-aggregated GCNs.
Node degree refers to the number of connections between nodes in a network.The different methods used not only affected the number of genes included in the network, but also the node degree connectivity across all nodes of the GCN.Average node degree connectivity was higher in networks with relaxed sparsity (442 in COO300 and 470 in HRR300) in comparison to stringent sparsity (149 in COO100 and 161 in HRR100).The range between minimum and maximum node degree connectivity is wider in non-aggregated GCNs compared to aggregated GCNs with the same sparsity threshold (comparing HRR300 with COO300 and HRR100 with COO100).The minimum node degree connectivity was set by the sparsity threshold in all the networks: 100 for stringent sparsity (HRR100 and COO100) and 300 for relaxed sparsity (HRR300 and COO300).The highest node degree connectivity was found in HRR300, with a maximum of 1790 coexpressed genes with one single gene.In addition, aggregated GCNs showed a bimodal node degree connectivity distribution while non-aggregated GCNs had a unimodal distribution (Figure 1).

COO300 was the GCN with the highest AUROC value
When considering sparsity threshold, both GCNs with relaxed sparsity (HRR300 and COO300) had AUROC values above 0.7 for all the annotated databases (Table 2).COO300 was the GCN with the highest average AUROC value (0.746), outperforming the other GCNs.COO300 had the highest mean AUROC in almost all the datasets, except for Pfam and PANTHER, where the performance of HRR300 was better than that of COO300.HRR100 and COO100 had AUROC values under 0.7 in almost all the functional annotation databases, except for GOcc and PANTHER datasets.The best functional annotation performance in all the networks was for the functional annotation GOcc with an average AUROC value of 0.761, followed by PANTHER (0.724), KEGG (0.718), Pfam (0.714), GObp (0.709), Mapman (0.706), and GOmf (0.693).
The method used for network building also affected its performance, but the effect was not consistent.Considering the effect of the sparsity threshold, average AUROC values for relaxed sparsity threshold were always higher (HRR300 and COO300 = 0.741) than for the stringent threshold (HRR100 and COO100 = 0.694).When comparing GCNs by aggregation method, at relaxed sparsity (HRR300 and COO300), the average AUROC value for the aggregated method was higher but comparing at the stringent threshold (HRR100 and COO100), the average AUROC value was better for the non-aggregated method.
Finally, we evaluated the effects of adding further experiments (i.e.Bioprojects) on the AUROC value of every GCN built in this study.Figure 2 shows the correlation between the network AUROC value and the number of Bioprojects used.For every combination of GCN building method (aggregated or non-aggregated), threshold (top 300 or top 100) and dataset used (GObp, GOmf, GOcc, and MapMan) we observed similar trends, where the AUROC value increased with the number of Bioprojects.This trend was more pronounced for aggregated GCNs than non-aggregated GCNs, reaching a plateau after adding 10 to 12 bioprojects.In all cases, the standard deviation of aggregated GCNs decreased as the number of Bioprojects increased.Violin plot of node degree connectivity in each of the aggregated and non-aggregated networks with relaxed or stringent sparsity (COO300, COO100, HRR300, and HRR100).Boxplots of node degree connectivity were added for each violin plot.

Aggregated GCNs showed a positive trend between average node degree connectivity and AUROC score of individual functional annotations for GObp, GOmf, GOcc, KEGG, and Mapman
To assess the relationship between the AUROC score of individual functional annotations and the average node degree connectivity of the genes sharing that annotation we used a Loess regression (Figure 3).For example, for an individual functional annotation, such as GOcc: cell wall, we studied if the individual AUROC score of GOcc: cell wall was related to the average number of connections of the genes sharing that particular GO annotation.We then repeated the analysis for all the functional annotations within a dataset (GOcc: apoplast, GOcc: extracellular region, etc).In the case of aggregated GCNs, there was a positive trend between average node degree connectivity and AUROC score of individual functional annotations for GObp, GOmf, GOcc, KEGG, and Mapman.In the case of non-aggregated GCNs, the only dataset with a positive trend between average node degree connectivity and AUROC score of individual functional annotations was KEGG.The average node degree connectivity had no effect on the AUROC score of individual functional annotations in the Pfam dataset in any of the GCNs studied.

Case study 1: The melting flesh network
We selected two well-characterized genes responsible for fruit f lesh softening in peach, the endopolygalacturonases PpPG21 and PpPG22, located on chromosome 4 (Table 3) [29][30][31][32][33][34].Based on the evidence available to date, the variability of f lesh softening and stone adhesion during fruit ripening is due to the allelic combination of these two homologous genes.Both genes, PpPG21 or PpPG22, are associated with the development of melting, nonmelting, or non-softening fruits, while PpPG22 is also associated with the development of freestone or clingstone fruits.
The PpPG21 and PpPG22 gene-centered networks were constituted by 485 and 354 genes, respectively.Despite PpPG21 and PpPG22 were not found to be mutually coexpressed, their networks shared 238 genes.These shared genes were selected and named as the melting f lesh (MF) network (Figure 4A; Supplementary Material S2).MF network was annotated in GObp, GOcc, GOmf, and Mapman datasets.From the 238 genes in the MF network, 136 genes showed significantly enriched terms in GObp, 123 in GOcc, 156 in GOmf and 116 in Mapman (Supplementary Material S2).
After MF network annotation, we performed an enrichment analysis.MF network was enriched in 33 different terms (33 terms were significantly over-represented in this subnetwork).Out of these 33 terms, 12 belonged to the GOmf dataset, 9 to Mapman, 8 to GObp, and 4 to GOcc (Figure 4B; Supplementary Material S2).Within GOmf terms, up to 26 genes were annotated as hydrolase activity or as its child term (direct descendant), hydrolase activity, acting on glycosyl bonds.The next term was 'xyloglucan:xyloglucosyl transferase activity', with four genes annotated.With three genes annotated, we found the terms 'methyl indole-3-acetate esterase activity', 'methyl salicylate esterase activity', 'methyl jasmonate esterase activity', 'oxidoreductase activity, acting on paired donors, with oxidation of a pair of donors resulting in the reduction of molecular oxygen to two molecules of water and metal ion transmembrane transporter activity'.Finally, with two genes annotated, we found the terms 'inositol hexakisphosphate binding', 'phosphate ion transmembrane transporter activity', 'protein-disulfide reductase activity', and 'acid-amino acid ligase activity'.
Using Mapman as the annotation dataset, 27 genes were annotated as 'enzyme classification'.There were eight genes annotated Table 3. Candidate genes selected for network validation.The gene IDs were referred to the peach reference genome version 1 and 2.0 [26,27] and NCBI [28] while genomic coordinates and annotation were referred to the peach reference genome version 2.0 [27]

Gene ID Gene name Genomic coordinates Annotation
Prupe Within GObp, there were 26 genes annotated as 'oxidationreduction process'.Up to 11 genes were annotated as 'metabolic process'.There were nine genes annotated as 'cell wall organization,' and four as 'cell wall biogenesis', child terms of 'cell wall organization or biogenesis'.There were four genes annotated as 'cellular glucan metabolic process' and its child term, 'xyloglucan metabolic process', four as 'jasmonic acid metabolic process', and three as 'salicylic acid metabolic process'.Using GOcc as the annotation dataset, 16 genes were annotated as 'extracellular region', 7 genes as 'apoplast', child term of 'extracellular region', and up to 13 genes were annotated as 'cell wall'.
We further explored the expression patterns of the MF network genes by inspecting all the SRA runs used to create the wholegenome networks.Those runs with enough metadata to be classified in organs were kept and used to calculate normalized expression values.Most MF-genes had specific or at least preferential expression in fruit tissues (Supplementary Figure S2).
Since PpMYB10.1 is a transcription factor, we explored its GCN with the idea to identify potential targets.The color-related PpMYB10.1 network was compared with tissue-independent and berry-dependent gene-centered networks of its grapevine orthologs VviMYBA1/VviMYBA2 and VviMYBA7, responsible for the anthocyanin pigmentation of fruits (Walker et al., 2007) and stress-induced pigmentation of vegetative organs (Matus et al., 2017), respectively.The comparison of both gene-centered networks show 10% of shared co-expressed orthologs (Figure 5B), from which many are known as direct targets of grapevine MYBA regulators, such as the glycosyl-transferase-coding gene VviUFGT1, the glutathione-S-transferase VviGST4 and chalcone synthase VviCHS3, among others.VviNAC33 is known as a regulator of Subgroup 6 MYBs in grape (D'Incà et al., 2023) and both peach and Vitis networks show co-expression of the NAC ortholog with PpMYB10 and VviMYBA genes, respectively.The peach FC network shows differences in conserved coexpression depending on the MYBA genes considered.The biggest similarity is found with fruit-specific VviMYBA1 and VviMYBA2 genes.

Different GCN topological features are affected by the selected algorithms
To achieve the best results when building gene coexpression networks (GCNs), two variables were tuned, aggregation method and sparsity threshold.We chose these variables since they are the ones that have the greatest inf luence on the subsequent performance of the GCNs [13].The four GCNs obtained were evaluated, with substantial differences in the general topological characteristics of the inferred GCNs.
When considering GCN building methods, a major difference between aggregated and non-aggregated GCNs was the number of genes constituting the network.Aggregated GCNs had 21 956 genes (81.7% of P. persica genes), while non-aggregated GCNs only had 17 505 (65.1% of P. persica genes).This difference comes from the low-expression gene filtering.In non-aggregated GCNs all the genes with less than 0.5 FPKM in 50% of the 498 RNA-Seq libraries were filtered, while in aggregated GCNs this filtering is independently performed for each of the 26 Bioproject groups.This allowed the inclusion in the GCN of genes expressed in more precise conditions and therefore involved in more specific processes.This indicates that both aggregated and non-aggregated networks are able to capture stable gene-gene relationships expressed in most of the RNA-Seq libraries used in the analysis, but only aggregated GCNs are able to detect gene-gene interactions produced in specific conditions.Condition-independent gene-gene connections could be related to basal metabolic pathways, while condition-dependent gene-gene interactions could be associated to specific metabolic pathways.This could explain the difference in the distribution of node degree connectivity between aggregated and non-aggregated GCNs.In scale-free topology networks (most cases in biology-related networks), degree is not distributed homogeneously across nodes; instead, some nodes may have a very high degree, highlighting them as putative network hubs.This property is what actually makes GCNs a suitable tool to precit gene function.As shown in Figure 1, aggregated GCNs had a bimodal distribution of node degree connectivity (it had two peaks), while non-aggregated GCNs had a unimodal distribution (only one peak).As previously mentioned, aggregated GCNs may be able to detect genes involved in specific and basal metabolic processes.The two modes detected in aggregated GCNs node degree connectivity distribution could be associated with these two groups of genes.The group with the lower node degree distribution could be associated with genes involved in more specific metabolic pathways, coexpressed with a lower number of genes.The group with the higher node degree distribution could be associated with genes involved in basal metabolic pathways and coexpressed with a higher number of genes.On the other hand, non-aggregated GCNs may only detect genes involved in basal metabolic pathways, having only one mode in their node degree distribution.
Another factor affecting the topology of the networks was the sparsity threshold selected.HRR300 and COO300 had a node degree connectivity higher than HRR100 and COO100.This was an expected result, since a higher number of ranked genes allows a higher number of connections between genes.

Sparsity threshold and the number of bioprojects determine network performance
According to the results, sparsity was a key factor affecting network performance.The average AUROC of relaxed sparsity threshold networks (HRR300 and COO300) was 0.741, while that of stringent sparsity threshold networks (HRR100 and COO100) was 0.694.Applying relaxed sparsity threshold during network building represented an increment of 6.3% in the AUROC score in comparison to stringent sparsity threshold.
The number of bioprojects used to build the GCN was a key factor in the case of aggregated methods, indicating the minimum number of Bioprojects necessary to reach a sufficiently high AUROC score (Figure 1).In every case, aggregated methods had a lower AUROC value than non-aggregated methods using a low number of Bioprojects.By increasing this number, aggregated methods overtook non-aggregated methods, as found in other studies [11,12].For future GCNs construction, increasing the number of Bioprojects could improve the overall performance of the GCNs.
While studying the effect of functional annotations average node degree on the AUROC value, we found major differences depending on the type of used dataset.There was a positive correlation between functional annotations average node degree and functional annotations individual AUROC in evidence-based datasets such as GObp, GOmf, GOcc, KEGG, and Mapman.On the other hand, this correlation was lost with datasets based on domain identification by sequence similarity, such as PANTHER and Pfam.These results are in agreement with the GBA principle, which states that coexpressed genes share function, and not necessarily similar sequences.

COO300 validated as a powerful tool for peach and Prunus research
In peach, fruit f lesh softening has been extensively studied at fruit ripening and postharvest due to its implication in fruit shelf life.Fruit softening involves several cellular processes, such as the disassembly of the cell wall and the dissolution of the middle lamella.These modifications are the result of hydrolytic changes in the polysaccharides that form the cell wall, including celluloses, hemicelluloses (mainly xyloglucan) and pectins.In case study 1, several terms found in the melting f lesh (MF) network were associated to this process, such as 'GOcc: cell wall', 'GObp: cell wall organization', and 'GObp: cell wall biogenesis', 'GOmf: hydrolase activity', 'GOmf: hydrolase activity, actin on glycosyl bonds', 'Mapman: enzyme classification.EC_2 transferases.EC_2.4glycosyltransferase', and 'GOmf: xyloglucan:xyloglucosyl transferase activity'.
We found 25 genes in the MF network that have previously been reported as associated to ripening and softening (Supplementary Material S2).Among them, we identified several genes involved in the enzymatic machinery responsible for cell wall disassembly, such as a pectin methylesterase (Prupe.7G192800),a pectin methylesterase inhibitor (Prupe.1G114500),a pectate lyase (Prupe.4G116600),a β-galactosidase (Prupe.3G050200),and a xyloglucan endotransglycosylase hydrolase (Prupe.1G255100).Additionally, we found an expansin, a cell wall structural protein (Prupe.6G075100).Related to ethylene, we identified a 1-aminocyclopropane-1-carboxylate synthase (PpACS1, Prupe.2G176900) and 1-amino-cyclopropane-1-carboxylate oxidase (PpACO1, Prupe.3G209900),both genes codifying the key enzymes catalyzing the final steps of the ethylene biosynthetic pathway [47].In fact, PpACS1 has been previously reported as a regulator of PpPG21 [48].Another gene related to ethylene production was the ethylene receptor 2 (PpETR2, Prupe.1G034300).The implication of this gene in the ethylene transduction signal has been verified at the transcriptional level in the final stages of fruit ripening in melting f lesh peaches [49].Regarding genes related to auxin biosynthesis, we found a YUCCA-like auxin-biosynthesis gene (PpYUC11, Prupe.6G157500) and an IAA amino acid synthase (PpGH3, Prupe.6G226100).Both genes have been reported to have the same expression pattern as PpACS1 at late ripening stages in response to high auxins levels in melting f lesh fruits [50].
Based on these results, we can affirm that the MF network is mainly constituted by genes involved in cell wall organization and biogenesis, with expression regulated by ripening-related phytohormones, such as ethylene, auxin, and methyl jasmonate.Moreover, we found 25 genes previously reported as involved in softening, some taking part in key steps of these processes.These results demonstrate that the MF network is closely related to peach fruit softening and therefore to the function of PpPG21 and PpPG22.
In case study 2, we prospected the fruit color (FC) network, related to the transcriptional regulation of anthocyanin accumulation in peach fruit tissues (i.e.mesocarp and exocarp).Anthocyanins belong to the f lavonoid compounds and are water-soluble pigments that determine the red-to-blue color of plant tissues in many species as peach [51,52].Their biosynthesis starts through the phenylpropanoid pathway in the cytosol, and then they are modified and transported to the vacuoles, where they finally accumulate.Anthocyanin spatial and temporal distribution is determined by the expression pattern of structural and regulatory genes.Furthermore, anthocyanin biosynthesis is controlled by environmental factors, such as light incidence in fruits and temperature in leaves [53][54][55][56].One of the key genes is PpMYB10.1, that has been proposed as a strong candidate for the red pigmentation of the f lesh around the stone and the red skin of the fruit [35-37, 39-43, 57].
The PpMYB10.1-gene centered network, called here FC network, was enriched with many terms directly involved in anthocyanin metabolism, such as 'Mapman: secondary metabolism.phenolics.Flavonoid biosynthesis.chalcones.chalcone synthase activity.chalcone synthase (CHS)' and 'Obp: polyketide biosynthetic process'.This and other related terms contain many genes that are orthologs of well-known anthocyanin-related genes in grapevine (e.g. the CHS gene Prupe.1G002900,Prupe.1G003000 and Prupe.I005800).In fact, our Prunus-Vitis dual-species comparison allowed us to identify putative targets of PpMYB10 that can be subject of further molecular characterization.For instance, one of the CHS genes reported here is known to be directly regulated by the MYB10.1/bHLH3complex [39].
Finally, a set of annotated terms related to cell wall organization was found in the FC network, probably due to the temporal and spatial co-localization of anthocyanin biosynthesis and pectin degradation during fruit ripening 'Mapman: cell wall organization.pectin.modification and degradation.pectate lyase' and 'GObp: pectin catabolic process'.These genes are involved in pectin modification during cell wall organization and include pectinesterase inhibitors (PEI) (Prupe.1G131900and Prupe.1G123800),pectate lyase (PL) (Prupe.1G239900,Prupe.1G268500 and Prupe.1G268700), a polygalacturonase (Prupe.2G014800),and two pectin methylesterases (PME) (Prupe.3G003700and Prupe.3G147000).Anthocyanins and cell wall degradation enzymes have overlapping expression patterns in peach fruit at ripening stage, as already observed at the transcriptional level in the mesocarp and exocarp under UV-B radiation [54].The gene pectate lyase 6 (PL6) is in fact also co-expressed to S6-MYBs in grapevine, representing either a potential MYB target or suggesting a common regulator with these type of MYB genes.
Taken the results obtained in case studies 1 and 2 together, we can affirm that COO300 is validated as an accurate and powerful tool for gene function prediction in Prunus sp.

Gene coexpression networks as catalysts for Prunus research
While large-scale GCNs have not been explored yet as gene function-prediction tools in Prunus research, they have been widely used in the model organism A. thaliana and other crop species, such as grapevine.Depending on the needs of the researcher, GCNs can be exploited in different ways.One of the most common is to identify different modules (also known as clusters or hubs) within the GCN through a clusterization analysis.These gene modules, which represent groups of genes highly connected between them and relatively isolated from the rest of the GCN, are particularly useful to study uncharacterized biological processes.For example, [9] used this approach in rice to annotate 13 537 genes, from which 2980 had no previous annotation.
Another approach that uses group of genes to study specific biological processes is the gene group-guided analysis.In this case, a list of well-characterized genes involved in a specific biological process are selected and genes coexpressing with the list of genes of interest are extracted from the network.In this way, the selected genes are used as a guide to study the transcriptional regulation of the biological process of interest.Huang et al. [7] successfully applied this approach to study the cell wall biosynthesis in maize.Pathway-centered network analysis has also been helpful in the identification of members or regulators of secondary metabolic pathways [12].
Finally, GCNs can be used to infer the function of a gene of interest by extracting gene-centered GCNs.These can also be used to study specific gene families, being particularly useful for studying transcription factor families.For instance, [14] developed R2R3 MYB-centered GCNs to study the potential secondary metabolic processes regulated by this family in grapevine.Gene-centered GCNs are of special interest in peach and Prunus research, where most trait-loci analyses lead to a list of candidate genes associated with the trait under study.With poor or no functional information, identifying the responsible gene from this list of candidates can be almost impossible.Even when a high-confidence candidate gene is identified, the lack of an efficient genetic transformation system is still one of the main limitations for functional, mutant, or transgenic based validation.Having a tool, such as the GCN presented in this study, with which obtaining useful information about the biological processes in which a gene is involved, may be of critical importance.

Conclusions
In this study, we performed the widest overview of transcriptomic analysis carried out to date in peach or other Prunus species.The GCN inference methods used, aggregated or non-aggregated, affected the topological characteristics and performance of the GCNs created.Using three well-characterized genes in peach, PpPG21, PpPG22, and PpMYB10, we were able to validate the network with the best performance, COO300.The GCN tool presented in this study will help Prunus researchers overcome the intrinsic limitations of working with crop tree species, prioritize research lines and outline new ones.COO300, named as PeachGCN v1.0, and the scripts necessary to run a function prediction analysis using it, are available at https://github.com/felipecobos/PeachGCN.

Mapping and quality filtering
We performed a sequencing-quality filtering and adapter removal using Trim Galore! version 0.6.1 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).Reads with terminal Ns were trimmed, then reads with a Phred score lower than 28 or smaller than 35 nucleotides were filtered.Filtered libraries were quality checked using FastQC version 0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).HISAT2 version 2.1 [73] was used to map RNA sequencing libraries to the reference peach genome 'Lovell' version 2.1 [26,27] with default parameters.Mapped Binary Alignment Map (BAM) files were filtered by alignment quality using SAMtools version 1.9 [74,75].Reads with mapping quality lower than 40 were filtered out.After this filtering, BAM files with less than 5 000 000 reads were discarded, leaving a total of 498 RNA-Seq libraries from 43 independent Bioprojects for further analyses.

Aggregated and non-aggregated GCNs inference
A raw count matrix was calculated using featureCounts [76], from Subread R package version 2.0.0 (http://subread.sourceforge.net/).For the raw count matrix construction, we excluded chimeric fragments and we used the coding DNA sequences as feature type and gene IDs as attribute type.The raw count matrix was then normalized to fragments per kilobase million (FPKM) mapped fragments (Z.[77]), obtaining a FPKM matrix.We then applied two different methodologies: aggregated and non-aggregated network inference with two sparsity thresholds set at top 100 (stringent threshold) and 300 (relaxed threshold) ranked genes (Supplementary Figure S1).
For non-aggregated analysis, genes with less than 0.5 FPKM in 50% of the RNA-Seq libraries were removed.Pearson's correlation coefficient (PCC) was calculated for the remaining genes and ranked according to descending PCC, giving a PCC matrix.High reciprocal rank networks for the top 100 (HRR100) and top 300 (HRR300) were constructed according to the formula: HRR x, y = max rank x, y , rank y, x whereby rank(x, y) is the descending sorted rank of gene y according to the coexpression list of gene x and vice versa for rank(y, x).
For aggregated analysis, we clustered the samples into 43 different groups according to the Bioproject study ID.We filtered Bioprojects with less than six RNA-Seq libraries, leaving 26 different groups with a total of 450 RNA-Seq libraries.Genes with less than 0.5 FPKM in 50% of the libraries within each group were removed and from each filtered FPKM matrix, a high reciprocal rank network for the top 100 and top 300 was constructed.Frequency of gene coexpression interactions in all groups was calculated and ranked in a co-occurrence matrix.Finally, co-occurrence networks for top 100 (COO100) and top 300 (COO300) interactions were obtained.

Networks performance assay
Networks were evaluated for their ability to connect peach genes sharing functional annotations.For this purpose, GBA neighbor voting, a machine learning algorithm based on the GBA principle [78], was assessed over the GObp, GOmf, GOcc, Pfam, KEGG, PANTHER, and MapMan datasets.Each network was scored by the area under the receiver operator characteristic curve (AUROC) across all functional categories annotated for the seven datasets.Annotations were limited to groups containing 20 to 1000 genes to ensure robustness and stable performance when using neighbor voting.The AUROC value threshold for an acceptable network functional annotation was set at 0.7.
We also evaluated the impact of adding individual Bioprojects to the different networks created, HRR300, HRR100, COO300, and COO100.For this purpose, we selected five subsets each of two Bioprojects computing the top 100 and top 300 HRR and COO GCNs, evaluating their AUROC using GObp, GOmf, GOcc, and MapMan datasets.We repeated this process adding one Bioproject to the initial subset to reach five subsets each with 26 Bioprojects, the maximum number of Bioprojects used in this study.The final subsets corresponded to the full HRR300, HRR100, COO300, and COO100.

Network validation
To validate the performance of COO300 in predicting gene functional annotations, we performed two case studies.In Case Study 1, we selected two well-characterized genes responsible for fruit f lesh softening in peach, the endopolygalacturonases PpPG21 and PpPG22, located on chromosome 4 (Table 3) [29][30][31][32][33][34].Based on the evidence available to date, the variability of f lesh softening and stone adhesion during fruit ripening is due to the allelic combination of these two homologous genes.Both genes, PpPG21 or PpPG22, are associated with the development of melting, non-melting or non-softening fruits, while PpPG22 is associated with the development of freestone or clingstone fruits.
To validate the performance of COO300 in predicting gene functional annotations, we performed two case studies (Table 3).As part of Case Study 1, genes coexpressed with PpPG21 and PpPG22 were extracted.Since both genes are involved in the peach fruit f lesh softening process, we selected genes present in both subnetworks.The selected subnetwork was named melting f lesh (MF) subnetwork.
MF and FC networks were functionally analyzed with enrichment analyses using GObp, GOmf, GOcc and Mapman classifications.Functional annotations statistically over-represented were selected if passed the significance threshold of q-value <0.1.Finally, we compared the enriched terms (the functional annotations statistically over-represented) of MF and FC networks with the current knowledge on the peach fruit softening process and skin anthocyanin accumulation, respectively.

Figure 1 .
Figure 1.Violin plot of node degree connectivity in each of the aggregated and non-aggregated networks with relaxed or stringent sparsity (COO300, COO100, HRR300, and HRR100).Boxplots of node degree connectivity were added for each violin plot.

Figure 4 .
Figure 4. MF network analysis.(A) PpPG21 and PpPG22 GCNs and the MF network.Biological processes of interest are highlighted in the MF network.TF: transcription factor, CYP450: cytochrome P450.(B) Gene set enrichment analysis of the MF network.

Figure 5 .
Figure 5. FC network analysis.(A) Gene set enrichment analysis of the FC network.(B) Dual species comparison of FC network and VviMYBA1/VviMYBA2 gene centered networks.For details see Supplementary Material S3.