The state of the art in soybean transcriptomics resources and gene coexpression networks

In the past decade, over 3000 samples of soybean transcriptomic data have accumulated in public repositories. Here, we review the state of the art in soybean transcriptomics, highlighting the major microarray and RNA-seq studies that investigated soybean transcriptional programs in different tissues and conditions. Further, we propose approaches for integrating such big data using gene coexpression network and outline important web resources that may facilitate soybean data acquisition and analysis, contributing to the acceleration of soybean breeding and functional genomics research.


IN TROD UCTION
Soybean [Glycine max (L.) Merr.] is a major economic crop and is an important source of edible protein and oil. Besides, its ability to host symbiotic diazotrophic bacteria contributes to sustainable agricultural practices by reducing the use of chemical fertilizers because of bacteria-mediated fixation of atmospheric nitrogen (N 2 ) (Hungria et al. 2015). An important milestone in soybean genomics-based research was the availability of high-throughput genomic and transcriptomic sequencing data, which can help identify and characterize genes and pathways underlying plant growth and development. Over the past years, there has been a significant increase in soybean transcriptomic studies (Libault et al. 2010a;Severin et al. 2010;Garg and Jain 2013a;O'Rourke et al. 2017), which explored the transcriptional landscape in several tissues (Irsigler et al. 2007;Libault et al. 2010b;Severin et al. 2010;Liu et al. 2019), developmental conditions (Lu et al. 2016;Gazara et al. 2019;Yang et al. 2019) and genotypes (Goettel et al. 2014;Prince et al. 2015). This short review aims to summarize major transcriptomic studies and discuss recent advances in coexpression network analyses for soybean, outlining important web resources to facilitate data integration.
The major transcriptomic datasets available in soybean were generated by hybridization-based (i.e. microarrays) and sequence-based (i.e. RNA-seq) methods. Although microarrays are relatively easy to conduct and inexpensive, dependence on oligo probes limits the identification of novel transcripts. In addition, there are technical limitations that reduce reproducibility of gene expression estimates, such as incorrect probe-to-gene assignments (Dai et al. 2005), non-specific probe hybridization (Kroll et al. 2008) and low accuracy and precision (Draghici et al. 2006;Jaksik et al. 2015). In contrast, RNA-seq uses second and third generation sequencing methods to estimate transcriptional profiles independent of genomic information (Lee et al. 2002). In addition to the numerous technical advantages, absolute transcript levels from RNA-seq are also more correlated with the proteomic profiles (Fu et al. 2009). Thus, this method is considered the best technique available for transcriptomics studies, particularly in non-model species (Garg and Jain 2013b).

M ICROA R R AY-BA SED ST UDIE S
Initially (i.e. prior to 2010), soybean microarray studies led to a considerable accumulation of data in public repositories such as Gene Expression Omnibus (GEO). Many of those studies explored differential gene expression during pathogen attack (Moy et al. 2004;Zou et al. 2005;Ithal et al. 2007;Li et al. 2008). Besides, expression profiles in nodules have been investigated to understand the complex mechanism of symbionts survival in the host plant (Brechenmacher et al. 2008). Thibaud-Nissen et al. (2003) reported important transcript changes associated with somatic embryogenesis in soybean that could be useful for tissue culture, whereas Hudson (2010) explored the role of circadian clock in gene expression regulation and identified important circadian clock-controlled genes. A list of important soybean transcriptomic studies using microarrays is shown in Table 1.

RNA-seq-based studies
Due to its many advantages (Fu et al. 2009;Jaksik et al. 2015), RNA-seq quickly replaced microarrays as the method of choice for transcriptomics. To date, several studies have explored the spatiotemporal changes occurring in various soybean tissues using RNA-seq. The two first studies were reported by Libault et al. (2010b), who sequenced cDNA obtained from 14 tissues and provided experimental support for soybean gene annotations, and Severin et al. (2010), who investigated the transcriptomes from seven tissues and seven seed developmental stages and observed that tissues were grouped in three large clusters with underground, aerial and seed parts samples. While these studies focused on genome annotation and major differential gene expression patterns across a broad range of tissues, many other research groups targeted important specialized tissues or organs (e.g. seeds).
Seeds are the primary source of nutrient reserves for germination and early seedling growth. Jones and Vodkin (2013) examined the expression changes of soybean seed development from 4 days post-fertilization to the mature seed. In another study, Goettel et al. (2014) sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and content and reported transcript nucleotide polymorphisms associated with oil quality traits. In an original approach, Lambirth et al. (2015) investigated the physiological effects of expressing some recombinant proteins at high levels in soybean, such as human thyroglobulin and myelin basic protein.
In addition to the cotyledons, the embryonic axis plays key regulatory roles during seed germination and early seedling growth. Bellieny-Rabelo et al. (2016) explored the transcriptional landscape of soybean embryonic axes during germination and described the activation of several biochemical pathways that are critical for germination, such as lipid mobilization and glyoxylate cycle. To understand the role of the key hormone gibberellin (GA) in regulating seed germination, Gazara et al. (2019) analysed the transcriptomes of embryonic axes from seeds germinating in the presence of paclobutrazol, a GA biosynthesis inhibitor.
The development of soybean cultivars capable of tolerating diverse biotic and abiotic stresses is a major challenge, as plant responses to such stresses involve many complex pathways. Hence, transcriptome profiling provides an opportunity to identify genes involved in stress tolerance (Deshmukh et al. 2014). Transcription factors (TFs) often play a critical role in regulating stress response. Belamkar et al. (2014) characterized the expression of HD-Zip TFs under dehydration and salt stress. In another experiment, Prince et al. (2015) compared the transcriptomes of two lines contrasting for drought tolerance and found that several genes associated with biotic stress were upregulated in the drought-tolerant genotype. Further, this study also identified genomic variations in aquaporin genes of the drought-tolerant line that could facilitate marker-assisted selection in soybeans with improved drought tolerance.

FROM GENE S TO NET WOR K S : NET WOR K M ODELLIN G FROM TR A NSCR IP TOM IC DATA
The large number of soybean gene expression studies has resulted in the accumulation of a huge amount of data on public repositories, offering opportunities to perform more integrative and complex analyses, such as the construction of gene coexpression networks (GCNs). In GCNs, nodes represent genes and edges represent the strength of association between gene pairs. GCN-based studies can be used to extract biological signals from the system as a whole and can often identify genes that are important regulators of various biological processes, but individually lack high statistical significance or have small expression changes in differential gene expression analyses (Elo et al. 2007). GCN inference consists in identifying densely connected gene clusters using similarity measures and clustering algorithms. The genegene similarity measures can be correlation-based, such as Pearson's correlation coefficient (PCC, parametric) and Spearman's correlation coefficient (SCC, non-parametric), or rank-based, such as highest reciprocal rank (HRR), mutual information (MI) and partial correlations (PC) (Liesecke et al. 2018). From an n × n similarity matrix, various different clustering algorithms (e.g. hierarchical clustering and graph-based clustering) can identify coexpression modules (i.e.

References Contribution
Moy et al. (2004) Identified soybean genes that were strongly upregulated during infection with Phytophthora sojae and hypothesized that pathogen transits from biotrophy to necrotrophy between 12 and 24 h after infection Ithal et al. (2007) Identified 429 differentially expressed genes between uninfected and soybean cyst nematode-infected root tissues Zou et al. (2005) Reported significant changes in transcript abundance for 3897 genes in response to Pseudomonas syringae infection Dhaubhadel et al.
Reported a transcriptome profiling of soybean embryos during development and indicated differences in isoflavonoid content Li et al. (2008) Reported 40 genes that showed specific responses in a soybean aphid-resistant cultivar Irsigler et al.
Reported a list of candidate regulatory components that may integrate the osmoticand ER-stress signalling pathways in plants Brechenmacher et al. (2008) Explored the complexity of regulatory elements that are likely essential for the development of the symbiosis in root nodules O'Rourke et al.
Reported 43 differentially expressed genes between iron efficient and inefficient genotypes Soybean transcriptomics resources and GCNs • 3 groups of coexpressed genes). In plant biology, WGCNA (weighted gene coexpression network analysis) (Langfelder and Horvath 2008) has become the most popular algorithm, and this trend also stands true for soybean GCNs (Table 2). This algorithm calculates gene-gene expression similarities from an adjacency matrix (PCC or SCC raised to a power β to amplify disparities) and attributes association strengths between gene pairs using both correlation coefficients and shared neighbours. Module detection is then performed using dynamic tree cutting, which identifies gene clusters from hierarchical clustering dendrograms (Langfelder and Horvath 2008). Further, GCNs can be integrated with TF-DNA interactions to reconstruct gene regulatory networks (GRNs), which depict interactions between TFs and their target genes. This approach assumes that coexpressed genes are co-regulated and, thus, are more likely to share a conserved motif (i.e. TF binding site) in their promoter sequences (Ko and Brandizzi 2020). From a coexpression module, GRNs can be inferred with two main approaches: (i) mapping known TF motifs to the promoter sequences of its genes via enrichment methods or (ii) de novo motif detection followed by TF mapping (Bailey et al. 2009;Ko and Brandizzi 2020). Information on each TF's motifs can be obtained by ChIP-seq (chromatin immunoprecipitation followed by deep sequencing), but this approach can identify motifs of one TF at a time, limiting its application in a genome-wide scale (Kulkarni and Vandepoele 2020). Besides, less than 20 ChIP-Seq experiments have been performed for soybean genes, what reduces the accuracy of soybean GRN modelling.
Currently, the major bottleneck of GCN inference is the lack of a standard pipeline to process data and model networks, especially for large heterogenous data sets. This can lead to a considerably high number of false positives that can bias biological interpretations in highly heterogenous data sets. Despite the challenges, the number of studies reporting soybean GCNs has increased significantly ( Table 2). Although only a few soybean GCN studies have been conducted, this is certainly a promising approach to explore the vast amount of accumulated RNA-seq data in the coming years. GCNs can be used to answer many biological questions, and the most prominent of them are discussed below.

GCNs for the inference of gene function
As a consequence of extensive gene duplications, inferring gene function in plants exclusively through homology mapping is extremely challenging, as paralogues often lack functional conservation (Mutwil et al. 2011). GCNs can help address this issue by deploying guilt-byassociation (GBA) approaches, which assume that functionally related genes tend to be co-regulated and, hence, coexpressed. Almeida-Silva et al. (2020) have recently predicted the function of 93 hub genes without functional description in soybean by using GBA from a global soybean GCN. Likewise, Mutwil et al. (2011) proposed an integration of sequence data with GCN comparison across seven plant species to transfer knowledge of gene function from one species to another. As a small fraction of soybean genes have had their functions experimentally characterized, GCN-based function inference can be used complementarily to other approaches to accelerate soybean functional genomics approaches.

GCNs for evolutionary analyses
The use of GCNs to study the evolution of genes and genomes has gained popularity over the past few years. In this sense, Almeida-Silva et al. (2020) analysed the distribution of paralogue pairs across modules of a global soybean GCN and observed that most of the paralogous genes tend to diverge at the transcriptional level after any mode of gene duplication, but particularly upon local (e.g. tandem or proximal) duplication events. From a global microarraybased GCN, Wu et al. (2019) identified a nodulation-related module and compared its genes with their homologues in other species, such as Medicago truncatula, Phaseolus vulgaris and Lotus japonicus. Table 2. GCN studies in soybean. For the objectives, 1: inference of gene function; 2: evolutionary analysis and/or network comparison; 3: guide-directed discovery of genes involved in a biological process of interest; 4: module-trait association to identify gene clusters involved in a biological process of interest. *NA: not specified by the authors. † Average number of modules from 10 different networks. ClusterONE: graph-based clustering method; MCL: Markov clustering; Kappa: statistic to measure agreement between two or more classes of qualitative data; HCCA: Heuristic Cluster Chiseling Algorithm.  Wisecaver et al. (2017) integrated coexpression data for several species and identified gene clusters likely involved in the biosynthesis of specialized metabolites. Yu et al. (2017) inferred GCNs for four species and identified conserved and species-specific modules, as well as their functions based on enrichment analyses. Lu et al. (2016) analysed genes involved in seed weight and oil content in wild and cultivated soybean and identified different connectivity patterns between the two species, suggesting a transcriptional rewiring during soybean domestication.

GCNs uncover genes likely involved in processes of agronomic relevance
The most common use of GCN aims at identifying genes involved in a biological process or pathway of interest. For that, researchers rely on two main strategies. The first one consists in using guide genes, which are genes known to be involved in the process, most often with reliable experimental support. Using a guide-directed approach, Bian et al. (2020) constructed a GCN of differentially expressed TFs under salt stress and identified a hub of TFs involved in stress tolerance. Lima et al. (2017) identified TFs potentially related to seed longevity, whereas Liu et al. (2019) identified hubs that are strongly connected to genes differentially expressed under salt stress, several of which are involved in phytohormone signalling. Dias et al. (2016) identified genes that are coexpressed with previously reported drought stress-related genes. The second approach consists in calculating module-trait correlations, which can identify gene clusters associated with a particular phenotype. Wu et al. (2019) identified a nodulation-related module whose central genes were experimentally validated as critical for nodulation. Yang et al. (2019) found a module whose genes are likely involved in oil accumulation in seeds. Similarly, Qi et al. (2018) identified hub genes related to seed oil composition. Du et al. (2017) identified hub genes involved in seed set and size. Wisecaver et al. (2017) found groups of genes putatively related to the biosynthesis of specialized metabolites in four species. Almeida-Silva et al. (2020) identified pathways that are enhanced or inhibited in specific tissues in a global analysis. Lu et al. (2016) identified genes that were differentially expressed and differentially connected between early and mid-maturation in seeds, many of which are likely involved in seed weight and oil content. . Through this user-friendly interface, users can visualize the expression levels of genes of interest in several soybean tissues as expression maps, heatmaps or bar plots. Additionally, expression data (TPM-normalized or raw read counts) can be downloaded by tissue, PubMed ID or DOI. A short tutorial on how to use the Soybean Expression Atlas for data download and analysis is described in Supporting Information-File S1. Furthermore, web resources containing precomputed soybean GCNs have also been built. PlaNet (http://aranet.mpimp-golm. mpg.de) can be used to analyse co-functional networks within and across species of photosynthetic organisms (Mutwil et al. 2011). ATTED-II (https://atted.jp) is a coexpression database that was originally developed for Arabidopsis thaliana, but now includes data for nine plant species, of which seven have both microarray and RNA-seq data (Obayashi et al. 2018). Finally, a collection of soybean context-specific networks entitled SoyCSN (http://soykb.org/ SoyCSN/) has recently been developed and made available for users . Our group plans to integrate the recently reported GCN (Almeida-Silva et al. 2020) in the Soybean Expression Atlas described earlier.

P UBLICLY AVA IL A BLE SOYBE A N E X PR E SSION DATA A ND FU T UR E PROSPECTS
With the decreasing cost of sequencing technologies, the number of RNA-seq experiments deposited on NCBI's Sequence Read Archive (SRA) has been growing exponentially. Not surprisingly, the top 3 countries in number of soybean RNA-seq samples (China, USA and Brazil, respectively) are also among the top 4 global leaders in soybean production (http://soystats.com/internationalworld-soybean-production/), indicating a correlation between soybean production and investments in soybean research (Fig. 1A). Expectedly, many RNA-seq samples have accumulated on SRA since the Soybean Expression Atlas has been released ( Fig. 1B; information from 2 October 2020). However, the quality of the new samples is unknown, as we are yet to incorporate them in the next update of the Soybean Expression Atlas. For instance, 15% of the samples processed for the first version of the atlas were excluded in the quality check steps of our pipeline. Interestingly, there was a 5-and 8-fold increase in the number of seed and root samples, respectively. New seed samples comprise mostly studies targeting seed oil biosynthesis, suggesting that this may be a promising area in soybean genomics-based research. Recently deposited root RNA-seq samples are mostly represented by microbial inoculation experiments (both pathogenic and plant growth-promoting microorganisms), demonstrating increasing global efforts to better understand soybean association with microorganisms. The Soybean Expression Atlas is currently being updated to include these new RNA-seq samples, and we hope to release new versions yearly. This will help accelerate soybean genomics-based research by providing researchers with easy download and analysis of large-scale expression data.

SUPPORTIN G INFOR M ATION
The following additional information is available in the online version of this article. File 1. User manual.  For each sample, dark blue bars represent BioSamples that have been analysed in the Soybean Expression Atlas (http://venanciogroup.uenf.br/cgi-bin/gmax_atlas/index.cgi), whereas light blue bars represent BioSamples that were deposited after the release of the Soybean Expression Atlas (verified on 2 October 2020). Some samples were not assigned to any tissue (i.e. 'unknown') because submitters did not include this information in the SRA metadata. Both country and tissue information were extracted from the SRA metadata.

ACKN OWLED GE M EN TS
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES; Finance Code 001) and Conselho Nacional de Desenvolvimento Científico e Tecnológico. The funding agencies had no role in the design of the study and collection, analysis and interpretation of data and in writing.