Soybean kinome: functional classification and gene expression patterns

Highlight Identification and functional classification of soybean kinase gene family revealed wide expansion and extensive divergence in gene structure, subcellular localizations, and tissue and stress gene expression patterns.


Introduction
The protein kinase (PK) gene family is one of the largest and most highly conserved gene families in plants. PKs phosphorylate proteins for functional changes and are involved in nearly all cellular processes, thereby regulating almost all aspects of plant growth, development, and responses to biotic and abiotic stresses. PK genes exist by the hundreds in all plant species in which they have been surveyed.
Genome-wide identification of PKs in a number of plant species have indicated the presence of more than 3% of the annotated proteins coding for PKs, an indication of their functional importance (Lehti-Shiu et al., 2009;Lehti-Shiu and Shiu, 2012). The functional classification of PKs was initially conducted based on the conservation and phylogeny analysis of the catalytic domains of eukaryotic PKs (Hanks et al., 1988;Hanks and Hunter, 1995). These classifications resulted in defining five major groups, divided into 55 subfamilies, with related substrate specificity and mode of regulation. The Hanks and Hunter classification was extended further by Manning et al. (2002) and Niedner et al. (2006) where the entire PK superfamily was classified into nine groups, 81 families, and 238 subfamilies based on sequence comparison of the catalytic domains as well as sequence similarity and domain structure outside the catalytic domains. Very recently, PKs from 25 plant species were classified into nine main groups and 115 families (Lehti- Shiu and Shiu, 2012). The defined groups included AGC (PKA-PKG-PKC), CAMKs (calcium-and calmodulin-regulated kinases), CK1 (casein kinase 1), CMGC (including cyclin-dependent kinases, mitogen-activated protein kinases, glycogen synthase kinase and cyclin-dependent kinases), STE (including many kinases functioning in MAP kinase cascades), TK (tyrosine kinases), TKL (tyrosine kinase-like kinases), plant-specific, and 'other' (a group of kinases that could not be classified easily into the previous groups). Each plant PK repertoire or kinome is significantly larger in size than that of other eukaryotes, including those residents in animal genomes. The dramatic expansion of plant kinomes over other eukaryotes may be the result of recent duplication events and a high retention rate of duplicates (Hanada et al., 2008;Lehti-Shiu and Shiu, 2012).
Several studies have indicated that extensive expansion of specific PK families such as receptor-like kinase (RLK)/Pelle has also contributed to the large number of plant PKs (Shiu and Bleecker, 2003;Shiu et al., 2004;Vij et al., 2008;Lehti-Shiu et al., 2009;Gao and Xue, 2012;Wei et al., 2014). The RLK/Pelle family has expanded at a significantly higher rate than other kinases and in general constitutes more than 60% of the PKs in flowering plants. The large size of this family can be attributed to the expansion of a few subfamilies, specifically those bearing leucine-rich repeat (LRR) domains (Lehti-Shiu et al., 2009). Because members of these subfamilies are known to have roles in defence/resistance responses, it has been suggested that the wide expansion was probably a result of adaptation to rapidly evolving pathogens (Lehti-Shiu et al., 2009). In contrast, the RLK subfamilies with developmental functions are conserved in size (Shiu et al., 2004). The expansion of the PK superfamily seems to correlate well with increasing developmental complexity. For example, the green alga Ostreococcus tauri and moss (Physcomitrella patens) contain 93 and 685 PKs, respectively, compared with 1008 PKs in Arabidopsis. In addition, there are considerable differences in the sizes of kinomes among plant species by more than 4-fold. However, because the kinome of various plant species constitutes a relatively similar percentage of the whole-genome protein-coding genes, it seems likely that the PK superfamily has undergone similar mechanisms of expansion in flowering plants (Lehti-Shiu et al., 2009).
Functional characterization studies of PKs in Arabidopsis thaliana during the last two decades have positioned PKs as core components of various signalling pathways controlling multiple biological processes (Champion et al., 2004;Morillo and Tax, 2006;Colcombet and Hirt, 2008;Rodriguez et al., 2010;Gish and Clark, 2011;Tena et al., 2011). In soybean, however, only a limited number of PKs have been functionally characterized, despite the recent development of functional genomics tools and a published soybean genome (Schmutz et al., 2010). Soybean PKs have been found to function in the response to biotic stress (Srour et al., 2012) and abiotic stress (Im et al., 2012;Sun et al., 2013;Yang et al., 2014), as well as in the equilibrium of gene expression between disease resistance and growth and development (Liu et al., 2011. Whole-genome sequencing of several plant species has allowed large-scale identification of plant kinomes with higher resolution than earlier studies. With the availability of the soybean whole-genome sequences along with annotation of the encoded proteins (Schmutz et al., 2010), it is now possible to identify and functionally classify the PK family using large-scale phylogenetic approaches, which was our goal for this study. We identified 2166 putative PK genes, which were functionally classified into groups, families, and subfamilies. Our analysis of soybean PKs also included their chromosomal location, gene structure, duplication events, expansion, subcellular localization, expression profiles, and co-expression relationships.

Identification and classification of soybean PKs
To identify soybean PKs at the genome level, all protein-coding genes were downloaded from the latest version of the soybean genome (Wm82.a2.v1) from Phytozome (v. 10.0.). Hidden Markov models (HMMs) of the 'typical' Pkinase clan [Pkinase (PF00069) and Pkinase_Tyr (PF07714)] (Finn et al., 2010) were used to search for putative PKs using HMMER v. 3.0 (Finn et al., 2011) with an E-value cut-off of <1.0. After this initial screen, 2352 sequences containing the PK domain were identified. Only the longest variant of each gene was retained and all other redundant sequences were deleted. All the remaining sequences were further aligned with the PFAM kinase domain models to confirm the presence of kinase domains (Pkinase and Pkinase_Tyr domains) and eliminate pseudogenes as described previously by Lehti-Shiu and Shiu (2012). In this analysis, the putative PKs were considered typical PKs only if the domain alignments covered at least 50% of the PFAM domain models. Finally, a total of 2166 proteins containing at least one kinase domain were identified and assigned as soybean 'typical' PKs. Classification of identified kinases to groups, families, and subfamilies was defined using HMMs of the different subfamilies developed by Lehti- Shiu and Shiu (2012) based on PK sequences obtained from 21 plant species. The classification was further confirmed using phylogenetic approaches using neighbour-joining (NJ) and maximum-likelihood (ML) methods.

Chromosome location and synteny analysis of soybean PKs
Information about the chromosome locations of all soybean PKs was obtained from the Phytozome database. Putative homologous chromosomal regions and segmental/tandem duplication events of the PK superfamily were determined using the Multiple Collinearity Scan toolkit (MCScanX, http://chibba.pgml.uga.edu/mcscan2/) (Wang et al., 2012c. Synonymous (K s ) and non-synonymous substitution (K a ) rates were calculated using the 'add_ka_and_ks_ to_collinearity.pl' function of MCScanX. Collinearities of duplicated PK genes were visualized using the Genome Pixelizer program (http://www.atgc.org/GenomePixelizer). Tandemly repeated PKs in soybean genome were mapped to position on each chromosome with Mapchart v2.2 (https://www.wageningenur.nl/en/show/Mapchart. htm).

Subcellular localization
To determine the putative transmembrane proteins of soybean PKs, the N-terminal signal peptides and transmembrane domains were predicted using the SignalP v. 4.1 (Petersen et al., 2011) andTMHMM v. 2.0 (Krogh et al., 2001) programs, respectively. The subcellular localization of soybean PKs was predicted by WoLF PSORT (Horton et al., 2007). To test the predicted nuclear localization patterns in planta, the coding sequences of six soybean PKs (Glyma.02G276000, Glyma.04G029700, Glyma.14G000800, Glyma.04G097100, Glyma.11G050000, and Glyma.12G182600) were amplified using gene-specific primer pairs containing XhoI and SalI restriction enzyme sites in the forward and reverse primers, respectively. PCR amplification was performed using the high-fidelity PrimeSTAR GXL DNA polymerase (Clontech) according to the manufacturer's instructions. PCR products were digested by XhoI and SalI, gel purified, and fused to the N terminus of the yellow fluorescent protein (YFP) reporter gene in the pSAT6-EYFP-N1 vector and under the control of a double 35S promoter. The constructs were delivered into onion epidermal cells by biolistic bombardment as described previously by Hewezi et al. (2010). After bombardment, onion epidermal peels were incubated at 26 °C for 24 h in the dark. The subcellular localization of the fused proteins was visualized using an EVOS ® FL Auto Cell Imaging System (Life Technologies). The subcellular localization of each protein was determined at least three times.

Microarray data analysis
To analyse the expression patterns of soybean PKs for tissues and stress treatments, microarray data were retrieved from the Gene Expression Omnibus (GEO) database at the National Center for Biotechnology Information. Fifty microarray datasets for various tissues and organs and 24 datasets for biotic and abiotic stress treatments were selected from the largest Affymetrix platform, GPL4592. The raw data were normalized using the Affymetrix MAS5 algorithm using the R package Affy (Gautier et al., 2004). The normalized gene expression values of 817 individual PK genes were used to calculate the expression value for each gene relative to the mean expression level of each tissue/organ included, as described previously (Schmid et al., 2005) or relative to the corresponding controls of stress treatments. In the case of subfamily, the normalized gene expression values were averaged for all members within each subfamily and used to calculate the expression value for each subfamily as described above. Because only 817 out of 2166 PKs have probes on the Affymetrix microarray GPL4592 platform, we included in our analysis only subfamilies in which gene expression data were available for at least one-third of the subfamily members. Heat maps of expression levels of individual PK genes (817 genes) or 82 PK subfamilies were generated using MeV (MultiExperiment Viewer) software, v. 4.9 (http://www.tm4.org/mev.html).

Co-expression networks
Co-expression networks were generated for 82 PK subfamilies where gene expression data were available for at least one-third of the subfamily members. For each tissue and stress treatment, gene expression values for all members in each subfamily were averaged and the Pearson correlation coefficient (PCC) was computed between each pair of the subfamilies using SPSS v. 20.0. Co-expressed subfamilies were selected with a PCC cut-off of 0.7. This cut-off value was used previously in several studies and is well suited to generate moderatesized networks (Lin et al., 2011;Wang et al., 2012b;Lan et al., 2013). Two independent co-expression networks were generated using tissue and stress response gene expression data described above. The constructed co-expression networks were visualized using the Cytoscape v. 3.1.1 program (http://www.cytoscape.org/) (Saito et al., 2012).

Genome-wide identification and classification of soybean PKs
We searched all annotated genes of the latest version of the soybean genome (Wm82.a2.v1) for genes encoding PKs using HMMER v. 3.0 (Finn et al., 2011) with HMMs of the 'typical' Pkinase clan (Pkinase and Pkinase_Tyr) (Finn et al., 2010). When the putative PK sequences were aligned with PFAM kinase domain models, we down selected genes for further analysis if the domain alignments covered at least 50% of the PFAM domain models, which resulted in the identification of 2166 PKs (Supplementary Table S1 at JXB online) and exclusion of 186 atypical kinases (Supplementary Table S2 at JXB online). The soybean kinome was further classified into groups and subfamilies using the HMM search approach described in Materials and methods and confirmed by phylogenetic tree analysis using NJ ( Fig. 1 and Supplementary  Fig. S1 at JXB online) and ML methods (Supplementary Fig. S2 at JXB online). Interestingly, both NJ and ML methods generated the same classification for 2152 PKs, and only 14 genes showed different classification between both methods including 12 unclassified genes by the NJ method (Supplementary Table S3 at JXB online). The 2166 soybean PKs were then classified into 19 groups in which RLK (1418), CMGC (166), CAMK (158), TKL (115), STE (86), AGC (70), and CK1 (32) comprised 2045 genes, representing 94% of the soybean kinome. These 19 groups were further classified into 122 subfamilies (Supplementary Table S3). The size of the subfamilies was highly variable and varied between one gene (PEK_GCN2, TKL-Pl-3, and CMGC_GSKL) and 238 genes (RLK-Pelle_DLSV). Finally, a set of 20 PKs was arbitrarily classified by HMM search with low E-values but did not cluster with any of the known subfamilies and was thus assigned as an unclassified group (Supplementary Table S3). These genes might be novel and soybean specific, and might possibly constitute new gene families.

Conserved domain and gene structure of soybean PKs
All the 2166 identified PKs were scanned against the PFAM database using HMMs to identify additional conserved domains that may be common among subfamilies. All identified conserved domains are provided in Supplementary Table  S4 at JXB online. We found that 943 genes (43.5%) had only a kinase domain, whereas the remaining 1223 genes (56.5%) had additional conserved domains. A large fraction of PK genes with additional conserved domains was identified in the AGC (88.6%), RLK-Pelle (67.4%), CAMK (65.2%), and TKL (53%) groups (Supplementary Table S4), demonstrating that various subfamilies have various domain compositions. In contrast, members of each subfamily shared a common domain arrangement, suggesting common evolutionary history among PKs within the same subfamily. Interestingly, we found that 74 PKs contained two or three PK domains (Supplementary Table S5 at JXB online). These genes were found to be distributed among different subfamilies, but a significant proportion of these genes were found in specific subfamilies including RLK-Pelle_RLCK-XI (7/7), Group-Pl-2 (2/2), AGC_RSK-2 (38/43), CMGC_CDK-CCRK (3/4), and CMGC_SRPK (3/7) (Supplementary Table S5).
To gain further insights into the structural diversity of soybean PKs, we analysed the intron number of all soybean kinase genes. The number of predicted intronless genes (262) was found to be much less than that of intron-containing genes (1904) (Supplementary Table S3). The number of introns per gene varied from 0 to 28, with 336 (15.5%) genes containing more than 10 introns. Comparison of the exon/ intron arrangement at the subfamily level revealed that the number of introns in 33 subfamilies was conserved in all subfamily members (Supplementary Table S3). The gene structure of some subfamilies was highly variable. For example, four members of the AGC_RSK-2 subfamily were intronless and 34 members contained one or two introns, whereas the five remaining members contained 21 introns. When we carefully examined the phylogenetic relationships of these genes, we found that members of this subfamily could be separated into two subclades. This analysis suggested that PKs with varied intron numbers within subfamilies had their own evolutionary trajectories during the course of evolution, which can be used as an evolutionary imprint for further phylogenetic classification of soybean PK subfamilies. While the majority of the soybean PK subfamilies included members with varied intron numbers, specifically those encompassing relatively many members, other families showed low numbers or intronless structures. For instance, the RLK-Pelle_CR4L family (23 members) contained 18 intronless genes and four genes with a one-intron/two-exon structure. Similarly, out of the 94 genes belonging to the RLK-Pelle_LRR-III subfamily, 59 had a single-intron structure. The intron/exon distribution patterns appeared to correspond well with the phylogenetic analysis and subfamilies classification, which provided another layer of verification of our classification process. Furthermore, we investigated the structural configurations of soybean PKs relative to other species, two eudicot species (A. thaliana and Medicago truncatula) and two monocot species (Oryza sativa and Zea mays), by comparing the number of introns across all PK genes. Interestingly, the distribution patterns of intron numbers in these five species were very comparable ( Supplementary Fig. S3 at JXB online), suggesting that similar intron gain and loss events contributed to the structural evolution of the PK gene family before the eudicot-monocot split.

Comparative analysis of the soybean kinase subfamilies with monocot and eudicot species
We compared the distribution of soybean kinases across the 122 families with those of four other plant species comprising two eudicot (A. thaliana and M. truncatula) and two monocot (O. sativa and Z. mays) species. As expected, we found that the size of the majority of PK families was larger in soybean than in the other species, apparently because of its large kinome repertoire. Nevertheless, the expansion of the size of specific subfamilies belonging to the RLK-Pelle group was remarkable. For example, the RLK-Pelle_CrRLK1L-1 subfamily was comprised of 62 members in soybean, whereas the other four species contained between 13 and 18 members (Fig. 2). Other gene subfamilies including RLK-Pelle_DLSV, RLK-Pelle_ LRR-III, RLK-Pelle_LRR-XI-1, RLK-Pelle_RLCK-VI, and RLK-Pelle_WAK_LRK10L-1 also contained much higher member numbers in soybean than in Arabidopsis, M. truncatula, rice, or maize. In addition, other non-RLK-Pelle subfamilies including AGC_RSK-2, CAMK_CAMKL-CHK1, TKL_Pl-4, CAMK_CDPK, and STE_STE11 contained much higher numbers of genes relative to the other four species. Thus, the large size of the soybean kinome could be attributed in part to the expansion of these specific subfamilies. In contrast, we found that 20 subfamilies contained only one or two members including, for example, PEK_ GCN2, AGC_PDK1, AGC_PKA-PKG, CAMK_CAMK1-DCAMKL, CMGC_Pl-The, SCY1_SCYL2, STE_STE20-Pl, STE_STE-Pl, ULK_Fused, ULK_ULK4, RLK-Pelle_URK-1, and RLK-Pelle_URK-3 (Fig. 2). Notably, the small size of these subfamilies was also conserved in the other four plant species. It seems likely that these subfamilies have been subjected to limited expansion processes.
Two monocot-specific kinase subfamilies (CMGC_ CDKL-Os and Others_BUB) were absent not only in soybean but also in the other two eudicot species (A. thaliana and M. truncatula) (Fig. 2). The CMGC_CDKL-Os kinase family is specific to monocots, and rice and maize contain, respectively, 18 and 27 genes belonging to this subfamily. Other subfamilies including CMGC_DYRK-YAK, CMGC_ GSKL, PEK_PEK, plant-specific_Group-Pl-2, TKL_Pl-3, and RLK-Pelle_RLCK-XIII were identified in Arabidopsis and/or M. truncatula but were not detected in soybean, suggesting that these families have limited expansion ability and were subsequently lost in soybean. Consistent with this hypothesis, these six families are single-member gene families in Arabidopsis and M. truncatula.

Chromosomal locations and segmental duplication events of PKs
All 2166 PKs were mapped to the 20 soybean chromosomes to define their genomic distribution and localization. The soybean PKs were localized mainly near the ends of the chromosomes (subtelomeric regions) and to a lesser extent in the non-telomeric and centromeric regions (Fig. 3A). The soybean PKs are not equally distributed across the 20 chromosomes. Chromosomes 8 and 13 contained the highest numbers of PKs per chromosome, 162 and 173 genes, respectively. In contrast, chromosomes 16 and 19 contained the lowest numbers of PKs per chromosome, with 78 and 74 genes, respectively. PK gene subfamilies were generally distributed on several chromosomes, with the exception of the RLK-Pelle_RLCK-Os gene family, which was located mainly on chromosomes 9 and 15 (Supplementary Table S3).
It has been suggested that the soybean genome experienced whole-genome-wide duplications (WGDs) approximately 59 and 13 million years ago (Mya) (Schmutz et al., 2010). Therefore, we investigated the contribution of various duplication events to the expansion of soybean PKs. We first searched for segmental duplicates by examining soybean genome for duplicate blocks that contains kinase genes. We identified 1357 collinearity events among 1547 PK genes, indicating that segmental duplication events have played a major role in increasing the number of soybean PK genes (Supplementary Table S6 at JXB online).
Genome duplications are believed to have a significant impact on the expansion of certain gene families in plants, specifically those with regulatory functions (Maere et al., 2005;Freeling, 2009). Thus, the contribution of the 13-and 59-Mya duplication events to the expansion of specific subfamilies was examined by determining the synonymous distance (K s values) of collinearity events (1357) for all syntenically duplicated PKs identified above (1547 genes) (Fig. 3B). The K s value of 0.06-0.39 was then used to differentiate the events contributed by the 13-Mya WGD from those contributed by the 59-Mya WGD events (K s =0.40-0.80) (Fig. 3B). We found 708 and 476 collinearity events arranged at the 13-Mya and 59-Mya WGD regions, respectively. The finding that both 13-and 59-Mya collinearity events were associated with 1470 PK genes out of 2166 (67.87%) indicated that the expansion of the soybean PK superfamily was significantly contributed by the 13-and 59-Mya WGD. Careful examination of the collinearity events of the 13-and 59-Mya WGD revealed their occurrence in several chromosomes at specific regions ( Fig. 3C-E), suggesting that the expansion of soybean PKs was associated with polyploidization and subsequent genome shuffling.
Furthermore, we compared the collinearity events contributed by 13-Mya WGD (K s =0.06-0.39) and those contributed by the 59-Mya WGD (K s =0.40-0.80), to test the potential contribution of these two duplication events to the expansion of specific PK subfamilies. While both 13-and 59-Mya WGDs seemed to be responsible for the collinearity relationships of the majority of PK family paralogues, we identified 31 PK subfamilies in which collinearity events could be attributed only to the 13-Mya WGD (Supplementary Table  S7 at JXB online). These 31 gene families may represent the evolution of additional PKs that function in specific developmental, environmental, or physiological conditions.

Tandem duplication of soybean PKs
We searched for PK genes that were tandemly arranged at the same chromosomal location. We found that 229 genes (10.57%) were clustered at specific chromosomal locations (Supplementary Table S8 at JXB online), suggesting that they may have been generated from tandem duplications. The tandemly arrayed genes were found in 73 clusters distributed unevenly among the 20 soybean chromosomes (Fig. 4), where chromosomes 6, 8, 13, and 18 contained the highest number of genes. The largest cluster was identified on chromosome 6 and contained 14 genes all of which belong to the RLK-Pelle_DLSV subfamily (Fig. 4). Interestingly, the 229 genes were distributed into 24 subfamilies, from which 21 were assigned to the RLK group, indicating that tandem duplication mainly contributed to the expansion of RLK group. Furthermore, we tested whether tandem duplication events had contributed to the expansion of specific subfamilies. Interestingly, we found that tandem duplications accounted for the generation of 58.3% (7/12 genes) of the RLK-Pelle_ RLCK-Os subfamily. Similarly, tandem duplications seem to have contributed to the generation of about 52.2% (12/23 genes) and 46.2% (6/13 genes) of the RLK-Pelle_WAK and RLK-Pelle_RKF3 gene subfamily members, respectively (Supplementary Table S9 at JXB online). Also, RLK-Pelle_ DLSV and RLK-Pelle_LRR-XI-1 had many genes generated through tandem duplications (Supplementary Table S9).
Previous reports have indicated that PKs expanded via tandem duplication tend to function in biotic stress responses (Hanada et al., 2008;Lehti-Shiu et al., 2009). Therefore, we   cytoplasm, RLKs are transmembrane proteins containing extracellular receptor domains and intracellular kinase domains. However, the cellular localizations of the PKs belonging to other groups are largely unknown. Thus, the subcellular localization of these proteins was predicted using WoLF PSORT software (Horton et al., 2007), CELLO (Yu et al., 2006), and NucPred (Brameier et al., 2007). The analysis provided suggestions for extracellular and cytoplasmic as well as nuclear localizations (Supplementary  Table S3). Interestingly, several subfamilies included members that were predicted to localize to various cellular compartments, suggesting that gene subfamily members might have experienced subfunctionalization or neofunctionalization. In contrast, members belonging to 11 PK subfamilies were predicted to have the same cellular localization. One striking finding was that 312 proteins were predicted as nucleus-localized kinases (Supplementary Table S3). This number is surprisingly high because, unlike animals, plant systems contain a very limited number of exclusively nucleus-localized PKs (Dahan et al., 2010). Therefore, we tested the subcellular localization of six predicted nucleuslocalized PKs in planta using onion epidermal cells. The coding sequences of these genes were fused to the N terminus of an enhanced YFP (eYFP) reporter gene and delivered into onion epidermal cells using biolistic bombardment. The YFP signals accumulated exclusively in the nucleus, confirming the predicted subcellular localization of these proteins (Fig. 5).

Expression patterns of PK families
We analysed gene expression profiles of individual PK genes using publicly available Affymetrix microarray datasets from the GPL4592 platform, which includes 53 experiments with 3785 samples. Gene expression data were available only for 817 PK genes that have probes on the Affymetrix GPL4592 platform. The normalized expression values of these genes were examined in 14 organs and tissues comprising roots, root tips, root apex, root hairs, nodules, cotyledons, leaves, shoots, pollen grains, axillary meristem, embryos, seeds, seed coats, and pods ( Supplementary Fig. S5 and Supplementary  Table S10 at JXB online), as well as under various biotic and abiotic stress treatments ( Supplementary Fig. S6 and Supplementary Table S11 at JXB online). In general, we found that members belonging to the same PK subfamilies showed similar expression patterns. Therefore, we decided to investigate the expression patterns of PK subfamilies rather than individual genes. Because gene expression data for many PKs are not available, we included in our analysis only subfamilies in which gene expression data were available for at least one-third of the subfamily members. As a result, 82 gene subfamilies were included in our analysis out of 122 total subfamilies. The average expression values of these 82 subfamilies were first examined in the 14 organs and tissues indicated above (Fig. 6A). We observed distinct expression patterns of the kinase subfamilies. One pattern was the high expression levels of several subfamilies in almost all tissues, including CK1_CK1-PI, CMGC_RCK, CMGC_GSK, RLK-Pelle_RLCK-VIII, CAMK_CAMKL-LKB, and STE_ STE20-PI, suggesting a key role of these subfamilies in fundamental cellular processes. In contrast, certain subfamilies exhibited tissue-specific expression patterns. For example, in root, root tips, and root hairs, RLK-Pelle_RLCK-VIII, NAK, CMGC_CDK-Pl, CAMK_CAMKL-LKB, CMGC_ RCK, and CMGC_GSK gene subfamilies exhibited the highest expression levels among all 82 families included in the analysis, suggesting a role for these subfamilies in root development and functions. Similarly, the CMGC_CDK-Pl subfamily showed the highest expression levels in embryos, seeds, and pods, suggesting a role for genes belonging to this subfamily in seed development. One remarkable finding was that many kinase subfamilies exhibited high expression in pollen. Among these subfamilies, RLK-Pelle_LRR-VI-2, RLK-Pelle_RLCK-VIIa-1, TKL-Pl-4, AGC-Pl, RLK-Pelle_ RLCK-VI, and CMGC_CDK-CRK7-CDK9 were probably pollen specific because their expression was not detected in the other tissues tested. Similarly, Group-Pl-4, RLK-Pelle_ Singleton, and CMGC_DYRK-YAK seemed to be specific to leaf, seed coat, and embryo, respectively, as a high abundance of transcripts belonging to these subfamilies was detected only in these tissues. Notably, the CAMK_CAMK1-DCAMKL, RLK-Pelle_LRR-Xb-1, PEK_GCN2, RLK-Pelle_Extensin, CMGC_P1-The, RLK-Pelle_LRR-I-1, CMGC_CDK-CDK7, STE_STE20-YSK, NEK, STE_STE11, and AGC_ NDR subfamilies showed the lowest expression levels across almost all tissues tested.
When we analysed the expression of the 82 kinase subfamilies under various biotic and abiotic stress treatments, these subfamilies showed apparent responses to specific biotic and/or abiotic stress treatments with up-or downregulation (Fig. 6B). For example, the AGC_NDR subfamily was induced by heat stress, drought,and soybean rust. CMGC_ CDK-CDK7 showed upregulation in response to drought and salt treatments and downregulation in response to heat stress. The CAMK_CAMK1-DCAMKL subfamily was induced in response to several abiotic/biotic stresses. Similarly, genes of the RLK-Pelle_Extensin subfamily exhibited upregulation in response to lipo-chitooligosaccharides (LCOs) and iron deficiency and downregulation in response to submergence. NEK and RLK-Pelle_LRR-Xb-1 showed upregulation by LCO treatment. PEK_GCN2 and CMGC_Pl-The were highly induced by heat and drought. Certain subfamilies responded specifically to particular stress conditions. For example, TLK, RLK-Pelle_LRR-VII-1, RLK-Pelle_RLCK-XI, and Group-PI-4 were predominantly upregulated in response to LCOs, iron deficiency, heat, and submergence treatments, respectively (Fig. 6B). Similarly, RLK-Pelle_LRR-XI-2 and RLK-Pelle_RLCK-II were specifically upregulated in response to infection by Phytophthora sojae and soybean rust, respectively. Consistent with the expected redundant and overlapping functions of PKs, we found that various subfamilies, specifically those belonging to the same groups, responded similarly to different stress conditions, pointing to common signalling pathways controlled by these subfamilies.

A global gene co-expression network of soybean PK families
The gene expression patterns above pointed to potential coexpression relationships between PK subfamilies. Thus, we constructed the co-expression network of the soybean PK subfamilies using the microarray datasets described above. Construction of the network was performed by determining all pairwise gene expression correlations between PK subfamilies using PCC values and a P value of <0.01. Two independent co-expression networks were generated using tissue and stress response gene expression data. The tissue co-expression network contained 48 nodes (kinase subfamilies) and 275 edges (co-expression events) (Fig. 7A) separated into one main and six subnetworks. The frequency distribution of co-expression events for each subfamily revealed both low and broad interconnected relationships. Out of the 27 subfamilies included in the main network, 25 had at least 10 edges and were thus considered hubs. The other six subnetworks contained between two and six subfamilies with the number of edges ranging between one and four. The narrow interconnectivity of these subfamilies implied their association with specific developmental processes.

Discussion
Our genome-wide bioinformatic analysis identified 2166 putative soybean PKs, which represents 4.67 % of all proteincoding genes in the genome. This proportion is higher than that of Arabidopsis (3.7%), maize (3.8%), and rice (4.1%) (Lehti- Shiu and Shiu, 2012;Wei et al., 2014). The size of the soybean PK superfamily is also apparently among the largest in plants. For example, There are 1008 PKs in Arabidopsis, 911 members in M. truncatula, 1417 members in rice, and 1241 members in maize (Lehti- Shiu and Shiu, 2012;Wei et al., 2014). The large size of the soybean PK superfamily could partly be attributed to the large genome size and two WGD events that occurred 59 and 13 Mya. In addition, high retention of duplicate genes could be another contributing factor. In fact, about 75% of the 46 430 protein-coding genes predicted in soybean were found in multiple copies and are generally arranged in duplicated blocks (Schmutz et al., 2010). Consistent with these data, our analysis revealed that segmental duplication events have played the key role in expanding the soybean PK repertoire and account for the generation of 71.42% of the soybean kinome. Tandem duplications also might have contributed to the expansion of the soybean kinome but to a much less extent, and their role appears to be restricted to specific subfamilies. It should be noted that subfunctionalization of kinase duplicates and insufficient time for pseudogenization or degeneration may also have contributed to the large size of the soybean PK superfamily. In accordance with this suggestion we found that K a /K s values of the syntenically duplicated PKs contributed by the WGD events occurring 59 and 13 Mya were <1, an indication of purifying selection rather than neutral selection for which the K a /K s value is expected to be equal to 1 (Li et al., 1981;Shiu et al., 2004).
The large numbers of soybean PKs relative to other plant species can be explained mainly by the large numbers of members of a few PK groups. The RLK group in soybean is remarkably large, in that it contains 1418 genes. Similarly, over half of the 2500 PKs in Eucalyptus grandis are RLK members (Lehti- Shiu and Shiu, 2012). Previous studies have indicated that this group has been subjected to extensive expansion over the course of plant evolution (Shiu et al., 2004;Gao and Xue, 2012;Lehti-Shiu and Shiu, 2012;Wei et al., 2014). The expansion of the RLK/Pelle group appears to be associated with specific families within this group, including DLSV, LRR, and RLCK, which contain far more members in soybean than in other plant species. LRR is the largest family within the RLK-Pelle group in soybean with 479 members, which are grouped into 23 subfamilies. Functional characterization of several members of these gene families in a number of plant species has revealed their implication in a wide range of biological processes including pathogen recognition, organ Fig. 7. Co-expression networks of soybean PK subfamilies. Two independent co-expression networks were generated using tissue (A) and stress response (B) gene expression data. Nodes indicate subfamilies and edges indicate significant co-expression between subfamilies. Green nodes with blue-coloured edges indicate co-expression events between subfamilies that are common in tissue and stress networks. (This figure is available in colour at JXB online.) patterns, steroid signalling, and stem cell control (reviewed by Gish and Clark, 2011).
The dramatic expansion of the LRR subfamilies in soybean is very remarkable and may point to a specific adaptation to sense and respond to variable developmental signals and rapidly evolving pathogen secretions. One obvious exception is the LRR-XII-1 subfamily, which contains 3.8 times more members in rice than in soybean (96 in rice versus 25 in soybean) (Fig. 2). The LRR-XII subfamily is also overrepresented in tomato (Sakamoto et al., 2012), and the expansion of this subfamily in cultivated plants such as tomato and rice could be the result of intense selection for disease-resistant varieties. Members of this subfamily have a well-known role in disease resistance, including EF-Tu RECEPTOR (EFR) and FLAGELLIN-SENSITIVE 2 (FLS2) in Arabidopsis, and Xa21 in rice (Shiu et al., 2004;Sakamoto et al., 2012). RLCK is the second largest subfamily of the RLK/Pelle group and contains 267 members belonging to 16 subfamilies.
Unlike most of the RLK/Pelle families, which are membrane-located, RLCKs are localized to the cytoplasm because of the absence of extracellular and transmembrane domains (Jurca et al., 2008). A limited number of RLCKs have been functionally studied, but some well-characterized RLCK genes indicate their implication in disease resistance, abiotic stress response, and hormone signalling (Tang et al., 1996;Shen et al., 2001;Swiderski and Innes, 2001;Muto et al., 2004;Yang et al., 2004). The DLSV subfamily also exhibits considerable differential expansion in soybean and is comprised of 238 members relative to only 89 members in Arabidopsis and 146 in rice. Initially, it was suggested that RLK/Pelle duplicated genes are those with defence/resistance-related functions, whereas those implicated in developmental control have been hardly duplicated (Shiu et al., 2004). It is widely known that pathogen recognition receptors are generally expressed at very low levels (Meyers et al., 2002) because high expression of these genes can lead to cell death and dwarf phenotypes (Chern et al., 2005;Kim and Wang, 2010), and hence low expression of RLK/Pelle duplicated genes is anticipated. Nevertheless, recent studies indicated that the RLK/Pelle gene family tends to be enriched in genes that are highly responsive to biotic and abiotic stresses and during various vegetative and reproductive developmental stages (Vij et al., 2008;Lehti-Shiu et al., 2009;Gao and Xue, 2012;Wei et al., 2014). Similarly, our gene expression analysis indicated that many subfamilies of DLSV, LRR, and RLCK are differentially expressed in a number of soybean tissues and organs and in response to different biotic and abiotic stress treatments. Taken together, these data argue against the hypothesis that the majority of RLK/Pelle duplicated genes are implicated in defence/resistance functions but rather suggest that large-scale expansion of this group, including DLSV, LRR, and RLCK families, in soybean may represent a specific adaptation to detect and respond to both developmental signals and environmental stimuli by changing cellular metabolism, gene expression, or growth and development. In support with this suggestion, several RLK-Pelle subfamilies displayed the highest connectivity in both tissue and stress coexpression networks, suggesting a role in synchronizing the interactions between defence signalling and plant growth and developmental pathways.
It has been suggested that lineage-specific expansion of biotic stress-responsive RLKs has been generated through tandem duplications because tandem duplications can initiate quick variations in gene content relative to other types of duplication (Hanada et al., 2008;Lehti-Shiu et al., 2009). In this context, it may be important to mention that all PK genes that are arranged as tandem repeats mainly belong to the RLK group. Thus, tandem duplications seem to have contributed to the expansion of the RLK group in soybean even though these tandemly arranged genes constitute only 16% of the RLK members. In Arabidopsis, both tandem and segmental duplications have contributed equally to the large number of RLK members, while in rice only tandem duplications have been assumed to play a major role in RLK expansion (Shiu et al., 2004). When the soybean tandemly repeated PK genes were grouped by associated biological processes using GO categorization, we found that the most abundant group corresponded to defence responses. These data are consistent with previous studies that RLKs with a role in defence tend to be in tandem repeats (Shiu et al., 2004;Lehti-Shiu et al., 2009).
Other non-RLK/Pelle subfamilies appear to be expanded through lineage-specific and species-specific gene duplication to cope with specific environmental stresses. In this regard, we identified 31 subfamilies in which expansion events had been contributed only by the 13-Mya WGD (Supplementary  Table S6). Among these subfamilies, the extensive expansion of several TKLs was very apparent. The TKL family in soybean has 115 members compared with 55 and 57 members in rice and Arabidopsis, respectively. While none of the soybean TKL genes has been functionally characterized, our gene expression data point to a functional role of these family members in the abiotic stress response, specifically aluminium toxicity and iron deficiency. Similarly, various subfamilies of the well-expanded STE family exhibited more restricted, abiotic stress-specific patterns of expression. Therefore, the extensive expansion of specific PK families in soybean could be the result of strong selection pressure for adaptation to a wide range of environments. Other widely expanded gene subfamilies such as AGC_RSK-2, CAMK_CAMKL-CHK1, and CAMK_CDPK were expressed in at least one tissue type and induced by one or more stress treatments.
The expression patterns of the extensively expanded families suggested that the kinome of soybean has been shaped by its developmental habits and physiological characteristics. Consistent with this hypothesis, we found that almost all of the soybean PK subfamilies were highly upregulated in response to LCOs. LCOs are rhizobial signalling molecules playing a fundamental role in the initiation of nitrogenfixing symbiosis in legumes. LCOs are perceived by several receptors in host roots and initiate a cascade of signalling events leading to the formation of root nodules (Oldroyd and Downie, 2008). While the pronounced response of soybean PKs to LCOs is in agreement with the well-documented effects of LCOs on plant development and morphogenesis as well as biotic and abiotic stress responses (Wang et al., 2012a), it provides the foundation to speculate that PK genes may have been co-opted to regulate a wide range of biological functions including symbiotic interactions. In addition to the RLK/Pelle family, the soybean kinome contains a significant percentage (7.29%) of CAMKs with 70 genes coding for calcium-dependent PKs (CDPK). The expansion of this family may reflect an adaptive evolution that allows soybean plants to perceive different calcium signals that mediate the plant response to environmental stress. On the other hand, several PK subfamilies have been subjected to limited expansion in soybean and remained low in copy number. These subfamilies probably represent ancient families in which new members have not been retained because they did not confer a selective advantage over the ancestral copy. Alternatively, these subfamilies may be involved in more basic and less environmentdependent cellular processes where the driving forces behind continuous expansion and evolution are very limited (Shiu et al., 2004).
Domain organization and intron/exon arrangement have frequently been used as a supporting indication for evolutionary relationships among genes and species. Domain organization analysis revealed that 74 PK genes contained two or three kinase domains. Because many PKs are known to form homo-and heterodimers (Ding et al., 2009), the twoand three-kinase domain structure of soybean PKs may be functionally equivalent to dimerized PKs where cooperative activity of two or three kinase domains may be required for specific substrates. Functional studies of truncated variants of these PKs will provide valuable insights into their mechanism of regulation and substrate specificity. The gene structure of the PK subfamilies clearly elucidated their differential propensity to lose or acquire introns. We found that the proportion of intronless PK genes (12.1%, 262 genes) was much less than that of total intronless genes in the soybean genome (21.7%), suggesting that intron gain has significantly contributed to the structural divergence of the soybean PK superfamily. It seems likely that the structural evolution of soybean PKs is conditioned by similar mechanisms controlling intron gain and loss in both monocot and eudicot species because the patterns of intron number distribution are very similar ( Supplementary Fig. S3). We observed that intron/exon position and arrangement patterns were tightly conserved in a number of subfamilies, specifically those whose expansion had been contributed by the WGD occurring 13 Mya, consistent with their close evolutionary relationship. In contrast, other members of subfamilies displayed a high structural diversity but a strong conservation of the kinase domain. This may reflect gene family expansion from old paralogues or multiple ancestral origins of the gene family. Finally, the wide structure differences of the PK supergene family associated with functional divergence may have contributed significantly to the high retention of the duplicate genes in soybean.
In conclusion, our analyses of soybean PKs revealed their wide expansion, and extensive divergence in gene structure, tissue and stress gene expression responses, and subcellular localizations. Functional characterization of subfamily members using high-throughput reverse genetics approaches will facilitate deciphering the multifunctional roles of the soybean kinome in plant development and physiology.

Supplementary data
Supplementary data are available at JXB online.
Supplementary  Supplementary Fig. S1. Phylogenetic classification of soybean protein kinases using the neighbour-joining method. Supplementary Fig. S2. Phylogenetic classification of soybean protein kinases using the maximum-likelihood method. Supplementary Fig. S3. Patterns distribution of intron numbers of protein kinase gene family in soybean and other plant species. Supplementary Fig. S4. GO-term annotation of 229 tandemly arrayed soybean protein kinases. Supplementary Fig. S5. Heat map of the expression patterns of individual soybean PK genes in different tissues. Supplementary Fig. S6. Heat map of the expression patterns of individual soybean PK genes under various biotic and abiotic stress conditions.