Evolution and functional diversification of R2R3-MYB transcription factors in plants

Abstract R2R3-MYB genes (R2R3-MYBs) form one of the largest transcription factor gene families in the plant kingdom, with substantial structural and functional diversity. However, the evolutionary processes leading to this amazing functional diversity have not yet been clearly established. Recently developed genomic and classical molecular technologies have provided detailed insights into the evolutionary relationships and functions of plant R2R3-MYBs. Here, we review recent genome-level and functional analyses of plant R2R3-MYBs, with an emphasis on their evolution and functional diversification. In land plants, this gene family underwent a large expansion by whole genome duplications and small-scale duplications. Along with this population explosion, a series of functionally conserved or lineage-specific subfamilies/groups arose with roles in three major plant-specific biological processes: development and cell differentiation, specialized metabolism, and biotic and abiotic stresses. The rapid expansion and functional diversification of plant R2R3-MYBs are highly consistent with the increasing complexity of angiosperms. In particular, recently derived R2R3-MYBs with three highly homologous intron patterns (a, b, and c) are disproportionately related to specialized metabolism and have become the predominant subfamilies in land plant genomes. The evolution of plant R2R3-MYBs is an active area of research, and further studies are expected to improve our understanding of the evolution and functional diversification of this gene family.

A number of studies have shown that this gene family underwent a large expansion with functional diversification in land plants [6,9,10,18,19,21,49]. This may have contributed to the origin and diversity of this kind of plant, as it conferred on them the ability to adapt from aquatic to terrestrial environments during evolution [50]. To capture the most significant developments in this area, this review focuses on (a) the identification and classification of the R2R3-MYB gene family in plants, (b) the characterization of conserved and diverse functions in different species and biological processes, and (c) the use of comparative genomics to clarify the expansion and functional diversification of the R2R3-MYB family. Finally, we discuss new research directions to improve our understanding of R2R3-MYB evolution and functions.

Identification and classification of R2R3-MYBs
Since its initial identification in 1982 [1], the R2R3-MYB family has been an important area of research. In the past 20 years, an increasing number of sequenced plant genomes have enabled integrative genome-wide overviews of this gene family in the plant kingdom. To date, the R2R3-MYB gene family has been evaluated in about 74 plant species, ranging from aquatic plants (such as Mesostigma viride [51]) to angiosperms (e.g. A. thaliana [9], Z. mays [49], Populus trichocarpa [21], Gossypium spp. [23], and Solanum tuberosum [52]) ( Fig. 1 and Supplemental Table 1), and a pipeline for the automatic identification of MYBs has been created in order to make genome-or transcriptome-wide investigations more consistent [53]. These studies have revealed that R2R3-MYBs are widely distributed in the plant kingdom, with a sharp increase in number from aquatic to angiosperm plants, forming one of the largest and most diverse TF gene families (ranging from one to hundreds of members) in land plants, especially angiosperms (Fig. 1).
The R2R3-MYB gene family clearly underwent functional diversification to form different subfamilies/groups during evolution. Accordingly, there are many classification schemes for plant R2R3-MYBs, and the differences primarily ref lect the existence of lineageor species-specific "orphan" genes [54,55], differences in sampling coverage, and phylogenetic analysis methods [6]. On the whole, three landmark classification systems have been developed and improved. First, the canonical classification and nomenclature of this gene family were initially established in A. thaliana [9] and included 25 groups based on the MYB domain and the amino acid motifs in the C-terminal regions [5]. Thereafter, many studies have applied this system with varying degrees of modification [22,49,[56][57][58]. Notably, in A. thaliana [9], around 30 of the 126 (∼23.81%) R2R3-MYBs were not assigned to any subfamily/subgroup in this classification owing to the lack of different representative species and the limited gene number (generally only one per species), resulting in scattered studies with relatively small sample sizes. Second, we performed a systematic comparative analysis of 50 major eukaryotic lineages with a total of 1548 R2R3-MYBs [10]. As expected, we not only confirmed the 25 well-defined subfamilies of the R2R3-MYB gene family in A. thaliana [9] but also defined many new species-or lineage-specific subfamilies in major eukaryotic lineages that had previously been neglected [21]. Ultimately, we classified the gene family into as many as 73 subfamilies supported by highly conserved gene structures and motif compositions [10]. The validity of this classification was further supported by our recent study of Brassicaceae [18]. In addition, a more systematic classification with a high resolution based on 87 species and 4312 sequences has recently been released, covering the major lineages of Archaeplastida. In that study, ten clades were designated as land plant R2R3-MYB subfamilies by applying a progressive phylogenetic analysis strategy aimed at eliminating the imbalance in resolution and species sampling [6]. We speculated that there may be more subfamilies specific to individual species or lineages, as there are still many "orphan" genes in the phylogenetic analyses. Third, the genome sequences for Marchantia polymorpha [59] and M. viride [51] have enabled the development of a macroscopic classification regime, resulting in three clades of R2R3-MYBs that correspond to their biological functions. Each clade is correlated with different functions associated with essential developmental processes, from the basic life cycle to specialized metabolism and organismal complexity, corresponding to a very recent classification in 2020 [60]. In that study, genes encoding R2R3-MYBs from Rhodophytes, Glaucophytes, Chromista, Chlorophytes, Charophytes, and Embryophytes were divided into three major clades (I, II, and III) (named subgroups in the original paper). Subgroup I is the most ancient group and includes members from all plant lineages, whereas subgroup III has become predominant in land plants [60]. The former two classification schemes share overlaps and are based on domain conservation, whereas this final classification system provides a macroscopic perspective based on broad evolutionary patterns.
Notably, a uniform nomenclature within the classification of this gene family in plants is still lacking. Currently, there is almost no dispute at the family (R2R3-MYB) and superfamily (MYB) levels, except for a few studies that refer to R2R3-MYBs as a subfamily or subgroup [50,61]. However, the nomenclature after R2R3-MYB is inconsistent, e.g. ∼7% of studies refer to groups [62,63], ∼31% refer to subfamilies [10,64], and ∼54% refer to subgroups [23,65], with clade or cluster used occasionally [66,67]. Subfamilies and subgroups are the most common subsequent appellations after family. Given that subfamily is used after the family level in many other TF gene Phylogenetic relationships among these species were obtained from the NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy). In total, 10 270 sequences of 75 plant species from Rhodophyta to angiosperms have been reported. Numbers in round brackets indicate the number of R2R3-MYB family members in each species. Detailed information, including corresponding references for each species, is provided in Supplemental Table 1. The end date for these statistics is November 2021.

Structural conservation of R2R3-MYBs
The R2R3-MYB protein consists of two major functional parts: a DNA-binding domain (MYB domain) located at the N terminus and a regulatory region (non-MYB region) located at the C terminus. The MYB domain is a signature, highly conserved feature in the gene family, whereas the non-MYB regions have diverged among plant species (Fig. 2a). In addition, the intron patterns in the MYB domains of the R2R3-MYBs are highly conserved within each subfamily across land plants [6,9,10,22,49]. In most cases, the MYB domain has one or two conserved intron insertion sites, with rare intron-less and multiintron genes [6,10]. Although the number of intron patterns differs slightly among taxonomic groups, there is clearly high structural conservation of MYB domains across land plants. In our previous study, we identified 12 highly conserved intron patterns across the plant kingdom [10] that may have arisen early in the transition from aquatic to terrestrial plants based on conserved intron numbers, insertion positions, and intron phases (Supplemental Fig. 1). The majority (∼70%) of the tested R2R3-MYBs share three types (patterns a, b, and c) of the 12 intron patterns (patterns a to l) [10], suggesting biased expansion during evolution. A recent study has shown that some charophyte R2R3-MYBs exhibit intron-exon structures identical to those of their corresponding land homologs in the same subfamily, whereas others show clear differences, indicating a difference among ancestral genes [6].
R2R3-MYB proteins can act as transcriptional activators as well as repressors [42]. Most are activators The DNA-binding domain, also called the MYB region, contains conserved R2 and R3 repeats. In most activators and some repressors [43], there is a conserved bHLH-interacting motif ([D/E] Lx2[R/K]x3Lx6Lx3R) within the first two helixes of the R3 domain that enables interactions with bHLH proteins [48] to form the MYB-bHLH-WDR transcriptional complex. The protein sequences in the C-terminal region often show divergence, with one or a few typical repressor motif(s) such as the EAR motif (ERF-associated amphiphilic repression), SID motif (Sensitive to ABA and Drought 2 protein interact motif), and TLLLFR. H, Helix; T, turn; W, tryptophan; X, amino acid. * indicates that the motif is not included in all R2R3-MYBs, but only in some. b. Number of species for which the functions of one or more R2R3-MYB genes were identified as of 2020. In the last decade (2011-2020), there was explosive growth in the breadth of taxa for which R2R3-MYB data were available. It is noteworthy that most of the increase in species number is ascribed to the vast number of orthologous MYB genes characterized in horticultural plants that have similar functions, especially in phenylpropanoid biosynthesis. c. Number of Arabidopsis R2R3-MYBs for which biological role(s) have been identified. In Arabidopsis, all the increase is due to paralogs with new functions. When a gene had different functions published in different years, we sorted it into the year of the first publication. The dark grey bar represents the number of R2R3-MYBs with unknown (i.e. not experimentally verified) functions. that bind to cis-elements in gene promoters to recruit transcriptional machinery and stimulate gene expression [43]. For instance, many R2R3-MYBs have a bHLHinteracting motif "(D/E)Lx2(R/K)x3Lx6Lx3R" in the R3 MYB domain (Fig. 2a), enabling physical interactions with the bHLH protein to form the MYB-bHLH-WDR (MBW) transcriptional complex [48]. Notably, although the conservation of the non-MYB region is much lower than that of the MYB domain, the C-terminal regions of closely related members within each subfamily or close subfamilies in an evolutionary clade generally share short conserved protein motifs (some called short linear motifs, SLiMs), such as EAR (ethylene-responsive element binding, "pdLNL(D/E)L" or "DLNxxP" in S4) [71], "(R/K)PRPRx(F/L)" in S6, SID (sensitive to ABA and Drought 2 protein interact motif), and TLLLFR [72] (Fig. 2a), indicating their common origin, functional similarities, and structural conservation [9]. Recently, Millard et al. [54] reported that the interactions between MYB TFs from S12 and bHLH were mediated by one of the conserved motifs ([L/F]LN[K/R]VA) located in the center of the non-MYB region. Two novel SLiMs with roles in protein-protein interaction were uncovered by Rodrigues et al. [73] in the disordered C-terminal region. For more detailed information, see the recent reviews by LaFountain and Yuan [47] and Chen et al. [35]. MYB activators and repressors often operate in a hierarchy [74]. Furthermore, many subfamily-and/or branch-specific motifs have been detected within R2R3-MYBs. For instance, based on a large number of species, up to 102 highly conserved motifs have been identified in non-MYB regions [10]. Interestingly, non-MYB regions across the entire A. thaliana R2R3-MYB gene family contain extensive intrinsically disordered regions (IDRs) with special post-translational modification sites, binding sites for physical interactions, and activation domains. These features may explain the substantial functional diversity and the structure-function relationships in plant R2R3-MYBs [54]. However, owing to the high divergence in non-MYB regions, a systematic analysis of motif(s) and/or IDRs covering broad and representative plant lineages is still lacking. Moreover, further experimental evidence is needed, including for IDRs.

Functional conservation and diversification of R2R3-MYBs in land plants
The biological functions of R2R3-MYBs have been verified in an increasing number of non-model crops with high economic value, especially in the last decade. More than 500 R2R3-MYBs have been functionally characterized in around 130 plant species (Fig. 2b and Supplemental Table 2), and their roles in many biological processes have been reported [5, 27-29, 32-34, 38, 41, 75-77]. Notably, research progress related to R2R3-MYB gene function is based mainly on studies of the model plant A. thaliana; approximately 80% of its 126 R2R3-MYBs have been experimentally characterized (Fig. 2c). Overall, the functions of R2R3-MYBs can be classified into three major processes: development and cell differentiation, specialized metabolism (especially the phenylpropanoid biosynthesis pathway), and stress responses (biotic and abiotic stresses).
In summary, R2R3-MYBs are a class of important TFs necessary for the basic life cycle of plants. Therefore, there is a need for more extensive field studies of R2R3-MYBs in the future to determine their functions in basic developmental processes.
In summary, R2R3-MYBs could improve plant responses to multiple abiotic and biotic stresses in agriculture and could be targets of advanced technologies, such as CRISPR technology, to improve crop resistance to environmental stimuli in order to maintain the food supply.

Key roles of R2R3-MYBs in metabolite biosynthesis
The principal roles of plant R2R3-MYBs (∼70%) reported to date are in the regulation of plant specialized metabolism, including the benzenoid, phenylpropanoid, terpenoid, and glucosinolate (GSL) pathways (Supplemental Table 2). Nearly half of the functionally characterized A. thaliana R2R3-MYBs are related to core and specialized metabolism reactions ( Fig. 3 and Supplemental Table 2) [5,9]. Notably, the majority of these R2R3-MYBs are involved as activators or repressors in the transcriptional regulation of the phenylpropanoid biosynthesis pathway, which gives rise to a class of phenylpropanoid-derived compounds such as f lavonoids (proanthocyanidins, anthocyanins, f lavones, f lavonols, isoflavonoids, and phlobaphenes), lignins, and other general phenylpropanoids (Fig. 3a) [43,77,189].
Flavonoids are a vast group of plant specialized metabolites with a common diphenylpropane (C6-C3-C6) backbone [45] that have important functions as pigments and/or light protectants. Typical subfamilies in the R2R3-MYB gene family that are involved in the regulation of flavonoid biosynthesis are S6, S5, S4, S7, S44, and S79 (Supplemental Table 2). In addition to their roles in A. thaliana, there is substantial emerging evidence for the roles of R2R3-MYBs in the transcriptional regulation of flavonoid metabolism in various plants, including important food crops (41), horticultural crops (fruits, 77; ornamentals, 61; vegetables, 37), economically valuable woody crops (27), and edible and medicinal plants (30). With respect to edible crops, including fruits and vegetables, research has mainly focused on substances such as anthocyanins and f lavonols that determine their quality and nutritional properties. However, f lower colors are key traits for f lower crops [42]. Potato StMYB44 has recently been identified as a repressor of anthocyanin biosynthesis in tuber f lesh under high temperatures [192], and whole-genome resequencing-based QTL-seq indicated that AhTc1 controls the purple testa color in peanut [193]. In addition, the overexpression of MdMYB24L resulted in higher anthocyanin contents in transgenic 'Orin' apple calli [113]. Similar studies have examined R2R3-MYBs in other fruit trees, including PaMYB10 in apricot [194], AcMYB123 in kiwifruit [32], LcMYB5 in litchi [195], and FcMYB123 in fig [112]. In M. polymorpha, MpMYB14 and MpMYB02 regulate anthocyanin accumulation [196]. Moreover, different types of flavonoid regulation are broadly observed. For instance, regulatory roles have been reported for PhMYB15 and PpMYBF1 in flavonol biosynthesis in peach fruit [197], FhMYB21L2 in flavonol biosynthesis in Freesia [109], CsPH4 in proanthocyanidin biosynthesis in citrus [198], SlMYB72 in carotenoid biosynthesis in tomato [199], CaMYB108 in capsaicinoid biosynthesis in pepper [200], AgMYB1 in apigenin biosynthesis in celery [137], HaMYB111 in floral ultraviolet patterning in sunflower [201], and SmMYB2 in phenolic acid biosynthesis in Salvia miltiorrhiza [202]. Apart from model flowers such as Petunia and Antirrhinum, a number of flowering bulbs have also been studied recently, including Freesia [203,204], Cattleya [205], and Narcissus [206].
Lignin is another major end-product of the phenylpropanoid pathway and a key component of secondary cell walls (SCWs) in wood; it is polymerized from phenylpropanoid-derived monolignols [207]. The role of R2R3-MYBs in the lignin biosynthesis pathway is a major focus of research. As wood is widely used for pulp, papermaking, and biofuels and wood quality is largely determined by lignin synthesis, most studies of the regulatory effects of R2R3-MYBs on lignin synthesis are based on woody plants. For example, the overexpression of PtoMYB055 upregulates the expression of lignin biosynthetic genes in transgenic poplar, resulting in an increase in the thickness of the SCW [208]. By contrast, PtMYB189 acts as a repressor to regulate SCW biosynthesis in poplar [209]. Likewise, the positive or negative regulation of the SCW by R2R3-MYBs has also been reported in Z. mays [210], G. hirsutum [211], O. sativa [212], Pyrus × bretschneideri [213], and Eriobotrya japonica [214]; however, additional studies are still needed.
In addition to phenylpropanoid metabolism, which has been investigated extensively in plants, other representative specialized metabolic compounds have been studied, such as those produced by the GSL and terpenoid pathways (Fig. 3b-c). GSLs have wellestablished anticarcinogenic and antioxidative effects in humans [215]. Interestingly, consistent with the distribution of GSLs, members of S12 are widely distributed in species from the family Brassicaceae and are predominantly related to GSL biosynthesis in this family (Supplemental Table 2). It has been reported that indolic GSL synthesis is regulated mainly by AtMYB34, AtMYB51, and AtMYB122, whereas aliphatic GSL synthesis is regulated by AtMYB28, AtMYB29, and AtMYB76 [191]. For example, silencing BjMYB28 homologs reduces the seed GSL content in Brassica juncea [216], BoMYB29 increases the methylsulphinyl GSL contentin Brassica oleracea var. acephala [217], and BnMYB28.1 was identified by a QTL analysis using near-isogenic lines [218]. These studies Typical groups of f lavonoids are shown in the yellow box, several representative groups for specific branches of f lavonoid biosynthesis are shown in the deep yellow box, and typical groups for lignins are shown in the light blue box. More information regarding the phenylpropanoid pathway has been reviewed recently [190]. b. R2R3-MYB regulation of glucosinolate biosynthesis in S12 has primarily been studied in Brassicaceae. The synthesis pathway involves three phases, with two types of starting amino acids, that are regulated by different R2R3-MYBs. A fine review with more details has recently been published by Mitreiter and Gigolashvili [191]. c. The best-studied terpenoids regulated by R2R3-MYBs are f loral volatiles.
provide good examples of the adaptive evolution of the R2R3-MYB gene family in land plants, especially in angiosperm diversification. Some plants can produce and emit fragrant molecules such as benzenoids and terpenoids, especially in f lowers, and these processes are strictly regulated at the transcriptional level by R2R3-MYBs. For instance, two R2R3-MYBs, FhMYB21L1 and FhMYB21L2, are expressed synchronously with FhTPS1 and can activate its expression to affect monoterpene synthase synthesis when overexpressed [110]. In addition, BpMYB21 and BpMYB61 in B. platyphylla are involved in triterpenoid synthesis [219]. R2R3-MYBs regulate terpenoid compounds in several medicinal plants, including S. miltiorrhiza [220,221], Panax ginseng [222], and Mentha spicata [223].
The research summarized above clearly demonstrates that plant R2R3-MYBs, which are related to plant specialized metabolism, are mainly involved in the phenylpropanoid biosynthesis pathway, and each subfamily member has key roles in distinct processes, such as proanthocyanidin or lignin biosynthesis. Accordingly, this gene family has probably contributed to functional evolution in land plants.

The origin and expansion of R2R3-MYBs in land plants are associated with functionality
Gene duplication is a prominent event in plant genome evolution and contributes to the establishment of multi-gene families. Genes can be duplicated by vari-ous mechanisms, such as whole genome duplication (WGD), chromosomal segmental duplication, tandem duplication, and retrotransposition. Based on systematic analyses of many land plants, we previously found that WGD/polyploidy events and small-scale duplications (e.g. segmental duplications) accounted for the large expansion of the R2R3-MYB gene family [18,22,209]. Similar results have been obtained in studies of broad plant lineages [21,56,215,224]. It is well known that WGD events are a driving force in angiosperm diversification. The tremendous expansion of R2R3-MYBs in land plants is consistent with WGD in angiosperms, indicating an important role for the expansion of R2R3-MYBs in the increased complexity of angiosperms. Small-scale duplications (e.g. tandem/segmental duplications) have also contributed to the rapid expansion and large size of this gene family in land plants, as well as the evolution of novel gene functions [10,18,21,56,224,225].
R2R3-MYBs underwent a rapid expansion over the course of plant evolution (i.e. from algae to land plants), resulting in a gradual increase in number together with an increase in organismal complexity. Accordingly, the number of genes in this family in land plants is large, showing a huge expansion after the divergence of angiosperms from other vascular plants [10,60]. For example, only eight R2R3-MYBs were detected in the single-celled chlorophyte Chlamydomonas reinhardtii, whereas up to 429 R2R3-MYBs were observed in the allotetraploid Brassica napus [18] (Fig. 1). Based on an analyses of 50 eukaryotes [10], the R2R3-MYB gene family was classified into 73 subfamilies, most of which were newly defined (S26-S73), with the exception of the first 25 subfamilies from Arabidopsis [9]. Based on the distribution of subfamily members across the 27 plant genomes investigated in the study [10], it was speculated that a few subfamilies (S18, S21, S22, S26, and S27) were established soon after the origin of land plants and may be the ancestors of land plant R2R3-MYBs, whereas the major subfamilies formed a monophyletic clade in land plants [10]. In summary, plant R2R3-MYBs experienced three major expansion events: one early in the origin of land plants from Chlorophyta, one after the divergence of spermatophytes from vascular plants, and one in the common ancestor of angiosperms before the divergence of monocots and eudicots, forming the majority of R2R3-MYB subfamilies [10,60]. Notably, members of this gene family exhibit an uneven phylogenetic distribution; the expansion was significantly biased toward subfamilies with three highly homologous intron patterns (a, b, and c) during evolution, resulting in an enormous expansion in the number of R2R3-MYBs in the plant lineage [10,18] (Fig. 4a and Supplemental Fig. 1). For instance, in the 12 land plants investigated in our previous study 10 , 2R-MYBs with patterns a-c generally accounted for ∼66-81% of this gene family in 10 angiosperms, including Z. mays (66%), Arabidopsis (73%), and Vitis vinifera (81%). A similar situation (73%) was found in B. napus [18]. By contrast, the corresponding percentages were 59% in P. patens and 20% in S. moellendorffii [10] (Fig. 4a). Accordingly, up to 43 subfamilies have intron pattern a, and 12 and 19 subfamilies have intron patterns a and b, respectively, and have generally been integrated into the subfamilies with pattern a 10 ( Fig. 4b and Supplemental Fig. 2). Novel subfamilies have commonly been derived from these three categories [10,18]. By contrast, the numbers of genes with the last nine intron patterns (d-l) are relatively conserved in land plants [10,18] (Fig. 4a). Relatively few genes have the last nine intron patterns (dl), which generally account for less than 30% of this gene family in most angiosperms [10]. With the exception of patterns d and j, which are shared by 7 and 2 subfamilies, respectively, each of the last 7 patterns are shared by only one subfamily (e.g. pattern e in S21 and pattern f in S19) 10 ( Fig. 4b and Supplemental Fig. 2). Thus, the number of genes with patterns d-l is relatively conserved in land plants [10,18].
As discussed above, plant R2R3-MYBs are crucial for the regulation of many plant-specific processes related to metabolism (22 subfamilies), development (16 subfamilies), and biotic and abiotic stress processes (14 subfamilies) ( Fig. 4b and Supplemental Fig. 2). Notably, the expansion of this gene family is closely accompanied by increases in the functional diversity of R2R3-MYBs in the plant kingdom. Interestingly, older subfamilies are generally involved in development and/or stress-related processes, whereas the majority of new/derived subfamilies in angiosperms are related to metabolism ( Fig. 4b and Supplemental Fig. 2), suggesting a bias in functional diversification toward specialized metabolism. In particular, the evolution of R2R3-MYBs in plants is related to specific expansions giving rise to species-or lineage-specific subfamilies [224]. For example, 5 of 43 B. napus subfamilies are Brassicaceae specific [18]. Several of the 43 pineapple subfamilies, such as C2, C8, and C22, are likely to be species or lineage specific [64]. Similarly, three dicot-specific and six grass-specific subfamilies have been observed based on a comparative genomic analysis of R2R3-MYBs in Arabidopsis, poplar, rice, maize, and switchgrass [226]. Although the functions of most species-or lineagespecific subfamilies are unknown, the distributions suggest that they have lineage-specific functions. For instance, members of the Brassicaceae-specific S12 subfamily (e.g. AtMYB28/HAG1, AtMYB29/HAG3, AtMYB34, AtMYB51/HIG1, AtMYB76/HAG2, and AtMYB122 in A. thaliana) are regulators of Brassicaceae-specific GSL biosynthesis 5 (Supplemental Table 2 and Supplemental Fig. 2). Three expanded subfamilies in Eucalyptus grandis, including a significantly higher number in woody perennial species (E. grandis, V. vinifera, and P. trichocarpa), and five subfamilies preferentially found in woody plants are potentially involved in cambium-derived woody growth [224]. Several sugar beet MYBs with an atypical amino acid composition in the R3 domain are species-specific regulators of the betalain red pigment pathway [227,228]. Together, these results indicate an obvious expansion and functional trend toward specialized metabolism in this gene family during angiosperm evolution.

Concluding remarks and prospects
R2R3-MYBs are ubiquitous in eukaryotes and are not plant specific, but the numbers of this type of MYB gene are generally higher in plants (especially in angiosperms) than in other eukaryotes [10], and they constitute one of the largest TF gene families in plant genomes. As the number of sequenced plant genomes has increased, the plant R2R3-MYB gene family has been systematically identified and analyzed in about 74 species at a genome-wide level (Fig. 1). These data have substantially improved our understanding of R2R3-MYB distribution, origin, classification, expansion, and evolutionary mechanisms in plants.
As sessile organisms, plants have had to adapt to constantly changing environments during evolution. Through time, plants have differentiated into highly complex organisms and have developed a large number of specialized metabolites with beneficial functions for survival and adaptation. R2R3-MYBs have undergone rapid expansion during plant evolution via WGD and small-scale duplications. These drastic expansion/ duplication events have played a key role in generating diversity, producing many conserved subfamilies and also lineage-specific subfamilies with specific biological functions (e.g. S12 in GSL biosynthesis), especially those  Table 2), as well as representatives of 73 subfamilies from our previous results [10]. The tree was rooted using S29 (the CDC5-like protein) as the outgroup. The scale bar represents 2 substitutions per site. Detailed information on the ML tree is provided in Supplemental Fig. 2. Detailed information on the sequences used for phylogenetic analyses is provided in Supplemental Table 3. The subfamily classification and nomenclature of the 598 proteins were based on those of Arabidopsis [9] and our previous study [10]. The new subfamilies, with the exception of the 25 subfamilies from Arabidopsis [9], are underlined, and their detailed functions and classification information are provided in Supplemental Table 2 and Fig. 2. The colored squares indicate the intron pattern to which each subfamily member belongs (Supplemental Fig. 1). Nodes with bootstrap values ≥70% and ≥50% are shown as black and gray dots, respectively, in the phylogenetic tree. related to specialized metabolism. About 70% of functionally characterized R2R3-MYBs and most subfamilies are related to specialized metabolic processes focused on phenylpropanoid-derived secondary metabolites ( Fig. 4b and Supplemental Table 2). The rapid expansion and diversification of R2R3-MYBs have been ongoing processes throughout the evolution of angiosperms. It is not surprising that this large gene family has contributed to the evolution of physiological or developmental processes that are specific to plants. The expansion of R2R3-MYBs in land plants may underpin the emergence of tremendous plant diversity, providing new functional characteristics related to protection against stress and changing environmental conditions. Thus, uncovering the genomic and molecular basis of the origin and evolution of such functional characteristics is a major research goal.
Owing to the availability of resources for gene functional analyses (e.g. genomic data, genetic populations, and experimental tools), a large proportion of the Arabidopsis R2R3-MYB gene family has been functionally characterized (about 100 of 126 genes). However, our current knowledge of the functions of plant R2R3-MYBs is based mainly on the model plant Arabidopsis, whereas studies of non-traditional models (such as crops) are in an early stage. More information about the functions of R2R3-MYBs in a broader array of plant taxa will undoubtedly improve our understanding of the mechanisms underlying their evolution and functional diversification. Newly developed genomic technologies (e.g. third-generation long-read sequencing technology) combined with traditional and novel molecular technologies (e.g. genome editing tools) offer excellent opportunities for further improving our understanding of the functions and functional diversification of R2R3-MYBs.