Comprehensive analysis of SSRs and database construction using all complete gene-coding sequences in major horticultural and representative plants

Simple sequence repeats (SSRs) are one of the most important genetic markers and widely exist in most species. Here, we identified 249,822 SSRs from 3,951,919 genes in 112 plants. Then, we conducted a comprehensive analysis of these SSRs and constructed a plant SSR database (PSSRD). Interestingly, more SSRs were found in lower plants than in higher plants, showing that lower plants needed to adapt to early extreme environments. Four specific enriched functional terms in the lower plant Chlamydomonas reinhardtii were detected when it was compared with seven other higher plants. In addition, Guanylate_cyc existed in more genes of lower plants than of higher plants. In our PSSRD, we constructed an interactive plotting function in the chart interface, and users can easily view the detailed information of SSRs. All SSR information, including sequences, primers, and annotations, can be downloaded from our database. Moreover, we developed Web SSR Finder and Batch SSR Finder tools, which can be easily used for identifying SSRs. Our database was developed using PHP, HTML, JavaScript, and MySQL, which are freely available at http://www.pssrd.info/. We conducted an analysis of the Myb gene families and flowering genes as two applications of the PSSRD. Further analysis indicated that whole-genome duplication and whole-genome triplication played a major role in the expansion of the Myb gene families. These SSR markers in our database will greatly facilitate comparative genomics and functional genomics studies in the future.


Introduction
Since molecular marker technology was developed in the 1980s, an increasing number of molecular marker types have been identified, which has rapidly accelerated genetic improvements in species 1 . The development and comparative analysis of molecular markers could help us reveal genetic variation underlying various biological functional genes [2][3][4] . To date, researchers have found several molecular markers, such as restriction fragment length polymorphisms, random amplified polymorphism DNA, sequence tag sites, amplified fragment length polymorphism, diversity array technology markers, singlenucleotide polymorphisms, specific locus amplified fragments, and simple sequence repeats (SSRs) 1,5,6 .
These molecular markers play important roles in genetic map construction, quantitative trait locus detection, marker-assisted selection (MAS), and fine localization of important functional genes to fulfill various demands of breeders 7,8 . There have been many studies of molecular markers in model plants 1,9 . For example, several kinds of molecular markers were used to identify genes related to leaf senescence, leaf shape, chlorophyll, and embryogenesis in Arabidopsis [10][11][12] . Similarly, most genes determining disease resistance and major agronomic traits, such as grain quality, grain weight, and grain size, were also detected using molecular markers in rice [13][14][15][16] . In horticultural plants, molecular markers are also widely used for plant breeding in most species, including Brassica rapa, Brassica oleracea, Solanum lycopersicum, Cucumis melo, Vitis vinifera, Fragaria ananassa, and pear [17][18][19][20][21][22] . Furthermore, progress in molecular genetics, genomic selection, and genome editing has provided deep insights into the understanding of molecular markers and greatly complemented breeding strategies 1 .
SSR markers are present in almost all species, particularly in eukaryotes. These markers have many applications, such as constructing linkage maps, fine mapping of genes, and selective breeding through genomic selection 2, [23][24][25] . SSRs have become extremely popular for phylogenetic analysis and have expanded our knowledge related to plant breeding [26][27][28] . The development of bioinformatics technology has enabled the development of SSR markers for many species [29][30][31] . Recently, there have been many reports on SSR development and application [32][33][34][35][36][37][38] . These studies have confirmed that SSRs are the classic, popular molecular markers used in plant science.
With an increasing number of plant genomes being released, it has become possible to construct a plant SSR database (PSSRD) using the SSRs identified from all genes in these plants. Compared with those in existing databases, all the species in the database in this study have undergone complete genome sequencing. In addition, the PSSRD provides primer information and Pfam function annotation, which allows researchers to use these SSRs in a more convenient manner than those in other databases. More importantly, we not only provide more comprehensive and representative SSR information with the construction of this database but also conduct large-scale systematic and comparative analyses of SSRs in 112 plants.

Overview of the main interface of the PSSRD
We identified 249,822 SSRs from 3,951,919 gene sequences of 112 plant species. Specifically, 132,114, 64,980, 9478, and 43,250 SSRs were detected in 70 eudicots, 27 monocots, 7 other higher plants (1 basal angiosperm, 2 gymnosperms, 1 Lycopodiophyta, 2 Bryophyta, and 1 Marchantiophyta), and 8 lower plants, respectively ( Fig. 1a and Table S1). Among these species, Fig. 1 The architecture of the plant SSR database (PSSRD) and related species. a The phylogenetic relationship of 112 species used for constructing the PSSRD according to NCBI taxonomy 66 . b The PSSRD architecture mainly includes the home, browse, download, tool, chart, and resource interfaces many are horticultural plants, such as vegetables (B. rapa, Brassica oleracea, Capsicum annuum, Daucus carota, and S. lycopersicum), fruits (Citrus clementina, C. melo, Fragaria vesca, Prunus persica, and V. vinifera), and flowers (Prunus mume, Aquilegia coerulea, and Catharanthus roseus). On average, primers were successfully designed for 98.82% of the SSRs for further study. Using these available datasets and related bioinformatics tools, we built a PSSRD, which helps users easily query, compare, and download SSR markers, primers, and functional annotations of several or all species simultaneously. All species used in this study were taxonomically classified to facilitate selection and use. The SSR information was stored in backend tables using MySQL (MySQL AB, Sweden) that can be accessed using the frontend web application of PSSRD (Fig. 1b). Here, we provide a detailed description of the interactive interfaces in this database, including the browse, chart, download, tool, resource, contact, and help interfaces ( Fig. 2 and Fig. S1).

Browse
To make the database easy to use by researchers, we divided all species into different groups according to their taxa (Fig. 2). For each taxon, the species were further sorted by the first letter of their Latin names. We provided detailed information for each species, such as SSR information (type, sequences, size, start, and end), primer information (forward and reverse sequences, melting temperature (T m ) value, and size), amplified production size, and related gene information (gene ID and links of Pfam annotation). Furthermore, we also integrated the search function at the browse interface, which allows the users to find related information according to gene ID, SSR type, and SSR sequences. Moreover, we provided a variety of export formats, including Excel, pdf, csv, duplicate, and print functions.

Chart interface
The chart interface provides several interactive plots to view the SSR data of all species (Fig. 2). First, the SSR number of each species is shown in the main interface, and the multiselect dropdown allows users to select the taxon for their needs. Furthermore, bar plots and line charts are used to show the SSR number of each species, which makes it easier and faster for users to compare SSRs between different species. Finally, all the information of these displayed SSRs can be downloaded at the lowerright corner of these pages as Excel files. These documents will allow researchers to conduct local batch SSR comparative analysis and perform relevant markerassisted selective breeding experiments.
In addition, we provide further graphical representations of the SSR information for each species. Each species has six plots with pie charts, bar plots, and line charts, which show detailed information on SSRs, including SSR type, SSR length, product size, most frequent SSR, base Fig. 2 Overviews of the main interfaces and internal features of the plant SSR database (PSSRD). The overviews of the PSSRD mainly contain the home, browse, download, tool, chart, resource, contact, and help interfaces number, and frequency of SSR distribution for each type. These diagrams could help users intuitively understand the SSR information of each species.

Download interface
The SSR information and statistics for each species can be downloaded from this interface (Fig. 2). Four files, including best primers, all primers, Pfam annotation, and position information of SSRs for each species, can be obtained from the download page of the PSSRD. The downloaded file is a tab-separated format, which can be browsed using Excel or other related text editors, such as EditPlus or Sublime text.

Tool interface
In addition to providing SSR information retrieval, graphical display, and download services for existing species, we developed two tools, the Web SSR Finder (WSF) and Batch SSR Finder (BSF) programs (Fig. 2). These two tools can assist researchers in conducting SSR identification and analysis for a new species.
For the WSF, users can upload nucleic acid sequences in the FASTA format and then set the minimum number of repetitions for various types of SSRs. Finally, the start button can be clicked, and after a moment, the relevant SSR identification results are obtained.
The BSF program can batch-detect SSRs in multiple species on the local server. Although the previous MISA program could identify SSRs, it only detected the SSR of one species at a time. Therefore, we have modified and updated the MISA program and named the new program BSF. In addition to some basic SSR identification files, we also provide comparative analysis files of SSRs between different species. With the completion of additional genome sequencing, a batch-comparison study needs to be conducted on the SSR information of a large number of species. Therefore, the updated BSF program is more convenient for users to carry out batch SSR identification and multispecies comparative studies. Anyone engaged in scientific research can download and freely use or further edit this program according to their own analysis needs.

Resource, help, and contact interfaces
For the resource interface, we collected most of the SSR research-related databases and provide relevant links for users to easily query and compare studies (Fig. 2). For the help interface, we provide the researcher with a detailed PSSRD user manual. In addition, we provide contact information to help users contact us conveniently and quickly.
Comprehensive comparative analysis of the SSRs in 112 species Trinucleotide SSRs were dominant according to the frequency distribution analysis In our study, all the SSRs were divided into nine types from mono-to nonanucleotides ( Fig. 3a and Table S1). We found that trinucleotides were the most common SSR type in all four groups, and the average percentages of the SSR numbers were 64.14%, 79.81%, 74.27%, and 84.87% for eudicots, monocots, other higher plants, and lower plants, respectively (Fig. 3c). Nevertheless, we found that the number of trinucleotide SSRs varied considerably among different species, ranging from 114 (eudicot plant: Chenopodium quinoa) to 12,663 (lower plant: C. reinhardtii). The average number of trinucleotide SSRs was 1610 in 112 plants, followed by dinucleotide SSRs (229) and hexanucleotide SSRs (219) ( Fig. 3 and Table S1). This result might have occurred because the trinucleotides in the genecoding regions did not lead to the transcoding of genes. This theory could be further verified by considering hexanucleotides, the percentage of which was also greater than that of the other SSR types in the four groups (Fig. 3b).

Correlation analysis of the factors related to different SSR characteristics
To explore the relationship between the factors related to different SSR characteristics, we conducted a correlation analysis for these factors. Here, we investigated several factors related to SSR characteristics, including SSR number, SSR density (SSR number per Mb), number of genes containing SSRs, and percentage of genes containing SSRs. In addition, the factors total gene number and total length of gene sequences were also used for the comparative analysis in all examined plants.
A significant correlation was detected between the percentage of genes containing SSRs and the SSR number or SSR density in plants (correlation coefficients > 0.80 and P value < 0.01) (Fig. 4). However, there was no significant correlation between SSR number and total gene number or the total length of gene sequences.

Comparative analysis indicated that more SSRs were present in lower plants than in higher plants
Our analyses showed that among the plants, the different lower plants had the largest SSR variations, including variations in SSR number, SSR density, number of genes containing SSRs, and percentage of genes containing SSRs (Fig. 5a, b and Fig. S2). The average SSR density in lower plants was the largest (256.90), followed by that in monocots (55.92), other higher plants (46.34), and eudicots (40.54) (Table S1).
To obtain detailed information about the SSRs in each species, we carried out a further analysis. Overall, more SSRs were detected in lower plants than in higher plants    Correlation analysis of different SSR characteristics, including total gene number, total gene sequences, SSR number, SSR density, number of genes containing SSRs, and percentage of genes containing SSRs. The lower-left corner represents the correlation analysis scatter diagram for different SSR characteristics. The plots in the middle are a bar chart for each SSR feature. The upper-right corner represents the correlation values between different SSR features. The 1, 2, and 3 red asterisks represent P < 0.05, P < 0.01, and P < 0.0001, respectively. The yellow background represents the SSR characteristics with significant differences (correlation coefficients > 0.80 and P value < 0.0001) (Fig. 5). Among the top 15 species with a high percentage of genes containing SSRs, six (40.00%) species belonged to lower plants (Fig. 5c). Two species with the highest percentage of SSR genes were lower plants, Micromonas pusilla CCMP1545 and C. reinhardtii ( Fig. 5c and Table  S1). In M. pusilla CCMP1545, 3768 genes contained SSRs, accounting for 35.35% of the total number of genes. This result might have been due to the special role played by SSRs in lower plants and provides a new perspective for the study of SSR function.

Functional enrichment analysis of genes containing SSRs in 112 species
To further explore the function of SSRs, we conducted functional annotation using the Pfam database. A total of 69.75% of the annotated genes contained SSRs in monocots, followed by those in eudicots (69.25%), other higher plants (65.29), and lower plants (60.27%) (Table S2). We further performed functional enrichment analysis of these SSR-related genes in 112 plants, and 155 terms were enriched with a q value < 0.05 and fold change ≥2 (Table  S3). Our enrichment analysis required that the annotation ratio of the term for SSR genes was twice as high as that of the whole-genome genes. The most enriched term was AP2, followed by Myb_DNA-bind 4, Myb_DNA binding, and TCP family genes. Interestingly, we found that the most significantly expanded terms belonged to the transcription factors associated with the regulation of abiotic stress, such as Myb, TCP, AP2, WRKY, and various zincfinger (zf-CxHx) proteins (Table S3). This result indicated that SSRs might play a very important role in the regulation of plant stress.
Furthermore, we selected the 20 most significantly enriched terms for graphic presentation, and all had q values < 3.32e − 78 (Fig. 6a). Among the 20 top enriched terms, the largest fold change was over 11.73 for Gua-nylate_cyc, followed by that for PTEN_C2 (7.91) and LIM_bind (7.68). This result indicates that these enriched proteins might play critical roles through SSRs in plants.
Further analysis showed that Guanylate_cyc (PF00211) was found in 27,984 sequences from 4096 species according to the Pfam database. Among these sequences, 12,485 sequences from 918 species belonged to Eukaryota, while most of the other sequences belonged to bacteria ( Fig. 6b and Fig. S3). In Eukaryota, most sequences (9235) were from 310 species of Metazoa, while only 391 sequences belonged to 21 species of green plants (Viridiplantae). In Viridiplantae, 12 sequences were from five Streptophyta species, and 379 sequences were from 16 Chlorophyta species (Fig. 6b). Therefore, more genes containing the Guanylate_cyc domain were found in lower plants than in higher plants.
Among the five species from Streptophyta, two species belonged to Charophyta (Klebsormidium nitens and Chara braunii), which contained six and two genes with the Guanylate_cyc domain, respectively (Fig. 6b). The other three species were from land plants, including one Bryophyta (Physcomitrella patens), one Lycophyte (Selaginella moellendorffii), and one angiosperm (Ricinus communis). All identified SSRs located in these genes with the Guanylate_cyc domain could be used as markers for functional studies in the future.

Functional enrichment analysis of genes containing SSRs in eight representative species
We further explored the function of genes containing SSRs in eight representative lower plants (Chlorophyta: C. reinhardtii) and higher plants, including the horticultural plant B. rapa, eudicot model plant Arabidopsis thaliana, monocot model plant O. sativa, basal angiosperm Amborella trichopoda, gymnosperm Picea abies, Lycopodiophyta S. moellendorffii, and Bryophyta P. patens (Fig. 7a).  Table S4). However, no enriched functional terms were found in P. patens.
Further Venn diagram analysis showed 23, 5, 4, and 2 enriched functional terms specific to O. sativa, B. rapa, P. abies, and C. reinhardtii, respectively (Fig. 7b). Two specific functional terms for the lower plant C. reinhardtii were zf-MYND and Guanylate_cyc (Fig. 7b). This result was also consistent with the above analysis of the Gua-nylate_cyc domain; that is, this domain mainly existed in lower plants. Interestingly, we found that Myb_DNA-bind_4 was detected in most plants as an enriched functional term, including B. rapa, O. sativa, P. abies, and S. moellendorffii. In addition, Myb_DNA binding was enriched in O. sativa and C. reinhardtii. This phenomenon indicated that Myb-related genes might play important roles mediated by SSRs in plants.

PSSRD application 1: Myb-related gene families Phylogenetic and comparative analysis of Myb-related gene families
Since the above analysis showed that Myb family genes were significantly enriched in SSR-related genes, we further conducted phylogenetic and comparative analysis of several Myb gene families.
To explore the evolution and function of Myb gene families, we constructed a phylogenetic tree using Mybrelated genes from five families in eight representative species, including B. rapa, A. thaliana, O. sativa, A. trichopoda, P. abies, S. moellendorffii, P. patens, and C. reinhardtii ( Fig. 9 and Fig. S4-7). According to the topology of the phylogenetic tree, the genes of each Mybrelated gene family were classified into different groups. We marked the main functions of most groups according to the Myb family gene functions in Arabidopsis. This result provided a good reference for studying other genes H ig h e r p la n t L o w e r p la n t 17      with unknown functions in the same group. Interestingly, we found that most Myb_DNA-binding family genes of the lower plant C. reinhardtii were clustered on the same branch in the evolutionary tree, while the genes of the other seven species were scattered on different branches (Fig. 9a). This result indicated that the genes of this gene family have experienced changes in the base sequences or gene structure. Thus, Myb_DNA-binding family genes might have evolved to have a greater variety of functions in higher plants than in lower plants, which might have allowed higher plants to become better adapted to terrestrial environments. In addition, we performed a comprehensive analysis of four other Myb-related gene families (Figs. S4-7).

Gene duplication and loss inference of Myb-related gene families
We analyzed the duplication and loss of Myb-related gene families in these eight plants using the Notung software through reconciliation between species and gene phylogenetic trees.
Among the eight species, the most genes were identified in B. rapa for all five Myb gene families (Fig. 9, Figs. S4-7, Table S10). In B. rapa, the number of Myb_DNA-binding family gene duplications was higher than the number of gene losses (193 vs. 15), whereas in Arabidopsis, the number of gene duplications was lower than the number of gene losses (Fig. 9c). Brassica rapa underwent an additional whole-genome triplication (WGT) event since its divergence from Arabidopsis according to a previous report 39 . Therefore, we inferred that WGT events might play important roles in the expansion of the Myb_DNAbinding gene family in B. rapa.
Similarly, there were more gene duplications than gene losses in O. sativa and P. patens, and these duplications occurred in one or several whole-genome duplication (WGD) events. For the other four Myb gene families, we found that they had similar trends in gene duplications and losses as those of the Myb_DNA-binding gene family (Figs. S4-7). Therefore, we believe that WGD or WGT plays a major role in the expansion of Myb gene families. This finding provides new insights and guidance into SSRs and other gene family analyses using datasets from our PSSRD.

PSSRD application 2: flowering-time gene analysis
SSRs are often located in some important functional genes related to plant development and various abiotic stress responses 2,40,41 . Here, we took flowering-time genes as an example to show the application of SSRs stored in our PSSRD. In plants, flowering is critically important for successful sexual reproduction and fruit and seed development 42,43 . A diverse range of environmental and endogenous signals regulate flowering 44,45 . Previous reports have indicated that many genes are involved in regulating plant flowering, and they could be assigned to several regulatory pathways, including photoperiod, vernalization, gibberellin, ambient temperature, autonomous, and aging pathways 43,46 .
Most flowering-time genes have been reported and functionally characterized in Arabidopsis and Brassica species 42,43,[46][47][48] . In Arabidopsis, 306 flowering-time genes have been identified, including 295 coding and 11 noncoding genes according to previous reports 47,48 . Based on these coding genes, we identified 514 homologous flowering-time genes in the horticultural plant B. rapa when compared with those in Arabidopsis by the Blastp program ( Fig. 10 and Table S11). Further analysis showed that 30 genes contained SSRs, accounting for 5.84% of all 514 flowering-time genes in B. rapa (Fig. 10). For example, the flowering locus KH domain (FLK, BraA03 g031700), phytochrome-dependent late flowering (PHL, BraA07 g036800), and cryptochrome 2 (CRY2, BraA10 g002940) genes contained SSRs in B. rapa. These SSRs will be useful for MAS breeding for flowering in Brassica in the future. Similarly, users could also search for SSRs in other functional genes of 112 species from the PSSRD. Therefore, our database can provide researchers with plentiful SSR resources.
The distribution of flowering-time genes on ten chromosomes in B. rapa. The green indicates that the flowering-time genes contained SSR markers.

Discussion
In this study, we comprehensively identified SSRs from all the gene-coding sequences (CDSs) of 112 plants and further performed functional enrichment analysis for SSR-related genes. Among the top 20 significant functional enrichment terms, the Guanylate_cyc term had the largest fold change for SSR-related genes relative to the (see figure on previous page) Fig. 9 Phylogenetic and gene duplication or loss analysis of the Myb_DNA-binding gene family in eight representative species. a Maximum-likelihood (ML) trees were generated based on the amino acid sequences of the Myb_DNA-binding gene family. The phylogenetic tree was constructed using FastTree software with 1000 bootstrap repeats in eight species. Bootstrap values >40% are shown with circles on each branch. b The gene number of the Myb_DNA-binding gene family in each species. c Gene duplication and loss analyses of the Myb_DNA-binding gene family using the Notung software in eight species. Differential gene duplications and losses are indicated by numbers with a + and − on each branch. Whole-genome duplication (WGD) and whole-genome triplication (WGT) events are indicated with a quadrilateral and hexagon, respectively whole-genome level. Interestingly, further investigation showed that the Guanylate_cyc domain existed in lower plants and other nonplant species, while it was rarely found in higher plants. Based on previous reports, guanylate cyclases catalyze guanosine triphosphate to cyclic guanosine monophosphate (cGMP). As an intracellular BraA10g000850 BraA10g001200 messenger, cGMP activates kinases and regulates ion channels 49,50 . Guanylate cyclases are part of the G-protein signaling cascade, which is inhibited by high intracellular calcium levels but activated by low calcium levels 51,52 . Therefore, the genes with the Guanylate_cyc domain might play critical roles in lower plants, animals, and bacteria. This finding provides a new perspective for the functional study of SSR-related genes.
Our findings showed that the most significantly expanded functional terms were transcription factor families related to the regulation of abiotic stresses, such as Myb, AP2, and WRKY. Most of these gene families played important roles in stress resistance in plants according to previous reports [53][54][55][56][57] . This result indicated that SSRs might play critical roles in regulating plant stresses. Further comparative analysis of eight representative plants showed that several specific and common enriched functional terms were detected. Among all functional enriched genes, Myb-related gene families existed in most plants. The Myb gene family has a wide range of effects on plant growth, development, and stress resistance, such as anther development, axillary meristem formation, cell-wall thickening, and sperm cell formation [58][59][60] . The Myb gene family is also involved in several biosynthesis pathways, such as anthocyanin and flavonol synthesis, and hormone responses 59,61,62 . Our further analysis indicated that WGD and WGT played a major role in the expansion of the Myb gene families. This finding provided new insights and guidance into SSRs and other gene families.
Currently, an increasing number of genomes have been sequenced, and it is possible to develop a large number of SSR markers at the whole-genome level in different species from each main kingdom. To date, several databases have been constructed to collect SSRs from one or more species, such as the Plant Microsatellite Database, FishMicroSat, and Microsatellite Database [63][64][65] . However, most existing SSR databases were constructed several years ago and have not been updated with novel sequence information, or they cannot be accessed. Therefore, we constructed a PSSRD in this study, and it will be updated with new SSR datasets and information promptly in the future. With the increasing number of genome sequences released, we will continuously collect novel genomic datasets and identify SSRs and store them in our PSSRD for users. We also encourage users to submit their new SSR datasets to us to further enrich and refine the database. Moreover, we welcome all users to send us feedback for further improvement of our database. We believe that the PSSRD will be a useful and user-friendly database for researchers.

Conclusion
In conclusion, we constructed a PSSRD for widely collected SSR sequences from 112 plants. Interestingly, we found that more SSRs were detected in the lower plants than in the higher plants. Moreover, a comprehensive comparative analysis of SSRs was conducted to reveal their basic characteristics and functional enrichment in different plants. This PSSRD can be used for comparative genomic analysis and molecular MAS studies of plants in the future.

Sequence collection
The CDSs and protein sequences of each plant in Fasta format were downloaded from the ensemble database (http://useast.ensembl.org/index.html). The alternative splice sequences within the species were removed by custom Perl script to ensure no redundancy of the datasets. We have provided detailed information on the 112 plants used in this study, such as the classification, genome information, and related references in Table S12. Based on the relationship of these species in the NCBI taxonomy, the phylogenetic trees were further edited and shown using the iTOL program 66,67 .

Primer design for SSR markers
The primers were designed for the identified SSRs using the Primer3 program 72 . The main parameters were set as follows according to a previous report 2 : (i) the optimum primer length was 20 nucleotides, and the range was from 18 to 27 bases. (ii) The optimum temperature of the T m was 60°C, and the range was from 55 to 65°C. (c) The optimum size of the target PCR products was 150 bp, and the range was from 100 to 280 bp. All other parameters were set to the default values according to the Primer3 program.

SSR statistics and correlation analysis of different factors
Violin plots with boxplots of SSR number, SSR density, and the percentage of genes containing SSRs were drawn using the ggviolin function in the ggpubr package of the R program (https://cran.r-project.org/web/packages/ggpubr/ index.html). Correlation coefficients and significance tests were performed using the Hmisc and Performance Analytics packages of the R program (https://www.r-project.org/). The definition of significant correlation was an absolute value of correlation coefficients > 0.80 and a P value < 0.01.

Functional annotation and enrichment analysis
The functional annotation of the genes containing SSRs and all other genes was conducted using the localized Pfam database (http://pfam.sanger.ac.uk) 73 . The Venn diagram was drawn by TBtools 74 . The functional enrichment analysis of the SSR-related genes compared with the whole-genome genes was conducted using the SciPy package of Python 75 . Then, R was used to perform Benjamini and Hochberg correction on the P value of significance test, and the parameters for significant functional enrichment terms were defined as q value < 0.05 and fold change ≥ 2 76,77 .

Identification and analysis of important functional gene families
Pfam was used to perform a domain search on the amino acid sequences of each species. The genes containing the domains of "Myb_DNA binding" (PF00249), "Myb_DNA-bind_3" (PF12776), "Myb_DNA-bind_4" (PF13837), "Myb_DNA-bind_5" (PF13873), "Myb_DNA-bind_6" (PF13921), and "Myb_DNA-bind_7" (PF15963) were extracted by self-programmed Perl with an e value <1e − 4. In addition, the Simple Modular Architecture Research Tool and Conserved Domains Database were used to conduct domain validation on these genes to ensure accuracy 78,79 . Arabidopsis flowering genes were collected from FLOR-ID and previous reports 47,48 . The homologous flowering genes in B. rapa were identified by a comparison with those in Arabidopsis by the Blastp program (e value <1e − 5, identity >70%).

Phylogenetic tree construction and gene duplication or loss inference
The amino acid sequences of each Myb gene family were aligned using Mafft v7.471 with the maxiterate set as 1000 80 . FastTree (v2.1.11) software was used to perform phylogenetic analysis using the maximum-likelihood method 81 . The Jones-Taylor-Thorton model was adopted, and the bootstrap replications were set as 1000. The phylogenetic trees of each Myb gene family were illustrated using the iTOL program to add SSR-related information or gene function 67 . Gene duplication and gene loss analysis were performed using the Notung2.9 software 82 .

Database construction
The PSSRD was constructed by applying various software packages, including MySQL database management, PHP, JavaScript, HTML, and CSS. The collected datasets were processed using Python or Perl, and several bioinformatics programs were used for interpreting biological data analysis and mining. The PSSRD contains several databases that store processed SSR-related data in MySQL. The interactive Web interface was constructed to enable users to conveniently access the PSSRD and obtain information for basic research using any popular browser on their devices. PHP, HTML, and JavaScript were used to transmit query requirements and extract data rapidly from the MySQL database to create report pages. The interactive plotting system was developed using d3.js and nvd3 helper libraries 83 . More importantly, two tools, WSF and BSF, are provided, which were rewritten according to the MISA 68 . These two tools will greatly facilitate the online or local batch identification of SSRs for users.