Deleterious and Adaptive Mutations in Plant Germplasm Conserved Ex Situ

Abstract Conserving more than 7 million plant germplasm accessions in 1,750 genebanks worldwide raises the hope of securing the food supply for humanity for future generations. However, there is a genetic cost for such long-term germplasm conservation, which has been largely unaccounted for before. We investigated the extent and variation of deleterious and adaptive mutations in 490 individual plants representing barley, wheat, oat, soybean, maize, rapa, and sunflower collections in a seed genebank using RNA-Seq technology. These collections were found to have a range of deleterious mutations detected from 125 (maize) to 83,695 (oat) with a mean of 13,537 and of the averaged sample-wise mutation burden per deleterious locus from 0.069 to 0.357 with a mean of 0.200. Soybean and sunflower collections showed that accessions acquired earlier had increased mutation burdens. The germplasm with more years of storage in several collections carried more deleterious and fewer adaptive mutations. The samples with more cycles of germplasm regeneration revealed fewer deleterious and more adaptive mutations. These findings are significant for understanding mutational dynamics and genetic cost in conserved germplasm and have implications for long-term germplasm management and conservation.

tolerated, and tolerated with low confidence.Note that the five SIFT databases (for oat, soybean, maize, rapa, and sunflower) are accessible as described in A11 below.

A5c. GERP++ Rejected substitution score
We also applied GERP++ (Davydov et al. 2010) to measure the phylogenetic constraint from the substitution of a locus by generating a Rejected Substitution (RS) score.Gerpcol, specifically, estimates constraint for each column of an alignment of several genomes of increasing taxonomic distance.Thus, considerable efforts were made to acquire multiple genome sequence alignments for the seven species.Table S5a lists the related crop species used to do multiple genome sequence alignments for each of the seven studied species, along with their reference genome sequence information and citations.Note that the same reference genome sequence files were used as those for SNP identification with minor exceptions of formatting for barley.All genome sequences to be aligned were prepared by removing the plastid and mitochondrial genome sequences and any contigs and scaffolds not assembled into pseudomolecules.For barley, the unsplit chromosome version 2 from Ensembl Plants was used.For the allohexaploid wheat, the whole genome sequences were separated into three constituent genomes, A, B, and D, for their separate alignments with the genomes of other species.For oat, the whole genome sequences were separated into individual chromosomes.Multiple whole-genome alignments were conducted using the large-scale genome alignment tool, LASTZ (Harris 2007) and converted to multiple alignment files (MAF) using: axtChain, chainNet, netSyntenic, netToAxt, and axtToMaf (UCSC Genome Browser Toolkit, Anaconda distribution).For barley, wheat, and oat, the alignments were made at the chromosomal level with the genomes of other species.For sunflower, the entire reference genome sequences were aligned with the genomes of other species using Mugsy v1.2.3 (Angiuoli and Salzberg 2011).
Single_cov2.v11 from the MultiZ package (Blanchette et al. 2004) was used to remove any low-scoring alignments where there was overlap in alignments against each reference genome to generate a single coverage alignment across each reference genome.The individual MAF files representing the three separate wheat genomes compared to the reference genomes were combined into a single MAF file for each reference species once single_cov2.v11completed.ROAST (Hou and Riemer 2008) was used to find orthologous alignments for each reference genome across all related crop genomes and to generate a single MAF file for each reference genome, or for each reference chromosome, in the case of barley, wheat, and oat.For maize, rapa, soybean, and sunflower a copy of the single MAF file was split into multiple MAF files each representing a single chromosome.The resulting individual chromosome MAF files were converted to aligned FASTA format with msa_view.Phylogenetic tree and neutral branch length (estimated from fourfold degenerate sites) analyses were made using PhastCons (msa_view and phyloFit; Siepel et al. 2005) and were used to quantify the constraint intensity at every position across a representative chromosome for each reference genome species.RS scores for each chromosome of each reference genome were calculated with Gerpcol following the same procedures used by Ramu et al. (2017), and the resulting files were concatenated and filtered to remove RS scores ≤0.The seven generated RS files with RS scores >0 are accessible as described in A11 below.

A6. SNP calling and annotation
The following six steps A6a to A6g were applied to each collection and the three specific paired groups (Table S2), with some exceptions in A6g.

A6a. SNP calling
Single nucleotide polymorphisms (SNPs) were called based on BAM files using ANGSD (Korneliussen et al. 2014) with the following parameters: -dovcf 1 -gl 1 -dopost 1 -domajorminor 4 -domaf 1 -snp_pval 1e-6.A VCF file was generated for each collection and each group of samples.Barley split-, wheat halved-, and oat halved-, chromosome VCF files were recombined using a custom Perl script.For each VCF file, the number of SNPs was calculated.

A6b. Annotation using stand-alone Ensembl Variant Effect Predictor (VEP)
All SNPs were annotated based on an ANGSD-generated VCF file using the stand-alone Ensembl VEP (McLaren et al. 2016;Naithani et al. 2017).For barley and wheat, VEP cache databases were downloaded from the Ensembl Plants website, and for the other five species, VEP cache databases were custom-made from the related GTF or GFF files following the instructions of VEP API and software (McLaren et al. 2016).This analysis allowed for the classification of SNPs into 17 different classes, including missense and synonymous variants.We also calculated the proportions of the detected variants associated with loss of function (LOF) based on the total count of variants from seven VEP annotation classes (Splice_acceptor_variant, Splice_donor_variant, Stop_gained, Stop_lost, Start_lost, Splice_region_variant, and Stop_retained_variant).

A6c. Identifying putative deleterious SNPs (dSNPs)
Amino acid substitutions and their effects on protein function were predicted with the SIFT algorithm.Nonsynonymous mutations with SIFT scores <0.05 were defined as putative deleterious mutations.SIFT (<0.05) and GERP++ RS (>0) annotations were combined to identify the deleterious mutations in constrained portions of the genome.These identified dSNP files formed the basis for further analyses below.

A6d. Identifying fixed dSNPs
ANGSD was applied to generate allele frequency estimations (or MAF files) based on dSNP sites that were identified with SIFT and GERP++ RS scores.Based on the allelic frequency data, fixed dSNPs were also identified.Plots were made using custom R (R Core Team, 2019) scripts to assess the distributions of allele frequencies and RS scores with respect to dSNP.

A6e. Expression of identified dSNPs
An effort was made to identify deleterious genes associated with the dSNPs and fixed dSNPs.A deleterious gene in this study was defined as a gene or canonical transcript associated with one or more of the identified dSNPs and they were extracted from the VEP output file based on the identified dSNPs and fixed dSNPs.The expression of these deleterious genes at the stage of 3leaves in monocots and of the first true leaf in dicots present in the RNA-Seq data was extracted and quantified using a custom Shell script based on StringTie (Pertea et al. 2015).The proportion of the dSNPs associated with expressed deleterious genes was estimated as the count of expressed deleterious genes over all the identified dSNPs.Such estimation represents only the lower bound, as an expressed deleterious gene may harbor more than one dSNP.

A6f. Jackknife estimation of dSNPs
To understand the variation in the detection of dSNPs and fixed dSNPs, we used a jackknife resampling approach.The computational effort was considerably large, given the 49 combinations of the seven collections each having seven combinations of collection and groups.Thus, only 10 jackknife sub-samples of size 69 for a collection and of size (group size minus one) for a group were generated, followed by the SNP calling and annotation from A6a to A6d to estimate dSNPs and fixed dSNPs.The mean and standard deviation of an estimate were calculated from 10 jackknife sub-samples.

A6g. Estimating deleterious base-substitution mutations
We also calculated the deleterious base-substitution mutations per sample (dBSMs) by dividing the detected dSNP count by the reference genome size (Table S5a) and sample size for each collection.This was similarly done for fixed dSNPs for each collection.To assess the variation of these dSNPs or fixed dSNPs, we performed a jackknife sub-sampling, as mentioned above.An effort was also made to calculate the dBSMs per year for the SY group of each collection and estimate the change in dBSMs per year with respect to dSNPs and fixed dSNPs, as this paired group represents the difference in storage years.

A7a. Estimating mutation burden for each assayed sample
With identified dSNP information, ANGSD was run for genotyping and obtaining deleterious allelic frequency files with respect to each collection and its three specific paired groups.Mutation burden per deleterious locus for individual samples was calculated from sample genotype data based on the numbers of deleterious alleles present in three models: homozygousmutation burden, heterozygous-mutation burden, and total mutation burden (Ramu et al. 2017;Wang et al. 2017).The homozygous-mutation burden per deleterious locus is the number of derived deleterious alleles in the homozygous state, divided by a product of 2 × total dSNP count.The heterozygous-mutation burden per deleterious locus is the number of derived deleterious alleles existing in the heterozygous state, divided by a product of 2 × total dSNP count.The total mutation burden per deleterious locus is the number of derived deleterious alleles existing in an accession (2 × homozygous-mutation burden + heterozygous-mutation burden), divided by a product of 2 × total dSNP count.A collection or group mutation burden was calculated based on the mean of individual total mutation burdens over all the respective samples.

A7b. Correlating estimated mutation burdens with accession features
An effort was also made to assess if the estimates of three individual mutation burdens for each sample were associated with its accession features such as the year of acquisition, the storage years since the last regeneration, and the germination levels.Linear regressions were performed between mutation burdens and accession features using basic R packages (R Core Team, 2019).Such assessments were made only for the collection samples with or without technical replicate samples.Sample-wise mutation burdens were also plotted against the storage years since the last regeneration.

A8. Inferring adaptive mutation
We applied polyDFE (Tataru et al. 2017) to infer adaptive mutations with the proportion of adaptive substitutions (i.e., those with a selection coefficient greater than zero) by generating the alpha-dfe value as the ratio of the estimated adaptive substitutions over the observed selected divergence counts (Loewe et al. 2006).The inference to the whole collection was made with only 40 randomly selected samples, as the polyDFE version 2.0 is limited to a sample size of 49 diploid plants.It was based on site frequency spectrum (SFS) data, and such SFS data was obtained by running ANGSD based on synonymous and non-synonymous SNP sites that were annotated as above from the respective VCF file for each collection or group.The ANGSD runs were conducted with -anc option using an ancestral sequence reconstructed (ASR) with two outgroups (Table S5a).The ASR was generated following the steps of A5c up to ROAST and using a custom script to filter alignments shorter than 500 to 1000 bp (depending on the crop species; with 1000 bp for wheat, barley, and oat) before the application of FastML (Ashkenazy et al. 2012).
For each polyDFE analysis, the total sequence length of the selected regions (TSLs) was first estimated by the product of the non-synonymous SNP count times the density of SNPs for the species, which was roughly estimated by the total length of the whole reference genome sequences divided by the total SNPs detected from the collection samples in this study (see the step A6a).The total sequence length of the neutral regions was estimated following the proportionality principle with 1/3 of the TSLs (Tataru et al. 2017).Such estimation of the sequence length may be conservative with RNA-Seq data and could scale up the alpha-dfe estimation (personal communication with Dr. Thomas Bataillon).Three models (A, C, and D) using the -w option were examined and the resulting alpha-dfe estimates were compared.The final analysis was done with Model A, as it was the most stable model with convergence.Each polyDFE analysis was done with 30 bootstrapped datasets, and the standard deviation of the alpha-dfe estimate was obtained accordingly.Only 30 bootstrapped samples were generated and analyzed, because the effort required for the whole polyDFE analysis was substantial, given the 49 combinations of collection and groups with the seven collections.Also, it is worth noting that our polyDFE analyses did not consider degenerated codon positions as discussed by Bierne and Eyre-Walker (2003), as the required effort was too large for our large genome datasets (Table S5a).Thus, our estimates of synonymous and non-synonymous SNP sites were not accurate and can bias the alpha-dfe estimation.

A9. Testing significance
Efforts were made to assess the significance of the difference in an estimate between paired groups.We applied a non-parametric Kruskal-Wallis test (Kruskal and Wallis 1952) to assess the difference in median value conservatively for each paired group in each collection for the following estimates: the counts and proportions of dSNPs and fixed dSNPs, individual mutation burdens, and the 30 bootstrapped alpha-dfe estimates for adaptive mutations.

A10. Gene ontology (GO) analysis of deleterious genes
An effort was also made to do GO analysis of deleterious genes identified for the SY group pair of each collection to identify the common and unique GO features.Genes or canonical transcripts associated with the identified dSNPs were extracted from the VEP output file.The extracted genes were analyzed by Blast2GO Pro v.5.2.5 (Conesa and Götz 2008) using the Gene Ontology Annotation workflow (BLAST, mapping, and annotation) and Enrichment Analysis (Fisher Exact Test).Non-redundant GO term sets were visualized using REVIGO (Supek et al. 2011) with treemaps and tag clouds to assist biological interpretation.These GO analyses were 11.1 million MSR (Table S3).Generally, the percentages of mapped reads in bam files were lower in outcrossing, than selfing, crops.Calling variants including SNP using ANGSD identified a range of 0.51 to 2.94 million SNPs for the seven collections with a mean of 1.56 million SNPs (Table S4).SNP annotation using Ensembl-Variant Effect Predictor allowed for the classification of SNPs into 17 different classes (Table S4).Most SNPs were located at intergenic, upstream, and downstream genic regions.The annotation also revealed abundant intron variants, indicating the presence of substantial unspliced pre-mRNAs in the sequence reads.The proportions of missense SNPs over all detected SNPs ranged from 0.122 (sunflower) to 0.210 (rapa) with a mean of 0.159.The proportions of synonymous SNPs ranged from 0.126 (barley) to 0.401 (rapa) with an average of 0.193.More missense than synonymous variants were detected in barley, oat, and soybean collections (Table S4).We also assessed the proportions of the detected variants associated with loss of function (LOF) based on the total count of variants from seven VEP annotation classes (Splice acceptor variant, Splice donor variant, Stop gained, Fu et al 11 S1 to S11 Table S1.Summary of all plant materials used for this study, including RNA-Seq sequencing and sequence deposition.Additional inventory and sequence information on the materials can be found in Tables S2 and S3, respectively.The assayed samples were acquired from the corresponding germplasm collections held at the Plant Gene Resources of Canada (PGRC), Saskatoon, Saskatchewan, Canada.

Group composition
Fu et al 13

Group composition
Fu et al 18    S5a.Composition of 12 reference genome sequences used for GERP++ 'rejected substitution' (RS) scoring for identification of deleterious SNPs in the samples of seven germplasm collections and outgroups used for ancestral sequence reconstruction for each crop.
Table S5b.Frequency distributions of GERP++ RS values for seven crop species inferred from the alignments of 12 other reference genome sequences (Table S5a).Indica Indica Thale Indica Grape Thale Indica a Genome size, in basepair, from the published and updated version of the sequence available in the host database at the time of analysis.d Two outgroups were used for ancestral sequence reconstruction for each crop b Host database and sequence release verison.EP = Ensembl Plants (https:// http://plants.ensembl.org).Also, barley-split reference geome sequence used for barley analysis was obtained from https://doi.ipk-gatersleben.de/DOI/a056fc05-2530-4122-8cc1-98cf265b2f06/ef4f19fc-8f1a-477c-a4b5-332d2eee2a14/2.GG for https://wheat.pw.usda.gov/GG3/content/avena-sang-download.Table S6.Estimates of deleterious base-substitution mutations per sample (dBSMs) for the samples of the seven collections.The standard deviation (SD) estimates for dSNPs and fixed dSNPs were generated from 10 jackknife sub-samples of the 70 assayed collection samples.
Table S7.Inferences of the changes in deleterious base-substitution mutations per sample per year (dBSMsy) based on the estimation of dSNPs (dS) and fixed dSNPs (fdS) from 10 jackknife sub-samples of the SY groups for each collection (see Table S2).Each SY group pair reflects the difference in the years of storage (YS) in the genebank since the last regeneration, while SY2 samples have more years of storage than SY1 samples.The standard deviations (SD) of the estimates are presented in parenthesis, and the results for significant tests of differences for each SY group pair are given in Table S9 below.Sample size (SZ) for each SY group pair is also shown.The results show some heterogeneity of the changes in dBSMsy with respect to dS and fdS under variable years of storage among the seven collections.S11.List of the Kruskal-Wallis tests for the six estimates inferred from each GD (or germination difference) group pair of the seven collections.The estimates and their standard deviations are 1) dSNP proportion, 2) fixed dSNP proportion, 3) individual total mutation burden, 4) individual homozygous mutation burden, 5) individual heterozygous mutation burden, and 6) alpha-dfe for adaptive mutation.The standard deviations of dSNP and fixed dSNP proportions were obtained from 10 jackknife sub-samples of the GD group in each collection.The standard deviations of alpha-dfe were obtained from 30 bootstrapped samples of each GD group.The statistical significance is shown with *,**, or *** for P <0.05, 0.01, or 0.001, respectively.Our research effort was largely spent on the development of SIFT databases for five plant species (oat, soybean, maize, rapa, and sunflower) and of GERP++ RS scores for seven species by performing multiple sequence alignments of each species genome against 12 other species genomes (Table S5a).
Major steps used to identify putative deleterious SNPs  In each panel, the total number of dSNPs and their mean and median allelic frequencies are also presented.Clearly, a majority of the identified dSNPs had a low allelic frequency, but fixed dSNPs were also observed in each collection.These patterns are more obvious in the barley, rapa, and wheat collections.
Fig. S3a.No marked associations between the mutation burdens per deleterious locus and the storage years since the last regeneration for the assayed barley and wheat samples.Three mutation burdens per deleterious locus (total, heterozygous, homozygous) were presented for each sample in each collection.For each panel, the samples were ordered for the burden extent and shown in their storage years since the last regeneration on the X-axis.
Fig. S3b.No marked associations between the mutation burdens per deleterious locus and the storage years since the last regeneration for the assayed soybean and maize samples.Three mutation burdens per deleterious locus (total, heterozygous, homozygous) were presented for each sample in each collection.For each panel, the samples were ordered for the burden extent and shown in their storage years since the last regeneration on the X-axis.
Fig. S3c.No marked associations between the mutation burdens per deleterious locus and the storage years since the last regeneration for the assayed rapa and sunflower samples.Three mutation burdens per deleterious locus (total, heterozygous, homozygous) were presented for each sample in each collection.For each panel, the samples were ordered for the burden extent and shown in their storage years since the last regeneration on the X-axis..

Fu et al 43
Fig. S3d.No marked associations between the mutation burdens per deleterious locus and the storage years since the last regeneration for the assayed oat samples.Three mutation burdens per deleterious locus (total, heterozygous, homozygous) were presented for each sample in each collection.For each panel, the samples were ordered for the burden extent and shown in their storage years since the last regeneration on the X-axis.
Fu et al 44 Fig. S4a.Regressions of individual mutation burdens per deleterious locus (total, heterozygous, homozygous) over the years of the accession acquisition in the seven collections.Four collections (barley, soybean, rapa, and sunflower) displayed trends of larger total mutation burdens for the samples acquired earlier, while three collections (wheat, maize, and oat) showed trends of lower total mutation burden in the samples acquired earlier.The regression panels highlighted in pink were statistically significant at p<0.05.Three collections (wheat, maize, and sunflower) showed trends of larger total mutation burden in the samples stored longer, while four collections (barley, oat soybean, and rapa) displayed trends of lower total mutation burdens for the samples stored longer after the last regeneration.The regression panels highlighted in pink were statistically significant at p<0.05.Regressions of individual mutation burdens per deleterious locus (total, heterozygous, homozygous) over the accession germination rates in the seven collections.Four collections (wheat, soybean, maize, and sunflower) displayed trends of larger total mutation burden for the samples with lower germinations, while three collections (barley, oat, and rapa) showed trends of larger total mutation burden in the samples with higher germinations.The regression panels highlighted in pink were statistically significant at p<0.05.Comparative allele frequency distributions for deleterious SNPs (dSNPs) identified in the samples of the SY1 (in blue) and SY2 (in red) groups in the seven collections.For example, the blue line and dots in the barley collection represent the allelic frequency distribution for the barley 10Y samples, while the red line and dots are those for the barley 27Y samples.The four left panels are the selfing plants and the three right panels are the outcrossing plants.There were no marked differences in deleterious allelic frequency distribution with respect to the years of storage between SY1 and SY2 groups in each collection.The rapa 16Y (or SY2) samples seemed to have more low allelic frequencies and fixation of dSNPs than the rapa 5Y (or SY1) samples.
Fig.S1.The flowchart showing the major steps used to identify putative deleterious SNPs in this study.The whole process required a SIFT database and GERP++ RS score data for one species.Our research effort was largely spent on the development of SIFT databases for five plant species (oat, soybean, maize, rapa, and sunflower) and of GERP++ RS scores for seven species by performing multiple sequence alignments of each species genome against 12 other species genomes (TableS5a).

Fig. S2 .
Fig. S2.The frequency distributions of deleterious SNPs (dSNPs) identified in seven collections each with 70 individual samples.The four left panels are the selfing plants and the three right panels are the outcrossing plants.In each panel, the total number of dSNPs and their mean and median allelic frequencies are also presented.Clearly, a majority of the identified dSNPs had a low allelic frequency, but fixed dSNPs were also observed in each collection.These patterns are more obvious in the barley, rapa, and wheat collections.
Fig.S4b.Regressions of individual mutation burdens per deleterious locus (total, heterozygous, homozygous) over the years of storage since the last field regeneration in the seven collections.Three collections (wheat, maize, and sunflower) showed trends of larger total mutation burden in the samples stored longer, while four collections (barley, oat soybean, and rapa) displayed trends of lower total mutation burdens for the samples stored longer after the last regeneration.The regression panels highlighted in pink were statistically significant at p<0.05.
Fig.S4c.Regressions of individual mutation burdens per deleterious locus (total, heterozygous, homozygous) over the accession germination rates in the seven collections.Four collections (wheat, soybean, maize, and sunflower) displayed trends of larger total mutation burden for the samples with lower germinations, while three collections (barley, oat, and rapa) showed trends of larger total mutation burden in the samples with higher germinations.The regression panels highlighted in pink were statistically significant at p<0.05.
Fig. S5.Venn diagrams showing the comparative counts of deleterious SNPs (dSNPs) that are shared and unique among three groups of samples (two SY group pairs and all of the assayed samples) in each collection.The four left panels are the selfing plants, while the three right panels are the outcrossing plants.Four collections (wheat, maize, rapa, and sunflower) displayed more dSNPs (and unique dSNPs) in the SY2 group of longer storage than the SY1 group.These results indicate that deleterious mutations generally increased with more years of storage in the genebank since the last field regeneration in the seven collections.

Fig
Fig. S6.Comparative allele frequency distributions for deleterious SNPs (dSNPs) identified in the samples of the SY1 (in blue) and SY2 (in red) groups in the seven collections.For example, the blue line and dots in the barley collection represent the allelic frequency distribution for the barley 10Y samples, while the red line and dots are those for the barley 27Y samples.The four left panels are the selfing plants and the three right panels are the outcrossing plants.There were no marked differences in deleterious allelic frequency distribution with respect to the years of storage between SY1 and SY2 groups in each collection.The rapa 16Y (or SY2) samples seemed to have more low allelic frequencies and fixation of dSNPs than the rapa 5Y (or SY1) samples.

Table S1a .
The relevant biological information for the seven crops studied.

Table S2a .
List of inventory and RNA-Seq information for 72 assayed barley samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S2b .
List of inventory and RNA-Seq information for 72 assayed wheat samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.Note NA is not available.

Table S2c .
List of inventory and RNA-Seq information for 72 assayed oat samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S2d .
List of inventory and RNA-Seq information for 72 assayed soybean samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S2e .
List of inventory and RNA-Seq information for 72 assayed maize samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S2f .
List of inventory and RNA-Seq information for 72 assayed rapa samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S2g .
List of inventory and RNA-Seq information for 72 assayed sunflower samples representing their respective PGRC collection, along with the composition of three paired groups for group analysis.CN=PGRC accession number, Country=country of origin, YA=year acquired, YR=year last regenerated, NR=the number of field regenerations done since the acquisition, GR=germination rate, SY=grouping based on the year last regenerated, RF=grouping based on NR, GD=grouping based on GR, and SEQ-Name=sequence file name.

Table S3a .
Summary of RNA-Seq sequence reads before trimming, after trimming, and after alignment for the samples of the barley and wheat collections.

Table S3b .
Summary of RNA-Seq sequence reads before trimming, after trimming, and after alignment for the samples of the oat and soybean and collections.

Table S3c .
Summary of RNA-Seq sequence reads before trimming, after trimming, and after alignment for the samples of the maize and rapa collections.

fastq paired reads Percent trimmed paired reads Total reads in bam file Total mapped reads in bam file Percent of mapped reads in bam file Sample Original fastq paired reads Trimmed fastq paired reads Percent trimmed paired reads Total reads in bam file Total mapped reads in bam file Percent of mapped reads in bam file MaizeTable S3d .
Summary of RNA-Seq sequence reads before trimming, after trimming, and after alignment for the samples of the sunflower collection.

Table S4b .
Summary of SNP discovery, SNP annotation, deleterious SNP identification, expressed deleterious genes, deleterious mutation burdens, and adaptive mutation inference in the wheat samples.

Table S4c .
Summary of SNP discovery, SNP annotation, deleterious SNP identification, expressed deleterious genes, deleterious mutation burdens, and adaptive mutation inference in the oat samples.

Table S4d .
Summary of SNP discovery, SNP annotation, deleterious SNP identification, expressed deleterious genes, deleterious mutation burdens, and adaptive mutation inference in the soybean samples.

Table S4e .
Summary of SNP discovery, SNP annotation, deleterious SNP identification, expressed deleterious genes, deleterious mutation burdens, and adaptive mutation inference in the maize samples.

Table S5c .
The distributions of GERP++ RS sites and dSNP counts per chromosome detected in the assayed samples of the seven collections.

Table S8a .
Summary of gene ontological (GO) analyses of all dSNPs detected in each storage year (SY) group of the seven collections.Biological processes in bold represent the most common ones across seven SY group pairs and biological processes in italics represent the ones unique within each SY group pair.Several major biological processes such as protein phosphorylation, organic substance metabolism, responses to chemical, stress, and stimulus were identified across seven SY group pairs and some distinct biological processes for each SY group pair.

Table S8b .
Summary of gene ontological (GO) analysis of all fixed dSNPs for each storage year (SY) group of the seven collections.Biological processes in bold represent the most common ones across seven SY group pairs, and biological processes in italics represent the ones unique within each SY group pair.Several major biological processes such as cellular process, macromolecule metabolism, nitrogen compound metabolism, and metabolism were identified across seven SY group pairs and some distinct biological processes for each SY group pair.

Table S9 .
List of the Kruskal-Wallis tests for the six estimates inferred from each SY (or storage year) group pair of the seven collections.The estimates and their standard deviations are 1) dSNP proportion, 2) fixed dSNP proportion, 3) individual total mutation burden, 4) individual homozygous mutation burden, 5) individual heterozygous mutation burden, and 6) alpha-dfe for adaptive mutation.The standard deviations of dSNP and fixed dSNP proportions were obtained from 10 jackknife sub-samples of the SY group in each collection.The standard deviations of alpha-dfe were obtained from 30 bootstrapped samples of each SY group.The statistical significance is shown with *,**, or *** for P <0.05, 0.01, or 0.001, respectively.

Table S10 .
List of the Kruskal-Wallis tests for the six estimates inferred from each RF (or regeneration frequency) group pair of the seven collections.The estimates and their standard deviations are 1) dSNP proportion, 2) fixed dSNP proportion, 3) individual total mutation burden, 4) individual homozygous mutation burden, 5) individual heterozygous mutation burden, and 6) alpha-dfe for adaptive mutation.The standard deviations of dSNP and fixed dSNP proportions were obtained from 10 jackknife sub-samples of the RF group in each collection.The standard deviations of alpha-dfe were obtained from 30 bootstrapped samples of each RF group.The statistical significance is shown with *,**, or *** for P <0.05, 0.01, or 0.001, respectively.