Genome assembly of the JD17 soybean provides a new reference genome for comparative genomics

Abstract Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromosome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with 3 published soybeans (WM82, ZH13, and W05), which identified 5 large inversions and 2 large translocations specific to JD17, 20,984–46,912 presence–absence variations spanning 13.1–46.9 Mb in size. A total of 1,695,741–3,664,629 SNPs and 446,689–800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.


Introduction
Soybean (Glycine max) is an important crop for protein and dietary oil and is ranked the fourth largest crop in production in the world. The current soybean reference genome, which was based on the Williams 82 (WM82) line (Schmutz et al. 2010) has greatly enhanced the identification of genes underlying important traits and facilitated research on the function and expression of soybean genes.
Recent studies using high-throughput sequencing has revealed extensive genetic diversities in soybean (Zhou et al. 2015). Pan-genome study on wild and cultivated soybeans has uncovered numerous genetic differences among soybean strains , suggesting that a single reference genome is inadequate for representing the genetic richness of soybean lines.
The latest progress in sequencing technologies has greatly advanced the ability to construct high-quality genome assemblies with chromosome-level continuity with dramatically reduced cost (Risse et al. 2015;Deschamps et al. 2018). The application of Iso-Seq protocol has enhanced genome annotation (Jiang et al. 2017;Jiao et al. 2017;Li et al. 2018;Magrini et al. 2018). Benefiting from improved technologies, more and more reference genomes of various soybean materials have been sequenced and reported, including the Chinese cultivar ZH13 (Shen et al. 2019), 3 reference genomes (G. max WM82v4, G. max Lee, and Glycine soja PI 483463) (Valliyodan et al. 2019), the wild strain W05 genome (Xie et al. 2019), the pan-genome constructed from 26 different soybean species , and the recently published Korean Hwangkeum genome (Kim et al. 2021) and 8 soybean genomes (Chu et al. 2021).
JD17 is a major soybean cultivar in the Huang-Huai-Hai region of China, and it is also the main soybean variety recognized by the Ministry of Agriculture since 2010. It is the offspring of Hobbit (maternal parent) and Zao 5241 [7476 Â 7527-1-1 (Yanli Â Williams)] (paternal parent) (Qin et al. 2014), and is famous for its lodging resistance, high yield, and strong adaptability (Zhao et al. 2013(Zhao et al. , 2015. The goal of this project was to construct a highquality genome assembly and provide annotations for JD17. Based on comparative analysis, we aim to obtain JD17-specific genes, with the goal to explain differences of important traits. Based on the analysis of genes for nodulation and symbiotic nitrogen fixation in legume (Roy et al. 2020), are there any JD17specific SNF genes that may dictate JD17-specific phenotypes (such as rhizobia number, morphology, and nitrogen fixation capacity).
In this study, we extracted genome DNA from developing underground tissues of JD17, and used PacBio single-molecule realtime (SMRT) sequencing and Hi-C mapping technologies to construct a high-quality soybean reference genome. After being annotated more completely, the JD17 pseudomolecules were used to identify structural differences in a comparative analysis with 3 published soybean reference genomes. We identified JD17specific presence-absence variations (PAVs) and a large number of SNPs and Indels, and also resolved the influence of these variants on the coding structure of nitrogen fixation-related genes.

Plant and sample preparation
Soybean seeds of G. max cv. JD17 used in this study were from Hebei Academy of Agricultural and Forestry Sciences. The seeds are planted and extracted in Xuelu Wang's Laboratory in Huazhong Agricultural University. The seeds were sterilized with chlorine gas [5 ml of 32% (w/w) HCl to 100 ml 4-5% (wt/vol) sodium hypochlorite in a beaker] for 15 h (Kereszt et al. 2007) and then left in a sterile hood for 2 h. The sterilized seeds were sown in growth bottles filled with sand after being soaked in sterile Milli-Q water for 30 min and watered with sterile Fahraeus solution (Få hraeus 1957) containing 2 mM KNO 3 . Seeds were grown in a growth chamber (light) at 28 C and 8 h (dark) at 23 C with 60% humidity for 16 h.

RNA preparation and sequencing
Underground tissues of inoculation and uninoculation from the 9 timepoints (1,4,6,8,10,15,20,25, and 30 day post inoculation, dpi) were collected and used for RNA extraction, respectively. After that, 9 RNA samples of inoculation and uninoculation were mixed equally as a sample, respectively. In addition, we also selected different tissues including root, nodule, stem, leaf, pod, seed, and flower for mixed RNA-Seq. All the RNA was extracted by TRIzol reagent (Invitrogen 15596026). we performed RNA-Seq on illumina platform and produced approximately 10 Gb raw data with 150 bp pair-end reads.

Whole-genome sequencing using SMRT technology
A total of 10-dpi root tissue of plants was used for SMRT wholegenome sequencing. Underground tissues were collected for genomic DNA preparation with modified CTAB method (Bergman and Quesneville 2007). Using 97 mg DNA, PacBio sequencing libraries were produced following manufacturers protocols as described for the greater than 30 kb-SMRTbell Libraries Needle Shearing (SMRTbell Template Prep Kit 1.0) with Blue Pippin size selections (Sage Science, http://www.sagescience.com/; Last accessed: 22 January 2022), and the SMRTbell libraries were constructed through Pacific Biosciences SMRTbell Template Prep Kit 1.0 (http://www.pacb.com/; Last accessed: 22 January 2022). SMRT sequencing was performed on a PacBio RSII instrument using P6/C4 sequencing chemistry (DNA/Polymerase Binding Kit P6) and 6 h movies. We used a total of 118 SMRT cells and produced 127.3 Gb of raw data with an average subread length of 15 kb (Supplementary Table 1). At the same time, we also used a part of DNA samples for resequencing on the Illumina HiSeq 2500 platform for 150 bp paired end reads, with a sequencing depth of approximately 45Â, for a total of 44.5 Gb. These data are mainly used for the evaluation of genome size and its heterozygosity, post-assembly error correction, and genome quality assessment.

Hi-C library construction and sequencing
For samples used for Hi-C-assisted assembly, leaves fixed in 1% (volume/volume) formaldehyde were used for library construction. Cell lysis, chromatin digestion, proximity ligation treatment, DNA recovery, and subsequent DNA manipulation were performed as previously described (Lieberman-Aiden et al. 2009). The restriction enzyme used in chromatin digestion is Mbol. Finally, the Hi-C library was sequenced on the Illumina HiSeq X 10 platform for 150 bp paired end reads, with a sequencing depth of approximately 150Â, for a total of 152.0 Gb.

De novo genome assembly of JD17
To perform de novo assembly of the JD17, we combined 3 different assemblers, including CANU (Koren et al. 2017 Table 2). The main assembly was performed on whole SMRT sequenced long reads. All assembly softwares were performed with a presumed 1-Gb genome size. If not specified, all programs in our study were run with default parameters. CANU was run with default parameters, and FALCON was run with "length_cutoff ¼ -1" for initial mapping of seed reads for the error-correction phase ( Supplementary Fig. 1). For a better FALCON assembly, we additionally optimized parameters as "DBsplit ¼ -x500, -s400, pa_HPCdaligner ¼ -v -B128 -t16e.70 -l1000 -s1000 -T8 -M24, ovlp_HPCdaligner ¼ -v -B128 -t32 -h60 -e.96 -l500 -s1000 -T8 -M24 and overlap_filtering ¼ -max_diff 60 -max_cov 60 -min_cov 2." The stats of 3 initial assemblies were show in Supplementary Table 2. We subsequently used the CANU assembly as the working set because it was able to generate more accurate and more contiguous genomes compared to FALCON and CANU. Subsequently, the draft assembly was polished twice using Quiver (SMRT Link v 5.0.1.9585) and finally corrected using about 45Â Illumina short reads with Pilon (v1.22) ( Supplementary Fig. 2). Then, these contigs were aligned to the NT library using blastn, and those identified as not belonging to plants were filtered out.

Continuation and connection of contigs
Due to the complementarity among 3 different assembly results from CANU (Koren et al. 2017), FALCON, and HGAP4, we optimized our assembly results using the GPM (Zhang et al. 2016) pipeline to extend and connect contigs for better contiguity. First, GPM loaded WM82 as a reference genome, and CANU assembly as the back-bone contigs. These contigs were ordered and located on chromosome based on WM82 (Glycine_max_v2.0) using blastn. This step produced a draft chromosome assembly, as JD17 v0.1. Second, we loaded FALCON assembly result and aligned with CANU assembly using blastn (Camacho et al. 2009), and got the potential overlapping relationships between CANU and FALCON contigs. Then CANU contigs were extended or connected by FALCON contigs as needed to produce the JD17 v0.2 assembly. Third, we repeated the above steps with HGAP4 assembly and the JD17 v0.3 was generated. Fourth, we loaded contigs from an assembly performed with the longest 70 Gb PacBio reads (extracted from the total 120 Gb sorted raw reads) using CANU. Alignment with the v0.3 version contigs to get the correspondence, then extend and connect contigs from v0.3 version. The optimization contigs is defined as JD17 v0.4. Finally, the contigs of chloroplast and mitochondrial sequences were identified and removed from JD17 v0.4 (Supplementary Table 3). All connection and extension events were validated by aligning subreads against the assembled contigs. Then, the JD17 v0.4 was re-polished using Quiver over twice iterations and corrected using Illumina short reads with Pilon (Walker et al. 2014). The finalized JD17 genome assembly (named as Glycine_max_JD17v1.0) was 995.0 Mb in size, with a contig N50 of 18.0 Mb ( Table 1). The assembly processing details are shown in Supplementary Fig. 3.
To anchor hybrid contigs into chromosome, the Hi-C sequencing data were aligned into contigs using bwa. According to the orders and orientations provided by the alignment, those contigs were clustered into chromosomes by ALLHiC v0.9.8 (Zhang et al. 2019) with recommended params in https://github.com/tangerzhang/ALLHiC/ wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome (Last accessed: 22 January 2022). According to the ALLHiC groups and assembly results create hic files, manual correction and validation were also performed by drawing contact maps with juicerbox (Durand et al. 2016). The genome assembly was finalized after this correction step (Supplementary Table 3).
The assessment of genomic heterozygosity and size is using the Genomic Character Estimator program (gce v 1.0.0, ftp://ftp. genomics.org.cn/pub/gce; Last accessed: 22 January 2022), and the heterozygous ratio based on kmer individuals is 0.029, and the corrected estimation of genome size is about 1.11 Gb.

Quality assessment of JD17 genome assembly
To assess the quality of the Glycine_max_JD17v1.0 assembly, we used our 65 Gb resequencing data. First, by aligning all reads to the assembly with BWA-MEM in BWA (v 0.7.17) (Li 2014), the mapping rate is over 98.8% and the coverage was over 99.65%, which shows the consistency between the assembly and reads. By using the GATK tools (v4.1.7.0) (McKenna et al. 2010;DePristo et al. 2011) for SNPs calling with JD17 resequencing data, we found 78,033 SNP, of which only 8,000 were homozygous, indicating that the JD17 genome has an accuracy of over 99.999%.
Merqury is also used to assess the quality of our genome assembly (Rhie et al. 2020), results show that the genome assembly error rate is 3.99303 e-05, and the integrity as high as 94.9455%. The completeness of the assembly was estimated by BUSCO with default parameters.

Annotation of TE and ncRNA sequences
To investigate the JD17 genome sequence features, we identified transposable elements (TEs) and other repetitive elements by RepeatMasker (v4.1.0) (Bergman and Quesneville 2007). Miniature inverted transposable elements (MITEs) were collected by MITE-Hunter (Han and Wessler 2010) with all default parameters. In order to get as much reliable long terminal repeat (LTR) retrotransposons information as possible, we used the LTR_retriever (v2.7) (Ou and Jiang 2018) analysis process, which integrates the output of LTR_FINDER (v1.1) (Xu and Wang 2007) and LTRharverst tools in GenomeTools (v1.5.10) (Gremme et al. 2013). Masking sequence with RepeatMasker (version 4.0.8) (http://www.repeatmasker.org/; Last accessed: 22 January 2022) based on MITEs and LTR library that has been identified. The other tandem repeats were identified by constructing a de novo repeat library using Repeatmodeler (version 1.0.11) (http://www.repeatmasker.org/; Last accessed: 22 January 2022). RepeatMasker was run against the genome assembly again, with all above library as the query library.
Noncoding RNAs were predicted by the Infernal program (v1.1.4) using default parameters (Nawrocki and Eddy 2013) and comparing the similarity of secondary structure between the JD17 genome sequence and Rfam (Nawrocki et al. 2015 (Brůna et al. 2020) as ab initial gene predictors, and a customized repeat library for RepeatMasker (Bergman and Quesneville 2007;Saha et al. 2008).

Annotation of protein-coding genes
The RNA-Seq data was de novo assembled using Trinity to obtain the assembled cDNA sequence, and annotated with the In order to obtain more accurate and complete annotation results, EVM (v1.1.1) was called to integrate the gene model prediction results from AUGUSTUS, GENEMARK, Exonerate, and PASA. After all, the PASA software is used to update these annotation results. Gene functions were inferred according to the best match of the alignments to the National Center for Biotechnology Information (NCBI) Swiss-Prot (Boeckmann et al. 2003) protein databases using BLASTP (ncbi blast v2.6.0þ) (Altschul et al. 1997;Camacho et al. 2009) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al. 2012) with an E-value threshold of 1E-5. Gene Ontology (GO) (Ashburner et al. 2000) IDs for each gene were obtained from Blast2GO (Conesa and Gotz 2008).

Genome-wide rearrangement and SV detection
To identify large-scale synteny among the 4 soybean lines, we created a genome-wide alignment using the Mauve aligner (February 13, 2015) with the progressiveMauve algorithm (Darling et al. 2010) with default parameters: default seed weights, determination of LCBs (minimum weights ¼ default), and full alignment with iterative refinement (Chakraborty et al. 2021).
The PAV sequences, PAV clusters, and PAV genes between JD17 and the 3 genomes (WM82, ZH13, and W05) were identified using the sliding window method described in the 2018 publication by Sun et al. (2018), with the same slide window size, alignment software, and parameters as theirs.

Identification and structural variation analysis of SNF genes
Identification of SNF gene sequences based on the literature published by Roy et al. in 2020(Roy et al. 2020. These sequences were first aligned to the protein sequences of JD17, WM82, ZH13, and W05 by blastp tool and filtered by coverage and identity greater than 40% to obtain possible nitrogen fixation-related genes. These filtered genes were then subjected to clustering analysis by the OrthoFinder tool (v2.5.4) (Emms and Kelly 2019). The SNF genes in the 4 soybeans were finally identified by determining the gene cluster where the published SNF genes were located. To obtain the expansion and contraction of SNF genes, these singlecopy SNF genes were further used to construct an evolutionary tree by RAxML tool (v8.2.11) (Stamatakis 2014)  Then the SNPs and Indels loci between JD17 and them were combined to assess the protein encoding-affected SNF genes. These genes (and gene sequences from the article) were then aligned using mafft (Nakamura et al. 2018) software, constructing an evolutionary tree based on fasttree (Price et al. 2010) tools, and manually checking the evolutionary tree manually to determine that these genes whose structure was affected were homologous (protein-coding genes on the same chromosome; on the same branch of the evolutionary tree, and then expanded if genes of the species were not present on that branch).

Results and discussion
High-quality genome assembly and annotation Genome assembly using PacBio SMRT data and Hi-C data (detail in Materials and Methods) resulted in a JD17 genome assembly with 411 contigs anchored to 20 pseudomolecules, with a total size of 965.8 Mb and contigs N50 up to 18.0 Mb (Table 1), accounting for 97.1% of the total contigs (Fig. 1) Table 4), indicating that the completeness of JD17 was higher than that of the WM82, ZH13, and W05 genomes (Shen et al. 2018;Xie et al. 2019).
In addition to the nuclear genome, we also constructed the full-length genomes of the soybean chloroplast (152.2 kb) and the complete genome of Bradyrhizobium japonicum USDA 110 (9.1 Mb), which are consistent with previously reported results ( Supplementary Fig. 4) (Kaneko et al. 2002;Saski et al. 2005).
We identified 52,840 protein-coding genes and 74,054 fulllength transcripts in the JD17 genome ( Table 2). The average length of mRNA, 5 0 UTR, CDS, and 3 0 UTR were 4,465, 302, 1,183, and 487 bp, respectively. Our gene annotation is relatively in agreement with the other 3 genomes. After searching with existing databases and conserved structural domains for functional annotation, a total of 72,529 (97.94%) transcripts have known domains or functions, which suggested that high-confidence annotation of JD17 genome was performed.
Among the genomic sequences of JD17, the highest proportion of repetitive sequences is the LTR retrotransposon, and a total of 631,056 LTR elements were identified, totaling 397.58 Mb, which is about 39.95% of the whole genome. Further by assessing the distribution and insertion time of LTRs, we found that the LTR enrichment was all in close proximity to the centromeric region, fewer in the gene enrichment region, and that the amplification of LTR retrotransposons in soybean occurred mainly within the past 1.5 million years and were all relatively young ( Supplementary Fig. 5). When comparing the activity of LTR elements ( Supplementary Fig. 6), Copia elements appear to be increasingly active and persistent over the past 3 million years. Similarly, Gypsy elements were relatively active during this period. In fact, between 1.4 and 3 million years, their activity levels do not differ much, but since 1.4 million years, the Gypsy element has become less active compared to the Copia element, while the unknown element has also become active over the last 3 million years, but at a much lower level compared to both the Gypsy and Copia elements. Based on the timing of the recent occurrence of WGD in soybean ($13 MYA) (Schmutz et al. 2010), it appears that these LTR insertions and expansions are different from gene duplications, but occur mainly after WGD.
To compare the differences in the composition and distribution of repetitive sequences between them, we annotated the repetitive sequences of other 3 genomes using the same approach. The results indicates that the composition of their TEs is essentially the same, except that WM82 has slightly less LTR content . Meanwhile, we found some blank regions in the genomes of W82 and W05 in the distribution of TEs, both of which are N sequences in the genome sequence. To determine whether these blank regions are the centromeric regions, we identified the centromeric regions of the JD17 genome by using the soybean-specific centromeric satellite repeats sequences (CentGm-1 and CentGm-2) (Gill et al. 2009). The evidence suggests that the location of JD17 centromeric regions is consistent with that of the blank regions in WM82 and W05 ( Fig. 2 and Supplementary Fig. 10). In general, the assemblies of JD17 and ZH13 are more complete in the centromeric regions, and the TEs of W05 has slightly richer components than the other strains.

Genomic rearrangements
To explore how many and how large-scale discordant regions of genetic mapping exist between the JD17 reference genome and the 3 published genomes, we performed a genome-wide comparative analysis. When the pseudochromosomes of JD17 were aligned to the other pseudochromosomes of WM82, ZH13, and W05, a total of 3,727 syntenic blocks were identified  Fig. 1. Overview of the JD17 reference genome. Tracks from outer to inner circles indicate: the chromosome of the genome; the gene density map; the repeat sequence density map; density distribution of SNPs between JD17 and WM82; density distribution of InDel between JD17 and WM82. PAV length distribution between JD17 and WM82; GC content of JD17.  Fig. 11), of which 1,368 were common syntenic blocks among the 4 strains (Supplementary Table 6). Approximately 88.51% of the JD17 genome sequence shares the syntenic blocks with 88.12% of WM82, 86.10% of ZH13, and 81.88% of W05. We found a total of 251 syntenic blocks that exist between Wm82, ZH13, and W05, but not in JD17, with a size between 3.87-5.27 Mb. Similarly, there are 113 such syntenic blocks between JD17, WM82, and ZH13, but not in W05, and the size is between 2.10 and 3.33 Mb. The size of the specific sequence in each of the 4 strains (i.e. the sequence that does not match any other strains) is between 82.58 and 119.37 Mb, of which the wildtype W05 has the most specific sequences and the W82 has the least. This may be due to the loss of sequences in cultivars during the long domestication process of artificial selection. The chromosomal differences between JD17 and the other 3 strains were further observed by comparative genome analysis. We observe that compared to WM82 and ZH13, JD17 has slightly more Copygains, 4.0 Mb and 3.8 Mb respectively, but W05 has more Copygains compared to JD17, with a combined total of about 7.1 Mb (Supplementary Table 7). We also found 56 inversions and 1,051 translocations events for the rearrangement events that occurred between JD17 and WM82 (Supplementary  Table 7). Similarly, 86 and 1,273 inversions and translocations occurred between JD17 and ZH13, and 93 and 2,996 inversions and translocations occurred between JD17 and W05, respectively. We observed 5 large inversions (>1 Mb) on Chr04, Chr05, Chr06, Chr07, and Chr19 and 2 large translocation events (>200 kb) on Chr02 and Chr07 ( Supplementary Figs. 12-14 and Table 8), respectively, that were present in JD17 compared to all other 3 species. Examination of the breakpoint loci of these variants by comparing the sequenced subreads to the genome showed that the assembly of JD17 was correct . Although the breakpoint at JD17 (Chr06:32,732,757) is an N sequence (i.e. gaps) and the subreads at Chr07:27,723,877 are poorly supported, the corresponding breakpoint positions in the WM82 genome are both gaps. This suggests that these structural differences such as inversions and translocations are more likely to be true genetic variation in JD17 relative to the other 3 strains.

Identified PAVs by comparison with WM82, ZH13, and W05
Comparison of the genomes of JD17 and WM82 revealed 20,984 JD17-specific fragments (total length: 13.13 Mb) and 22,635 WM82-specific genomic fragments (total length: 30.81 Mb) ( Supplementary Table 9). Similarly, comparing the genomes of JD17 and ZH13 revealed 22,818 JD17-specific genomic fragments (13.68 Mb) and 23,456 ZH13-specific genomic fragments (33.27 Mb). Comparison of the JD17 and W05 genomes also revealed 36,658 JD17-specific genomic fragments (21.86 Mb) and 37,443 W05-specific genomic fragments (46.92 Mb). We further merged PAV sequences within 100 kb from the physical coordinates to identify PAV clusters (Sun et al. 2018) (Supplementary  Table 10). The majority of these PAV sequences (99.8%) were shorter than 5 kb. Compared with Wm82, 23 PAV sequences longer than 5 kb in JD17 were identified with an average length of 7,195 bp, standard deviation is 1,789.74. These PAV sequences were unevenly distributed in the genome (Fig. 1), with some located in clusters. The largest of these PAV cluster fragments is a 1.6-Mb JD17-specific fragment containing 5 predicted genes located between 14.8 and 16.4 Mb on chromosome 1. However, there is a PAV cluster of less than 0.7 Mb that is very rich in genes, up to 73 coding genes, located between 45.3 and 46.0 Mb on chromosome 8. After GO enrichment analysis, they were found to be associated with NADH dehydrogenase, aspartate-type endopeptidase activity, histidine, protein, and sugar de metabolism ( Supplementary Fig. 22).
Based on the criterion that a gene can be designated as a PAV gene if its coding sequence is covered by !75% of the PAV sequence (Sun et al. 2018), we compared JD17 with WM82, respectively, and we identified 100 JD17-specific and 75 WM82-specific PAV genes. Similarly, by comparing JD17 with ZH13, we identified a total of 80 JD17-specific and 156 ZH13-specific PAV genes, and between JD17 and W05, 101 JD17-specific and 67 W05-specific PAV genes were identified. It can be seen that there are large differences in genomic sequences between JD17 and WM82, ZH13, and W05.
By comparing these JD17-specific genes, it was found that only 20 JD17-specific genes were identified simultaneously ( Supplementary Fig. 23), and most of the JD17-specific genes could still be found in different strains. The functional enrichment analysis of these 20 genes showed that these genes were related to signal transduction, response to stimulus, protein synthesis, and nitrogen metabolism. The results of functional annotation also showed that these genes were associated with powdery mildew and TMV resistance protein synthesis ( Supplementary Fig. 24).

Identified SNPs and Indels by comparison with WM82, ZH13, and W05
To find SNPs and Indel between JD17 and the other 3 strains, we combined MUMmer and SyRI tools. The results showed that we identified 1,695,741 (2,675,463 in ZH13 and 3,664,629 in W05) SNP,213,509 (324,109 in ZH13 and 391,977 in W05) insertions,and 233,180 (391,977 in ZH13 and 408,512 in W05) deletions variations in JD17 genome by comparative genome-wide analysis. Combined with the results of structural variation and PAVs, both showed the smallest difference between JD17 and WM82 and the largest difference with wild-type soybean W05. This may be due to the fact that JD17 has the lineage of WM82, but it also supported that wild soybean can provide richer genetic diversity.
Annotation of these variants showed that they mainly affect intergenic regions, but there are still 4,947 variants between JD17 and WM82 that have a very large impact on 1,785 protein-coding genes. Enrichment analysis of these affected genes revealed that the primary function of these genes is associated with sulfate transmembrane transport, regulation of organ growth, regulation of developmental growth, carbohydrate binding ( Supplementary  Fig. 25).
Similarly, a large effect of 7,363 variants on 2,567 proteincoding genes was observed between JD17 and ZH13, and a large effect of 6,674 variants on 3,611 protein-coding genes was observed between JD17 and W05.
Identification of SNF genes and differences among JD17, WM82, ZH13, and W05 To identify genes associated with SNF in the 4 genomes, we blast identified genes from published articles to these genomes. We identified 331, 360, 285, and 346 genes potentially related to nitrogen fixation in JD17, WM82, ZH13, and W05 genomes, respectively (Supplementary Table 11). Of these, 88.8% of the SNF genes were identified in all 4 soybeans, and in JD17, 96% of the genes were identified, while fewer genes were identified in ZH13 ( Supplementary   Fig. 26). Subsequently, by comparative analysis of SNF genes from these 4 species, there are 2 expansion genes (LjCLE-RS2 and MtNAC969), 2 contraction genes (LjCLC1 and MtCP6), and 5 lost genes (GmENOD93, GmRj2_GmRFG1, LjENOD40-1/ENOD40-2, and LjHIP, MtCAS31) (see Supplementary Table 12 for details) in JD17 relative to their common ancestor (Supplementary Table 12). By examining the structural variation of 331 genes potentially associated with nitrogen fixation in JD17, we finally found 18 genes with structural variation between WM82, ZH13, or W05 and affecting protein coding (Supplementary Table 13). Further by constructing an evolutionary tree to manually confirm the effect of these genes undergoing structural variation on them, we finally observed that 9 of them were not in the same branch of the evolutionary tree as the published sequence of WM82. For example, the LjIGN1 genes (Glyma.10G291900 and Glyma.20G241200), which encodes the synthetic Ankyrin-Repeat Membrane Protein, which is important for symbiosis and nodulation (Kumagai et al. 2007). By evolutionary analysis of their homologous genes, we found that Glyma.10G291900 and Glyma.20G241200 are located in 2 branches, and the branch where Glyma.10G291900 is located has corresponding genes in all 4 species with essentially the same protein code ( Supplementary Fig. 27). However, the branch of Glyma.20G241200 showed differences. By multiple sequence comparison, JD020G0232500 in fact encodes 192 amino acids that are very different from several other proteins ( Supplementary Fig. 28).
These sequence differences may affect its expression specificity, recognition, and interaction, which may in turn affect the recognition, infection, and symbiotic nitrogen of rhizobia. This may be one of the reasons for the difference in nodulation phenotype between JD17 and WM82, which requires further research to verify.

Conclusions
Here, we constructed a high-quality de novo assembly of the soybean cultivar Jidou 17 (JD17) with contigs N50 approaching 18 Mb. Combined use of homologous, de novo, and transcriptome evidence, 52,840 gene models, and 74,054 full-length transcriptomes were predicted. The total number of repeats was about 580.22 Mb, accounting for 58.30% of the whole genome, of which LTRs were the most abundant, accounting for 39.95%. The centromeric regions in the JD17 reference genome were identified. Analysis of LTR insertion time showed that LTR insertion was more and more active, and the 2 main LTR elements, Copia and Gypsy, showed differences. LTR insertions and expansions are different from gene duplications, but occur mainly after WGD. Through whole genome comparison between 4 genomes, identification of a large number of JD17 specific sequences, variants, and genes, including 5 large insertions, 2 large translocations, and 20 PAV genes. These genes were related to signal transduction, response to stimulus, protein synthesis, and nitrogen metabolism. At the same time, the SNPS and INDELs between the JD17 relative to other 3 genome are identified, respectively. The protein-coding structures of 1,785 genes were found to be affected by the variations. Finally, we identified the SNF gene in JD17 and assessed their genetic differences, the results found that protein-coding structures of 9 SNF genes were significantly affected. We hope that our dataset will provide a valuable resource for comparative genomics and functional genomics of soybean.

Data availability
The sequencing data (PacBio Whole Genome Sequencing data for assembly, resequencing data, Hi-C data, and Mixed-Sample RNAseq data for Annotation) used in this study have been deposited into National Center for Biotechnology Information under BioProject Number PRJNA412346 with accession number SRR12416523-SRR12416632, SRR12416444, SRR12416750 and SRR9643849-SRR9643851. The final assembly had been deposited at GenBank JACKXA000000000. Supplemental material is available at figshare: https://doi.org/10.25387/g3.17065499.