Building pan-genome infrastructures for crop plants and their use in association genetics

Abstract Pan-genomic studies aim at representing the entire sequence diversity within a species to provide useful resources for evolutionary studies, functional genomics and breeding of cultivated plants. Cost reductions in high-throughput sequencing and advances in sequence assembly algorithms have made it possible to create multiple reference genomes along with a catalogue of all forms of genetic variations in plant species with large and complex or polyploid genomes. In this review, we summarize the current approaches to building pan-genomes as an in silico representation of plant sequence diversity and outline relevant methods for their effective utilization in linking structural with phenotypic variation. We propose as future research avenues (i) transcriptomic and epigenomic studies across multiple reference genomes and (ii) the development of user-friendly and feature-rich pan-genome browsers.


The pan-genome concept
Crop species exhibit extensive phenotypic variation in agronomic characters, such as phenology, yield, metabolite biosynthesis and response to biotic and abiotic stresses. Effective utilization of genetic variation is key to crop improvement to meet future challenges of climate change and evolving pathogens. 1-3 DNA sequence polymorphisms are commonly classified into single-nucleotide polymorphisms (SNPs), short insertions and deletions (indels) and larger (>50 bp) structural variations (SVs), which comprise presence/ absence variants (PAVs) and copy number variants (CNVs) as well as balanced rearrangements, namely inversions and inter/intrachromosomal translocations. 4,5 Capturing the full spectrum of natural SV in a species is challenging. In the past decade, reference genome sequence assemblies and catalogues of sequence diversity were generated for many crop species, among them the major cereal [6][7][8][9] and legume crops. 10,11 These projects assembled genome sequences for a single genotype and detected SNPs and indels from highthroughput sequencing data mapped to the reference genome sequence. Although a single reference genome sequence is the backbone of a genomic infrastructure, it cannot represent the full complement of sequence diversity of a species. Especially challenging are large-structural variants that are difficult to capture by shortread sequencing and reference-based analysis. Nevertheless, several studies have shown that this class of variants can play a vital role in determining agronomic traits, [12][13][14][15] local adaptation and speciation. [16][17][18][19][20] The concept of a 'pan-genome' refers to the universe of genome sequences existing in a species. Representing each and every sequence variant segregating in the pan-genome is a distant goal. Firstgeneration pan-genome studies commonly aimed at discovering as many structural variants as possible with a diverse, but necessarily limited set of genotypes. Pan-genomic studies have been conducted in various model and crop plants including Arabidopsis thaliana, 21, 22 Brachypodium distachyon, 23 Brassica oleracea, 24 tomato, 25 rice, [26][27][28] soybean, 29 rapeseed, 30 wheat 31 and barley. 32 To date, the pan-genome concept has been discussed extensively regarding definitions, approaches, computational challenges and potential applications. [33][34][35][36][37][38][39] Moreover, the development of computational tools for pan-genome representations and visualizations have already been discussed in detail elsewhere. 35,[40][41][42] Here, we review strategies for (i) building pan-genomes from reference-quality genome sequence assemblies, (ii) genotyping SVs discovered in large diversity panels using short-read resequencing and (iii) linking SVs to phenotypes in genome-wide association studies (GWAS). We propose transcriptomic and epigenomic studies focusing on accessions with highquality genome assemblies as well as the development of pan-genome visualization solutions (e.g. web browser) as future research avenue.

Selecting germplasm for a sequence assembly
The first step in setting up a pan-genome infrastructure is the selection of a diverse set of representative genotypes for sequence assembly (Fig. 1).
The goal is to capture as many genetic variants as possible with a limited panel of genotypes. Genebanks, i.e. national or international germplasm repositories, host hundreds to thousands of accessions of all major crop species, but minor crops might be not as well represented in ex situ collections (http://www.fao.org/3/i1500e/i1500e00.htm). Genome-wide genotypic data for entire genebank collections or representative subsets are crucial to select diverse accessions covering all major germplasm groups in a species. Such genebank genomics studies have been reported recently for barley, 43 wheat, 44 maize 45 and rice. 46 Genotyping-by-sequencing (GBS) 47 was used to fingerprint more than 20,000 wild and domesticated barleys 43 from the German ex situ genebank. Researchers from the International Maize and Wheat Improvement Centre (CIMMYT) report GBS profiles for 44,624 wheat lines from the breeding programs 44,48 as well as DArTseq data for 80,000 wheat accessions from the genebanks of CIMMYT and the International Centre for Agricultural Research in the Dry Areas. The genomes of more than 3000 cultivated rice accessions from the International Rice Research Institute genebank were sequenced to generate a digital genebank and a pan-genome. 46 There are various approaches for selecting coresets. 49 For example, the tool Corehunter 50 implements different algorithms operating on genetic distance matrices to maximize diversity, representativeness and/or allelic richness of core sets. Custom selections may also be made from clustering the diversity space as represented by principle component analysis 51 or model-based ancestry estimation. 52 Pan-genome panels may include domesticated accessions as Figure 1. Pan-genome selection and construction. Representative genotypes are chosen from genetically diverse populations based on genome-wide genotypic data for ex situ germplasm collections. Chromosome-scale genome assemblies are built for a small, but representative core set. The pan-genome compartments such as core (i.e. genomic sequences present in all individual of a species) and variable (i.e. sequences found in some/few individuals) are identified from the de novo assemblies.
well as accessions of conspecific wild progenitors or ancestors of polyploid species, e.g. maize and teosinte or wheat and wild emmer and Aegilops tauschii. Crop-wild relatives in the secondary and tertiary gene pools 53 may be included to serve as out-groups, e.g. to determine ancestral states for SVs (Fig. 2), or because of their relevance in introgression breeding. In addition to focusing on maximizing representativeness of global diversity in a crop, a pan-genome project may also select genotypes that have played an important role in breeding and genetics such as founder genotypes of breeding programs, parents of experimental populations 54 or genotypes amenable to genetic transformation 55,56 may be included to maximize the benefits for the research and breeding community. Vice versa, the accessions included in pan-genomic studies are poised to become reference genotypes in future genetic and functional studies by virtue of the genomic resources associated with them.
3. Moving from short-read resequencing to long-read reference genomes

Alignment vs. assembly
High-throughput short-read sequencing on the Illumina platform has been extensively used for plant genome assembly, population genomics and GWAS studies, but it has important drawbacks. The intergenic space in plant genomes is mainly derived from transposable elements (TEs). 57 Since Illumina reads are only up to a few hundred basepairs in length, they cannot traverse most repeats, leading to fragmented and incomplete genome assemblies. Similarly, applying short-read sequencing data to detect SVs using read depth or paired-end information ('split reads') is prone to errors in very complex regions, such as plant resistance gene loci. Alignment of long (>10 kb) reads to a reference genome can overcome some of these challenges. Still, even with long reads, insertions exceeding the read length, tandem and segmental duplications, 58,59 as well as balanced events such as large inversions (>1 Mb), 60-62 are challenging to detect from alignments to a single reference genome.

Assembly methods
De novo assembly of multiple high-quality reference genome sequences and their comparison by pair-wise sequence alignment is arguably the most powerful and accurate approach to detect all types of sequence variant at base-level resolution. 62 The progress in genome sequencing and assembly methods in the past two decades has been tremendous. The first approaches at whole-genome assembly, namely hierarchical sequencing of bacterial artificial chromosomes on the Sanger platform could only be implemented by international consortia even for small-sized genomes like Arabidopsis 63 or rice. 9 The development of high-throughput shortread sequencing first on the 454, then on the Illumina platforms, 64 enabled the generation of draft genomes for many plant sequences, including most crops. 65,66 But still assembly contiguous genome sequences from short-reads was a complicated and resourceintensive task 67,68 and did not scale well to tens to hundreds of genomes. Multiple short-read libraries with various insert sizes were required for scaffolding contig-level assemblies that were often too fragmented to be useful on their own. Complementary approaches such as optical mapping, 69 genetic mapping 70 and chromosome conformation capture sequencing (Hi-C) 71,72 were required to increase sequence contiguity from kilobase-sized contigs to full chromosomes. Long-read sequencing on the PacBio 73 and Oxford Nanopore 74 platforms have conceptually simplified this approach as assembly of long (> 10 kb) reads result in megabase-sized scaffolds even in complex genomes. 75 Yet, the high error rate of long-read sequencing (10-15%) requires substantial computational resource for correction and overlap determination-to a degree that assembly of polyploid plant genome could take months. 76 The need for vast computational resource to assemble large (>1 Gb) plant genomes has recently been obviated by the development of accurate long-sequencing on the PacBio platform. 77 Repeated read-out of the same DNA fragment by circular consensus sequencing yield reads in the 15-25 kb range with error rate below 1%. 76 State-of-the art algorithms (HiCanu 78 and hifiasm 79 ) can now assemble human-sized genomes to megabasescale contiguity within hours on standard compute servers.

Assembly approaches for pan-genomics
We predict that accurate long-read sequencing is a breakthrough technology that will greatly improve our ability to assemble large and complex, heterozygous or polyploid genomes and to do this in timeframe that enabling scaling to pan-genomes. Highly contiguous and accurate genome assemblies will provide access to regions previously inaccessible to sequence analysis such as centromeres 80 or loci Each hexagon order represents individual genome from distinct accessions. (d) A combination of assemblies and resequencing data underpins genetic analyses such as GWAS and population genetic inquiries into pan-genome complexity. Accessory functional data on gene expression and gene profiles will decorate pan-genomes to assist hypothesis generation. All information is provided to research community in a user-friendly web interface (browser).
involved in response to pathogens. 81,82 However, it should also be kept in mind that any genome assembly can contain errors potentially giving rise to spurious SV calls. 62,83 Complementary evidence provided by independent mapping approaches, such as optical maps and Hi-C, are needed to validate and correct assemblies to increase confidence in structural variant calls, particularly for reciprocal events such as inversions and translocations.
At the time of writing, it is an ambitious, but not unrealistic research goal to generate tens of high-quality reference genomes for large-genome plant species and hundreds of reference genomes for smaller species within the timeframe of 1 year. In plants, wholegenome assembly-based pan-genomes have been reported for rice (number of accessions, n ¼ 16), 28,46,84 barley (n ¼ 20), 32 wheat (n ¼ 10), 31 maize [n ¼ 26; NAM Genomes Project (https://namgenomes.org)], Brachypodium distachyon (n ¼ 54), 23 Glycine soja (n ¼ 7), 29 Brassica napus (n ¼ 8) 30 and soybean (n ¼ 26). 85 Computational method development has focused on fast algorithms for aligning long-reads to reference genomes and reference genomes to each other as well as to call variants from such alignments. 38 Likewise, genome assembly software has kept pace with methodological advances in long-read sequencing. 78,79 Nevertheless, sequence assembly of complex plant genomes remains challenging: algorithms struggle with resolving multiple haplotypes in heterozygous or autopolyploid genomes. 79,86 Assemblies might result in fragmented sequences, produce chimeric contigs joining different haplotypes or ignore alternative haplotypes. Even when haploid genome assemblies can be constructed from rare inbred or haploid genotypes in otherwise outcrossing or polypoid species, 87 detecting and phasing heterozygous SVs remains challenging in these species.

Pan-genome graphs
Once genome sequence assemblies of a diversity panel have been obtained, a common first analysis is to compartmentalize the assembled sequences into the core and the variable genome (Fig. 1). The variable genome comprises sequences that are present in some genotypes, but absent from others. The core genome is present in all individuals of a species and may comprise sequence whose loss is incompatible with proper organismal functioning such as housekeeping genes. 88 In bacteria, where the pan-genome concept was developed first, 88 the core and variable compartments refer only to gene sequences. As bacterial genomes are small and mainly composed of coding sequence, this approach is correct and straightforward to implement because methods to cluster genes into orthologous groups are well established. In plant and animals, however, a purely genebased analysis would ignore a large proportion of diversity present in intergenic sequences. As a consequence of the frequent movement of repetitive elements, 89 much of the variable component of a plant pangenome is intergenic and derived from TEs. Since orthologous relationships are hard to establish between copies of TEs in different genotypes, recording all sequence alignments between repetitive elements would result in a data structure of inextricable complexity.
Toolkits for the construction, analysis and visualization of graph-based pan-genomes such as vg toolkit, 42 minigraph 90 the Practical Haplotype Graph 91 are under active development. 40 As of now, further evaluation and development of heuristics for pruning complex regions is needed before these approaches can be deployed on collections of tens to hundreds of plant genome assemblies in the same standardized and streamlined way as toolkits for SNP genotyping operate on short-read data. 92,93 In the meantime, different ad hoc approaches have been devised to focus on low-copy, but not necessarily genic, regions. In rapeseed, a pan-genome sequence was constructed by adding the PAV sequences from multiple individual genomes to one single reference genome. 30 In soybean, a graphbased pan-genome construction was performed with non-redundant SVs against a reference genome. 85

The single-copy pan-genome
Recently in barley, a so-called 'single-copy pan-genome' was built by clustering single-copy regions extracted from multiple chromosome-scale sequence assemblies. This work-around enabled quantitative estimates of pan-genome complexity, such as saturation analysis, and provided a reference to derive bi-allelic SV markers for use in association genetics. However, approaches targeting single-copy regions may prove ineffective in polyploids where even highly conserved house-keeping genes occur in multiple copies in the subgenomes. Moreover, as genic regions are under stronger selective pressure and have reduced sequence diversity, genebased analyses may underestimate pan-genome complexity. For instance, the gene-based pan-genome of soybean reached a plateau with 25 representative accessions, 85 but this picture could change entire genomes are considered.

Need for genotyping SV in larger germplasm panels
Despite continuous methodological advances and cost reductions in the past decade, sequence assembly is still substantially more expensive than resequencing. In large-genome plants species, the size of germplasm panels that can be subjected to de novo sequence assembly may not be large enough for GWAS or population genomic analysis. One possible approach for including structural variants into genetic analysis is the use of linked SNPs as proxies. But, several studies have shown that the rapid decay of linkage disequilibrium can result in many SVs that are not tagged by near-by SNPs. 15,94 A further conceptual drawback is that even if linked SNPs can pinpoint loci in association scans, causal variants residing in SVs whose sequence is absent from the reference genome would be inaccessible.

Graph-based methods
Low-coverage whole-genome shotgun sequencing can scale to panels comprising thousands of accessions. Thus, it can complement catalogues of SVs seeded with genome sequence assemblies to discover new, or genotype known events. There are several approaches for genotyping SVs (Fig. 3), which are discovered in a smaller discovery panel, in short-read data for more individuals. One of them is to build variations graphs from SVs discovered in the reference panel (Fig. 3a) and aligning short-reads to the graph. 42,[95][96][97] Graph-based SV genotyping requires high read coverage (10-30X) to achieve good accuracy. 42 The advantages of high read depth need to be weighed against larger panel size affording greater statistical power. An alternative approach is to extract defined short sequences (k-mers) that are diagnostic for the presence or absence of SV and whose presence can be confidently ascertained in short (< 300 bp) read data. For instance, multiple short k-mers with lengths typically in the range of 30-100 bp can be extracted from SVs and queried in short-read resequencing data. Multiple k-mers might be combined to increase specificity, mitigate the effects of missing data in lowcoverage data and differentiate between different haplotypes sharing the same SV. Choosing k-mers from single-or low-copy regions is needed to avoid unspecific matches (Fig. 3b). Single-copy regions do not only comprise genes, but also non-coding regulatory regions and unique TE insertion sites. 98 Thus, they can serve as anchor points for larger haplotypes even in repetitive regions. Presence/absence tables of the diagnostic k-mers act as biallelic marker matrices for use in genetic mapping applications, i.e. GWAS or quantitative trait locus (QTL) mapping in biparental populations. As there are fewer SVs than SNPs, commonly used GWAS methods developed for SNP genotyping or sequencing studies (such as GEMMA 99 or GAPIT 100 ) are readily applicable. As a proof-of-principle, Jayakodi et al. 32 queried single-copy k-mers from structural variants detected in 20 barley assemblies in GBS and WGS data of diversity panels and used a k-mer abundance matrix in GWAS scans for morphological characters with a simple genetic architecture. Song et al. 30 used GWAS with PAVderived markers to identify SVs associated with silique length, seed weight and flowering time in rapeseed.

Reference-free methods
A conceptually similar k-mer-based approach is reference-free association mapping with k-mer counts determined only from short read data without any sequence assemblies (Fig. 3c). Instead of diagnostic k-mers ascertained from a discovery panel of reference genomes, all k-mers occurring in a collection of short reads are catalogued and their presence/absence in individual genotypes is tabulated. As the number of distinct k-mers is on the order of billions in large plant genomes, a pre-selection of informative markers is needed for GWAS scans that test for significant marker-trait associations with linear models. Two approaches for k-mer-based GWAS in plants have been described. AgRenSeq 101 combines resistance (R) gene enrichment sequencing with fast k-mer counting and GWAS scans using general linear models accounting for population structure. Due to the pre-selection of resistance orthologues, AgRenSeq is geared toward the discovery of R genes associated with specific diseases. The kmerGWAS 102 pipeline first quantifies k-mers in either whole genome shotgun or reduced representation sequencing data and then selects a prioritized set of k-mers based on a simple and fast statistical test. This smaller set of markers is used in a linear mixed model GWAS accounting for kinship. Both AgRenseq and kmerGWAS do not require a reference genome, but can benefit from it by aligning associated k-mers to it to determine chromosomal locations of GWAS peaks. In the absence of a reference genome or a sequence assembly representing the haplotype of interest, de novo assembly of reads containing k-mers associated with phenotypes may result in complete genes. However, because of the small size of the assembled contigs in the range of 1-10 kb, genomic contextualization is lacking, Figure 3. Pan-genome representation and GWAS with SV. (a) A pan-genome graph is constructed from the alignment of chromosome-scale sequence assemblies. This graph represents all types of genetic variants. Sections of the genome are shown as coloured hexagons. Each colour represent one genotypes. SV are represented by different paths through the graph. Tools for constructing and working with pan-genome graphs under active development. Two alternative approaches to capture pan-genomic information in genetic analyses are currently being used. (b) SVs between these genomes are detected from alignments against a common reference genome. Single-copy regions are extracted from the assemblies (mauve colour) and overlapped with SV (orange colour). Singlecopy k-mers residing in SVs are extracted and their abundance is ascertained in short-read data from a diversity panel to genotype the underlying SV. (c) Reference-free approaches select k-mers directly from short-read data of a diversity panel without the need of genome assemblies. Matrices of k-mer counts from either single-copy or reference-free approaches are used as markers in GWAS.
which could complicate the differentiation between linked and causal variants, in particular, if they reside in intergenic regions for which low-copy informative k-mer may be lacking.
As sequence assemblies for more genotypes become available, the pan-genome saturates, that is, the available reference genomes capture most haplotypes segregating at a certain minimum frequency (e.g. 1%) in the population. Then, both reference-agnostic k-mer GWAS followed by aligning peak markers to multiple sequence assemblies and GWAS with diagnostic k-mers tagging pre-defined haplotypes would conceptually converge. Future work should focus on defining best practices for compiling discovery panels (i.e. highquality reference genomes), choosing sequencing depth and selecting the most appropriate analysis strategies.
6. Beyond the pan-genome

Pan-transcriptomes
SVs can influence gene expression in various ways, for instance by disrupting gene structures, by altering gene copy number or by changing the composition or positioning of cis-regulatory sequences. 59,85,103,104 In addition to changing DNA sequence, SV could affect gene expression by altering epigenomic marks. Unravelling the functional consequences of a given SV, e.g. one associated with an agronomic phenotype, can be challenging. A notable example is a 13 Mb inversion (Inv4m) on maize chromosome 4 that is associated with early flowering. 105 Expression analysis in more than 430 RNA samples from near-isogenic lines did not reveal one single variant as a convincing causal candidate. Precise perturbations by gene editing or even flipping the inverted haplotype back to the ancestral configuration are possible, 106 but technically demanding, strategies toward understanding how this inversion altered flowering time. Gene expression atlases across the development of a single genotype have been developed in many plant species 107,108 and are recognized as valuable community resources that inform about when, where and how strongly a gene is expressed.

Pan-epigenomes
In the same way, we envision that profiling gene expression and epigenomic marks across a set of genotypes for which chromosomescale reference genome sequences have been assembled will yield pan-transcriptome and pan-epigenome atlases as permanent community resources. Large-scale expression profiling and population-scale epigenomic studies have been done before, but in the absence of multiple sequence assemblies, data were mapped to a single reference. By integrative analysis of matching genomic, transcriptomic and epigenomic data, it will be possible to analyse the co-location of structural variants and epigenomic variants and gene expression differences between accessions. Such data can help prioritize variants in GWAS studies and guide the development of hypothesis for approaches targeting individual variants (Fig. 2). Recent reports have reported first results in these directions: in tomato, almost, half of the SVs detected in a pan-genome constructed from 14 sequence assemblies overlap with genes and/or flanking regulatory sequences and many of them showed subtle, yet significant changes in gene expression. 59 In soybean, more than 1,000 SVs were associated with expression changes, notably a candidate gene for iron uptake was identified with RNA-seq evidences. 85 Yang et al. reported 207 cis expression QTLs linked to SVs. Among these, 70 were found to form chromatin loops coding genes in Chromatin Interaction Analysis by Paired-End Tag Sequencing. 103

Browsers
As methods for sequence assembly and comparative analyses improve, previously inaccessible genomic variants become amenable to genetic study. An outstanding challenge is to make new and more complex data structures such as non-linear graph-based pan-genomes accessible to researchers and breeders who are inexperienced in using command-line tools. An integrated pan-genome browser capable of representing SNPs and large SVs in multi-reference coordinate system, together with their annotations, accessory transcriptomic and epigenomics datasets, as well as links to germplasm repositories would serve as a one-stop shop for genome analysis. However, before this vision can be realized, many obstacles need to be overcome. Among them are the construction of and sequence alignment to pan-genome graphs (e.g. by using vg 42 or minigraph 90 ) as well as merging and consolidating gene annotations across a large and potentially growing number of sequence assemblies. [109][110][111] As a first step in this direction, we propose the implementation of web-based tools to query and analyse multiple chromosome-scale reference genomes in a gene-centric manner. The framework needs to include query forms to retrieve allelic gene sequences from multiple reference genomes, inspect multiple-sequence alignments of alleles of genes or larger haplotypes and query the presence of alleles or haplotypes in a wider set of germplasm.

Concluding remarks and future perspectives
Recent pan-genomic studies have revealed exciting insights into crop domestication and the genetic basis of agronomic traits. We expect, while the analysis and visualization methods mature, pan-genomics will establish as indispensable component in the genomics toolbox of plant geneticists and breeders. Since workflows for sequence assemblies and association genetics are in place, future studies will extend analysis and visualization methods in population genetics, gene expression and epigenetics to the scale of pan-genomes. We anticipate that pan-genomes will become an essential component in studying the diversity of crops and their wild relatives and in developing efficient concepts for their usage in pre-breeding. Digital genebanks based on sequence-based genotyping are feasible right now. 13,43 The long-term goals of having genome assemblies for all genebank accessions 112 is still a distant goal, which, however, has just come a bit closer with the recent breakthroughs in assembly methodology.