Integrating a newly developed BAC-based physical mapping resource for Lolium perenne with a genome-wide association study across a L. perenne European ecotype collection identifies genomic contexts associated with agriculturally important traits

Abstract Background and Aims Lolium perenne (perennial ryegrass) is the most widely cultivated forage and amenity grass species in temperate areas worldwide and there is a need to understand the genetic architectures of key agricultural traits and crop characteristics that deliver wider environmental services. Our aim was to identify genomic regions associated with agriculturally important traits by integrating a bacterial artificial chromosome (BAC)-based physical map with a genome-wide association study (GWAS). Methods BAC-based physical maps for L. perenne were constructed from ~212 000 high-information-content fingerprints using Fingerprint Contig and Linear Topology Contig software. BAC clones were associated with both BAC-end sequences and a partial minimum tiling path sequence. A panel of 716 L. perenne diploid genotypes from 90 European accessions was assessed in the field over 2 years, and genotyped using a Lolium Infinium SNP array. The GWAS was carried out using a linear mixed model implemented in TASSEL, and extended genomic regions associated with significant markers were identified through integration with the physical map. Key Results Between ~3600 and 7500 physical map contigs were derived, depending on the software and probability thresholds used, and integrated with ~35 k sequenced BAC clones to develop a resource predicted to span the majority of the L. perenne genome. From the GWAS, eight different loci were significantly associated with heading date, plant width, plant biomass and water-soluble carbohydrate accumulation, seven of which could be associated with physical map contigs. This allowed the identification of a number of candidate genes. Conclusions Combining the physical mapping resource with the GWAS has allowed us to extend the search for candidate genes across larger regions of the L. perenne genome and identified a number of interesting gene model annotations. These physical maps will aid in validating future sequence-based assemblies of the L. perenne genome.


INTRODUCTION
Lolium perenne (perennial ryegrass) is widely grown in Northern Europe and in temperate areas worldwide as a forage and amenity grass. For a number of decades there have been public and private plant breeding programmes for L. perenne, and the L. perenne seed constituent is both the largest and most commercially valuable component within many marketed grass seed mixes. Due to this wide cultivation, in addition to its economic and agricultural value, it contributes to the environmental footprint of large areas of grasslands. Thus, the trait biology of this grass can have major effects on landscapes, soil structures, hydrology, water quality and carbon sequestration, in addition to the 'intended' qualities that recommend it for on-farm use.
Lolium perenne is a diploid species that contains a two-locus gametophytic self-incompatibility system (Thorogood et al., 2002(Thorogood et al., , 2017. As a consequence, both commercial and in situ ecotype populations are often highly heterozygous and variable both within and between populations (Blackmore et al., 2015(Blackmore et al., , 2016. One advantage of this is that there is considerable phenotypic variation across the gene pool that is available to plant breeders for targeted crop improvement. The disadvantage is that delimiting and controlling this variation, in order to define a variety that meets the requirements for distinctness, uniformity and stability, can be a major challenge. Historically, in order to understand the complexity of the genetic control of target traits in L. perenne and related species, there has been a major focus on the analysis of quantitative trait loci (QTL) using genetic maps and trait data developed from tightly defined families (for an overview see Shinozuka et al., 2012). These have been derived from two-way pseudo-testcrosses (i.e. a pair cross of two heterozygous individuals; e.g. Inoue et al., 2004;Ergon et al., 2006;Byrne et al., 2009;Brown et al., 2010;Hegarty et al., 2013;Paina et al., 2016) or, where available, using inbred lines or doubled haploid genotypes to derive F 2 -and BC 1 -type populations (e.g. Yamada et al., 2004;Turner et al., 2006;Kobayashi et al., 2011;Foito et al., 2015) and Lolium/Festuca introgression populations (Humphreys et al., 2005;Moore et al., 2005;Armstead et al., 2006a, b). More recent developments for rapid generation of high-density genetic maps, such as the Lolium/Festuca DArT array (Kopecky et al.,20 09), a Lolium Infinium SNP-genotyping array (Blackmore et al., 2015) and genotyping by sequencing (Ashraf et al., 2016;Pembleton et al., 2016;Velmurugan et al., 2016), have enabled a number of genome-wide association studies (GWAS) and germplasm sampling studies to be conducted across L. perenne populations, often in the context of advanced breeding populations, aimed at defining marker/trait associations with greater resolution for both candidate gene identification and genomic selection. Across these studies, associations and genomic predictions have been developed particularly for heading date, but also for crown rust resistance, seed yield, biomass, water-soluble carbohydrate (WSC) and dry matter digestibility (Blackmore et al., 2015(Blackmore et al., , 2016Fè et al., 2015;Ashraf et al., 2016;Grinberg et al., 2016;Byrne et al., 2017;Arojju et al., 2018;Cericola et al., 2018).
Concurrent with the development of new methods of genetic analysis, there have been remarkable advances in the technologies and software available for generating genome sequences and assembling long-distance contiguous DNA sequence reads, which can, or are approaching the ability to, approximate complete chromosome (pseudomolecule) sequences. In-depth transcriptome sequencing and assembly have also guided detailed annotations of many genomes that both describe gene structure and predict function. Recent published advances in this area include two initial genome assemblies for L. perenne Velmurugan et al., 2016) and a high-quality reference genome assembly for Hordeum vulgare (Mascher et al., 2017). These have built on the more established reference assemblies for rice (International Rice Genome Sequencing Project, 2005) and Brachypodium distachyon (International Brachypodium Initiative, 2010) amongst grass species. Along with direct sequence assembly methods, other approaches that rely on different metrics have been used in developing and validating sequence-based contigs, including bacterial artificial chromosome (BAC)-based physical mapping (e.g. Gu et al., 2009;Zhou et al., 2009;Fleury et al., 2010;Ariyadasa et al., 2014;Varshney et al., 2014). This technique relies on developing 'fingerprints' of individual BACs based on restriction enzyme fragment sizes, and identifying overlapping BAC clones through fingerprint matching (Sulston et al., 1988;Luo et al., 2003). Extended regions of contiguity can then be developed using software approaches implemented in Fingerprint Contig (Soderlund et al., 1997(Soderlund et al., , 2000 and Linear Topology Contig (Frenkel et al., 2010). Physical map assemblies can then be integrated with sequence-based assemblies through BAC-end sequencing (BES) and selective BAC sequencing, often based on projecting a minimum tiling path (MTP) of BACs through each physical contig. This has proved to be an important adjunct in constructing and validating sequence scaffolds in genome assemblies (Mascher et al., 2017).
For many decades, plant breeders and others have been collecting perennial ryegrass and related germplasm from diverse locations in order to preserve and curate the range of genetic variation available within this species group; the gene bank at Aberystwyth University alone contains in excess of 6000 different accessions of Lolium spp. However, characterizing such accessions genotypically or phenotypically is a major challenge, but is also fundamental to their future exploitation. With climate uncertainty and changing patterns of land use, it is pressing that we develop a greater understanding of the potential of these germplasm collections for both enhancing the agricultural relevance of forage grass varieties and addressing new challenges that may relate more directly to enhancing environmental services, such as water and nutrient use efficiency and carbon sequestration. The work we present here is part of this effort to exploit new germplasm resources through linking new genetic and genomic approaches with comprehensive phenotyping studies. Specifically, the aim of the present study has been to integrate a newly developed genomics resource for L. perenne with the results of a GWAS of a broad selection of European L. perenne ecotypes in order to define the genomic contexts of significant GWAS markers and to suggest candidate genes. This will enable the prioritization of particular L. perenne accessions and genotypes for further evaluation in terms of broadening the range of germplasm available and deepening our understanding of the biology that underpins addressing future challenges. Figure 1 illustrates the workflow described in the following sections and the Supplementary Data.

Development of a BAC-based L. perenne physical map
The development of the L. perenne BAC-based physical map is described in more detail in Supplementary Data Methods S1. Briefly, two BAC libraries, each consisting of 120 960 clones, were developed from an inbred L. perenne genotype using HindIII or BstY1 partial restriction and all clones were fingerprinted using high-information-content fingerprinting (HICF; Luo et al., 2003). Fingerprint data derived from the individual BACs were used to construct physical maps using FingerPrint Contig (FPC) and Linear Topology Contig (LTC) software. We obtained FPC v. 9.4 from http://www.agcol.arizona.edu/software/fpc/, with HICF data analysed largely as described in Kim et al. (2007). The LTC software was obtained from MultiQTL Ltd (Haifa, Israel). Initial LTC physical map construction was performed using LTCbeta 2.1 [physical map version LTC-18(2s)] and subsequent LTC physical maps were constructed using LTC 1.4.6. using a range of different stringencies from P = 1 × 10 −12 to 1 × 10 −27 (physical map versions LTC-12 to LTC-27) for the construction of networks of significant clone overlaps, from which were derived the BAC contigs constituting each version of the physical map.
Sequence tagging of the L. perenne physical map Shotgun clone sequencing. An MTP was predicted from the physical map version LTC-18(2s), and 39 202 BAC clones were shotgun-sequenced, largely using the methods of Mascher et al. (2017). Each BAC was assembled individually using ABySS with Kmer length 71 (v. 1.5.2; Simpson et al., 2009) and BACs with a total assembly length <30 or >200 kb were re-sequenced. Methods employed in the further concatenation of the assemblies, alignment with the L. perenne draft genome assembly of Byrne et al. (2015) and further characterization of the sequence database used for anchoring GWAS markers to the physical maps (designated the LpBAC5000 database) are provided in Supplementary Data Methods S2.
BAC-end sequences. BAC-end sequencing of all 24 1920 clones was carried out following standard protocols using M13 primers, Sanger sequencing and capillary fragment separation and detection (Kim et al., 2007).

Three-dimensional BAC pool screening with PCR-based markers
Previously genetically mapped markers were associated with BACs using 3-D BAC pool screening. BAC pools were constructed from both BAC libraries by Amplicon Express (Pullman, WA, USA) according to a 'superpool/matrix pool' approach (Bouzidi et al., 2006). PCR-based markers were Phenotyping-establishment year, year 1 and year 2 (Table 1) Genotyping on the L. perenne Infinium array GWAS and identification of significant markers (Table 3 and Fig. 2) BLAST searches of significant marker sequences against LpBAC5000 L. perenne ecotype population established in the field (Table S2) Production of BAC libraries from L. perenne genotype Fingerprinting (HICF) and BAC-end sequencing of BAC clones (Table S3) Construction of 3-D BAC pools Construction of physical maps/contigs using FPC and LTC (Tables 2 and S4) Marker screening of 3-D BAC pools (Table S4) Prediction and sequencing of BACs in physical map minimum tiling path Assembly of sequences from individual BACs and construction of the LpBAC5000 database Significant marker BLAST alignments identify physical contigs LpBAC5000 (Table S5) BAC sequence assemblies from identified physical contigs used in prediction of gene models, using L. perenne, B. distachyon and barley genomics resources (  Table S1.

Population genotyping and phenotyping
The development and genotyping of the ecotype population has been described previously (Blackmore et al., 2015). Briefly, 716 genotypes from a European ecotype collection sampled from 90 locations (n = 8 for 89 locations, n = 4 for one location; Supplementary Data Table S2) were maintained and grown as spaced plants in a field site at Aberystwyth, UK, over a period of 2 years. Plants were genotyped using a custom Illumina Infinium iSelect array across 3425 SNPs and 2034 informative markers (minimum minor allele frequency set at 5 %) spanning the L. perenne genome identified.
Plants were grown as spaced plants in three replicated blocks, with management of plants following standard procedures for perennial ryegrass national testing (Supplementary Data Methods S4). Plants were evaluated for a range of growthrelated phenotypes and biochemical measures as described in Table 1. Field-based measurements and analytical chemistry measurements for WSC, neutral and acid detergent fibre (ADF/ NDF), total nitrogen (N) and dry matter digestibility were conducted according to standard practices used in perennial ryegrass national variety assessments (Supplementary Data Methods S5).

GWAS
The GWAS was implemented using TASSEL (Trait Analysis by aSSociation, Evolution, and Linkage) 5.2.4.1 (https://bitbucket. org/tasseladmin/tassel-5-source/downloads/; Bradbury et al., 2007) using the mixed linear model MLM (PCA+K) workflow, in which genetic population structure is estimated using principal components analysis and fitted as a fixed effect, and genotype relatedness is estimated through a kinship matrix and fitted as a random effect (Zhang et al., 2010). TASSEL default parameters were used throughout (the first five principal components were used for population structure estimation; minimum minor allele frequency was set at 5 %). Dry weight data were square-root-transformed prior to analysis. False discovery rates were controlled using the Benjamini-Hochberg procedure at a level of 10 %.

Physical map construction
Summary statistics for the two BAC libraries and their subsequent processing are described in Supplementary Data Table S3. The details of the derived physical maps assembled using FPC and LTC are summarized in Table 2. In comparing the different versions of the physical map, LTC-15 and particularly LTC-12 incorporated fewer clones than the other physical maps. The main reason for this was that at the lower stringency levels more clones had >500 significant overlaps and so were excluded from the subsequent assemblies. Both FPC physical maps incorporated more clones than any of the LTC physical maps, presumably as a consequence of the different methods employed by the two softwares for establishing contig reliability. Overall, 29 482 clones were excluded from all of the physical map versions, 125 321 clones were common to all of the LTC physical maps from LTC-18 to LTC-27, and 121 842 of these were also common to FPC1.03. Where clones were incorporated in contigs across cut-off thresholds and assembly methods, contig composition and clone order within contigs tended to be highly conserved (for a list of the physical maps contigs generated, see Supplementary Data Table S4, available from https:// doi.org/10.20391/bb56e6d7-8913-4bd7-8167-2b7e4c01382b).

Direct anchoring of physical maps to genetic/genomic positions
Of the 1240 markers that were screened on the superpools and matrix pools, 887 could be assigned to contigs (excluding contigs with >1000 constituent BACs) across the FPC and LTC methods and stringencies with the requirement that the marker was positive for a minimum of two different BACs in the same contig. Overall, this resulted in between 522 and 711 contigs being assigned genetic/genomic positions, depending on the assembly software and stringency used. Marker/BAC/physical contig associations are given in Supplementary Data Table S4.

BAC-end sequences
Summary statistics for the BES of 241920 clones are given in Supplementary Data Table S3 (BES data are available from NCBI under GenBank accession numbers MJ032229-MJ424519 and can also be downloaded from https://doi.org/10.20391/61921116-ddd0-4d85-b0fd-e0d734bc63c8). For the clones that were incorporated in the versions of the physical maps, ~71 % had BES for both ends of the clone, 21 % had BES for one end of the clone and 8 % had no associated BES information.

Shotgun clone sequencing
We assembled 35 434 BACs individually. The average BAC assembly length was 77.5 kb, the median number of sequences was 9 and N50 contig length was 10.3 ± 4.8 kb (Supplementary Data Fig. S1A, B). From these, 33 480 BACs were extracted with total contig sequence length in the size range 30-200 kb, which formed the basis of the LpBAC5000 database. Further analysis details are provided in Supplementary Data Results S1. This sequencing project files are deposited online under Bioproject PRJNA475227, BioSample:SAMN09382314, SRA Sample:SRS3412769 and SRA Study:SRP150420.

LpBAC5000 database
Overall, LpBAC5000 consisted of 207 958 sequences totalling ~2634 Mb extracted from 33 480 clones (~85 %) of the predicted MTP. The average contig size was ~12.7 kb and the average sequence length per BAC was ~78.7 kb. In order to estimate the coverage of the gene space within LpBAC5000, the alignment lengths of LpBAC5000 sequence contigs to gene models within the lope_transcripts.V1.0.fasta, Bdistachyon_314_ v3.1.cds.fasta and Hv_IBSC_PGSB_r1_CDS_HighConf.fa databases were estimated. For the L. perenne database, ~89 % of the gene space was represented, tagging ~97 % of the gene models. For the B. distachyon and H. vulgare databases, the equivalent figures were ~73 and ~78 % for the gene space and ~74 and ~85 % for the tagged gene models, respectively. Summary statistics of the LpBAC5000 sequence database are given in Supplementary Data Table S5A, B. The LpBAC5000 sequence file can be downloaded from https://doi.org/10.20391/ dfb05330-7485-444f-a475-8310bee5d510.
Coverage of the L. perenne genome within the physical map Lolium perenne genome coverages within the physical maps were estimated to range from 1.57-2.11 Gb without correction for MTP BACs with no associated sequences to 1.85-2.5 Gb with correction for missing BAC MTP sequences, depending on the physical map version. This compares with the Byrne et al. (2015) estimate of 2.07 Gb from their draft genome assembly (Supplementary Data Table S6).

GWAS analysis
The results of the GWAS analysis are summarized in Table 3, Fig. 2 and Supplementary Data Fig. S2. Quantile-quantile (QQ) plots indicated that the distributions of probability estimates were slightly above chance expectation for a number of the field-based measurements of plant biomass and architecture, though not heading date. Measures of plant chemical composition were in line with expectation. For five of the 16 trait scores analysed, a total of 13 significant marker/trait associations, corresponding to eight different genetic map positions, were identified within the false discovery rate (FDR) of 10 % (Fig. 2). Of these, four were associated with heading date (one of which has not been assigned a genetic map position). Seven different markers were associated with WSC; however, four of these were SNPs from within the same marker contig (ctg35543) and a further two were from a marker contig (ctg40624) predicted to be a different exon of the same gene as ctg35543. The seventh marker was a separate locus genetically mapping to a different chromosome. One marker was significantly associated with plant biomass within 10 % FDR. However, the QQ plot for this trait does not indicate a major deviation from expectation for that marker, suggesting this may be a false positive.

Integration of the GWAS with the sequence-tagged physical map
BLAST searches of significant marker sequences against the LpBAC5000 database were used to identify associated physical map contigs (Table 4, Fig. 3, Supplementary Data Fig. S3). For six of the identified physical contigs there was a strong indication of conserved synteny, with both B. distachyon and H. vulgare with physical contigs encompassing between eight and 32 B. distachyon gene models. The remaining physical contig, 1969c-18, identified by marker ctg20671, could be associated with three gene models in total; two were physically close on B. distachyon chromosome 3 and the third was from chromosome 1. This last gene model had no H. vulgare orthologue at the stringencies tested. Marker ctg41386-226 was significantly associated with WSC, but could not be aligned with any of the physical contigs through the LpBAC5000 database. BLAST searches of LpBAC5000 against the published L. perenne draft genome sequence were also used to suggest scaffolds from this draft genome assembly that were contiguous according to the physical map contigs. The numbers of draft genome scaffolds aligned varied from nine for 4073c-18 to one for 1969c-18 (Fig. 3, Supplementary Data Fig. S3).

DISCUSSION
The aim of this paper has been to explore the targeted use of L. perenne BAC-based physical maps to suggest broader genomic regions, validated through the physical contig structures, that could be associated with markers found to be significant in a GWAS analysis derived from the field performance of a selection of European L. perenne ecotypes. To this end, the physical map assemblies and associated sequence tags presented in the paper encompass the majority of the L. perenne genome and extend the 'genomic reach' of the current published sequencebased assemblies for L. perenne Velmurugan et al., 2016). As a result, we have been able to identify both genomic regions and relevant gene models contained within these regions, which are associated with the GWAS markers and phenotypes. As a significance level we chose to use a 10 % FDR as, in taking this work forward, this represents a good balance between the likelihood of identifying useful variation and the resources required for screening plant material in follow-up evaluations. In terms of candidate genes, as always for nonmodel species (and particularly so for non-cereal grass species) there is a frustrating gap between our ability to identify candidate genes and our subsequent ability to validate these candidate genes experimentally. However, even in the absence of direct functional validation of candidates, knowledge of the regions of genomes that contribute to trait performances are valuable in identifying potentially useful genotypes/phenotypes present in germplasm collections for further experimentation. Blackmore et al. (2015Blackmore et al. ( , 2016 reported the establishment of this panel of 716 L. perenne ecotypes sampled from 90 locations across Europe and described the genetic relationships between population and patterns of linkage disequilibrium on the basis of the genotype scores from the Infinium marker array. The present study, in part, represents a follow-up analysis incorporating marker and phenotype data, but also integrates this within a genomic context. As is common to a number of population genetics analyses within the ryegrasses, flowering time features strongly in terms of traits that can be associated with genetic markers and genomic regions, and four of the eight loci within the 10 % FDR were associated with heading date, one from year 1 and three from year 2. As an outcome this is not particularly surprising, as the phenology of flowering is known to be a key determinant of many growth, physiological and biochemical parameters in the annual cycle of perennial grasses. Commercially, the selection of genotypes that have co-ordinated flowering times represents a primary route for controlling phenotypic variation. In addition to flowering time, significant marker/trait associations were identified for plant width, plant biomass and WSC content. The integration of the physical mapping with the GWAS is discussed below on a marker-by-marker/contig-by-contig basis. Marker ctg8613 had its closest BLAST alignment with gene model Bradi2g51467, which associated it with physical contig 4073c-18. This was a relatively large physical contig consisting of 138 BACs and spanning 2227 high-informationcontent (HIC) fingerprints. In terms of conserved synteny with B. distachyon, this contig spanned a region containing is the position of the SNP within the contig) were also significant within 10 % FDR (P = 2.79 × 10 −5 to 8.41 × 10 −5 ). 5 Markers ctg40624-321/549 were also significantly associated with WSC (P = 1.52 × 10 −4 ). Ctg40624 genetically maps to the same position on LG5 and is likely to be derived from a different exon from the same gene as ctg35543.
36 gene models from Bradi2g51255 to Bradi2g51550, 32 of which were tagged within LpBAC5000 (Table 4). Twentyseven of these gene models were associated with annotations, one of which (Bradi2g51370/HORVU3Hr1G075540) represents a member of the Casein Kinase 1 gene family (CK1). Members of this gene family have been directly associated with the modulation of flowering time in both Arabidopsis thaliana (hereafter referred to as arabidopsis) and rice (Tan et al., 2013;Kwon et al., 2015;Zhang et al., 2016) and, in the latter case, a CK1 gene underlies the Hd16/Early Flowering 1 heading date QTL (Dai and Xue, 2010;Hori et al., 2013). Marker ctg8613 has not previously been genetically mapped in L. perenne but superpool/matrix pool screens of a different marker, R1_302, identified three independent BACs from contig 4073c-18. This would place this contig on chromosome 3 of the Lolium/Festuca introgression map (King et al., 2013). Additionally, marker ctg41342 could also be aligned with physical contig 4073c-18 and this has been previously mapped in the AberMagic × Aurora genetic mapping population to chromosome 3 at 49 cM (marker ctg41342 was 79 % homozygous for one of the alleles in the present study and was not significantly associated with heading date). This placing of physical contig 4073c-18 on chromosome 3 would not suggest that this CK1 is a direct orthologue of the gene underlying the Hd16 QTL in rice (Os03g0793500/ LOC_Os03g57940), but is consistent with orthology with Os01g0772600/LOC_Os01g56580. While we are not aware of any rice heading date QTL directly associated with   Os01g0772600/LOC_Os01g56580, it is the most closely related gene model to Hd16, showing 77 % sequence identity at the amino acid level. Thus, a functional role in flowering regulation is not unlikely. Additionally, the closest orthologue in arabidopsis to Bradi2g51370/HORVU3Hr1G075540/ Os01g0772600 is AT3G03940. This arabidopsis gene model is annotated as a protein kinase family protein, specifically as photoregulatory protein kinase 3 (PPK3). The PPK protein family has been shown in arabidopsis to interact in the bluelight-dependent phosphorylation of cryptochrome 2, a protein involved in photoperiod-induced flowering, and ppk mutants are associated with delayed flowering (Tan et al., 2013). Thus, the CK1 gene model identified on physical contig 4073c-18 is a good candidate for influencing flowering time in L. perenne.    Table 4 and discussed in the text. Dark grey rectangles on the light grey background illustrate the positions of B. distachyon gene models just outside of the defined region. Wide horizontal red bars illustrate the B. distachyon gene models tagged within LpBAC5000 or within the L. perenne genomic scaffolds from Byrne et al. (2015), above and below the contig illustration, respectively. Narrow horizontal red bars indicate where gene space sequences are contiguous. Orange filled boxes indicate the aligned positions of the marker sequences. Red BACs within the contig illustrations indicate BACs for which sequence information is available. In (A), the BACs from which the aligned LpBAC5000 contigs were derived are number-coded and detailed in the key. In (B) the BAC names are given directly.
Marker, ctg50617; physical contig, 63852c-27; trait, heading date year 1 (Supplementary Data Fig. S3A) A second physical contig associated with heading date consisted of 65 BACs spanning 980 HIC fingerprints that showed conserved synteny with a region of the B. distachyon genome from Bradi1g58190 to Bradi1g58254, including ten gene models, eight of which were tagged in LpBAC5000. Of the eight gene models in this region with annotations, one, Bradi1g58230/HORVU2Hr1G060680, is predicted to code for a phytochrome-interacting factor (PIF). The PIFs are a family of transcription factors that interact with the red/far-red light receptors, phytochromes, and so are involved in light perception and photomorphogenesis (Pham et al., 2018). In arabidopsis, PIF3 has been associated particularly with seedling de-etiolation, chlorophyll and anthocyanin biosynthesis and, through interaction with circadian clock genes, diurnal physiological responses (Soy et al., 2016). It has also been shown in arabidopsis that both constitutive and tissue-specific expression of PIF3 in the plant vasculature can promote flowering; pif3 mutants can also show delayed flowering. In rice, the putative PIF3 orthologue Os07g0143200/LOC_Os07g05010, known as OsPIF14 (Cordeiro et al., 2016) or Phytochrome Interacting Factor 3-Like 14 (OsPIL14; Nakamura et al., 2007), has been suggested as being involved in the interaction between light and stress signalling through repression of Dehydration-Responsive Element-Binding 1/C-Repeat-Binding Factor (DREB1/CBF) genes (Cordeiro et al., 2016). This represents a possible route through which the PIF3 gene model within this region in L. perenne could have a significant effect on flowering time in this ecotype population. Marker, ctg9479;physical contig, 2356c-18;trait, heading date year 2 (Fig. 3B) A third physical contig associated with heading date consisted of 54 BACs spanning 977 HIC fingerprints and showed conserved synteny with a region of the B. distachyon genome from Bradi4g26270 to Bradi4g26410. The region contained 16 gene models, 13 of which were tagged within LpBAC5000. Among these, gene model Bradi4g26300/HORVU4Hr1G026680 is annotated as a PIN-FORMED auxin efflux carrier protein. Polar auxin transport is established as being a key process in plant morphogenesis in which differential auxin gradients determine the developmental differentiation of meristems (Balzan et al., 2014). Bradi4g26300 itself belongs to the Sister-of-PIN1 (SoPIN1) clade and sopin1 mutants in B. distachyon have been associated with abnormal spikelet development (O'Connor, 2014), and it is not unlikely that the orthologue of this gene in L. perenne is also involved in determining floral meristem development. Expressions of PIN genes are known to be influenced by abiotic stresses, implying that there can be an environmentally induced response in terms of auxin gradients within plant tissues (Wang et al., 2015). Particularly, for such a diverse pan-European ecotype collection grown in a single (UK) environment, a proportion of the genotypes are likely to have been under some degree of environmental stress simply due to the contrasts between the collection and evaluation sites. Thus, environmentally induced alterations in the expression patterns at the floral meristem of the L. perenne orthologue of Bradi4g26300 could result in changes in flowering patterns. Marker ctg35543;trait,WSC (Supplementary Data Fig. S3B) This physical contig consisted of 58 BACs spanning 978 HIC fingerprints. In terms of conserved synteny with B. distachyon, 4029c-18 included gene models from Bradi4g31640 to Bradi4g31720. Only six of the ten gene models within this region were tagged within LpBAC5000, partly due to lack of MTP coverage at one end of this contig. However, one of the BES within this region could be aligned with Bradi4g31720 (and Scaffold_3258_ref0017503, from Byrne et al., 2015), thus establishing the region of conserved synteny at the end of the contig that lacked MTP coverage. The most interesting annotated gene model within this region was Bradi4g31680, to which the most similar gene model in arabidopsis is AT1G50480, itself annotated as 10-formyltetrahydrofolate synthetase (THFS). THFS catalyses the conversion of formate and tetrahydrofolate to 10-formyl tetrahydrofolate and is an important enzyme in one-carbon metabolism (Hanson and Roje, 2001). The reverse of this reaction is catalysed by 10-formyl tetrahydrofolate deformylases, which have been established as being essential enzymes in photorespiration in arabidopsis under ambient CO 2 concentration. Knockouts of 10-formyl tetrahydrofolate deformylases result in accumulation of glycine and 5-and 10-formyl tetrahydrofolate (Collakova et al., 2008). While THFS is directly metabolically linked through to methionine/serine/glycine/purine metabolism, in general terms it is an enzyme that affects the partitioning of carbon between different pools in plants. Additionally, it could also influence photosynthetic efficiency through metabolic feedback on photorespiration. Thus, variation in accumulation of WSCs across this ecotype collection could be a function of overall photosynthetic efficiency and assimilate partitioning influenced by differential THFS activity. Interestingly, Blackmore et al. (2015) identified marker ctg35543 as being one of the top 50 markers contributing to the first principal component (associated with east-west geographical distribution) in their analysis of genetic diversity. This is a further indication of the potential significance of this gene. Marker ctg54379; physical contig trait,biomass (Supplementary Data Fig. S3C) Physical contig 896c-18 consisted of 60 BACs and spanned 1180 HIC fingerprints. It contained a conserved syntenic region covering Bradi2g50230 to Bradi2g50340 but also included a not-immediately-co-linear gene model at the end of the contig, Bradi2g35187. The region contained 14 gene models, 11 of which were tagged within LpBAC5000. Among the gene models, Bradi2g50280 was annotated as a gibberellin 2-oxidase (GA2ox). The closest orthologue in rice to Bradi2g50280 is LOC_Os01g55240/Os01g0757200, which is identified as gibberellin 2-oxidase 3 (GA2ox3; Lo et al., 2008). The GA2ox family in rice is responsible for regulating the levels of biologically active gibberellins in tissues by metabolizing active gibberellin (GA) forms to inactive ones. Specifically, GA2ox3 is an important enzyme in the conversion of biologically active GA1 and GA4 to the inactive GA34 and GA8 (Lo et al., 2008). In general, GA deficiency in rice is associated with dwarf phenotypes, but there can also be effects on seed dormancy and shoot and root growth, including tillering, flowering and seedset. Specifically, a GA2ox3 T-DNA insertion mutant in rice was found to have a severe dwarf phenotype with no seed production (Lo et al., 2008) and a rice GWAS study identified SNP markers close to GA2ox3 that were significantly associated with seed dormancy (Magwa et al., 2016). Marker ctg54379 was significantly associated with vegetative biomass measured in June of year 1. Biomass in ryegrass is a function of overall leaf growth, but also of tiller number. The GA2ox family will play an important role in the regulation of growth and architectural parameters in ryegrasses through their role in gibberellin metabolism, as they do in other plant species. However, the QQ plot for this trait (Fig. 2) does not obviously show the 'elbow' shape associated with the presence of unusually significant markers. Therefore, while there is close proximity between the GA2ox3 gene and the significant marker for biomass, the interpretation of this marker/trait association has to be treated with caution.

Marker ctg8394 and marker ctg20671
Marker ctg8394 and marker ctg20671 were associated with physical contigs 2309c-18 (Supplementary Data Fig. S3D) and 1969c-18 (Supplementary Data Fig. S3E), respectively. Both physical contigs had some inconsistencies in terms of conserved synteny with B. distachyon, and 2309c-18 showed inconsistent clone order and composition across the various physical map assembly parameters (Supplementary Data Table S4). Physical contig 1969c-18 was, seemingly, a reliable contig; however, gene models could only be tagged within LpBAC5000 at one end of the contig. None of the gene model annotations could be obviously related to possible functions in moderating the associated traits for either physical contig. These contigs are not discussed further.

QTL and GWAS studies in L. perenne
Heading date. There have been a considerable number of published studies of the genetic control of flowering in L. perenne and related grasses, mostly based upon QTL analyses in biparental genetic mapping families (see Shinozuka et al., 2012, for a comprehensive overview). Heading date QTL have been identified on all seven L. perenne linkage groups (LGs) and, in the present study, significant markers were associated with LGs 2, 3, 4 and 7. Often, the lack of common markers across studies and the low resolution of many QTL analyses presents problems in terms of identifying equivalent QTL, and this is certainly true when trying to put the results of the present study in a broader context. We can say that the assumed genetic position of marker ctg8613-723 on LG3 is not incompatible with the position of the heading-date QTL identified on the same LG associated with marker LpRGA5 in the study of Studer et al. (2008). Similarly, the genetic position of ctg20671-156 on LG7 is not incompatible with the minor heading-date QTL identified on LG7 associated with markers LTCOa/b in the study of Armstead et al. (2004). More certainly, we can say that none of the significant heading-date markers from the present study can be associated, in terms of genetic/genomic position, with the major heading-date QTL that have been identified on LG4 and LG7, for which there are good candidate genes, i.e. the L. perenne orthologues of Vernalisation 1 and Flowering Locus T, respectively (Armstead et al., 2004;Jensen et al., 2005;Skøt et al., 2011). More recent studies have analysed GWAS data for significant associations with flowering time. Arojju et al. (2016) used six full-sib families derived from crosses between perennial ryegrass varieties and did not identify any significant marker/trait associations in the GWAS. However, using a single-marker analysis approach, they did identify significant associations on LG2, 4 and 7. Similarly, Fè et al. (2015) presented a GWAS study using a mixture of F 2 families and synthetic populations, both derived from breeding material, and identified 19 SNPs significantly (5 % FDR) associated with heading date. In the study of Fè et al. (2015) the relationship of these SNPs to the published L. perenne sequence scaffolds  was provided. However, none of these scaffolds could be associated with the physical contigs identified in the present study.
Plant width, biomass and WSC. Marker ctg8394-538 was associated with plant width, which is a function of both tiller number and growth habit (i.e. prostrate versus erect). QTL for tiller number have been identified in different studies on all LGs of L. perenne and Lolium multiflorum (Inoue et al., 2004;Yamada et al., 2004;Kobayashi et al., 2011;Sartie et al., 2011), but the lack of common markers means that direct comparisons with the present study have very little resolution. However, in addition to tiller number, Yamada et al. (2004) also recorded prostrate versus erect growth habit and identified QTL for this trait on LG4 and 7. The QTL on LG7 had the highest LOD score of any QTL in that study and was also the only QTL that was non-co-incident with any of the other morphological traits measured. The position of this QTL [derived from one RFLP marker, RZ144, that can be linked through from the study of Yamada et al. (2004) to a previous version of the F 2 genetic map] is not incompatible with the genetic position of marker ctg8394-538, though no obvious candidate gene was identified on the associated physical contig, 2309c-18.
Marker ctg54379-73 was significantly associated with plant biomass, though the QQ plot indicates that this may have been a chance outcome. Biomass is a trait that is a fundamental productivity measure for forage grasses and, not surprisingly, QTL have been identified on a number of different L. perenne LGs, including LG3. Two studies (Turner et al., 2008;Anhalt et al., 2009) associated the QTL peak for biomass with SSR marker rv1133. In a combined F 2 genetic map that integrated some of the Infinium iSelect markers with previously genetically mapped markers (unpublished), rv1133 and ctg54379-73 genetically map ~20 cM apart. Therefore, the previously identified biomass QTL are probably not co-incident with any effect associated with marker ctg54379-73. Sartie et al. (2011) identified a QTL for tiller number (which is a component of biomass) on LG3, but there are no common markers that cross-reference to this study.
Ctg35543-1175 and ctg41386-226 identified significant associations with WSC on LG5 and LG6 respectively. Foito et al. (2015) and Turner et al. (2006) conducted a QTL analysis for WSC accumulation and polar metabolite concentrations (including soluble sugars) in L. perenne genetic mapping populations and identified QTL for various WSC fractions on LG1, 2, 3, 5, 6 and 7, which, specifically, included a QTL on LG5 for leaf glucose content in the autumn and QTL for total WSC, fructan, glucose and fructose on LG6. However, as for a number of the studies referred to in this discussion, lack of common markers makes establishing co-incidence of markers and QTL across studies problematic.

Future directions
In this study we have described the development and deployment of a BAC-based physical mapping approach in order to inform and increase the resolution of a GWAS in terms of identifying potentially useful germplasm accessions, candidate genomic regions and genes. Clearly this is not an end-point for L. perenne genomics, the ultimate goal being to be able to define an accurate genomic sequence for L. perenne in the form of seven pseudomolecules and to integrate this with both comprehensive gene annotations and descriptions of physical genetic variation. Work is currently in progress to reach these ends at a number of research centres and it is hoped that the resources reported here will aid in the achievement of these aims. Beyond genomics, major challenges also exist for grassland geneticists and biotechnologists in understanding and exploiting the considerable germplasm resources that are available. These challenges include both the establishment of protocols and, hopefully, pipelines for higher-throughput phenotyping of L. perenne accessions and, at the molecular level, methods for validating candidate genes. In the latter context, while genetic transformation has been available for many years for L. perenne and related species and so the potential to knock out or otherwise modify the expression of specific genes is there, this is often quite a blunt tool, particularly if one is trying to understand the contribution of allelic differences to trait variation. For this, the application of Crispr Cas-9 and similar genome editing technologies (Yin et al.,20 17) is likely to be the biotechnological route to a greater understanding. Thus, it is hoped that in the not too distant future we will be able to integrate phenotyping, genetics, genomics and biotechnology to more fully understand the biology of these grasses and so to contribute germplasm-based solutions to the various societal and environmental challenges and opportunities associated with grasslands.

SUPPLEMENTARY DATA
Supplementary data are available online at https://academic. oup.com/aob and consist of the following. Fig. S1: distribution of assembly lengths and N50s for sequenced BACs in the MTP. Fig. S2: QQ plots for non-significant marker/trait associations from the GWAS. Fig. S3: diagrammatic representations of physical map ctgs 62852c-18, 4029c-18, 896c-18, 2309c-18 and 1969c-18 in relation to conserved syntenic regions in the B. distachyon genome. Methods S1: details of physical map construction using FPC and LTC. Methods S2: methods used for concatenation of BAC assemblies for comparison with the published assembly of Byrne et al. (2015) and for screening of MTP clones for potential cross-contamination. Methods S3: construction of BAC superpools and matrix pools for marker screening and details of the RAD sequencing pilot study. Methods S4 and S5: field and analytical chemistry measurements and protocols used in the evaluation of the ecotype family. Results S1: concatenation of BAC assemblies, identification and removal of potential crosscontaminating sequences and RAD sequencing pilot study. Table S1: map positions, marker sequences and SNP positions for markers assayed on the BAC superpools and matrix pools using KASP. Table S2: geographical collection sites of L. perenne accessions used in the GWAS analysis. Tables S3, S5 and S6: summary statistics for BAC library production and fingerprinting, sequence size range in the LpBAC5000 database and estimates of the L. perenne genome coverage within the physical maps. Supplementary Data available via doi consist of the following. Table S4: L. perenne physical maps generated using FPC and LTC assembly softwares from BAC HICF data, https://doi.org/10.20391/bb56e6d7-8913-4bd7-8167-2b7e4c01382b; LpBAC5000 sequence database, https://doi.org/10.20391/dfb05330-7485-444f-a475-8310bee5d510; BAC-end sequence database, https://doi.org/ 10.20391/61921116-ddd0-4d85-b0fd-e0d734bc63c8.