Chromosome-level genome assembly of the European green woodpecker Picus viridis

Abstract The European green woodpecker, Picus viridis, is a widely distributed species found in the Western Palearctic region. Here, we assembled a highly contiguous genome assembly for this species using a combination of short- and long-read sequencing and scaffolded with chromatin conformation capture (Hi-C). The final genome assembly was 1.28 Gb and features a scaffold N50 of 37 Mb and a scaffold L50 of 39.165 Mb. The assembly incorporates 89.4% of the genes identified in birds in OrthoDB. Gene and repetitive content annotation on the assembly detected 15,805 genes and a ∼30.1% occurrence of repetitive elements, respectively. Analysis of synteny demonstrates the fragmented nature of the P. viridis genome when compared to the chicken (Gallus gallus). The assembly and annotations produced in this study will certainly help for further research into the genomics of P. viridis and the comparative evolution of woodpeckers. Five historical and seven contemporary samples have been resequenced and may give insights on the population history of this species.


Introduction
The European green woodpecker (Picus viridis Linnaeus 1758) is a common Western Palearctic species that can be found in all sort of wooded environment where it often forages on ants on the ground.This species was previously considered to be conspecific with the Iberian green woodpecker (Picus sharpei) and Levaillant's woodpecker (Picus vaillantii) but recent molecular studies suggest that these taxa may consist of distinct biological species (Perktas et al. 2011;Pons et al. 2011Pons et al. , 2019)).European green woodpecker and Iberian green woodpecker form a secondary contact zone in southern France where individuals from the two species sometimes hybridize in the departments of Pyrénées-Orientales, Aude, Hérault and Gard, with very limited introgression outside of the contact zone (Pons et al. 2019).The European green woodpecker is currently considered as a common species (least concern) with a global increasing population size of 1.2-2.3 million individuals (Birdlife International 2024), among which 300,000-600,000 are thought to occur in France (Issa et al. 2015).The French population size is also considered to be increasing (+50% over the 1980-2012 time period, +2% over the 2001-2015 time period) (Issa et al. 2015) and currently represents a stronghold of this species (25% of the total population).
Here, we present a chromosome level assembly of the European green woodpecker genome, as well as genome resequencing data for another 11 individuals sampled over the 1970-2020 time period.Our assembly is based on a combination of Illumina short reads, Nanopore PromethION long reads and chromatin conformation capture (Hi-C).The vast majority of the in silico annotated genes are located on 47 chromosomal-level sequences, in accordance with the described karyotype of the species (2n = 94, Hammar 1970).

Sampling scheme
We sampled 13 Picus viridis viridis individuals, 5 historical individuals collected between 1970 and 1977, and 8 contemporary individuals collected between 2016 and 2021 (Table 1).Toe pads sampling for the historical specimens was performed using gloves and sterile scalpel blades that were changed between each individual.Tissues for contemporary specimens consisted of muscle or heart are either preserved in ethanol or flash frozen.For the two specimens used to assemble the reference genome, the tissues consisted of heart, muscle, and liver (Nanopore long reads and Illumina short reads, MNHN ZO-2020-125) or liver (Hi-C, MNHN ZO-2021-2133) sampled 1-2 h after the individual's death and either immediately extracted (heart) or flash frozen at −80°C (liver).The two individuals originated from localities distant of 25 km from each other and are both around 800 km away from the secondary contact zone with its sister species Picus sharpei (Pons et al. 2019).
Historical specimens were extracted before any modern samples were processed.DNA quality and purity was assessed using a Qubit Quantification (Qubit dsDNA BR Assay Kit, Life Technologies), NanoDrop spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE) and Fragment Analyzer (Agilent) to assess DNA quality and quantity.

Reference genome sequencing
DNA-seq libraries have been prepared according to Illumina's protocols using the Illumina TruSeq Nano DNA HT Library Prep Kit.Briefly, DNA was fragmented by sonication, size selection (average insert size: 400 bp) was performed using SPB beads (kit beads) and adaptators were ligated to be sequenced.Library quality was assessed using an Advanced Analytical Fragment Analyzer (Advanced Analytical Technologies, Inc., Iowa, USA) and libraries were quantified by qPCR using the Kapa Library Quantification Kit.DNA-sequencing experiments have been performed on a NovaSeq SP lane (Illumina, CA, USA) using a paired-end read length of 2 × 150 pb with the Illumina NovaSeq Reagent Kits.
Long-read library preparation was performed according to the manufacturer's instructions "1D gDNA selecting for long reads (SQK-LSK109)".At each step, DNA was quantified using the Qubit dsDNA HS Assay Kit (Life Technologies).DNA purity was tested using the nanodrop (Thermofisher) and size distribution and degradation assessed using the Fragment analyzer (Agilent) High Sensitivity Large Fragment 50 kb Kit.Purification steps were performed using AMPure XP beads (Beckman Coulter).For one simple library, 10 µg of DNA was purified then sheared at 25 kb using the megaruptor system (diagenode).A one-step DNA damage repair + END-repair + dA tail of double-stranded DNA fragments was performed on 2 µg of sample before ligating adapters to the library.Library was loaded onto one R9.4.1 flowcell at 20 fmol then reloaded once at 11 fmol on GridION instrument within 72H.For one optimized library, 20 µg of DNA was purified then sheared at 20 kb using the megaruptor system (diagenode).A size selection step at 5 kb using Short Read Eliminator XS Kit (Circulomics) was performed on 15 µg of sample.For 5 µg of DNA, an extra repair step with SMRTbell DAMAGE REPAIR KIT (PACBIO, 100-465-900) was performed before the one-step DNA damage repair + END-repair + dA tail of double-stranded DNA fragments and adapters ligation.Library was loaded onto one R9.4.1 flowcell at 20 fmol then reloaded 3 times at 20 fmol on a PromethION instrument within 72H.All short-and long-read libraries were prepared and sequenced at the GeT-PlaGe core facility, INRAE, Toulouse.

Hi-C
Liver frozen tissue was directly put into 50 ml of formaldehyde (3%) and phosphate buffered saline (PBS) solution.Fixation was then incubated for 1 h under gentle agitation.Glycine was added to a final concentration of 0.125 mM and the reaction was incubated for another 20 min.Tissue was recovered by centrifugation (5,000g for 20 min) and washed with PBS 1X before being recentrifuged.The Supernatant was discarded and tissue was stored at −80°C until use.The Hi-C library was constructed from liver tissue starting from a mass of 20 mg.Tissue was first resuspended in 1.2 ml of TE 1X and disrupted using CK14 glass beads (Precellys, Bertin Technology) and a precellys apparatus (program: 5 × 30 s ON-30 s OFF-8700 rpm).The lysate was recovered and then used as input for the ARIMA-HiC preparation kit (Arima Genomics).Libraries were then processed for sequencing as previously described (Moreau et al. 2018) and sequenced on a Novaseq apparatus.

Resequencing of historical and contemporary samples
Resequencing of the 11 individuals was performed at the "Institut du Cerveau" (ICM, Pitié-Salpêtrière Hospital, Paris) using a target 250-300 bp insert size on a NovaSeq 6000 system sequencer.

Genome reference assembly and quality assessment
A first assembly was made using the Nanopore long reads using the Flye (v2.8.1-b1676) de novo assembler (Kolmogorov et al. 2019).The assembly was then polished using 3 iterations of PILON (Walker et al. 2014), then once with Racon (Vaser et al. 2017), finished by a last iteration of PILON.The resulting polished assembly has been used as a guide for the Hi-C scaffolding process using instaGRAAL (Baudry et al. 2020).

Genome size completeness, estimates of genome size using k-mer (SGA preqc)
SGA preqc tool (https://raw.githubusercontent.com/jts/sga/master/src/SGA/preqc.cpp) was used to do a k-mer analysis in order to estimate genome size completeness.K-mers, short DNA sequences of fixed length, were analyzed for their abundance in the genomic data.By calculating the occurrence of k-mers, it aims to make predictions about the size and completeness of the genome.
It gives a hint on the quality of the data and anticipates the difficulty of the assembly process.

K-mer composition of the genome assembly
To analyze the quality and composition of our genome assembly, we employed KAT (K-mer Analysis Toolkit) (Mapleson et al. 2017).This toolkit is designed for reference-free quality control of whole genome shotgun reads and de novo genome assemblies.It utilizes k-mer frequencies and GC composition to assess errors, bias, and contamination throughout the assembly process.By comparing the k-mers present in the input reads and the resulting assemblies, it offers valuable insights into the composition and quality of genome assemblies.

Masking repeated elements
We used RepeatMasker (v4.1.2-p1)(Smit et al. 2013) on the genome assembly obtained from the instaGRAAL step.This program screens DNA sequences for interspersed repeats and low complexity DNA sequences.It annotates the identified repeats and hard-masks them by replacing them with Ns.

Genome annotation
The gene structure annotation was performed using BRAKER (Hoff et al. 2016), an automated method that utilizes both genomic and RNA-Seq data to generate comprehensive gene structure annotations in novel genomes.Using BRAKER-2.1.6,GeneMark (Lomsadze et al. 2005(Lomsadze et al. , 2014) ) is first executed on a set of proteins coming from the reference proteome of Gallus gallus from UniProt (UP000000539).GeneMark is trained using these protein-coding sequences and produces a set of sequences for the ab initio methods used afterwards, like AUGUSTUS (Stanke et al. 2006), that produces gene and features predictions from the given P. viridis genome.The resulting gene set consists of genes strongly supported by extrinsic evidence.Some statistics on the produced annotation have been generated using AGAT (v1.2.0) (Dainat 2023).

Genome synteny
The masked genome was globally aligned with the genome of the chicken, G. gallus, the model organism, using MUMmer (Kurtz et al. 2004).We based our methodology for analyzing synteny on the one used for the reference genome assembly of Colaptes auratus (Hruska and Manthey 2021).Only regions with more than 70% identity were retained, with lengths greater than 500 bp.Next, the P. viridis chromosomes showing the greatest overlap with chicken chromosomes were associated and renamed according to their corresponding chicken chromosome name.A synteny circos plot produced with OmicCircos (v1.36) (Hu et al. 2014) shows the overall chromosome rearrangement between the two species.
Only chromosomes sharing at least 500 unique links were conserved in the figure.To reorder chromosomes, the median position of hits in chickens for each segment from the P. viridis chromosomes was used to sort them.The same approach was performed against the genome of the Northern Flicker, C. auratus, retaining fragments aligned with MUMmer (Kurtz et al. 2004) that exceed 10 kb in length and exhibit 80% identity due to the significant noise resulting from close proximity and an abundance of repeated sequences.

Assessing assembly quality
In order to estimate the quality we used BUSCO v4.1.4to determine the proportion of genes expected in birds, using the Aves_ODB10 dataset, i.e. the 8,338 single-copy orthologous genes cataloged for the Aves class by the OrthoDB (v10) database (Simão et al. 2015).

Ultraconserved elements
We also assessed genome completeness by estimating the number of ultraconserved elements (UCEs), a set of loci commonly used in phylogenetics.We retrieved the UCEs from 13 Piciformes genomes published on Gebank (cutoff date 2022 Oct 12; Supplementary Table 2).We extracted the UCEs with Phyluce (Faircloth 2016) according to the online tutorial (https://phyluce.readthedocs.io/en/latest/tutorials/tutorial-3.html).The UCEs from the 14 individuals were then aligned using MAFFT (Katoh et al. 2002) and internally trimmed with Gblocks (Castresana 2000;Talavera and Castresana 2007).We did not allow missing loci for a species.We performed a concatenated maximum likelihood analysis with the UCE loci under the GTRGAMMA model in RAXML (Stamatakis 2014).

Mitochondrial genome assembly
We used NOVOPlasty-2.7.2 (Dierckxsens et al. 2017) to assemble the mitochondrial genomes using the short reads.We specified a genome range of 16,500-19,000 bp, a K-mer size of k = 39 and an insert size of 300 bp.As a seed, we used an ATP6 sequence from another P. viridis viridis individual (MF766578, Shakya et al. 2017).When the genome was not circularized, we used the BWA algorithm, as implemented in Ugene (Okonechnikov et al. 2012) using the default option except the number of differences that we set to 4. We performed a mitochondrial genome analysis using all Piciformes sequences that are available on Genbank (cutoff date 2023 Jul 15).We restricted our analyses to 12 protein-coding genes (we did not include ND6 as it was not present in several genbank sequences).We used the mitochondrial DNA genome  of Galbula dea as an outgroup (MN356220) and included 36 other Piciformes (35 individuals available on Genbank and the newly produced sequence of P. viridis) (Supplementary Table 1).We did not include Yungipicus canicapillus MK335534 because it is a chimeric sequence between Y. canicapillus and Dendrocopos darjellensis (Fuchs et al., in preparation).Stop codons were excluded from the alignments as well as the extra nucleotide found in position 174 in ND3 in most Piciformes species.Substitution models were selected under the Bayesian Information Criterion, as implemented in MEGA X (Kumar et al. 2018).Singe locus and concatenated partitioned maximum likelihood analyses were performed using RAXML on the RAXML Blackbox (Kozlov et al. 2019).Clade robustness was estimated using 100 bootstrap replicates.

Population resequencing analysis and variant calling
From the newly assembled reference genome, we used snpArcher (Mirchandani et al. 2024) to call variants.This pipeline involves a mapping of all the reads on the reference genome and for those where the coverage and depth are sufficient, it marks each position that differs from the reference and merge all individuals into a single file in VCF format (Variant Call Format).We computed the number of segregating sites, S, and Watterson's Θ, using PopGenome 2.7.5 (Pfeifer et al. 2014) and the Nei diversity index π (Nei and Li 1979) using VCFtools (Danecek et al. 2011) (Table 2).Principal component analysis (PCA) were produced using pcadapt (v4.3.3)(Privé et al. 2020).Hierarchical clustering analysis was performed on a dissimilarity matrix of all the samples using SNPRelate (v1.32.2) (Zheng et al. 2012).snpArcher automatically applies specific filters using the HaplotypeCaller from the genome analysis toolkit (GATK) to customize our approach to the dataset's characteristics.To heighten sensitivity and include variants in areas with inadequate read support, we set the -min-pruning filter to a permissive threshold of 1, meaning that variants with as little as one supporting read are considered for inclusion in the final variant call set. the -min-dangling-branch-length filter is set to a value of 1, enabling even short "dangling branches" in the assembly graph to be conserved and considering variants with minimal support.Nevertheless, it is essential to recognize that these filter settings boost sensitivity while potentially increasing the probability of false positive variant calls.These parameter selections were made carefully to achieve an equilibrium between sensitivity and specificity while considering the characteristics of our dataset.This has been carried out for subsequent filtering, after the construction of VCF, to enable fine-tuning of filters according to the observed depth distribution.Following these primary filters, further filtering based on coverage depth (DP) is applied to each variant, after the application of GATK caller filters.After implementing the initial filters, each variant undergoes depth-based filtering based on the depth of coverage (DP) once the GATK caller filters have been applied.Each variant is retained only if the depth exceeds 10× and falls below the 90th percentile of the observed depth distribution, in order to eliminate outliers.

Site frequency spectrum
The site frequency spectrum (SFS) represents the distribution of the allelic frequencies of the mutations throughout the genome (Fu 1995).It gives the number of mutations present at each frequency.The folded SFS of a sample of n diploid individuals is described as a vector η such that η = (η 1 , η 2 ,…, η 2n−1 ), where η i is the number of mutations at frequency i/2n with i∈[1:2 n − 1].Spectra are also normalized and transformed to ϕ i = η i × i(n − I)/Ση i , except for i = n, where ϕ i = n/2Ση i (Achaz 2009).

Reference genome sequencing
Heart provided less-degraded DNA when compared to pectoral muscle or liver for long-read sequencing.We obtained 41.2 Gbp of long reads (Gridion: 3.5 Gbp, N50 close to 16 kb; Promethion: 37.7 Gbp, N50 close to 10 kb) as well as about 65 Gbp of short-read data.The SGA preqc genome size estimate for P. viridis was Chromosome-level genome assembly of the European green woodpecker | 7 1.26 Gb.From mapping information, coverage depth is around 73× (Supplementary Table 5).

Assembly quality and genome completness
The majority of the assembly presents unique k-mer content that is only present once (Fig. 1).This pattern is typical of what we expect from a complete haploid assembly, generated from a diploid genome.
Results from the BUSCO analyses indicated that 93.6% of the loci were included in the assembly (single-copy: 88.7%, duplicated: 0.4%, fragmented: 4.5%).Comparison with other Picidae genomes indicated our genome assembly ranks third out of five for genome completeness (Supplementary Table 2) and third out of four for genomes for which long reads were generated.
The second highest number of retrieved UCEs in the Piciformes was found in our P. viridis assembly, confirming its high completeness (Supplementary Table 3).Alignment of the concatenated UCEs (2906 loci) was 631,268 bp long, among which 16,077 sites were informative.The result of the maximum likelihood analyses recovered the same topology (Fig. 2) at the family level than Prum et al. (2015), with all nodes being supported by bootstrap values of 100.

Mitogenome assembly
Our final assembly of the mitochondrial genome was 16,912 bp long, with some uncertainty concerning the exact length due to the presence of a 64 bp repeat at the end of the first control region as well as one C monomeric region in 16S (Supplementary Table 4).The genome is similar to many other avian genomes with 13 protein-coding genes, 22 transfer RNA and 2 Ribosomal RNA.The control region is duplicated, presenting 1 functional control region and 1 degenerated control region.
The topology recovered from the partitioned concatenated analysis of 12 mitochondrial protein-coding loci (Fig. 3) was in strong agreement with current phylogenetic hypotheses for family-level and genus-level relationships in Piciformes (e.g.Prum et al. 2015;Shakya et al. 2017).

Genome annotation
A total of 15,848 protein-coding genes were identified, representing 81,357,234 bp in total, along with essential transcription factors, noncoding RNAs, and repetitive elements, through automated genome annotation; 16,212 mRNA, representing 82,809,862 bp in total, were identified, including 364 isoforms; 77,423 exons are numbered, at a rate of 4.8 exons per coding DNA sequence (Supplementary Table 9).

Repetitive elements
Detailed analysis of repetitive elements in the P. viridis genome revealed their abundance, comprising 30.1% of the assembled genome (385,325,029 out of 1,279,164,199 total bp).This estimate compares with the 28% recovered for C. auratus (Hruska and Manthey 2021), 25.8% for Melanerpes aurifrons (Wiley and

Discussion
Using a combination of short and long reads as well as Hi-C data, we successfully assembled a reference genome for a species that had no existing reference genome.The final genome assembly of European Green Woodpecker is estimated to be about 1.28 Gbp.
Exhibiting an N50 contig length of 37 Mbp, our assembly displays significant contig continuity, retaining almost all genetic information across the 47 contiguous scaffolds.The de novo assembly and annotation of the P. viridis genome presented here represent an important resource for the ornithological community and complements our understanding of the genetic structure of the European green woodpecker.However, it is important to acknowledge that challenges persist with any genome assembly (Peona et al. 2018;Weissensteiner and Suh 2019).Although continuity and completeness have significantly improved, some genomic regions with high GC content and repetitive elements may still present challenges for accurate assembly (Chen et al. 2013;Goldstein et al. 2019).

Genome completeness
The BUSCO completeness score of 89.4% (Table 4) is comparatively low for recently assembled bird genomes, especially considering the methods used (Hi-Ci, long reads, and short reads) when compared to other recently published genomes (e.g.Baudrin et al. 2023).Among Piciformes, our assembly ranked third out of four species (M.aurifrons, D. pubescens, P. viridis, C. auratus) for which long reads were used.We note, however, that the 2 assemblies with the lowest BUSCO scores (C.auratus, P. viridis) are part of the Picini subclade, whereas the 2 higher scores belong to the Dendropicini (D. pubescens) and Melanerpini (M.aurifrons) clades (Shakya et al. 2017).Genome fragmentation, especially the number of microchromosomes, could explain this pattern as they could be more difficult to assemble.(Zhang et al. 2014;Manthey et al. 2018).
The proportion of the genome that consists of repetitive elements is the highest in the two species (P.viridis: 30.1%, C. auratus: 28%) that have the lower BUSCO scores.It is plausible that the high proportion of repetitive elements in the genomes could also contribute to the lower recovery of BUSCO loci in the assemblies.

Spatial structure in historical and contemporaneous data
On the PCA (Fig. 6), there is a visible genomic structure following a north-south gradient on PC2, and a temporal differentiation based on PC1.Indeed the genomic information overall indicates a genetic proximity to birds from nearby regions.Yet, one individual, ZO-2017-195, from the Ile-de-France area, where several modern samples were sequenced, is an outlier on the PCA, although it is located more internally in the dendrogram (Fig. 7).
Temporal genetic structuring may exist, where current (expanding) populations are the result of expansion of only a subset of the 1970-1980 population, with potential replacement.Indeed, population expansion may not be homogenous and it is conceivable that not all subpopulations contributed equally to the strong population expansion documented in the last 50 years.Population genomic studies from different time periods will be needed to test this hypothesis.

Fig. 1 .
Fig. 1.Distribution of k-mers using KAT.In the k-mer spectrum, reads that are absent from the assembly are displayed in black.The error distribution is quite low with most of errors under 10×.

Fig. 2 .
Fig. 2. Phylogenetic tree showing relationships among Piciformes species resulting from maximum likelihood analysis of Ultra Conserved Elements loci.The tree was reconstructed from an alignment of 631,268 nucleotides (2,906 loci).The bootstrap percentage (BP) is indicated on the nodes.

Fig. 3 .
Fig. 3. Maximum likelihood phylogenetic tree showing relationships among Piciformes species.Twelve mitochondrial loci, 37 taxa, and 10,857 characters were included in the analysis.The bootstrap percentage (BP) is indicated on the nodes.

Fig. 4 .
Fig. 4. Circos plot.Woodpecker chromosomes are on the left, in blue; chicken chromosomes are on the right, in yellow.Note that the number of chromosomes differs between the Chicken Gallus gallus and the European green woodpecker Picus viridis.

Fig. 5 .
Fig. 5. Circos plot.Woodpecker chromosomes are on the left, in blue; Northern flicker (Colaptes auratus) chromosomes are on the right, in yellow.

Fig. 6 .
Fig. 6.PCA based on genomic diversity among the historical and contemporary samples (left) and map of the sampled Picus viridis individiduals (right).

Fig. 7 .
Fig. 7. Dendrogram of hierarchical clustering based on pairwise nuclear sequence dissimilarity between samples.

Fig. 8 .
Fig.8.Site frequency spectrum.On left, the raw counts of each allele frequency; on right, the normalized (rescaled [0,1]) and transformed, so that the expected spectrum for a constant population size should be flat at y = 1/11.

Table 1 .
Information of the sequenced individuals.
All individuals were sampled in France.

Table 3 .
(Paradis 2010)ozas 2009)he mitochondrial data, as estimated using DNAsp 5(Librado and Rozas 2009).Watterson's theta Θ), as estimated by DNAsp 5(Librado and Rozas 2009), are reported in Table3.The nuclear data and the genetic diversity values are higher in historical specimens than in contemporaneous specimens.The minimum spanning network, as estimated by Pegas(Paradis 2010), is shown in Supplementary Fig.1.The average P-distance among individuals is 0.11% (minimum: 0.005% between two contemporary individuals sampled 400 km apart and 0.17% between a contemporary and historical individual sampled 400 km apart as well).