SMRT long reads and Direct Label and Stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica)

Abstract Background The barn swallow (Hirundo rustica) is a migratory bird that has been the focus of a large number of ecological, behavioral, and genetic studies. To facilitate further population genetics and genomic studies, we present a reference genome assembly for the European subspecies (H. r. rustica). Findings As part of the Genome10K effort on generating high-quality vertebrate genomes (Vertebrate Genomes Project), we have assembled a highly contiguous genome assembly using single molecule real-time (SMRT) DNA sequencing and several Bionano optical map technologies. We compared and integrated optical maps derived from both the Nick, Label, Repair, and Stain technology and from the Direct Label and Stain (DLS) technology. As proposed by Bionano, DLS more than doubled the scaffold N50 with respect to the nickase. The dual enzyme hybrid scaffold led to a further marginal increase in scaffold N50 and an overall increase of confidence in the scaffolds. After removal of haplotigs, the final assembly is approximately 1.21 Gbp in size, with a scaffold N50 value of more than 25.95 Mbp. Conclusions This high-quality genome assembly represents a valuable resource for future studies of population genetics and genomics in the barn swallow and for studies concerning the evolution of avian genomes. It also represents one of the very first genomes assembled by combining SMRT long-read sequencing with the new Bionano DLS technology for scaffolding. The quality of this assembly demonstrates the potential of this methodology to substantially increase the contiguity of genome assemblies.


Blood sample collection
The blood used as a source of DNA was derived from a minimally invasive sampling performed on a female individual of approximately two years of age during May 2017 on a farm near Milan in northern Italy (45.4N 9.3E). Blood was collected in heparinized capillary tubes. Three hours after collection, the sample was centrifuged to separate blood cells from plasma and then stored at −80 • C.

DNA extraction and quality control for SMRT library preparation
DNA extraction was performed on the blood cell portion of centrifuged whole blood containing nucleated erythrocytes and leukocytes with the Wizard genomic DNA purification kit (Promega, cat. no. A1125), using the protocol for tissue (not human blood). This kit employs a protocol similar to the classic phenol/chloroform DNA extraction, with no vortexing steps after cell lysis. After purification, DNA quality and concentration were assessed by Nanodrop (Thermo Fisher Scientific, cat. no. ND-1000) and subsequently by pulsed field gel electrophoresis (PFGE). Detectable DNA was greater than 23 kbp in size, with the vast majority greater than 50 kbp and even greater than 200 kbp ( Supplementary Fig. S1). PFGE quality results were further confirmed by capillary electrophoresis on FEMTO Pulse instrument (AATI, cat. no. FP-1002-0275) ( Supplementary Fig. S2). DNA was stored at −80 • C and shipped to the sequencing facility on dry ice.

SMRT library preparation and sequencing
The SMRTbell Express Template Prep Kit (PacBio, cat. no. 101-357-000) was used to produce the insert library. Input genomic DNA (gDNA) concentration was measured on a Qubit Fluorometer dsDNA Broad Range (Life Technologies, cat. no. 32850). Next, 10 μg of gDNA was mechanically sheared to an average size distribution of 40-50 kbp using a Megaruptor Device (Diagenode, cat. no. B06010001). FEMTO Pulse capillary electrophoresis was employed to assess the size of the fragments. Then, 5 μg of sheared gDNA was DNA-damage repaired and end-repaired using polishing enzymes. Blunt-end ligation was used to create the SMRTbell template. A Blue Pippin device (Sage Science, cat. no. BLU0001) was used to size-select the SMRTbell template and enrich for fragments >30 kbp, excluding the first two cells for which the library was enriched for fragments >15 kbp. The sizeselected library was checked using FEMTO Pulse and quantified on a Qubit Fluorometer. A ready-to-sequence SMRTbell Polymerase Complex was created using the Sequel binding kit 2.0 (PacBio, cat. no. 100-862-200). The PacBio Sequel instrument was programmed to sequence the library on 18 Sequel SMRT Cells 1M v2 (PacBio, cat. no. 101-008-000), taking one movie of 10 hours per cell, using the Sequel Sequencing Kit 2.1 (PacBio, cat. no. 101-310-400). After the run, sequencing data quality was checked via the PacBio SMRT Link v5.0.1 software using the "run QC module". An average of 3.7 Gbp (standard deviation: 1.7) were produced per SMRT cell (average N50 = 25,622 bp), with considerable improvements between the average 15 kbp library and the 30 kbp library (see Supplementary Fig. S3 for more detailed statistics). We observed a wide distribution in the GC content of reads (Supplementary Fig. S4). This is likely explained by the presence in avian genomes of three classes of chromosomes: macrochromosomes  Mbp, 5 in chicken), intermediate chromosomes 5 in chicken), and microchromosomes (12 Mbp on average, 28 in chicken). Microchromosomes account for only 18% of the total genome but harbor ∼31% of all chicken genes, have higher recombination rates, and have higher GC contents on average [13].

Assembly of SMRT reads
The assembly of long reads was conducted with software CANU v1.7 (Canu, RRID:SCR 015880) [14] using default parameters except for the "'correctedErrorRate", which was set at 0.075. The assembly processes occupied 3,840 central processing unit (CPU) hours and 2.2 Tb of random access memory (RAM) for read correction, 768 CPU hours and 1.1 Tb of RAM for the trimming steps, and 3,280 CPU hours and 2.2 Tb of RAM for the assembly phase. The long-read assembly contained 3,872 contigs with a N50 of 5.2 Mbp for a total length of 1,311.7 Mbp (Table 1 and Supplementary Table S1). Final polishing was performed using the Arrow v2.10 software (Pacific Biosciences) and resulted in final coverage of 45.4×.

Cell count and DNA extraction for optical mapping
High-molecular-weight (HMW) DNA was extracted from 7-8 μL of the cell portion from the same blood sample used for SMRT sequencing with the Blood and Cell Culture DNA Isolation kit (Bionano Genomics, cat. no. RE-016-10). HMW DNA was extracted by embedding cells in low melting temperature agarose plugs that were incubated with Proteinase K (Qiagen, cat. no. 158920) and RNAseA (Qiagen, cat. no. 158924). The plugs were washed and solubilized using Agarase Enzyme (Thermo Fisher Scientific, cat. no. EO0461) to release HMW DNA and further purified by drop dialysis. DNA was homogenized overnight prior to quantification using a Qubit Fluorometer.

In silico digestion
The genome assembly obtained with CANU was in silico digested using Bionano Access software to test whether the nicking enzyme (Nb.BssSI), with recognition sequence CACGAG, and the non-nicking enzyme DLE-1, with recognition sequence CTTAAG, were suitable for optical mapping in our bird genome. An average of 16.9 nicks/100 kbp with a nick-to-nick distance N50 of 11,708 bp was expected for Nb.BssSI, while DLE-1 was found to induce 19.1 nicks/100 kbp with a nick-to-nick distance N50 of 8,775 bp, both in line with manufacturer's requirements.

DNA labeling for optical mapping
For NLRS, DNA was labeled using the Prep DNA Labeling Kit-NLRS according to manufacturer's instructions (Bionano Ge-nomics, cat. no. 80001). Then, 300 ng of purified gDNA was nicked with Nb.BssSI (New England BioLabs, cat. no. R0681S) in NEB Buffer 3. The nicked DNA was labeled with a fluorescent-dUTP nucleotide analog using Taq DNA polymerase (New England BioLabs, cat. no. M0267S). After labeling, nicks were ligated with Taq DNA ligase (New England BioLabs, cat. no. M0208S) in the presence of dNTPs. The backbone of fluorescently labeled DNA was counterstained overnight with YOYO-1 (Bionano Genomics, cat. no. 80001).
For DLS, DNA was labeled using the Bionano Prep DNA Labeling Kit-DLS (cat. no. 80005) according to manufacturer's instructions. Next, 750 ng of purified gDNA was labeled with DLE labeling Mix and subsequently incubated with Proteinase K (Qiagen, cat. no. 158920) followed by drop dialysis. After the clean-up step, the DNA was pre-stained, homogenized, and quantified using on a Qubit Fluorometer to establish the appropriate amount of backbone stain. The reaction was incubated at room temperature for at least 2 hours.

Generation of optical maps
NLRS and DLS labeled DNA were loaded into a nanochannel array of a Saphyr Chip (Bionano Genomics, cat. no. FC-030-01) and run by electrophoresis each into a compartment. Linearized DNA molecules were imaged using the Saphyr system and associated software (Bionano Genomics, cat. no. 90001 and CR-002-01).
In the experiment with Nb.BssSI, molecule N50 was 0.1298 Mbp for molecules above 20 kbp and 0.2336 Mbp for molecules above 150 kbp, with an average label density of 11.8/100 kbp for molecules above 150 kbp. Map rate was 38.9% for molecules above 150 kbp. Effective coverage was 28.2×. In the experiment with DLE-1, molecule N50 was 0.2475 Mbp for molecules above 20 kbp and 0.3641 Mbp for molecules above 150 kbp, with an average label density of 15.7/100 kbp for molecules above 150 kbp. Map rate was 56.4% for molecules above 150 kbp. Effective coverage was 30.6×. Using both Nb.BssSI and DLE-1, label metrics were in line with the manufacturer's expectations.

Assembly of optical maps
The de novo assembly of the optical maps was performed using the Bionano Access v1.2.1 and Bionano Solve v3.2.1 software. The assembly type performed was the "non-haplotype" with "no extend split" and "no cut segdups". Default parameters were adjusted to accommodate the genomic properties of the barn swallow genome. Specifically, given the size of the genome, the minimal length for the molecules to be used in the assembly was reduced to 100 kbp, the "Initial P value" cutoff threshold was adjusted to 1 × 10 −10 and the P value cutoff threshold for extension and refinement was set to 1 × 10 −11 according to manufacturer's guidelines (default values are 150 kbp, 1 × 10 −11 and 1 × 10 −12 , respectively).
A total of 233,450 (of 530,527) NLRS-labeled molecules (N50 = 0.2012 Mbp) were aligned to produce 2,384 map fragments with an N50 of 0.66 Mbp for a total length of 1,338.6 Mbp (coverage = 32×). Also, 108,307 (of 229,267) DLE-1 labeled input DNA molecules with a N50 of 0.3228 Mbp (theoretical coverage of the reference 48×) produced 555 maps with a N50 length of 12.1 Mbp for a total length 1,299.3 Mbp (coverage = 23×).

Purge of haplotigs and final assembly
Notably, all hybrid assemblies were somewhat larger than the expected genome size, and, in all cases, the N50 of un-scaffolded contigs was extremely low (0.06 Mbp for the dual-enzyme hybrid assembly). We hypothesized that a significant proportion of these small contigs might represent divergent homologous haplotigs that were assembled independently [15]. Similarity searches were consistent with this possibility as almost 95% of the contigs that were not scaffolded in the dual-enzyme hybrid assembly showed >98% identity to scaffolded contigs over 75% of their length or more. These contigs were discarded, resulting in a final assembly (Table 1 and Supplementary Table S1 for detailed statistics) of 1.21 Gbp (N50 = 25.9 Mbp) made up of 273 dual-enzyme hybrid scaffolds (N50 = 28.42 Mbp) and 91 un-scaffolded contigs (N50 = 0.0644 Mbp). The final assembly is slightly smaller than the previously estimated genome size (1.28 Gbp) [16]. This potentially reflects an imprecise older estimate and/or the possibility that some repeated sequences (e.g., centromeric and telomeric low-complexity regions) were either collapsed in the initial assembly steps or discarded in the final haplotig purging step described above. The average SMRT read coverage for the genome assembly was 34.15X (implying a theoretical quality value of more than 40). Supplementary Fig. S5 provides a summary of observed sequence coverage depth.

Annotation of genes and repeats
With respect to mammals, avian genomes generally contain relatively low proportions of repetitive sequences and show strong mutual synteny [17]. This appears to be the case for the barn swallow genome. In particular, 7.11% of the final assembly was annotated as repetitive using WindowMasker [18] and RepeatMasker (RepeatMasker, RRID:SCR 012954) [19]. The major contributors to annotated repeats were L2/CR1/Rex long interspersed nuclear elements (3.37%), retroviral long terminal repeats (1.59%), and simple repeats (1.56%). Repeats were soft-masked prior to de novo gene prediction using Augustus (Augustus, RRID:SCR 008417) [20] with Gallus gallus gene models. In all, 35,644 protein coding genes were predicted, of which 9,189 were overlapped by more than 30% of their size with repetitive genomic elements. Of the remaining 26,455 predicted protein coding genes, 24,331 harbored a PFAM protein domain (as identified by PfamScan v1.6 [21]). Simple similarity searches based on blastp [22] (with default parameters) suggested that 17,895 of the predicted protein coding genes have a best reciprocal blast hit with gene models derived from G. gallus GRCg6a assembly (as available from [23]), while 2,927 of the proteins predicted by Augustus did not show any significant match (e-value < = 1 × 10 −15 , identity >35%).
Protein sequences inferred from coding sequences identified by BUSCO v3 as barn swallow orthologs of universal avian singlecopy genes were aligned to passerine orthologs present in or-thoDB v9.0 [25] when all represented passerines had an annotated ortholog. A total of 3,927 protein alignments were generated using the software muscle v3.8.31 [26] with default settings. Software GBlocks v0.91b [27] with default settings apart from allowing gaps in final blocks was used to exclude lowquality alignment regions. Trimmed protein alignments were concatenated to produce a supergene alignment with 1,707,664 amino acid positions. Maximum-likelihood (ML) phylogenetic inference and estimation of aLRT branch support indexes were performed using the software PhyML v3.0 [28], with the LG substitution matrix [29] incorporating four variable and one invariable gamma distributed substitution rate categories. Distance bootstrap proportions (100 replicates) were estimated using the BioNJ method with the Kimura protein distance correction as implemented in the software SeaView v4.6.5 [26]. ML phylogenetic analysis of concatenated protein sequence alignments yielded a robustly supported topology ( Supplementary Fig. S6) that is consistent with previous phylogenomic studies [30,31] as well as with gene-level phylogenies [32,33].

Synteny with the chicken genome
Alignment of the final assembly with the most recent assembly of the chicken genome (GRCg6a) using D-Genies [34] indicates high levels of collinearity between these two genomes with a limited number of intra-chromosomal rearrangements (Fig. 2). The high level of collinearity between independently assembled and scaffolded sequences provides circumstantial support for the quality of both the contigs and the hybrid scaffolds, and it also is consistent with previous observations of high levels of synteny and minimal inter-chromosomal rearrangements among birds [17].
Overall, 90.44% of the chicken assembly can be uniquely aligned to regions in the barn swallow assembly. Table 2 shows for each chicken chromosome (assembly GRCg6a) the number of barn swallow scaffolds aligning uniquely (by best reciprocal Basic Local Alignment Search Tool [BLAST] analysis) as well as the percentage of the chicken chromosome involved in alignments. Together with the synteny plot shown in Fig. 2, these data indicate that a high proportion (>85%) of most barn swallow autosomes are likely assembled in fewer than 10 scaffolds. Indeed, several chromosomes are likely assembled as single scaffolds. However, some alignments of chicken chromosomes to the barn swallow assembly are either notably more fragmented or partial. In particular, a large proportion of chicken chromosomes 1 and 4 are represented in unique alignments with the barn swallow assembly. However, for both of these chromosomes, a number of rearrangements are implied (Fig. 2), which is in line with previous comparisons between the chicken genome and those of other Passeriformes [35][36][37]. Unambiguous matches between chicken chromosome 16 (2.84 Mb in the chicken assembly GRCg6a, 16 Mb according to flow karyotyping [38]) and the barn swallow assembly were scarce, consistent with previous reports of difficulties in assembling this chromosome [35,39] and likely due to the unusual gene distribution, presence of rRNA repeats, and the polymorphic and often polygenic major histocompatibility complex loci [40] on this chromosome. Similarly, chromosome 31, for which RepeatMasker identified 3.57 Mb (58% of the For each chicken chromosome, the number of scaffolds aligning uniquely as well as the percentage of the chicken chromosome involved in alignments are reported. GRCg6a chromosome assembly) as repeats, was also assembled in a rather fragmented manner in the barn swallow. Of the sex chromosomes, chicken chromosome Z sequences are well represented, if somewhat fragmented, in the barn swallow assembly. The discontinuous assembly of this chromosome is likely related to the widespread presence of repeats [41,42]. For the chicken W chromosome (6.81 Mb in the GRCg6a chromosome assembly, 43 Mb according to flow karyotyping [38]), apparent orthologs of 45 (of 53 single-copy genes annotated on the W chromosome of in the G. gallus assembly) were identified in the barn swallow genome, although only 46% of the assembled chicken chromosome found best reciprocal BLAST matches. Indeed, Avian W chromosomes are gene poor and contain long, lineage-specific repeats [43,44], complicating both assembly and comparative analyses.

Conclusion
Short-read next-generation sequencing (now known as secondgeneration sequencing, or SGS) technologies have allowed the production of cost-effective genome drafts for many birds and other vertebrate species [30,45,46]. However, the reduction in genome sequencing costs has typically come at the price of compromises in contiguity and accuracy of assemblies with respect to earlier efforts based on Sanger reads and extensive physical mapping [47]. Many limitations of SGS-based assemblies stem from the occurrence of long sequence repeats. In many animal species, transposons are frequently located in introns [48], and the presence of large gene families of closely related paralogs can lead to the existence of long "genic" repeats. Accordingly, even apparently contiguous genic regions can feature juxtaposition of paralogous gene fragments [15]. Given the inception of large-scale sequencing initiatives aiming to produce genome assemblies for a wide range of organisms [49][50][51][52], it is critical to identify combinations of sequencing and scaffolding approaches that allow the cost-effective generation of genuinely high-quality genome assemblies [10]. While exhibiting higher rates of single-base errors than some SGS approaches, TGS technologies, including SMRT sequencing, offer read lengths unparalleled by SGS or Sanger sequencing [53]. Moreover, recent and ongoing improvements in TGS methods are rapidly reducing the "per-base" cost of TGS data compared to that of SGS. On the other hand, as an alternative to scaffolding with long insert mate pairs [54] or to chromatin proximity ligation sequencing [55], contiguity and accuracy of long read-based assemblies can be further improved by optical mapping. This relies on nanoscale channels that can accommodate thousands of single, ultralong (>200 kbp) double-stranded DNA filaments in parallel, subsequently stained to recognize specific 6-7 bp long motifs [56]. The combination of long reads and optical maps has already proven invaluable to produce high-quality genome assemblies, even in the case of particularly complex genomes [57]. Here, using only SMRT sequencing and Bionano optical maps, we have produced a high-quality and contiguous genome for the barn swallow. With respect to a previously reported SGS-based assembly of the American barn swallow genome using a comparable amount of raw data [2], even the contigs generated from long-read sequencing alone show a 134-fold increase in N50, similar to the increase recently obtained for the Anna's hummingbird genome using the same approach [15]. Furthermore, the 1.6-fold change in scaffold N50 attained by Bionano NLRS HS before removal of haplotigs is comparable with results obtained by other genome assemblies that have employed this method [58]. Strikingly, the new DLS method greatly outperformed the NLRS system, providing a 3.3-fold increase of N50 (before removal of haplotigs). Moreover, incorporation of both labeling systems into the hybrid scaffolding yielded a final assembly showing 5-fold improvement of the N50 with respect to the original SMRT assembly, simultaneously providing "independent" validation of many scaffold junctions. We note that the presence of numerous microchromosomes in avian genomes restricts the final N50 value potentially attainable for the assembly. For example, the fully assembled karyotype of the chicken genome assembly (GRCg6a) would have an N50 of ∼90 Mbp. Yet, after removal of putative haplotigs, our genome assembly contiguity metrics meet the high standards of the VGP consortium Platinum Genome criteria (contig N50 in excess of 1 Mbp and scaffold N50 above 10 Mbp) [10]. Accordingly, we believe that the data presented here, while attesting to the effectiveness of SMRT sequencing combined with DLS optical mapping for the assembly of vertebrate genomes, will provide an invaluable asset for population genetics and genomics in the barn swallow and for comparative genomics in birds.

Re-use Potential
Future directions for the barn swallow genome will include further scaffolding using a Genome10K-VGP approach, the phasing of the assembly to generate extended haplotypes, a more thorough gene annotation using RNA/IsoSeq sequencing data, detailed comparisons with the genome of the North American subspecies H. r. erythrogaster, studies on the genomic architecture of traits under natural and sexual selection, and re-evaluation of data from population genetics studies conducted in this species (as it was shown that the availability of a high-quality genome may change the interpretation of some results), as well as characterization of the epigenetic landscape.

Availability of supporting data
Sequencing data supporting the results of this article are in the GenBank repository under Bioproject PRJNA481100, and the optical maps, annotations, and other data are available in the Gi-gaScience GigaDB repository [59].

Ethics approval
The blood sample used to generate the genomic data derived from a minimally invasive sampling on a single individual. Appropriate consent was obtained from the local authorities (Regione Lombardia).