SMRT long-read sequencing and Direct Label and Stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica)

Background: The barn swallow (​Hirundo rustica​) is a migratory bird that has been the focus of a large number of ecological, behavioural and genetic studies. To facilitate further population genetics and genomic studies, here we present a high-quality genome for the European subspecies (​Hirundo rustica rustica​ ). Findings: We have assembled a highly contiguous genome sequence using Single Molecule Real-Time (SMRT) DNA sequencing and Bionano optical maps. We compared and integrated optical maps derived both from the Nick, Label, Repair and Stain and from Direct Label and Stain technologies. For our SMRT-only assembly, the direct labelling system more than doubled the assembly N50 with respect to the nickase system. The dual enzyme hybrid scaffold led to a further marginal increase in scaffold N50 and an overall increase of confidence in scaffolds. After removal of haplotigs, the final assembly is approximately 1.21 Gbp in size, with an N50 value of over 25.95 Mbp, representing an improvement in N50 of over 650 fold with respect to a previously reported assembly based on paired-end short read data. Conclusions: This high-quality genome assembly represents a valuable resource for further studies of population genetics of the barn swallow and for studies concerning the evolution of avian genomes. It also represents the first genome assembled combining SMRT sequencing with the new Bionano Direct Label and Stain technology for scaffolding, highlighting the potential of this methodology to contribute to substantial increases in the contiguity of genome assemblies.


Context
The barn swallow is a passerine bird with at least 8 recognized subspecies in Europe, Asia and North America. The European barn swallow ( Hirundo rustica rustica ) (Figure 1) breeds in a broad latitudinal range, between 63-68°N and 20-30°N [1] . Numerous evolutionary and ecological studies have focussed on its biology, life history, sexual selection, and response to climate change. More recently, the barn swallow has become the focus of genetic studies on the divergence between subspecies [2][3][4] and on the control of phenological traits [5][6][7][8] . Due to its synanthropic habits and its cultural value, the barn swallow is also a flagship species in conservation biology [1] . The availability of high-quality genomic resources, including a reference genome, is thus pivotal to further boost the study and conservation of this species. In 2016, Safran and coworkers reported the first draft of the genome for the American subspecies ( Hirundo rustica erythrogaster ) constructed from Illumina paired-end reads at 47x coverage depth [2] .
This assembly was described as containing 1.1 Gbp of assembled sequences (average contig length 11 kbp, contig N50 = 39 kbp, contig N90 = 3.8 kbp, longest scaffold: 732 kbp), compared to an estimated genome size of 1.28 Gbp [9] . Moreover, the assembly was derived from a male individual, excluding information for the W chromosome, as females are the heterogametic (ZW) sex in birds.
To address the aforementioned limitations, we have employed two single-molecule technologies, SMRT Third-Generation Sequencing (TGS) from Pacific Biosciences (Menlo Park, California, USA) and optical mapping from Bionano Genomics (San Diego, California, USA), to produce a state-of-the-art high-quality genome assembly for the European subspecies. For optical mapping we have labeled DNA molecules both with one of the original Nick, Label, Repair and Stain (NLRS) nickases (Nb. BssSI) and with the new Direct Label and Stain (DLS) approach (enzyme DLE-1). The latter technique was officially released in February 2018 and avoids nicking and subsequent cleavage of DNA molecules during staining [10] . We show that, at least with our data, DLS allows a considerable improvement of scaffold contiguity with respect to the nickase tested. Furthermore, the "dual enzyme" approach affords additional support for scaffold junctions. To our knowledge this genome assembly is the first to incorporate DLS data, and their integration with SMRT sequencing provided assembly contiguity metrics well in excess of those specified for "Platinum genomes" by the Vertebrate Genomes Project (VGP) [11] .

Blood sample collection
The blood used as a source of DNA was derived from a minimally invasive sampling performed on a female individual of approximately two years of age during May 2017 in a farm near Milan in Northern-Italy (45.4N 9.3E). Blood was collected in heparinized capillary tubes. Three hours after collection, the sample was centrifuged to separate blood cells from plasma and then stored at -80°C.

DNA extraction and quality control for SMRT library preparation
DNA extraction was performed on blood cells portion of centrifuged whole blood containing nucleated erythrocytes and leukocytes using the Wizard genomic DNA purification kit (Promega, Cat. No. A1125). This kit employs a protocol similar to classical Phenol/Chloroform DNA extraction, with no vortexing steps after cell lysis. After purification, DNA quality and concentration was assessed by  Figure 2). DNA was stored at -80°C and shipped on dry ice.  Mbp, 5 in chicken) and microchromosomes (12 Mbp on average, 28 in chicken). These last account for only 18% of the total genome but harbor ∼31% of all chicken genes, have higher recombination rates and higher GC contents on average [12] .

Assembly of SMRT reads
The final assembly of long reads was conducted with software CANU v1.7 [13] using default parameters except for the "correctedErrorRate" which was set at 0.075. The assembly processes occupied 3,840 CPU hours and 2.2 Tb of RAM for read correction, 768 CPU hours and 1.1 Tb of RAM for the trimming steps, and 3280 CPU hours and 2.2 Tb of RAM for the assembly phase. The assembly contained 3,872 contigs with a N50 of 5,2 Mbp for a total length of the assembly of 1311.7 Mbp (Table 1 and Supplementary Table 1). Final polishing was performed using the Arrow v2.10 software (Pacific Biosciences) and resulted in final coverage of 45.4x.

Cell count and DNA extraction for optical mapping
High-molecular weight (HMW) DNA was extracted from 7-8 µl of the cell portion from the same blood sample used for SMRT sequencing using the Blood and Cell Culture DNA Isolation kit (Bionano Genomics, Cat. No. RE-016-10 ). HMW DNA was extracted by embedding cells in low melting temperature agarose plugs that were incubated with Proteinase K (Qiagen, Cat. No. 158920) and RNAseA (Qiagen, Cat. No. 158924). The plugs were washed and solubilized using Agarase Enzyme (Thermo Fisher Scientific, Cat. No. EO0461) to release HMW DNA and further purified by drop dialysis. DNA was homogenised overnight prior to quantification using a Qubit Fluorometer.

In silico digestion
The genome assembly obtained with CANU was in silico digested using Bionano Access software to test whether the nicking enzyme (Nb.BssSI), with recognition sequence (CACGAG), and the non-nicking enzyme DLE-1, with recognition sequence (CTTAAG), were suitable for optical mapping in our bird genome. An average of 16.9 nicks/100 kbp with a nick-to-nick distance N50 of 11,708 bp were expected for Nb.BssSI, while DLE-1 was found to induce 19.1 nicks/100 kbp with a nick-to-nick distance N50 of 8,775 bp, both in line with manufacturer's requirements.

Generation of optical maps
NLRS and DLS labeled DNA were loaded into a nanochannel array of a Saphyr Chip (Bionano Genomics, Cat. No. FC-030-01) and run by electrophoresis each into a compartment. Linearized DNA molecules were imaged using the Saphyr system and associated software (Bionano Genomics, Cat. No. 90001 and CR-002-01). In the experiment with DLE-1, molecule N50 was 0.2475 Mbp for molecules above 20 kbp and 0.3641 Mbp for molecules above 150 kbp -with an average label density of 15.7/100 kbp for molecules above 150 kbp. Map rate was 56.4% for molecules above 150 kbp.
Effective coverage was 30.6x.
In the experiment with Nb.BssSI, molecule N50 was 0.1298 Mbp for molecules above 20 kbp and 0.2336 Mbp for molecules above 150 kbp -with an average label density of 11.8/100 kbp for molecules above 150 kbp. Map rate was 38.9% for molecules above 150 kbp. Effective coverage was 28.2x. Using both DLE-1 and Nb.BssSI, label metrics were in line with the manufacturer's expectations.

Assembly of optical maps
The de novo assembly of the optical maps was performed using the Bionano Access v1.2.1 and Bionano Solve v3.2.1 software. The assembly type performed was the "non-haplotype" with "no extend split" and "no cut segdups". Default parameters were adjusted to accommodate the genomic properties of the barn swallow genome. Specifically, given the size of the genome, the minimal length for the molecules to be used in the assembly was reduced to 100 kbp, the "Initial P-value" cut off threshold was adjusted to 1 x 10 -10 and the P-value cut off threshold for extension and refinement was Notably, all hybrid assemblies were somewhat larger than the expected genome size, and in all cases, the N50 of un-scaffolded contigs was extremely low (0.06 Mbp for the dual enzyme hybrid assembly).
We hypothesized that a significant proportion of these small contigs might represent divergent homologous haplotigs that were assembled independently. Similarity searches were consistent with this possibility as almost 95% of the contigs that were not scaffolded in the dual enzyme hybrid assembly showed > 98% identity to scaffolded contigs over 75% of their length or more. These contigs were discarded, resulting in a final assembly (Table 1 and Supplementary Table 1 [14] . 2 SMRT reads assembled using CANU v1.7 [13] . 3 SMRT contigs assembled with CANU and scaffolded using Bionano dual enzyme HS, with haplotigs removed as detailed in the text. *Based on a barn swallow genome size estimate of 1.28 Gbp [9] .

Annotation of genes and repeats
With respect to mammals, avian genomes generally contain relatively low proportions of repetitive sequences and show strong mutual synteny [15] . This appears to be the case also for the barn swallow genome. In particular, 7.11% of the final assembly was annotated as repetitive using RepeatMasker [16] , with the major contributions deriving from L2/CR1/Rex LINE elements (3.37%), retroviral LTRs (1.59%) and simple repeats (1.56%). These repeats were soft-masked prior to de novo gene prediction using Augustus [17] with Gallus gallus gene models.
In all, 35,644 protein coding genes were predicted, of which 9,189 were overlapped by more than 30% of their size with repetitive genomic elements. Of the remaining 26,455 predicted protein coding genes, 24,331 harbored a PFAM protein domain. Simple similarity searches based on blastp [18] (with default parameters) suggested that 17,895 of the predicted protein coding genes have a best reciprocal blast hit with gene models derived from Gallus gallus GRCg6a assembly (as available from [19] ), while 2,927 of the proteins predicted by Augustus did not show any significant match (e-value <= 1 x 10 -15 , identity > 35%).

BUSCO genes
Of a total of 4915 conserved bird Benchmarking with Universal Single-Copy Orthologs (BUSCO) groups [20] sought, 4,598 (93.6%) were complete, 4,521 (92.0%) were complete and single-copy, 77 (1.6%) were complete and duplicated, 192 (3.9%) were fragmented and 125 (2.5%) were missing. The percentage of contiguously assembled BUSCO genes is consistent with recent results with Anna's Hummingbird ( Calypte anna ) and the Zebra Finch ( Taeniopygia guttata ) [21] . We note that 40 of the "missing" bird BUSCO genes are absent from at least 2 of the 54 available avian genome sequences, suggesting that, despite the potentially incomplete nature of some draft genomes, some of these genes may not be universally conserved among birds.

Synteny with the Chicken and Hummingbird genomes
Alignment of the final assembly with the most recent assembly of the chicken genome (GRCg6a) using D-Genies [22] indicates high levels of collinearity between these two genomes with a limited number of intra-chromosomal rearrangements (Figure 2). The high level of collinearity between independently assembled and scaffolded sequences provides circumstantial support for the quality of both the contigs and the hybrid scaffolds and is consistent with previous observations of high levels of synteny and minimal inter-chromosomal rearrangements among birds [15] .

Conclusion
During the last 20 years, nucleic acid sequencing technologies have developed over four times faster than improvements of microchip complexity predicted by Moore's law [23,24] . While short-read NGS (now known as Second Generation Sequencing -SGS) technologies have allowed the production of cost-effective genome drafts for many birds and other vertebrate species [25][26][27] , the reduction in genome sequencing costs has typically come at the price of compromises in contiguity and accuracy of assemblies with respect to earlier efforts based on Sanger reads and extensive physical mapping [28] . Many limitations of SGS-based assemblies stem from the occurrence of long sequence repeats.
In many animal species, transposons are frequently located in introns [29] and the presence of large gene families of closely related paralogs can lead to the existence of long "genic" repeats.
Accordingly, even apparently contiguous genic regions can feature juxtaposition of paralogous gene fragments [21] . Given the inception of large scale sequencing initiatives aiming to produce genome assemblies for a wide range of organisms [30][31][32][33] , it is critical to identify combinations of sequencing and scaffolding approaches that allow the cost effective generation of genuinely high-quality genome assemblies [11] . While exhibiting higher rates of single-base errors than some SGS approaches, TGS technologies, including SMRT sequencing, offer read-lengths unparalleled by SGS or Sanger sequencing [34] . Moreover, recent and ongoing improvements in TGS methods are rapidly reducing the "per-base" cost of TGS data compared to that of SGS. On the other hand, as an alternative to scaffolding with long insert mate-pairs [35] or to chromatin proximity ligation sequencing [36] , contiguity and accuracy of long-read-based assemblies can be further improved by optical mapping.
This relies on nanoscale channels that can accommodate thousands of single, ultralong (>200 kbp) double-stranded DNA filaments in parallel, subsequently stained to recognize specific 8-9 bp long motifs [37] . The combination of long reads and optical maps has already proven invaluable to produce high-quality genome assemblies, even in the case of particularly complex genomes [38] . Here, using only SMRT sequencing and Bionano optical maps we have produced a high-quality and contiguous genome for the barn swallow. With respect to a previously reported SGS-based assembly of the American barn swallow genome using a comparable amount of raw data [2] , even the contigs generated from long-read sequencing alone show an 134-fold increase in N50. In terms of N50, number and average length, these contigs are similar to those recently obtained for the Anna's hummingbird genome using the same technology [21] . Furthermore, the fold change in N50 attained by Bionano NLRS hybrid scaffolding of the European barn swallow genome (1.6 fold before removal of haplotigs) is comparable with results obtained by other genome assemblies that have employed this method [39] . Strikingly, the new DLS method greatly outperformed the NLRS system, providing a 3.3 fold increase of N50 (before removal of haplotigs). Moreover, incorporation of both labelling systems into the hybrid scaffolding yielded a final assembly showing 5-fold improvement of the N50 with respect to the original SMRT assembly, simultaneously providing "independent" validation of many scaffold junctions. We note that the presence of numerous microchromosomes in avian genomes restricts the final N50 value potentially attainable for the assembly, as the fully assembled karyotype would have an N50 of~90 Mbp. Yet, after removal of putative haplotigs, our genome assembly contiguity metrics meet the high standards of the VGP consortium "Platinum Genome" criteria (contig N50 in excess of 1 Mbp and scaffold N50 above 10 Mbp) [11] . Accordin gly, we believe that the data presented here, as well as attesting to the effectiveness of SMRT sequencing combined with DLS optical mapping for the assembly of vertebrate genomes, will provide an invaluable asset for population genetics studies in the barn swallow and for comparative genomics in birds.

Re-use Potential
Future directions for the barn swallow genome include the phasing of the assembly to generate extended haplotypes, a more thorough gene annotation using RNA/IsoSeq sequencing data, detailed comparisons with genome data from Hirundo rustica erythrogaster, re-evaluation of data from previous population genetics studies conducted in this species, as well as characterization of the epigenetic landscape.

Ethics approval
The blood sample used to generate the genomic data derived from a minimally invasive sampling on a single individual. Appropriate consent was obtained from the local authorities (Regione Lombardia). Summary statistics for each SMRT cell employed.

Additional files
GC content distribution in all sequence reads.
Cumulative coverage distribution of the final (de-haplotyped) assembly of the barn swallow genome.
Coverage is indicated on the X axis. Red lines are used to display the proportion of the genome covered by more than 10, 20, 30, 40, 50 or 60 reads respectively. Comparison of assembly metrics for contigs and scaffolds between different assemblies. In hybrid scaffolds, the first column refers to assemblies including the un-scaffolded contigs while the second column only includes scaffolded contigs metrics. The estimated genome size of 1.28 Gbp is from [9] .
Average gene size was estimated according to the latest available annotation of the Gallus gallus genome (GRCg6a).