Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C

Abstract Background The German Shepherd Dog (GSD) is one of the most common breeds on earth and has been bred for its utility and intelligence. It is often first choice for police and military work, as well as protection, disability assistance, and search-and-rescue. Yet, GSDs are well known to be susceptible to a range of genetic diseases that can interfere with their training. Such diseases are of particular concern when they occur later in life, and fully trained animals are not able to continue their duties. Findings Here, we provide the draft genome sequence of a healthy German Shepherd female as a reference for future disease and evolutionary studies. We generated this improved canid reference genome (CanFam_GSD) utilizing a combination of Pacific Bioscience, Oxford Nanopore, 10X Genomics, Bionano, and Hi-C technologies. The GSD assembly is ∼80 times as contiguous as the current canid reference genome (20.9 vs 0.267 Mb contig N50), containing far fewer gaps (306 vs 23,876) and fewer scaffolds (429 vs 3,310) than the current canid reference genome CanFamv3.1. Two chromosomes (4 and 35) are assembled into single scaffolds with no gaps. BUSCO analyses of the genome assembly results show that 93.0% of the conserved single-copy genes are complete in the GSD assembly compared with 92.2% for CanFam v3.1. Homology-based gene annotation increases this value to ∼99%. Detailed examination of the evolutionarily important pancreatic amylase region reveals that there are most likely 7 copies of the gene, indicative of a duplication of 4 ancestral copies and the disruption of 1 copy. Conclusions GSD genome assembly and annotation were produced with major improvement in completeness, continuity, and quality over the existing canid reference. This resource will enable further research related to canine diseases, the evolutionary relationships of canids, and other aspects of canid biology.

The German Shepherd Dog (GSD) is one of the most common breeds on earth and has been 60 bred for its utility and intelligence. It is often first choice for police and military work, as well 61 as protection, disability assistance and search-and-rescue. Yet, GSD's are well known to be 62 afflicted with a range of genetic diseases that can interfere with their training. Such diseases 63 are of particular concern when they occur later in life, and fully trained animals are not able to 64 continue their duties. 65

Findings 66
Here, we provide the draft genome sequence of a healthy German Shepherd female as a 67 reference for future disease and evolutionary studies. We generated this improved canid 68 reference genome (CanFam_GSD) utilising a combination of Pacific Bioscience, Oxford 69 Nanopore, 10X Genomics, Bionano, and Hi-C technologies. The GSD assembly is 70 approximately 80 times as contiguous as the current canid reference genome (20.9 Mb vs 0.267 71 Mb contig N50), containing far fewer gaps (306 vs 23,876) and fewer scaffolds (429 vs 3,310) 72 than the current canid reference genome CanFam v3.1. Two chromosomes (4 and 35) are 73 assembled into single scaffolds with no gaps. Benchmarking Universal Single-Copy Orthologs 74 analyses of the genome assembly results show 93.0% of the conserved single-copy genes are 75 complete in the GSD assembly compared to 92.2% for CanFam v3.1. Homology-based gene 76 annotation increases this value to about 99%. Detailed examination of the evolutionary 77 important pancreatic amylase region reveals there are most likely seven copies of the gene 78 indicative of a duplication of four ancestral copies and the disruption of one copy. 79

Conclusions 80
GSD genome assembly and annotation were produced with major improvement in 81 completeness, continuity and quality over the existing canid reference. This resource will 82 In Australia, early imports of GSD's were known to have arrived from 1904. In October 1928 the Federal Government of Australia placed an importation ban on the breed, which was 121 enforced in 1929. During the course of the import ban, which was to stretch for another 43 122 years, few imports were smuggled into the country. The import ban was lifted in 1972 with 123 some restrictions remaining until 1976. With the lifting of the import ban, German, New 124 Zealand, England and some American dogs were imported into Australia, and the breed 125 enjoyed a surge in popularity. Currently, the GSD is the largest breed (purebred) dog 126 population in Australia [17]. 127

128
The aim of this study is to is to provide a high resolution long read de novo assembly of the 129 genome of a GSD female that is free of known genetic diseases (Figure 1). This de novo 130 genome assembly will be an invaluable tool for advancing knowledge of both simple and 131 polygenic genetic diseases and also the evolutionary affinities of the GSD. 132  assembled sequence contigs were scaffolded sequentially using 10X linked-reads, Bionano 146 optical mapping and Hi-C proximity ligation scaffolding. To increase the contiguity of the 147 assembly we used the SMRT and ONT reads to fill gaps, which was then followed by a final 148 round of polishing. Homology-based gene prediction was performed using Canis lupus 149 familiaris and eight related mammals. The resulting chromosome-length genome assembly 150 and its gene annotation was deposited to NCBI with accession number GCA_008641055.1. 151 In addition to the nuclear genome, the mitochondrial genome was assembled and has been 152 uploaded to GigaDB   dogs finding the expansion as early as the seventh century with between 4-16 copies in modern 208 dogs [30]. No long reads were found to span the entire region. The longest read in the region 209 covered no more than three complete copies, with four copies ultimately submitted in the GSD 210 assembly. Further examination of this region was attempted using both the Bionano optical 211 map and read depth analysis from the SMRT and ONT reads (Supplementary File 1). The 212 read depth results estimate that there are between 7-8 copies of the gene while the Bionano 213 map indicated the most likely copy number is seven (Figure 2). For the Bionano analysis, 214 single molecules of Bionano data were de novo assembled using a haplotype-aware algorithm 215 To better determine whether Bionano data supports seven or eight copies of the AMY2B repeat, 235 we compared the two genome map alleles against two synthetic sequence constructs containing 236 either seven (amy2b_dom7copyext) or eight (amy2b_dom8copyext) copies of the 14,862 bp 237 AMY2B repeat with the highest read depth support (namely the third copy) from the GSD 238 assembly, flanked by ~401.5 kb sequences assembled from SMRT and ONT reads 239 (Supplementary File 1). Alignment results confirmed the presence of seven repeat units, 240 showing a perfect alignment to the seven-copy sequence construct (Figure 2A), but a 241 "deletion" of one repeat unit relative to the eight-copy construct ( Figure 2B). Watch system categorises the GSD as a Category Three breed "requiring particular monitoring 282 and additional support" and considered to be more susceptible to developing specific health 283 conditions associated with exaggerated conformation. Breed Watch points of concern include 284 cow hocks, excessive turn of stifle, nervous temperament, sickle hock, and weak hindquarters 285 [37]. 286   287 The high-quality genome assembly will advance knowledge of breed specific diseases such 288 as CHD and extend to issues related to canine personality. The severity of CHD depends on 289 both genetic and environmental factors. In GSDs, the heritability (h 2 )  generations. In both breeds, the pattern of co-inheritance was found to be similar for a broad 297 personality trait previously named shyness-boldness with heritability estimated to be 0.25 in 298 the two breeds. Currently, the underlying genes involved in these behaviors are not known. 299

300
The assembly is expected to enable the selection of GSDs for particular duties including 301 police work where their sensitive nose is frequently used to discriminate odors. Robin et al. 302 [43] analyzed the nucleotide sequences of 109 OR genes (102 genes and seven pseudogenes) 303 in six different breeds including GSDs. In this study, they showed that OR genes are highly 304 polymorphic, with a mean of one SNP per 577 nucleotides. However, the degree of 305 polymorphism observed is highly variable, with some OR genes having few if any SNPs and 306 others being highly polymorphic (1 SNP/122 nt). Yang et al. [44] conducted a preliminary 307 study of 22 SNPs from the exonic regions of 12 OR genes in GSDs and found a significant 308 correlation between SNP genotypes of OR genes and olfactory abilities of dogs. 309 We envisage these data will also facilitate understanding of the evolution of dog breeds and 311 canids in general. The evolutionary position of the GSD among extant breeds is not firmly 312 established. The Federation Cynologique International places it in Group 1 as part of the 313 Herding group. Bigi et al. [45] hypothesised that the German shepherd dog was closely related 314 to the Czechoslovakian Wolfdog. More recently Parker et al. [6] proposed that the GSD is In selecting an animal for the project, it was considered essential to select a female that had 327 been cleared, as much as possible, of any recognisable inherited conditions. The animal needed 328 to display all the hallmarks of a good quality representative of the breed but need not 329 necessarily be a show-winning specimen. Nala is an easy going and approachable 5.5 year old 330 female (born 05 December 2013) and a treasured family pet that showed typical appearance 331 for a GSD. She has had no sign of hip dysplasia, that appears in GSD (Supplementary Figure  332   6), or any other known genetic diseases. Nala had a combined hip score of three (one on the 333 left hand side and two on the right hand side) when the x-ray was taken at five years of age: 334 each hip was measured on a 0 -53 scale, with a total of 106 being crippling. The score of three 335 is well below the current Australian average of nine for GSD's. She is registered with the 336 Australian National Kennel Council (2100398550)  High molecular weight (HMW) DNA was isolated from fresh blood (stored at 4°C) using the 404 Bionano Prep Blood DNA Isolation Protocol (Bionano Genomics (BNG), Document #30033 405 revision C). Briefly, after lysing the red blood cells, white blood cells were recovered and 406 embedded in agarose plugs. These plugs were subjected to Proteinase K (Qiagen Cat# 158920) 407 digestion for two rounds (2 hours, then overnight) at 50°C. Following extensive washing as 408 prescribed in the protocol, the plugs were melted and treated with GELase enzyme (Epicentre, 409 Catalog # G31200 were then aligned to the assembled optical maps based on DLE-1 labels using RefAligner. 467 Discrete sequence maps that can be linked via a Bionano genome map were scaffolded. 468 Alignments indicating conflict between the sequence and optical maps, and hence suggestive 469 of mis-assembly, were resolved. Specifically, optical maps supported by at least ten single 470 molecules at the conflict site were indicative of sequence mis-assembly, and so the sequence 471 map would be "cut" (split) at the conflict point. In contrast, insufficient single molecule 472 support for the optical map was indicative of optical map assembly error, and so the optical 473 map would be "cut" at the conflict site. Details of the method are provided in the Bionano 474 Solve Theory of Operation: Hybrid Scaffold (Document #30073). Following hybrid 475 scaffolding, 21 arbitrary 10 kb N-gaps (introduced during the sequence assembly process) 476 were re-sized based on estimated inter-label distances from the optical maps. In all, 160 477 sequence contigs were hybrid-scaffolded into 109 hybrid scaffolds with N50 of ~46.3 Mb. 478 The remaining 1,004 sequence contigs with an N50 of ~78.8 kb could not be scaffolded 479 either because they are too short (< 100 kb) for hybrid-scaffolding with Bionano maps or 480 because they did not align to any optical maps. 481 482

Chromosome-length assembly using Hi-C data 483
The Hi-C data was processed using Juicer (Juicer, RRID:SCR_017226) [60], and used as input 484 into the 3D-DNA pipeline [61] to produce a candidate chromosome-length genome assembly. 485 We performed additional finishing on the scaffolds using Juicebox Assembly Tools [62]. 486 Figure 3 shows the contact matrices generated by aligning the Hi-C data set to the genome 487 assembly before the Hi-C upgrade (on the left), and after Hi-C scaffolding (on the right). The 488 matrices are visualised in Juicebox.js, a cloud-based visualisation system for Hi-C data [63]  The Pilon-polished genome underwent a final scaffold clean-up to generate a high-quality core 509 assembly, remove low-coverage artefacts and haplotig sequences, and annotate remaining 510 scaffolds with potential issues. 511 512

Low-coverage filter 513
The TOW5157A1 library PacBio subreads (12.5M subreads; 108Gb) were mapped onto 514 the Nala_canu_arrow2_10x_racon_bionano_HiC_pbjelly_pilon assembly using Minimap2 515 v2.16 (-ax map-pb --secondary=no) [56]. Initial read depth analysis was performed with 516 BBMap v38.51 pileup.sh [66]. Any scaffolds with median coverage less than three (e.g., less 517 than 50% of the scaffold covered by at least three reads) were filtered out as low-coverage 518 scaffolds. Of the 1,057 Pilon-polished scaffolds, 220 scaffolds were removed in the initial low-519 coverage filter, leaving 837 scaffolds. 520 depth thresholds were set to 5X, 30X and 80X. Any scaffolds with <80% at diploid read 526 depth were identified by PurgeHaplotigs for reassignment. Scaffolds with 80%+ bases in the 527 low/haploid coverage bins and 95%+ of their length mapped by PurgeHaplotigs onto another 528 scaffold were filtered as haplotigs or assembly artefacts. Any other scaffolds with 80%+ low 529 coverage bases were filtered as Low Coverage. This analysis resulted in a further 11 scaffolds 530 filtered for low coverage and 268 filtered as haplotigs or assembly artefacts, leaving 558 531 scaffolds. 532 533

Purge Haplotigs analysis -round 2 534
Subreads were re-mapped on to the remaining 558 scaffolds resulting in a further 128 scaffolds 535 filtered as haplotigs or assembly artefacts leaving 430 scaffolds. No additional scaffolds 536 with 80%+ low coverage bases were identified. Any scaffold with 80%+ bases in the 537 low/haploid coverage bins were filtered as haplotigs or assembly artefacts. Scaffolds 538 with 20%+ diploid coverage were marked as retention as probable diploids. Scaffolds with 539 <20% diploid coverage and 50%+ high coverage were marked as probable collapsed repeats. 540 A single remaining scaffold marked as JUNK by PurgeHaplotigs (over 80% low/high 541 coverage) was also filtered as a probable artefact. 542 543

Purge Haplotigs analysis -round 3 544
Subreads were re-mapped on to the remaining 430 scaffolds for a third round 545

Gene prediction including annotation of repetitive elements 578
The genome was annotated using the homology-based gene prediction program GeMoMa 579 Jonkahra Lets Elope, was from Australian breeding and the sire, CH Arkon Vom Altenberger 630 Land was imported from Germany. Sequences were generated on the Pacific Biosciences 631 Sequel instrument (V2 chemistry) and Oxford Nanopore PromethION instrument (guppy 632 bascaller Version 3.0.6+9999d81) to ~30x genome coverage, each, based on a genome size 633 estimate of 2.4 Gb (this estimate is used for all coverage estimates). All long read sequences 634 were assembled with the Canu v1.8 algorithm then error corrected twice using the Arrow 635 genomic consensus polishing module. The assembly was scaffolded with Chromium 10x 636 linked-reads (~41x coverage excluding the barcode) using Long Ranger v2.1.6 using DNA 637 from the same animal. Polishing of the assembly for residual indels was done by aligning the 638 Illumina data with Minimap2 and the Racon algorithm. Single molecule Bionano data (~57x 639 effective coverage) was then used to superscaffold the sequence assembly using DNA extracted 640 from the same canid. For this, single molecule optical maps were first de novo assembled into 641 consensus maps, which were than aligned to the sequence assembly in silico digested with the 642 same labelling enzyme for hybrid scaffolding, using Bionano Solve (v3.2.2_08022018) with 643 RefAligner (7782.7865rel). This assembly was further scaffolded to chromosome-length by 644 the DNA Zoo following the methodology described here: www.dnazoo.org/methods. Briefly, 645 an in situ Hi-C library was prepared from a blood sample of a purebred GSD male named 646 Tydus (AKC Registration DN53646601) provided by the Cornell Veterinary Biobank and 647 sequenced to 29x coverage. The Hi-C data was processed using Juicer, and used as input into 648 the 3D-DNA pipeline to produce a candidate chromosome-length genome assembly. We 649 performed additional finishing on the scaffolds using Juicebox Assembly Tools. The assembly 650 was then long-read gap filled with the PBJelly algorithm, and the additional data error corrected 651 using Arrow. The Chromium data was mapped onto the assembly with the Long Ranger v2.   (>100bp)