De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads

Abstract Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.


The long-read assemblies result in 150-fold to 200-fold increases in contiguity
To generate long-read assemblies, high molecular weight DNA was isolated from muscle tissue 92 of the same zebra finch male and Anna's hummingbird female used to create the current 93 reference genomes [2,6]. The DNA was sheared, 35-40 kb libraries generated, size-selected for 94 inserts >17 kb (Fig. S1), and then SMRT sequencing performed on the PacBio RS II instrument 95 to obtain ~96X coverage for the zebra finch (19 kb N50 read length) and ~70X for the 96 hummingbird (22 kb N50 read length; Fig. S2). The long reads were originally assembled into a 97 merged haplotype with an early version of the FALCON assembler [15], which we found 98 unintentionally introduced indels for some nucleotides that differed between haplotypes (tested 99 on the hummingbird; data not shown). We then re-assembled using FALCON v0.4.0 followed by 100 the FALCON-Unzip module [16] to prevent indel formation and generate long-range phased 101 haplotypes. Thus, the new assemblies, unlike the current reference assemblies, are phased 102 diploids. This PacBio-based sequencing and assembly approach does not link contigs into 103 gapped scaffolds. Scaffolding requires additional approaches, which we will report on separately 104 in a study comparing scaffolding technologies with these assemblies. The results presented here 105 were found independent of scaffolding. 106 For the zebra finch, our long-read approach resulted in 1159 primary haplotype contigs 107 with an estimated total genome size of 1.14 Gb (1.2 Gb expected; [17]) and contig N50 of 5.81 4 Mb, representing a 108-fold reduction in the number of contigs and a 150-fold improvement in 109 contiguity compared to the current Sanger-based reference ( Table 1A). The diploid assembly 110 process produced 2188 associated, or secondary, haplotype contigs (i.e. haplotigs) with an 111 estimated length of 0.84 Gb (Table 1A), implying that about 75% of the genome contained 112 sufficient heterozygosity to be phased into haplotypes by  Unzip, the primary contigs are the longest path through the assembly string graph, the secondary 114 haplotigs are by definition shorter and can be more numerous, resulting in lower contiguity for 115 the haplotigs. Regions of the genome with very low heterozygosity remain as collapsed 116 haplotypes in the primary contigs.

117
The PacBio long-read assembly for the hummingbird was of similar quality, with 1076 118 primary contigs generating a primary haploid genome size of 1.01 Gb (1.14 Gb expected; [17]), 119 and a contig N50 of 5.36 Mb, representing a 116-fold reduction in the number of contigs and a 120 201-fold improvement in contiguity over the reference ( Table 1B). The length of the assembled 121 secondary haplotigs for the hummingbird was similar to that of the primary contig backbone 122 (1.01 Gb; the long-read length PacBio-based assembly. Improvement is calculated between the 2 nd and 3 rd columns 131 for the primary PacBio-based haplotype. The higher number of contigs in the secondary haplotype (5 th 132 column) are a result of the arbitrary assignment of shorter haplotypes to the haplotig category.

134
The long-read assemblies have more complete conserved protein coding genes 135 To assess gene completeness, we analyzed 248 highly conserved eukaryotic genes from the 136 CEGMA human set [18,19] in each of the assemblies. Both the PacBio-based zebra finch and 137 hummingbird assemblies showed improved resolution of these gene sequences, with a close to 138 doubling (~71%) for the zebra finch and 26% increase for the hummingbird in the number of 139 complete or near-complete (>95%) CEGMA genes assembled, compared to the references (Fig.  5   1A). Because updating the CEGMA gene sets was recently discontinued due to lack of continued 141 funding and ease of use (http://www.acgt.me/blog/2015/5/18/goodbye-cegma-hello-busco), we 142 also searched for a set of conserved, single-copy genes from the orthoDB9 [20] gene set using 143 the recommended replacement BUSCO pipeline [21]. We observed more modest improvements 144 (~10%) in the number of complete genes in the zebra finch (and no change with the 145 hummingbird) when assessed using the BUSCO v2.0 pipeline on a set of 303 single-copy 146 conserved eukaryotic genes (Fig. 1B), and barely any change (1-3%) when using a newly 147 generated BUSCO set of 4915 avian genes (Fig. 1C) CEGMA 303 eukaryotic set that includes several higher-quality genome assemblies, the PacBio-159 based assemblies had very few fragmented genes compared to the Sanger-based and Illumina-160 based assemblies (Fig. 1B). Thus, our new assemblies have the potential to upgrade the BUSCO 161 set to more complete and more accurately assembled genes, a conclusion supported by our 162 analyses below.

164
The long-read assemblies have greater and more accurate transcriptome and regulome 165 representations 166 To assess transcriptome gene completeness by an approach that does not depend on other 167 species' genomes, we aligned zebra finch brain paired-end Illumina RNA-Seq reads to the zebra 168 finch genome assemblies using TopHat2 [23]. We generated the RNA-Seq data from 169 microdissected RA song nuclei, a region that has convergent gene regulation with the human 170 laryngeal motor cortex (LMC) involved in speech production ( Fig. S4; [4]). The PacBio-based 171 assembly resulted in a ~7% increase in total transcript read mappings compared to the Sanger-172 based reference ( Fig. 2A), suggesting more genic regions available for read alignments. This was 173 explained by a decrease in unmapped reads and increase in reads that mapped to the genome 174 more than once (multiple) compared to the Sanger-based reference (Fig. 2B), supporting the idea 175 that the long-read assemblies recovered more repetitive or closely related gene orthologs. The

185
H3K27ac activity is generally high in active gene regulatory regions, such as promoters and 186 enhancers [27]. Similar to the transcriptome, there was an increase (~4%) of HK327ac Chip-Seq 187 genomic reads that mapped to the PacBio-based assembly compared to the Sanger-based 188 reference ( Fig. 2A). Unlike the RNA-Seq transcript reads, the ChIP-Seq genomic reads showed a 189 significant 10% increase in unique mapped reads with a concomitant decrease in multiple 190 mapped reads (Fig. 2B). We believe this difference is due to technical reasons in using paired-191 end transcript (RNA-Seq) versus single-end genomic (ChIP-Seq) read data, as a multiple-192 mapped increase with the RNA-Seq transcript data was not detected when using only one read of 193 each pair-end (p=0.3, paired t-test, n=5). Overall, these findings are consistent with the PacBio-194 based assembly having a more complete and structurally accurate assembly for both coding and 195 regulatory non-coding genomic regions.

Completion and correction of genes important in vocal learning and neuroscience research 198
The genome-wide analyses above demonstrate improvements to overall genome assembly 199 quality using long reads, but they do not inform about real-life experiences with individual genes 200 where there have been challenges with assemblies. We undertook a detailed analysis of four of examined thus far, part of the GC-rich promoter region is missing (Fig. 3A, gap 1). 212 In the zebra finch Sanger-based reference, EGR1 is located on a 5.7 kb contig (on 213 chromosome 13), bounded by the gap in the GC-rich promoter region and 2 others downstream 214 of the gene; gaps between contigs in the published reference were given arbitrary 100 Ns [2]. We 215 found that the PacBio long-read assembly completely resolved all three gaps in the zebra finch 216 EGR1 locus for both alleles, resulting in complete protein coding and surrounding gene bodies in 217 a 205.5 kb primary contig and a 129.1 kb secondary haplotig ( Fig. 3B; Fig. S5A). The promoter 218 region gap, located 572 bp upstream of the start of the first exon, was resolved by an 804 bp 219 repeat region that was also not supported by the PacBio data and also had low quality scores (Fig   224   3A,B, gap 2). The third 100 N gap, located ~3.5 kb downstream of the EGR1 gene, was resolved 225 by 18 bp of sequence in the PacBio assembly (Fig. 3B, gap 3). The PacBio-based differences in 226 the assembly were supported by numerous long-read (>10,000 bp) molecules that extended 227 through the entire gene, spanning all three gaps (Fig. S6A). The two haplotypes were >99.8% intron, and the GC-rich promoter region (Fig. 3C, black). Due to this gap in the reference, the 239 corresponding NCBI gene prediction (accession XP_008493713.1) instead recruited a stretch of 240 sequence ~7 kb upstream of the gap, predicting a first exon that has no sequence homology with 241 EGR1 in the PacBio-based assembly or to sequences of other species (Fig. 3C & D). Upstream 242 of this gap in the Illumina-based assembly was also a 200 bp tandem repeat that was not 243 supported by the PacBio sequence reads and the assembly (Fig. 3C, red; Fig. S5B). These PacBio-based hummingbird and zebra finch assemblies (Fig. S7). 252 These findings indicate that relative to the intermediate-and short-read assemblies, the 253 PacBio-based long-read assembly can fill in missing gaps in a previously hard-to-sequence GC-254 rich regulatory region, eliminate low quality erroneous sequences and base calls at the edges of 255 gaps in the Sanger-based assembly, and eliminate erroneous tandem duplications adjacent to 256 gaps, all preventing inaccurate gene predictions. In addition, using one species as a reference to 257 help assemble another may not work for such a gene, as the surrounding sequence to the gene 258 body in these two Neoaves species is highly divergent.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 8 260 DUSP1. The dual specificity phosphatase 1 (DUSP1) is also an immediate early gene, but one 261 that regulates the cellular responses to stress [33]. In all species examined thus far it is mostly 262 up-regulated by activity in the highly active thalamic-recipient primary sensory neurons of the 263 cortex (i.e. mammal cortex layer 4 cells and the comparable avian intercalated pallial cells), but 264 within the motor pathways, it is only up-regulated to high levels by activity in the vocal learning 265 circuits of vocal learners [11,34]. This specialized regulation in vocal learning circuits has been 266 proposed to be associated with convergent microsatellite sequences found in the upstream 267 promoter region of the gene mainly in vocal learning species [11]. This was determined by PCR-268 cloning of single genomic molecules from multiple species, because the reference assemblies did 269 not have this region properly assembled [11].

270
In the zebra finch Sanger-based reference, DUSP1 is located on the chromosome 13 271 scaffold, separated in 3 contigs, with 2 gaps, all surrounded by low quality sequences (Fig. 4A).

272
The NCBI gene prediction of this assembly resulted in 4 exons generating a 322 a.a.

300
In the hummingbird Illumina-based assembly, the DUSP1 region was represented by 2 301 contigs separated by a large 1005 N gap (Fig. 4C), on a 7 Mb scaffold. In the PacBio-based 302 assembly, the entire gene was fully resolved ( Fig. 4C; Fig. S8B), in a much larger gapless 12.8 303 Mb contig (the second allele is fully resolved in a 3.8 Mb contig). Comparing the two assemblies 304 revealed that because of the gap in the Illumina-based reference, it lacks about half of the 305 DUSP1 gene, including the first two exons and introns, and ~380 bp upstream of the start of the 306 gene ( Fig. 4C). As a result, the corresponding NCBI gene prediction (XP_008496991.1) 307 recruited a sequence ~44 kb upstream predicting 46 a.a. with no sequence homology to DUSP1 308 of other species, whereas the PacBio-based assembly yielded a 369 a.a. protein with 99% 309 sequence homology to the PacBio-based zebra finch and chicken DUSP1 (Fig. 4D). A 200 bp 310 tandem repeat in the Illumina-based assembly downstream of the gap, erroneously in exon 3, is a 311 misplaced copy of the microsatellite region ( Fig. 4C; Fig. S8B). This is the reason why two 312 thirds of exon 3 is erroneously duplicated in the NCBI protein prediction (Fig. 4D). These 313 PacBio-based differences in the assembly were validated by single-molecule Iso-Seq mRNA 314 long-reads of DUSP1 (Fig. S11A). The PacBio assemblies also revealed that the microsatellite 315 region was significantly shorter in the hummingbird (~270 bp) than the zebra finch genome 316 (~1100 bp; Fig. S10B). 317 These findings in both species demonstrate that intermediate-and short-read assemblies 318 not only have gaps with missing relevant repetitive microsatellite sequence, but that short-read 319 misassemblies of these repetitive sequences lead to erroneous protein coding sequence 320 predictions. Further, not only does the long-read assembly resolve them, but it helps generate a 321 diploid assembly that resolves allelic differences and prevents erroneous assembly duplications 322 and misplacement errors between haplotypes. become the most studied gene for understanding the genetic mechanisms and evolution of 333 spoken language [43], yet we find that the very large gene body of ~400 kb is incompletely 334 assembled, including in vocal learning species (Fig. 5A). 335 In the zebra finch Sanger-based reference, FOXP2 is located on the chromosome 1A 336 scaffold and separated into 10 contigs (1 to 231 kb in length) with nine 100 N gaps each (Fig.   337 5A). These include 2 gaps immediately upstream of the first exon, making the beginning of the 338   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  quality Sanger-based reads (Fig. 5D). The DNA sequence between the two assembled PacBio 347 haplotypes was >99% similar across the entire 400 kb FOXP2 gene, and identical over the 348 coding sequence, with differences occurring in the more complex non-coding gaps that were 349 difficult to sequence and assemble by the Sanger method ( Fig. 5B *61 nucleotide differences 350 total). The predicted protein sequence from the PacBio-based assembly is identical to the 351 predicted Sanger-based reference (NP_001041728.1), with the exception of a.a. residue 42 352 (threonine vs. serine) (Fig. S13A). The PacBio nucleotide call also exists in the mRNA sequence 353 of another zebra finch animal in NCBI (NM_001048263.2) and in other avian species we 354 examined, and is thus likely a base call error in the Sanger-based zebra finch reference.

390
In the zebra finch Sanger-based reference, SLIT1 is located on chromosome 6, split 391 among 8 contigs with 7 gaps, and 7 additional contigs and gaps surrounding the ~40 kb gene 392 (Fig. 6A) predictions from the reference also resulted in a truncated protein with two missing exons (Fig.   396   6B). The PacBio-based assembly fully resolved the gene region, in two alleles on 15.7 Mb and 397 5.6 Mb contigs, respectively, and completely recovered all 35+ exons (Fig. S15A). Similar to 398 above, reference sequences flanking the gaps were found to be erroneous and corrected, and an 399 erroneous tandem duplication was also corrected (not shown). Filling in these gaps recovered the 400 two missing exons: exon 1 within a 1 kb region of sequence in the PacBio-based assembly that is 401 75% GC-rich, replacing 390 bp of erroneous gap-flanking sequence; and exon 35 adjacent to a 402 gap (Fig. 6A,B). A predicted exon upstream of exon 1 in a repeat region was not supported (Fig.   403   6A,B). The PacBio-based assembly thereby generates a complete SLIT1 gene prediction of 1538 404 a.a. (Fig. 6B). The gene is heterozygous in the individual, with 3 codon differences between the 405 two alleles (Fig. 6B, positions 90, 1006, and 1363, respectively), and an additional 24 silent 406 heterozygous SNPs across the coding region. The two alleles were phased along the entire length 407 of the gene.

419
These findings demonstrate that long-read assemblies can fully resolve a complex multi-420 exon gene, as well as have a higher base-call accuracy than Sanger-or Illumina-based reads in 421 difficult to sequence regions, including exons, leading to higher protein-coding sequence 422 accuracy.

424
Other genes. We have manually compared several dozen other genes between the different 425 assemblies, and found in all cases investigated that errors in the Sanger-based and Illumina-based 426 assemblies were corrected in the PacBio-based long-read assemblies. These genes included other 427 immediate early gene transcription factors, other genes in the SLIT and ROBO gene families, and 428 the SAP30 gene family, which all had the same types of errors in the genes discussed above. In 429 addition, we also found cases were genes were missing from the Sanger-based zebra finch or represented in the PacBio-based assemblies. We also noted cases where an assembled gene was 434 incorrectly localized on a scaffold in the Sanger-based assembly whose synteny was corrected 435 with the PacBio-based assembly, such as the vasopressin receptor AVPR1B, which will be 436 reported on in more detail separately. Data for these types of errors were not shown due to space   The PacBio-based long-read assemblies corrected these problems, and for the first time 453 resolved gene bodies of all the genes we examined into single, contiguous, gap-less sequences.

454
The phasing of haplotypes, although initially done to prevent a computationally introduced indel 455 error, reveal how important phasing is to prevent assembly and gene prediction errors. Thus far, 456 we have not seen an error (i.e. difference) in the genes we examined in the PacBio-based long- 457   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 read assembly relative to the other assemblies that was supported by single sequenced genomic 458 DNA molecules, RNA-Seq and Iso-Seq mRNA molecules, or other independent evidence. With 459 these improvements, we now, for the first time, have complete and accurate assembled genes of 460 interest that we now can pursue further without the need to individually and arduously clone, 461 sequence, and correct the assemblies one gene at a time.

462
Our study highlights the value of maintaining frozen tissue or cells of the individuals 463 used to create previous reference genomes, as we could only discover some of the errors (e.g. 464 caused by haplotype differences) by long-read de novo genome assemblies of the same 465 individual used to create the reference. We are now using these PacBio-based assemblies with 466 several groups and companies as starting assemblies for scaffolding into phased, diploid, 467 chromosome-level zebra finch and hummingbird assemblies to upgrade the references, which 468 will be reported on separately. However, even without scaffolding, these more highly contiguous 469 assemblies will be helpful to researchers to extract more accurate assemblies of their genes of    3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 SMRT sequencing was performed on the Pacific Biosciences RS II instrument at Pacific 498 Biosciences using an on plate concentration of 125 pM, P6-C4 sequencing chemistry, with 499 magnetic bead loading, and 360 minute movies. A total of 124 SMRT Cells were run for the 500 zebra finch and 63 SMRT Cells for the hummingbird. Sequence coverage for the zebra finch was 501 ~96 fold, with half of the 114 Gb of data contained in reads longer than 19 kb. For the 502 hummingbird, coverage was ~70 fold, with half of the 40.4 Gb of data contained in reads longer 503 than 22 kb (Fig. S2).  To compare protein amino acid sequence size between the CEGMA and BUSCO 534 datasets, we performed blastp of each CEGMA sequence against the ancestral proteins of the 535 target BUSCO dataset. We took the single best hit with an e-value cut off of 0.001 and extracted 536 the CEGMA and BUSCO protein length values. We then ran a one-sided paired Wilcoxon 537   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 signed-rank test of the two lengths for each protein.

556
RNA sequencing was centered around vocal learning brain regions in the zebra finch and will be 557 described in more detail in a later publication. We utilized our data here for population analyses 558 of assembly quality and for initial annotations. In brief, following modifications of a previously 559 described protocol [25], nine adult male zebra finches were isolated in soundproof chambers for 560 12 hours in the dark to obtain brain tissue from silent animals. Then brains were dissected from 561 the skull and sectioned to 400 microns using a Stoelting tissue slicer (51415). The sections were 562 moved to a petri dish containing cold PBS with proteinase inhibitor cocktail (11697498001).

563
Under a dissecting microscope (Olympus MVX10), the four principle song nuclei (Area X, 564 LMAN, HVC, and RA) as well as their immediate adjacent brain regions were microdissected 565 using 2mm fine scissors and placed in microcentrifuge tubes. The samples were stored at -80 566 °C. Then RNA was isolated and quantified, and samples of two birds were then pooled for each 567 replicate, resulting in 5 replicates (one single animal in one). RNA was converted to cDNA and 568 library preparation was performed using the NEXTflex™ Directional RNA-Seq Kit (Illumina) 569 and paired-end reads were sequenced on an Illumina HiSeq 2500 system. Adapters and poor 570 quality bases (<30) were trimmed using fastq-mcf from the ea-utilities package, and reads were 571 aligned to assemblies using Tophat2 (v2.0.14).