Highly Contiguous Genomes Improve the Understanding of Avian Olfactory Receptor Repertoires

Synopsis Third-generation (long-read–based) sequencing technologies are reshaping our understanding of genome structure and function. One of the most persistent challenges in genome biology has been confidently reconstructing radiations of complex gene families. Olfactory receptors (ORs) represent just such a gene family with upward of thousands of receptors in some mammalian taxa. Whereas in birds olfaction was historically an overlooked sensory modality, new studies have revealed an important role for smell. Chromosome-level assemblies for birds allow a new opportunity to characterize patterns of OR diversity among major bird lineages. Previous studies of short-read–based (second-generation) genome assemblies have associated OR gene family size with avian ecology, but such conclusions could be premature especially when new assembly methods reshape our understanding of avian OR evolution. Here we provide a fundamental characterization of OR repertoires in five recent genome assemblies, including the most recent assembly of golden-collared manakin ( Manacus vitellinus ). We find that short read-based assemblies systematically undercount the avian-specific gamma-c OR subfamily, a subfamily that comprises over 65% of avian OR diversity. Therefore, in contrast to previous studies, we find a high diversity of gamma-c ORs across the avian tree of life. Building on these findings, ongoing sequencing efforts and improved genome assemblies will clarify the relationship between OR diversity and avian ecology.


Introduction
Our understanding of avian sensory biology has progressed substantially in recent years. Studies have discovered fantastic ways in which birds experience the world, including the visual detection of nonspectral colors, the detection of sugar via a repurposed umami taste receptor in nectivorous species, and amphibious hearing in cormorants (Baldwin et al. 2014;Larsen et al. 2020;Stoddard et al. 2020). However, while studies investigating most senses, particularly bird vision, have received considerable attention, research into olfaction has lagged behind. Misconceptions about bird olfaction date back nearly 200 years, when John James Audubon falsely concluded that turkey vultures (Cathartes aura) could not smell carrion (Audubon 1826). Darwin also performed behavioral experiments on Andean condors (Vultur gryphus) to conclude that they could not smell meat (Darwin 1891). An examination of olfactory bulb size across a diversity of bird species concluded that birds could not have anything more than a rudimentary sense of smell (Hill 1905). In response to these conclusions, bird olfaction remained relatively unexplored until behavioral studies showed odor recognition in pigeons (Michelsen 1959). Following this study, there have been a wealth of morphological and behavioral studies testing for olfaction in both captive and wild birds (Bang and Cobb 1968;Hagelin 2007;Gwinner and Berger 2008;Nevitt et al. 2008;Krause et al. 2012; Van Huynh and Rice 2019).
To follow this appreciation for the behavioral and ecological roles of olfaction in birds, researchers have characterized bird olfactory receptors at a genomic level. Olfactory receptors (ORs) are seven transmembrane domain rhodopsin-like G protein-coupled recep-Advance Access publication June 28, 2021 C The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Integrative and Comparative Biology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com tors that detect odors when expressed in the olfactory sensory neurons of the nasal epithelium (Buck and Axel 1991;Mombaerts 2004). In the protruding cilia of the olfactory sensory neurons, ORs recognize specific volatile compounds in their transmembrane domain binding pocket, which creates a signaling cascade that depolarizes the cell membrane and sends an action potential to the olfactory bulb glomeruli and later the brain (summarized in Breer 2003). Each OR may recognize one or more odorants, and each odorant may be detected by one or more ORs, and so in this way, species may perceive a wide array of odors (Saito et al. 2009).
ORs constitute one of the largest gene families in vertebrates. For example, the elephant genome contains about 2000 intact ORs (Niimura et al. 2014). In birds, there are three major subfamilies of ORs, the alpha, gamma, and gamma-c (Niimura and Nei 2005). The alpha and gamma subfamilies are shared between all amniotes (Niimura and Nei 2005;Steiger et al. 2009). The third subfamily, gamma-c, is unique to birds (Niimura and Nei 2005;Silva et al. 2020). The gamma-c subfamily is numerous in some bird genomes comprising over 65% of the OR repertoire (Steiger et al. 2009;Khan et al. 2015). Gamma-c ORs are similar in sequence and sequences cluster by species rather than by orthologs among species (Steiger et al. 2009). Gamma-c ORs likely evolve with a high level of birth and death rates and gene conversion to maintain the speciesspecific clustering (Niimura and Nei 2005;Steiger et al. 2009).
The first genomic investigations to determine bird OR repertoire counts provided further evidence for the potential of birds to recognize a wide variety of odors. A total of 214 intact ORs were reported in chicken (Gallus gallus) and 134 reported in zebra finch (Taeniopygia guttata) by Steiger et al. (2009). The finding of hundreds of ORs in the chicken and zebra finch genomes was replicated using multiple OR identifying pipelines Khan et al. 2015;Vandewege et al. 2016). All of these studies identified similar OR counts as well as similar proportions of each OR subfamily, with the gammac family dominating the OR repertoire. The majority of ORs, however, were located on unmapped contigs, including over 90% of gamma-c ORs in chicken and zebra finch.
Second-generation (Illumina, short-read-based) genomes greatly broadened genome sampling across the tree of life, including birds . However, in these assemblies, intact OR numbers were significantly lower than had been observed in the Sanger-based chicken and zebra finch assemblies (Steiger et al. 2009;Khan et al. 2015). Particularly absent from analyses was the distinct avian radiation of the gamma-c OR subfamily, with 45 of 46 species with short-read assemblies yielding fewer than 25 gamma-c ORs (Khan et al. 2015). Despite sequencing technology being a common thread in the 45 assemblies with lower OR counts, technical explanations were ruled out in favor of evolutionary explanations for the observed patterns of diversity (Khan et al. 2015).
Chromosome-level reference genomes, using longread sequencing technology, should provide more reliable information about OR repertoire diversity in birds. The Vertebrate Genomes Project recently expanded chromosome-scale-assembly methods from model organisms across the vertebrate tree of life (Rhie et al. 2021). Combining these and other new assemblies, we are now able to characterize OR diversity in five bird genomes in which long-read approaches have been deployed (Feng et al. 2020;Liu et al. 2021;Rhie et al. 2021). Included in our species analyses is the new assembly of golden-collared manakin (Manacus vitellinus) that was sequenced as part of a collaborative effort within the National Science Foundation supported Research Coordination Network for biologists studying manakins (Pipridae). We directly compare Sanger, Illumina, hybrid, and Pac-Bio based assemblies to examine the ways in which our understanding of bird OR family repertoire, and our comprehension of avian olfactory capabilities, is shaped by these higher-quality assemblies.

Assembly selection
We sought to compare OR discovery rates and assembly quality by using select bird species with multiple publicly available genome assemblies on GenBank (https://www.ncbi.nlm.nih.gov/genbank/). Assemblies for each species varied in the sequencing technology employed and assembly software used (Table 1). In order to examine how variation in genome sequencing methods impacts OR discovery and description, we included one genome of each species with long-read-sequencing technology (for example, PacBio [Pacific Biosciences], Menlo Park, California) as well as one genome without long-read sequencing. We obtained two assemblies from five bird species: emu (Dromaius novaehollandiae), chicken (G. gallus), Anna's hummingbird (Calypte anna), golden-collared manakin (M. vitellinus), and zebra finch (T. guttata) Zhang et al. 2014;Feng et al. 2020;Liu et al. 2021;Rhie et al. 2021, Table 1). In addition to the availability of multiple genomes, these five species are representative across the three major groupings of extant birds, including the Paleognathae and two groups within the Neognathae, Galloanseres, and Neoaves, represent diverse ecology, and include two important avian models, chicken and zebra finch (Fig. 1A).

OR identification
To detect putatively functional ORs in the selected genomes, we created a BLAST query with a set of 2110 OR protein sequences from 6 mammals (Ornithorhynchus anatinus, Didelphis virginiana, Bos taurus, Canis lupus, Rattus norvegicus, Macaca mulatta), 2 birds (G. gallus, T. guttata), and 1 crocodilian (Gavialis gangeticus). We obtained this query OR set by combining previously published OR subgenomes (Niimura and Nei 2007;Niimura 2009;Vandewege et al. 2016). Using this query file, we performed TBLASTN searches against all 11 bird genomes with a threshold of E < 1e-20. To remove pseudogenized and truncated ORs, we filtered for hits > 250 amino acids long. For any single location on the genome, we filtered out hits within 100 bp of each other, and selected the lowest E-value associated with that location. After obtaining unique BLAST hits, we extracted the associated nucleotide sequence from the genome as well as 300-bp regions flanking the hit both upstream and downstream. We used a modified Perl script from Using this alignment, we recorded the position of the first amino acid in the first transmembrane domain. To estimate the location of the ORF start codon, we used modified Perl scripts from Beichman et al. (2019) to find the most appropriate methionine upstream of this recorded transmembrane start position (Montague et al. 2014;Beichman et al. 2019). ORF sequences were then truncated at the 5' ends to begin with this methionine. This set of ORFs was then aligned using the E-INSI-I parameters in MAFFT to a set of T. guttata reference ORs as well as 11 non-OR rhodopsin-like G-protein coupled receptors (non-OR GPCRs) that functioned as an outgroup (Katoh and Standley 2013;Niimura 2013;Vandewege et al. 2016;Beichman et al. 2019). We then used clustalW to generate a neighbor-joining tree from this alignment with 1000 bootstraps, gaps removed, and Kimura's distance correction (Kimura 1980;Goujon et al. 2010). We then removed any ORFs that were phylogenetically more closely related to the non-OR GPCRs.

Classification of final OR set
We classified all remaining ORFs as functional ORs. Using this final set, we ran a maximum-likelihood tree using IQ-TREE with automatic model selection and 1000 Shimodaira-Hasegawa-like approximate likelihood ra- tio test replicates (Minh et al. 2020). Using maximumlikelihood support values, we collapsed all nodes with <50% support into a polytomy using the iTOL software, and rooted the tree using the ancestral branch leading to the 11 non-OR GPCRs (Letunic and Bork 2019). We classified bird ORs into subfamilies alpha, gamma, and gamma-c based on the subfamily of the query sequence used to identify the OR and the location of the OR in one of the three distinct avian OR clades (Steiger et al. 2009;Vandewege et al. 2016). We then counted the final number of OR sequences as well as the number of ORs from each subfamily.

OR totals
We identified a total of 1496 ORs across all 10 bird assemblies from 5 species (Table 2). Of these ORs, the gamma-c subfamily constituted 77% (1158) of the total, while 18% (263) of the identified ORs were gamma, and 5% (74) were alpha subfamily ORs. For assemblies with long-read sequencing, we found 946 ORs, with 42 alpha (4%), 162 gamma (17%), and 741 gamma-c (78%). Within a single assembly, the chicken Ggal6 (see assembly abbreviations in Table 1) yielded the largest number of ORs, 355 in total, 303 (85%) of which were gamma-c ORs (Fig. 1E). Gamma-c represented 97% (179/184) of the ORs in zebra finch Tgut1, the highest percentage of gamma-c out of the total OR repertoire for any assembly. Emu, Dnov2, yielded both the highest gamma counts (75) and the highest alpha counts (26), in addition to 195 gamma-c ORs (Fig. 1F). In the OR maximumlikelihood phylogenies, we elected to present each species separately for clarity of visualization (Fig. 1B-E). We noted that in these analyses the gamma OR subfamily for D. novaehollandiae was not recovered as monophyletic (Fig. 1F). In other multispecies analyses we have done, this is not the case, and this is also not the case in our Dnov2 neighbor-joining tree (analyses not shown). This unusual pattern here seems to be driven by the long branch at the base of the Dnov2 alpha OR clade.

OR counts relative to previous studies
Overall, reanalysis of previously analyzed genomes was consistent with previous findings (Khan et al. 2015;Vandewege et al. 2016, Table 2). For Ggal4, we recovered 272 ORs, 6 more than previously recovered (Vandewege et al. 2016). Two previous searches of the Tgut1 assembly yielded 182 and 190 ORs, similar to our search of 184 ORs Vandewege et al. 2016, respectively). We also found similar subfamily diversity of gamma and gamma-c ORs in zebra finch and chicken (Niimura 2009;Steiger et al. 2009;Wang et al. 2013;Khan et al. 2015;Vandewege et al. 2016). Similar patterns emerged for the other short-read assemblies in our analysis (Table 2), giving us confidence in our methods of recovering ORs.

Assembly effects on OR subgenome
We found that assembly had substantial effects on the ability to reconstruct OR subgenomes. In four of the five surveyed species, inclusion of long-read sequencing (Pacific Biosciences) increased OR counts ( Table 2). The most pronounced effect on OR repertoire was in the gamma-c family, which also constitutes the majority of known avian ORs. Between D. novaehollandiae assemblies, contigN50 improved from 0.86 Mb in Dnov1 with Illumina sequencing to 14.09 Mb in Dnov2. We detected an additional 239 ORs in Dnov2, of which 188 (78.6%) were from the gamma-c family. In the case of M. vitellinus, no gamma-c representatives were recovered from Mvit1 in our analysis or a previous analysis (Khan et al. 2015), yet our search of Mvit3 yielded 97 gamma-c ORs (Fig. 1B).
Improved assemblies also resulted in the identification of additional ORs in the alpha and gamma sub- Fig. 2 Distribution of gamma-c ORs among chromosomes and scaffolds in chromosome-level assemblies. The largest gamma-c OR cluster in each long-read assembly is located on unmapped scaffolds except G. gallus, where the largest cluster is localized to chromosome 33, a microchromosome.
families as well, but these effects were less pronounced ( Table 2). The gamma subfamily of D. novaehollandiae and in M. vitellinus more than doubled in count in Dnov2 compared to Dnov1 (42 new ORs) and in Mvit3 compared to Mvit1 (10 new ORs), respectively. In other species, the relative increase in gamma was smaller (Table 2). Alpha OR counts were similar between within-species assemblies (Table 2). Unexpectedly, we identified no alpha ORs in theC. anna Cann2 assembly, though two were previously reported based on Cann1.
The Sanger-sequencing-based T. guttata genome Tgut1 unexpectedly yielded a higher number of ORs than Tgut2, which was assembled using multiple technologies. This species was the only case in which a newer assembly based on long-read technology yielded fewer ORs than the assembly without long-read sequencing. Despite an overall higher OR count in Tgut1, we found more alpha ORs (three vs. two) and more gamma ORs (six vs. three) in Tgut2 than in Tgut1. However, the overall count of ORs in Tgut2 was substantially lower as there were 119 fewer gamma-c ORs in Tgut2 than in Tgut1. Therefore, the lower OR count in Tgut2 is entirely due to differences in gamma-c OR recovery.

Physical location of ORs in avian genomes
Although new, chromosome-scale assemblies have assigned the vast majority of genomic sequence data to chromosomes (Rhie et al. 2021), gamma-c OR regions remain primarily assigned to unmapped scaffolds (Fig. 2). The exception to this rule is the Ggal6 assembly in which 302 out of 303 identified ORs have been assigned to chromosomes. Most of these (274 ORs) map to a single microchromosome 33 (Lee et al. 2020 , Fig. 3). The remaining ORs are divided between two other microchromosomes, chromosome 16 (8 ORs) and chro-mosome 31 (16 ORs), with a single receptor on an unmapped scaffold. For C. anna Cann2, only 20% of gamma-c receptors were on scaffolds assigned to chromosomes. The main clusters of ORs were a group of 10 assigned to the W chromosome and another 18 assigned to a single scaffold (RRCD01000065.1). For D. novaehollandiae Dnov2, only 8% of gamma-c ORs are assigned to chromosomes, with a large cluster of 108 on scaffold JABVCD010000554.1 (Fig. 2). The remaining gamma-c loci were distributed among 29 chromosomes and scaffolds. Finally, only 3% (2/60) of T. guttata Tgut2 gamma-c ORs were assigned to chromosomes with a cluster of 24 loci on scaffold VOHI02000029.1.

Long-read sequencing is critical for characterizing avian OR repertoire
Our results show that Illumina short-read-based approaches were not successful in accurately characterizing the gamma-c OR subfamily. The three Illuminabased assemblies we assessed undercounted gamma-c diversity, revealing fewer than 10 gamma-c ORs in each assembly. In all five of the assemblies sequenced with long reads, we found that gamma-c ORs constituted at least 66% of the OR subgenome, and, on average, there were 148 gamma-c loci per species. The hybrid assembly approach using MaSuRCA (Zimin et al. 2013) also substantially increased gamma-c recovery in Mvit3.
In the most phylogenetically comprehensive analysis of avian ORs to date, Khan et al. (2015) analyzed OR repertoire from 48 bird species. Forty-six of the assemblies surveyed were sequenced and assembled using short-read-based methods. The two other species included were the Sanger-based chicken and zebra finch, and those had the highest OR diversity, which was attributed to ecological adaptations of these two particular species (Khan et al. 2015). Other reports in the literature also interpret a lack of gammac ORs as biologically meaningful without considering the shortcomings of short-read-based assemblies (Zhan et al. 2013). Importantly, these issues extend beyond gamma-c ORs to other complex gene families such as the major histocompatibility complex (MHC). Longread-based studies are also improving our understanding of the avian MHC (He et al. 2021). Select MHC genes are linked to ORs in mammals, providing further evidence that this family repertoire may also be obscured in Illumina-sequenced assemblies (Santos et al. 2010).
Prior to our analysis, chicken (Ggal3, Ggal4) and zebra finch (Tgut1) were the only Sanger-based assemblies analyzed (Niimura 2009;Steiger et al. 2009;Wang et al. 2013;Khan et al. 2015;Vandewege et al. 2016). Analyses of these assemblies, which involve longer read lengths and Bacterial Artificial Chromosome (BAC)-based scaffolding, provided the only previous evidence of substantial gamma-c OR diversity. Whereas the incorporation of long-read sequencing methods greatly increased the count of gammac ORs relative to Illumina-based assemblies, they also reduced the count of gamma-c in T. guttata compared to the previous Sanger-based assembly. We propose two potential reasons for this disparity. It is possible that the original Tgut1 assembly resolved alternative alleles as separate loci, artificially inflating the total gamma-c count with duplicate loci. Tgut2 and many third-generation assemblies are haplotype phased, mitigating this problem. Additionally, however, filtering steps at the end of the Tgut2 assembly curation process used for quality control may aggressively remove repetitive regions that harbor tandem gamma-c loci.

Phylogenetic implications of updated OR counts
Our finding of high OR diversity for D. novaehollandiae has potentially important implications for broad-scale patterns of OR evolution in birds. High OR diversity in this paleognath genome contrasts with lower diversity found in previous analyses (Le Duc et al. 2015) and now suggests that the avian ancestor had high gammac OR diversity. Based on our assessment, all three OR subfamilies have lower diversity in the three neoavian lineages tested relative to D. novaehollandiae and G. gallus. Due to our limited taxonomic sampling, it remains unclear whether the differences reflect broader patterns of phylogenetic change or lineage-specific adaptations. For example, D. novaehollandiae and G. gallus are both omnivorous and therefore may retain ORs as a variety of different odorants are relevant when foraging. The three Neoaves species we selected are not generalists, instead C. anna is mostly nectivorous, M. vitellinus is mostly frugivorous, and T. guttata is a granivore. Our understanding of phylogenetic patterns of OR diversity will continue to change with improved genome assembly and taxonomic sampling. For example, by sequencing additional manakin species, we will be able to determine whether OR repertoire is elaborated within a frugivorous family or if counts vary substantially within individual lineages.

Toward a better understanding of the avian OR subgenome
Many key features of OR subgenomes and olfaction generally are well-characterized in mammals, but not in birds (Olender et al. 2008;Niimura et al. 2014). There are no previous reports of using assemblies with longread sequencing to search for an OR repertoire in birds, and until now, no previous reports of an expansive gamma-c OR repertoire outside of G. gallus and T. guttata assemblies. To date, there has been only one report of transcriptome sequencing an avian olfactory epithelium, a critical undertaking considering that ORs are expressed in nonolfactory tissues and some identified ORs may be nonfunctional despite having open reading frames (Pluznick et al. 2009;Sin et al. 2019). Although Sin et al. (2019) do show the expression of at least three gamma-c ORs in the Oceanodroma leucorhoa olfactory epithelium, it still leaves the expression of a potentially large number of gamma-c genes uncharacterized. A current priority would be to sequence the olfactory epithelium transcriptome from species with high-quality genome assemblies to understand the extent to which a high number of gamma-c ORs recovered from the genome are functionally expressed in the olfactory epithelium.
Information about which ORs are expressed in the olfactory epithelium will also help with functional testing to identify the binding properties of avian ORs in a process known as "deorphanizing." To date, no avian ORs have been deorphanized, and so it remains unclear what ligands are bound by any avian ORs. The diverse gamma-c ORs, unique to birds, are of particular interest. Avian olfaction is in many fundamental ways a frontier in the field of sensory biology.
There is also a great deal left to be learned about the molecular evolutionary mechanisms that have given rise to the diversity of gamma-c genes in birds. First, enhanced spatial information on the physical location of OR genes will be informative for understanding the genetic processes at play. With most loci still scattered among unmapped scaffolds, it is somewhat unclear how clustered these loci are. That said, the G. gallus Ggal6 OR repertoire is highly spatially localized (Figs 2 and 3), a pattern that is likely to be found in other species as well. Given spatial clustering, and extremely short branch-lengths, gamma-c species-specific clades could be a result of gene conversion among loci, but further study is needed. In the passerine bird MHC, endogenous retroviral elements may have played a role in generating gene family diversity in MHC Class II (Balakrishnan et al. 2010), and the same could be the case in the gamma-c radiation across birds. We note, however, that high repeat content is a general pattern across avian microchromosomes and is not restricted to OR and MHC regions (Fig. 3, Burt 2002).

Long-read-assembly methods better characterize complex gene families
Our increased OR counts in long-read assemblies contribute to the growing literature quantifying the ad-vantages of third-generation (long-read) sequencing technology over second-generation (short-read) sequencing technology. Third-generation sequencing has greatly improved the detection of tandem repeats generated by long terminal repeat retrotransposons, microsatellites, homonucleotide stretches, and repetitive regions (Mason et al. 2016;Kapusta and Suh 2017;Korlach et al. 2017). Large gene families found in clusters like the ORs described here may show the greatest improvement following the incorporation of long-read data. Long-read sequencing has already led to better characterization of MHC loci in birds, and also greatly increased the resolution and count of vomeronasal receptors in mammals (Larsen et al. 2014;He et al. 2021).
Even with long-read sequencing, the chromosomal location of all ORs remains largely unresolved. With the exception of the chicken, ORs in the long-read assemblies that we analyzed largely mapped to unassigned scaffolds, including the largest OR cluster in each assembly (Fig. 2). The inability to assign these scaffolds to a chromosome is likely related to the expansion of duplicate regions that contain OR loci and the high repeat element content found in microchromosomes (Fig. 3, Burt 2002). Indeed the assignment of an OR-containing region to the hummingbird female-specific W chromosome is likely spurious and driven by the repetitive sequences in these regions. Bird chromosomes are highly syntenic, suggesting that the large OR cluster on the chicken microchromosome 33 likely matches homologous chromosomes (Nanda et al. 2011). Manual curation of these regions may be required to resolve the remaining uncertainties. Other solutions to this complexity include physically mapping OR loci to chromosomes, and/or using approaches less dependent on assembly. Sin and colleagues (2019) incorporated assessment of depth of coverage in their OR quantification pipeline, an approach that should be informative in the face of varying assembly quality.
terial is based upon work conducted while C.N.B. was serving at the National Science Foundation.

Funding
This work was supported by National Science Foundation awards [grant numbers RCN-1457541, IOS-1456612].

Supplementary data
Supplementary data available at ICB online.

Data availability
No new data were generated or analyzed in support of this research.