Genus-wide characterization of bumblebee genomes reveals variation associated with key ecological and behavioral traits of pollinators

Bumblebees are a diverse group of globally important pollinators in natural ecosystems and for agricultural food production. With both eusocial and solitary lifecycle phases, and some social parasite species, they are especially interesting models to understand social evolution, behavior, and ecology. Reports of many species in decline point to pathogen transmission, habitat loss, pesticide usage, and global climate change, as interconnected causes. These threats to bumblebee diversity make our reliance on a handful of well-studied species for agricultural pollination particularly precarious. To broadly sample bumblebee genomic and phenotypic diversity, we de novo sequenced and assembled the genomes of 17 species, representing all 15 subgenera, producing the first genus-wide quantification of genetic and genomic variation potentially underlying key ecological and behavioral traits. The species phylogeny resolves subgenera relationships while incomplete lineage sorting likely drives high levels of gene tree discordance. Five chromosome-level assemblies show a stable 18-chromosome karyotype, with major rearrangements creating 25 chromosomes in social parasites. Differential transposable element activity drives changes in genome sizes, with putative domestications of repetitive sequences influencing gene coding and regulatory potential. Dynamically evolving gene families and signatures of positive selection point to genus-wide variation in processes linked to foraging, diet and metabolism, immunity and detoxification, as well as adaptations for life at high altitudes. These high-quality genomic resources capture natural genetic and phenotypic variation across bumblebees, offering new opportunities to advance our understanding of their remarkable ecological success and to identify and manage current and future threats.


197
Genome-scale phylogeny of bumblebees 198 The species-level molecular phylogeny ( Figure 1A) estimated from maximum-199 likelihood analysis with IQTree (Minh, et al. 2020b) is largely consistent with 200 previously inferred phylogenetic relationships of the 15 subgenera based on five genes 201 (Cameron, et al. 2007;Williams, et al. 2008), showing only two topological differences. 202 The results support previous conclusions that: (i) subgenus Mendacibobus (Md) is the 203 sister group to all the other subgenera; and (ii) lineages named Psithyrus (Ps) are within 204 the Bombus clade, arguing they should not be named as an independent genus ( Figure  205 1A). The species phylogeny was built from the concatenated aligned protein sequences 206 of 3,617 universal single-copy orthologs from 19 bumblebee species (17 from the 207 current study, two published previously: B. terrestris and B. impatiens (Sadd,et al. 208 2015)) and four honeybee species (A. florea, A. dorsata (Oppenheim, et al. 2020), A. 209 cerana (Park, et al. 2015), and A. mellifera (Weinstock, et al. 2006)), with orthologous 210 groups delineated using the OrthoDB software (Kriventseva, et al. 2015). 211 Complementary analysis with ASTRAL (Zhang, et al. 2018) resulted in an identical 212 species tree with the exception of the placement of B. pyrosoma, which no longer forms 213 a monophyletic pairing with B. breviceps (Additional file 2: Figure S4). This type of 214 discordance between species tree methods is consistent with a known shortcoming of

231
Differences in the total content of transposable elements (C) and coding DNA sequences (D) of the 19 genomes 232 relative to that of B. superbus (which has the smallest genome assembly size) are plotted against their genome size 233 differences (relative to that of B. superbus).

234
However, inferring rooted gene trees from 3,530 single-copy orthologous groups 235 reveals extreme levels of discordance: none of their topologies match the topology of 236 the tree inferred from concatenation (Additional file 1: Table S5 and Additional file 1:  237  Table S6), and nearly every gene tree has a unique topology (Additional file 1: Table  238 S7). Such extreme levels of discordance have been observed previously in birds 239 (Jarvis, et al. 2014) and tomatoes (Pease, et al. 2016), and have been attributed to a 240 variety of sources, such as ILS and introgression (Maddison 1997 (node labels in Figure 1A). These site concordance factors, the short internal branches 247 of the species tree, and the strong correlation between them (Additional file 2: Figure  248 S5), are consistent with ILS driving the observed gene tree discordance. Gene-level 249 phylogenies are therefore used in all subsequent gene-based molecular evolution 250 analyses because such discordance can bias inferences of substitutions when mapped 251 onto a species tree (

Major genomic rearrangements in social parasites 254
The five Hi-C genome assemblies indicate that four of the five subgenera have 18 255 chromosomes (Figure 2A and 2C; Additional file 2: Figure S6A-B), consistent with 256 previous karyotypic analysis that inferred the ancestral chromosome number is 18 257 (Owen, et al. 1995). However, the social parasite bumblebee, B. turneri, subgenus 258 Psithyrus, has 25 chromosomes ( Figure 2B), consistent with previous cytological 259 work (Owen and Robin 1983). Despite the higher chromosome number, its genome 260 size is within the range of other bumblebees ( Figure 1; Table 1). Pairwise 261 comparisons between B. turneri and each of the other four chromosomal-level 262 assemblies to investigate macrosynteny relationships and understand how a 25-263 chromosome karyotype was derived from the ancestral state revealed three processes. 264 First, some chromosomes descended, structurally unchanged, from ancestral 265 chromosomes (e.g., chromosome 5; Figure 2D in blue). Second, some originated by 266 fission of an ancestral chromosome (e.g., 11 and 25 of B. turneri originated by the 267 fission of ancestral chromosome 11; Figure 2D in red). Lastly, some are derived from 268 fusions of two or more ancestral chromosome segments (e.g., B. turneri chromosome    have a high recombination rate (Wilfert, et al. 2007), their rates of chromosome 290 evolution are relatively slow, which is further supported by the observed high synteny 291 contiguity across species (average 88%, from 80-95%; Additional file 1: Table S9).

293
Transposable elements drive genome size variation 294 Genome assembly sizes (haploid) range from 230 Mb in B. superbus to 262 Mb in B. 295 sibiricus (Figure 1). Ancestral genome size inference of bumblebees produced an 296 estimate of 230-231 Mb, similar to that of members of the subgenus Mendacibombus, 297 but smaller than the genomes of all other extant bumblebees surveyed (Additional file 298 2: Figure S7).

304
Mendacibombus species have a smaller genome size than other species (Figure 1), 305 and TEs that transposed in non-Mendacibombus species after divergence from 306 Mendacibombus show copy numbers ranging from 1,992-4,755 (Additional file 2: 307 Figure S9), supporting the contribution of TEs to genome size evolution. 308 Furthermore, TE proliferation history analysis indicated that all non-Mendacibombus 309 species have more recent TE amplification peaks (Additional file 2: Figure S10), 310 consistent with increased TE activity driving genome size increases.

312
The genomic distributions of TEs include 1,074-1,786 TE loci that overlap with the 313 coding regions of protein-coding genes (Additional file 1: Gene content evolution reflects foraging and diet diversity 323 Orthology delineation results indicate that a majority of genes are found in one or 324 more copies in nearly all lineages across bumblebees ( Figure 1B). These include 53 325 Bombus-specific ortholog groups, which are present in all 19 bumblebees but absent 326 in all four honeybees ( Figure 1B; Additional file 1: Table S13), and may play roles in 327 lineage-specific traits. Functional annotation suggests that five of these Bombus-328 specific genes are associated with protein metabolism and transport (Additional file 1: 329 Table S13), potentially linked to the higher protein content of pollen collected by 330 bumblebees than honeybees (Leonhardt and Blüthgen 2011). Ortholog groups with 331 the broadest species representation are functionally enriched for core biological 332 processes such as protein transport, signal transduction (e.g. Wnt pathway), 333 (de)ubiquitination, and cytoskeleton organization (Additional file 1: Table S14). In 334 contrast, those with sparse or lineage-restricted species representation are enriched for 335 processes including smell and taste perception, amino acid biosynthesis, and 336 oxidation-reduction (Additional file 1: Table S14). On average, 465 species-specific 337 genes (those without an ortholog in any other lineage) were identified in each 338 bumblebee species (range 137-767) (Additional file 1: Table S15), which may 339 contribute to species-specific traits but whose functional roles remain to be explored.

341
Turnover analysis of gene repertoires across the Bombus phylogeny (15 species, one 342 per subgenus) using CAFE v3.0 ) identified expansions and 343 contractions among 13,828 gene families and quantified variations in gene gain/loss 344 rates across species (Additional file 2: Figure S11). After error correction, the overall 345 rate of gene gain/loss in Bombus genomes is 0.0036/gene/million years, similar to an 346 analysis of 18 anopheline species and 25 drosophilids (Additional file 1: measures of gene copy number variation also identifies these processes as enriched 357 among the most variable gene families, in contrast to the most stable that are involved 358 in processes related to translation, adhesion, and transport (Additional file 1: Table  359 S19). In terms of protein domain copy number evolution, the most highly variable 360 genes are those with protein-protein interaction mediating F-box domains, putatively 361 DNA-binding SAP motifs, and phosphate-transferring guanylate kinases (Additional 362 file 1: Table S20).

364
Stable intron-exon structures with abundant stop-codon readthrough 365 Protein-coding potential analysis using B. terrestris as the reference species identified 366 851 candidate readthrough stop codons (Additional file 2: Figure S12; Additional file 367 1: In contrast, intron-exon boundaries within Bombus genes are relatively stable. 381 Examining evolutionary histories of intron gains and losses revealed few changes, 382 representing only 3-4% of ancestral intron sites, with more gains than losses 383 (Additional file 2: Figure S13; Additional file 1: Table S22), unlike drosophilids and  384 anophelines where losses dominate (Neafsey, et al. 2015), suggesting that bumblebee 385 gene structure has remained relatively stable over the 34 million years since their last 386 common ancestor.

395
Divergence and selective constraints of protein-coding genes 396 Bumblebee genes with elevated sequence divergence and/or relaxed constraints 397 include processes related to smell perception, chitin metabolism, RNA processing, 398 DNA repair, and oxidation-reduction ( Figure 3). Measures of evolutionary rate 399 (amino acid sequence divergence) and selective constraint (dN/dS) showed similar 400 trends among different functional categories of genes. Most genes are strongly 401 constrained, with median estimates of dN/dS much lower than one. Assignment of GO 402 terms and InterPro domains is usually biased towards slower-evolving, well-403 conserved genes (Additional file 2: Figure S14). Nevertheless, functional categories 404 with the fastest-evolving genes are further supported and complemented by 405 examining molecular function GO terms (Additional file 2: Figure S15A) and 406 InterPro domains (Additional file 2: Figure S15B), which show elevated rates for 407 odorant binding, olfactory receptor activity, chitin binding, oxidoreductase activity, 408 top 20% respectively (Additional file 2: Figure S16), showed genes with the slowest 411 evolutionary rates and the lowest dN/dS ratios were enriched for essential house-412 keeping biological processes and molecular functions (Additional file 1: Table S23; 413 Additional file 1: Table S24). In contrast, genes with the fastest evolutionary rates 414 were enriched for processes linked to polysaccharide biosynthesis, tRNA 415 aminoacylation, drug binding and RNA methyltransferase activity (Additional file 1: 416 Table S23). Genes with the highest dN/dS ratios were enriched for processes and 417 functions including proteolysis, translation, ncRNA processing, and chitin metabolism 418 (Additional file 1: Table S24). 419 420 Codon usage bias driven by AT content 421 Analysis of codon usage bias showed no evidence for selection on optimal codons, in 422 contrast to drosophilids but similar to anophelines (Neafsey, et al. 2015;Vicario, et al. 423 2007). Instead, codon usage bias in bumblebees seems to be driven mainly by AT 424 content, consistent with previous reports in Hymenoptera (Behura and Severson 425 2012). Optimal codons were estimated in each species and correlation coefficients 426 were computed between relative synonymous codon usage (RSCU) and effective 427 number of codons (ENC) per gene. All species have a similar preference and intensity 428 of preference; for each amino acid, there was a consistently highly preferred codon 429 and often a secondarily preferred one, all ending in A/T (Additional file 2: Figure  430 S17). To test if codon usage could largely be explained by mutation bias, a linear 431 model was used to predict Fop (frequency of optimal codon) from overall gene AT 432 content and amino acid use. The model explained 99.2% of the Fop variation without 433 the need to include the species origin of each gene. The AT content alone explained 434 81% of the variation (Additional file 2: Figure S18). Moreover, a strong correlation 435 was observed between codon AT content and the correlation between RSCU and 436 ENC across all species (Additional file 2: Figure S19). in a subset of species, including putative pheromone receptors (Additional file 1: 463 Table S26). Compared with ORs, GRs and IRs have much lower and more stable 464 gene counts (Additional file 2: Figure S20). However, despite overall conservation of 465 gene number and widespread evidence for purifying selection, there is evidence that 466 some GR and IR genes experienced positive selection in a subset of species, including 467 receptors putatively involved in sensing fructose and temperature (Additional file 1: 468 Table S26).  Table S27), indicating a genus-wide deficit of this gene category, previously 490 observed in two bumblebees (Sadd, et al. 2015). There are 88 detoxification genes on 491 average in bumblebees, with little variation across species (Additional file 1: Table  492 S27). Despite overall conservation of gene number and widespread evidence for 493 purifying selection (mean dN/dS is 0.26), a total of 30 detoxification genes, including 494 CCEs, P450s, and GSTs, showed evidence of positive diversifying selection in a 495 subset of species (Additional file 1: Table S28). 496 497 Immune defense: Immune genes are involved in recognition of and defense against 498 pathogens. Similar to detoxification genes, counts in the 17 sequenced genomes are 499 much lower than in drosophilids and anophelines (Additional file 1: Table S29), 500 showing that the previously noted paucity in two bumblebees (Barribeau, et al. 2015; 501 Sadd, et al. 2015) extends to the whole genus. Bumblebee genomes contain 502 components of all major immune pathways described in insects, and gene counts are 503 fairly conserved across species (Additional file 1: Table S29). For example, all 504 species have two genes encoding Gram-negative bacteria binding-proteins, while 505 peptidoglycan-recognition proteins are more variable with between four and six gene 506 copies. Comparing dN/dS ratios between immune genes and all single-copy 507 orthologous genes in bumblebees showed that immune genes exhibit slightly higher 508 dN/dS ratios (P = 0.04, Wilcoxon rank sum test), and among immune genes, 509 recognition and signaling genes have higher dN/dS ratios than effector genes ( Figure  510 4B). In addition, despite widespread evidence for purifying selection, a total of 77 511 immune genes showed evidence of positive selection in a subset of bumblebee species 512 (Additional file 1: collected at elevations > 4,000 m that represent three subgenera (Figure 1). No genes 519 show signatures of positive selection in all high-elevation species but none of the low-520 elevation species. However, nine genes show evidence of positive selection in species 521 representing two of the three high-elevation subgenera, but none of the low-elevation 522 species (Additional file 1: Table S31). Two encode Myosin-VIIa and CPAMD8, 523 respectively, which are involved in eye development (Cheong, et al. 2016;Williams 524 and Lopes 2011). As bumblebees detect flowers visually (Meyer-Rochow 2019), 525 signatures of selection might be related to fine tuning eye development for optimal 526 foraging in high altitude light conditions. Three genes encode histone deacetylase, 527 synaptotagmin-12, and heterogeneous nuclear ribonucleoprotein, which are involved 528 in maintaining muscle integrity and keeping "flight state", which is critical for 529 undertaking long-distance food-searching (Liu, et Table S31). 534 The remaining gene encodes a proton channel, which may be also involved in the 535 metabolic adaptation to hypoxia (Bacon and Harris 2004). 536 537 Sex-determination: Evolutionary analysis of sex-determination genes in bumblebees 538 and related species indicated that all Bombus genomes share a duplicated copy of 539 feminizer (fem), named fem 1 ( Figure 4C). Compared to fem, fem 1 shows a higher 540 level of divergence among bumblebees (femBombus dN/dS = 0.24; fem 1Bombus dN/dS = 541 0.77; Figure 4C). These ratios are close to the range observed for Apis, in which fem 542 has evolved under purifying selection and the paralogous gene complementary sex 543 determiner (csd) has evolved by neo-functionalization ( Figure 4C) (Hasselmann,et al. 544 2008). A hypothesis branch-site testing framework (RELAX), identifies evidence for 545 relaxation of selection in fem 1Bombus compared to femBombus (P<0.001, LR = 36.34). 546 Moreover, the spurious action of diversifying selection on branches was 547 predominantly found in fem 1Bombus ( Figure 4C). A mixed effect model of evolution 548 (MEME) was applied to identify individual sites that were subject to episodic 549 diversifying selection, and at least 15 sites (p< 0.05) were found to be under positive 550 selection, with some being located in known motifs (Additional file 2: Figure S21). 551 The results of these selection analyses suggest that both fem and fem 1 contribute to 552 the Bombus sex determination pathway. For the transformer 2 (tra-2) gene, consistent 553 amino acid changes between Bombus and Apis were found within the RNA 554 recognition domain (Additional file 2: Figure S22), supporting a previous hypothesis 555 of a regulatory modification between the two groups .

557
Discussion 558 Comparative analysis of multiple genomes in a phylogenetic framework substantially 559 improves the precision and sensitivity of evolutionary inference and provides robust 560 results identifying stable and dynamic features. In this study, we performed 561 comparative analyses of genome structures and contents, as well as global and family-562 targeted gene evolutionary dynamics across the phylogeny of Bombus, using 17 563 annotated de novo assemblies and two previously published genomes. 564 565 Many attributes of bumblebee genomes are highly conserved across species. For 566 example, overall genome size and genome structure, the number of protein-coding 567 genes and non-coding RNAs, gene intron-exon structures, and the pattern of codon 568 usage are all very similar across these 19 genomes. However, other aspects of genome 569 biology are dynamically evolving. TEs are a major contributor to genome size variation 570 (Figure 1) as well as a potential source of coding and regulatory sequences (Additional 571 file 1: Table S10-12). Differential gene gain and loss also contribute to gene content 572 variation across bumblebees and lead to lineage-specific gene repertoires (Figure 4 A; 573 Additional file 2: Figure S20; Additional file 1: Table S17). Finally, for genes shared 574 by all species, the action of positive selection is different across species (Additional file 575 1: Table S26; Additional file 1: Table S28; Additional file 1: Table S30; Additional file 576 1: Table S31), which can lead to gene functional divergence possibly reflecting key 577 eco-ethological differences. 578 579 An exception to the otherwise overall conserved genome structure is the set of species 580 in the subgenus Psithyrus. These bumblebees exhibit social parasitism; they do not have 581 a worker caste, and it is not necessary for them to forage for nectar and pollen (Lhomme 582 and Hines 2019). Originally, this subgenus was argued to be a separate genus due to 583 distinct behavior and higher chromosome number, however subsequent phylogenetic 584 analysis placed Psithyrus within the subgenus Bombus (Williams, et al. 2008). Here, 585 based on a much larger genomic dataset, we confirm that species in the subgenus form 586 a monophyletic group within the Bombus clade ( Figure 1A). In addition, we show that, 587 although Psithyrus species have an increased chromosome number, their genome sizes 588 are within the range of those of the other bumblebees ( Figure 1A), and their 25 589 chromosomes reflect a mix of fission, fusion, and retention of the 18 ancestral 590 bumblebee chromosomes (Figure 2; Additional file 2: Figure S6). Chromosome 591 rearrangements (e.g., fissions, fusions, and inversions) have been posited to play roles 592 in speciation (Ayala and Coluzzi 2005), and thus may explain the diversification and 593 social parasitic behavior of Psithyrus. In addition to genome structure variation, we 594 identified a net loss of 11 odorant receptor genes in the common ancestor of Psithyrus 595 species (Figure 4), which could be a cause or consequence of their socially parasitic 596 behavior. study, we found out that genes involved in smell and taste perception are among the 602 fastest evolving gene categories, both in copy number variation and in sequence 603 divergence (Figure 3; Additional file 2: Figure S15; Additional file 1: Table S18-19).  Figure S15; Additional file 1: Table S18-19). 615 These variable patterns of chitin-related gene evolution potentially underlie observed 616 differences in morphology and insecticide resistance, which could influence the 617 suitability of different species for commercial use. Across bumblebee genomes the 618 fastest evolving genes are also related to processes including protein glycosylation, 619 methylation, proteolysis, and tRNA aminoacylation for protein translation (Figure 3; 620 Additional file 2: Figure S15; Additional file 1: Table S18-19). Protein glycosylation 621 is involved in multiple physiological processes including growth, development, 622 circadian rhythms, immunity, and fertility (Walski, et al. 2017). tRNA aminoacylation 623 for protein translation process are involved in response to the changing environment 624 (Pan 2013). Some genes that are not among the fastest evolving categories-for 625 example, immune and detoxification genes, which are involved in the interaction of 626 bumblebees with external environments¾show differential patterns of positive 627 selection in subsets of species (Additional file 1: Table S28; Additional file 1: Table  628 S30), which can lead to gene functional divergence. Taken together, identification of 629 the fastest evolving genes and those showing patterns of differential positive selection 630 reveals substantial genetic variation across bumblebees. Future experimental 631 investigations will be required to determine how the identified genetic variation is 632 linked to specific differences in traits such as food preference, morphogenesis, 633 insecticide and pathogen resistance, and the response to changing environments. 634 In addition to our discoveries regarding protein-coding genes, we found that TE-related 635 sequences likely contribute to the variation of coding and regulatory repertoires ( Figure  636 1; Additional file 1: Table S10-12). Compared with non-Mendacibombus bumblebees, 637 Mendacibombus species have smaller genomes ( Figure 1)  investigating genes involved in high-elevation adaptation. We identified genes showing 650 signs of positive selection in at least two subgenera of high-elevation species but not in 651 any of the low-elevation species (Additional file 1: Table S31). These include genes 652 putatively involved in eye development, muscle integrity maintenance, and metabolism, 653 highlighting the importance of successful food-searching in high-elevation habitats 654 where food is scarce. Exploring these further and identifying additional genomic 655 features linked to life at high altitudes will help to understand differential successes of 656 bumblebee species in a changing world.

658
Conclusions 659 We have produced highly complete and accurate genome assemblies of 17 bumblebee 660 species, including representatives from all of the 15 subgenera of Bombus. Our genus-661 wide comparative analysis of bumblebee genomes revealed how genome structures, 662 genome contents, and gene evolutionary dynamics vary across bumblebees, and 663 identified genetic variations that may underlie species trait differences in foraging, diet 664 and metabolism, morphology and insecticide resistance, immunity and detoxification, 665 as well as adaptations for life at high altitudes. Our work provides genomic resources 666 that capture genetic and phenotypic variation, which should advance our understanding 667 of bumblebee success and help identify potential threats. These resources form a 668 foundation for future research, including resequencing and population genomics studies 669 for functional gene positioning and cloning, which will inform the use of bumblebees 670 in agriculture, as well as the design of strategies to prevent the decline of this important 671 group of pollinators. 672 673 674

Materials and Methods 675
Sample collection and DNA extraction 676 Criteria including phylogenetic position, biological trait, geographic distribution, and 677 specimen availability were applied to select species for whole genome sequencing. A 678 total of 17 bumblebee species were selected (Additional file 1: Arctic. In addition, species traits (i.e. range size, tongue length, parasite incidence, 686 and decline status) vary across the selected bumblebees (Arbetman, et al. 2017). 687 Samples were collected in the summer of 2016, with location and elevation 688 information summarized in Additional file 1: Table S1. Their identities were 689 confirmed by DNA barcoding as described (Hebert, et al. 2004). Genomic DNA was 690 extracted from each specimen using the Gentra Puregene Tissue Kit (Qiagen). The 691 abdomens of each sample were removed before DNA extraction to avoid microbial 692 contamination. 693 Genome sequencing and assembly 694 Genomic DNA purified from one single haploid drone of each species was used to 695 generate one "fragment" library with an insert size of 400 or 450 bp using the 696 NEBNext ® Ultra ™ DNA Library Prep Kit for Illumina ® (NEB, USA). The prepared 697 fragment libraries were sequenced on an Illumina HiSeq 2500 platform with a read 698 length of 250 bp to produce overlapping paired-end shotgun reads (2 × 250 bp), and 699 the target sequencing coverage was 100-fold or more for each species. Genomic DNA 700 purified from multiple specimens of each species was used to generate four "jump" 701 libraries (insert sizes: 4 kb, 6 kb, 8 kb, and 10 kb) according to reported methods 702 (Heavens, et al. 2015). The prepared jump libraries were sequenced on an Illumina 703 HiSeq X Ten platform, and paired-end reads (2 × 150 bp) were generated, with a 704 sequencing depth of at least 40-fold coverage for each jump library. The sequencing 705 results of "fragment" and "jump" libraries are summarized in Additional file 1: Table  706 S2.

707
For each species, the 250 bp overlapping paired-end shotgun reads from the fragment 708 library were processed using the software Seqtk (https://github.com/lh3/seqtk) to 709 randomly subsample read pairs to achieve the total sequence length equivalent to ~60-710 fold sequencing coverage, a coverage recommended by the assembler we used 711 (https://software.broadinstitute.org/software/discovar/blog/). Then, the subsampled 712 shotgun reads were assembled using the software DISCOVAR de novo (version 713 52488), which performs well in assembling insect genomes (Love, et al. 2016), to 714 produce contiguous sequences (contigs) for each species. Finally, shotgun reads from 715 jump libraries were used to scaffold the contigs using the software BESST (Version 716 2.2.6) (Sahlin, et al. 2014). The obtained genome assemblies were checked for DNA 717 contamination by searching against the NCBI non-redundant nucleotide database (Nt) 718 using BLASTN (Camacho, et al. 2009), with an E-value cutoff of 1e-5. 719 To evaluate the quality and completeness of the genome assemblies, we compared 720 genes present in the assemblies to a set of 4,415 universal single-copy orthologs 721 (lineage dataset: hymenoptera_odb9) using the software BUSCO v3 ( instructions. Library quality was assessed on the Agilent Bioanalyzer 2100 system. 734 The prepared libraries were sequenced on the Illumina HiSeq X Ten platform, 735 generating paired-end reads with a read length of 150 bp. 736 Protein-coding gene annotation. Annotation of protein-coding genes was based on 737 ab initio gene predictions, transcript evidence, and homologous protein evidence, all 738 of which were implemented in the MAKER computational pipeline (Cantarel,et al. 739 2008). Briefly, RNA-seq samples were assembled using Trinity (Haas, et al. 2013) 740 with two different strategies using default parameters, de novo assembly and genome-741 guided assembly. Assembled transcripts were inspected by calculation of FPKM 742 (fragments per kilobase of exon per million fragments mapped) expression values and 743 removed if FPKM <1 and iso-percentage <3%. The filtered transcripts were imported 744 into the PASA program (Haas, et al. 2003) for construction of comprehensive 745 transcripts, as PASA is able to take advantage of the high sensitivity of reference-746 based assembly while leveraging the ability of de novo annotation to detect novel 747 transcripts. The nearly "full-length" transcripts selected from PASA-assembled 748 transcripts were imported to data training programs including SNAP (Korf 2004), 749 GENEMARK (Lomsadze, et al. 2005) and AUGUSTUS (Stanke, et al. 2006). 750 Afterwards, the MAKER pipeline was used to integrate multiple tiers of coding 751 evidence and generate a comprehensive set of protein-coding genes. 752 The second round of MAKER was run to improve gene annotation. The predicted 753 gene models with AED scores less than 0.2 were extracted for re-training using 754 SNAP, GENEMARK, and AUGUSTUS. In addition, the RNA-seq reads were 755 mapped to genomes using HiSAT2 and re-assembled using StringTie (Pertea,et al. 756 2016). The assembled RNA-seq transcripts, along with proteins from bees 757 (superfamily Apoidea) that are available in NCBI GenBank (last accessed on 758 01/28/2018), were imported into the MAKER pipeline to generate gene models, 759 followed by manual curation of key gene families. 760 Functional annotation of the obtained gene models 761 To obtain functional clues for the predicted gene models, protein sequences encoded 762 by them were searched against the Uniprot-Swiss-Prot protein databases (last 763 accessed on 01/28/2018) using the BLASTp algorithm implemented in BLAST suite 764 v2.28 (Altschul, et al. 1990). In addition, protein domains and GO terms associated 765 with gene models were identified by InterproScan-5 (Jones, et al. 2014). 766 To evaluate the quality and completeness of gene annotation, we compared protein 767 sequences predicted from the genome assemblies to a set of 4,415 universal single-768 copy orthologs (lineage dataset: hymenoptera_odb9) using the software BUSCO v3 769 (Waterhouse, et al. 2018). 770 miRNA annotation 771 Hairpin sequences downloaded from miRBase (http://www.mirbase.org/) were 772 aligned to each reference genome using BLASTN (Altschul, et al. 1990) with an e-773 value cut-off of 10-6. Results were further filtered based on alignment length (≥50nt) 774 and sequence similarity (≥80%). Mature sequences from miRBase were then mapped 775 against this set of selected BLASTN hits, using Patman (Prufer, et al. 2008) with 776 parameters -g 0 -e 1 (no gaps, up to one mismatch). Only genomic hits where at least 777 one mature microRNA could be mapped with these criteria were retained. These were 778 treated as a set of putative homologous microRNA genes. 779 Small RNA reads of B. terrestris were mapped to these predicted homologous loci, 780 with no gaps or mismatches allowed. Genomic loci with at least 10 mapped reads 781 were then selected, showing coverage at both the 5' and 3' ends. The final set of high 782 confidence microRNAs was obtained by selecting all loci with the expected hairpin 783 secondary structure, as predicted by RNAfold from the ViennaRNA package 784 (Hofacker 2009), as well as strong evidence of Drosha-Dicer processing from the 785 (manually inspected) patterns of small-RNA read alignments. 786 tRNA annotation 787 All of the bumblebee genomes were screened with tRNAScan-SE (Lowe and Eddy 788 1997) to identify tRNA genes, with default parameters. 789 The prediction of lncRNAs: protein-coding potential for RNA transcripts was 790 predicted using two algorithms, LGC version 1.0 ) and CPAT 791 version 2.0.0 (Wang, et al. 2013).
LGC could be used in a cross-species manner and 792 the algorithm was applied directly to bumblebees, while CPAT requires high-quality 793 training data to build a species-specific model. Considering bumblebees do not have 794 enough high-quality "coding" and "non-coding" transcripts to build a model, the 795 prebuilt fly model in CPAT was used. All the predictions were performed on a Linux 796 platform. RNA transcripts were deemed to be non-coding if they were consistently 797 predicted to be non-coding by both LGC and CPAT. 798 Gene synteny analysis 799 MCScanX (Wang, et al. 2012) was used to identify syntenic blocks, defined as 800 regions with more than five collinear genes, between B. terrestris, a previously 801 published bumblebee genome (Sadd, et al. 2015), and each of the newly sequenced 802 bumblebees with default parameters to infer synteny contiguity. 803 De novo identification and annotation of transposable elements (TEs) 804 Methods based on TE structure 805 LTR retrotransposons of the bumblebee genomes were de novo identified and 806 annotated by LTRharvest and LTRdigest (Ellinghaus, et al. 2008;Steinbiss, et al. 807 2009). The identified LTR retrotransposons were further classified with the PASTEC 808 module of the REPET package (Hoede, et al. 2014). When identifying LTR 809 retrotransposons, TSD length was set to 4-6 bp and the minimum similarity of LTRs 810 was set to 85%; the four-nucleotide termini of each LTR retrotransposon was set as 811 TG…CA. LTR length was set to 100-6000 bp. For the post-processing of LTRdigest, 812 pptlength was set to 10-30 bp, pbsoffset to 0-5 bp, and trans to Dm-tRNAs.fa. 813 pHMMs were used to define protein domains taken from the Pfam database. sequence repeats) and "noCat" (which means no classification was found).

TE landscapes in the bumblebee genomes 831
First, CD-HIT-EST (version 4.6.6) ( applying a neutral substitution rate of 3.6 × 10 -9 for bumblebee (Liu, et al. 2017). 882 The genomic distribution of TEs in bumblebees 883 The genomic coordinates of TEs in each species were compared with the coordinates 884 of protein-coding genes in the same species to identify TEs that resided within or near 885 predicted genes. Only when there were > 50 bp of overlap between a TE and 886 predicted CDS was a TE considered to be overlapping with a coding region. In B. tree as implemented in IQTree (Minh, et al. 2020a). 932 The quartet-based species tree reconstruction program ASTRAL (Zhang, et al. 2018), 933 which can account for ILS, was also used for building the species phylogeny. The 934 ggtree R package was used to visualize trees (Yu, et al. 2017).

935
Estimate of ancestral genome sizes 936 The genome assemblies produced in this study were highly complete (Additional file 937 2: Figure S1), and genome assembly sizes do not correlate with assembly contiguity 938 (p = 0.973; Additional file 2: Figure S23). Thus, smaller genome size estimates are 939 unlikely to be artifacts of incomplete genome assembly, and quality control during 940 assembly ensured that larger genomes were not due to extrinsic DNA contamination. 941 Therefore, the genome assembly sizes should reflect true differences across 942 bumblebees. Genome assembly sizes of the 19 sequenced bumblebees and four 943 honeybees were obtained from the current study and published genome assemblies: B. phylogenetic tree estimated in this study ( Figure 1A), and ancestral genome sizes of 949 bumblebees were estimated using parsimony ancestral state reconstruction in 950 Mesquite 3.51 (http://www.mesquiteproject.org), with honeybee genome sizes serving 951 as the outgroup. 952 Hi-C library construction, sequencing, and assembly 953 For B. turneri, library preparation was performed by Annoroad Gene Technology 954 (http://en.annoroad.com) and mainly followed a protocol described previously 955 (Belton, et al. 2012 the in situ Digestion-ligation-only Hi-C protocol was employed to generate Hi-C 995 reads as described (Lin, et al. 2018). In brief, for each species, brain tissue of wild-996 caught workers was ground into homogenate. Treated the samples and filtered the 997 precipitated cells. Cells were double cross-linked with formaldehyde with EGS 998 (Thermo) and 1% formaldehyde (Sigma). After that, the remaining formaldehyde was 999 sequestered with glycine. The cross-linked cells were subsequently lysed in lysis 1000 buffer and incubated at 50 °C for 5min, placed on ice immediately. After incubation, 1001 the nuclei were digested by MseI (NEB, 100 units/µl). After restriction enzyme 1002 digestion, MseI biotin linkers were ligated to the digested chromatin respectively. 1003 Made the nuclei fragment-end phosphorylation. Next, added T4 DNA ligase 1004 (Thermo) to reaction complexes. Ligation was performed at 20 °C for 2h with rotation 1005 at 15 r.p.m. Then, purifying the proximity ligation DNA. The purified products were 1006 digested by MmeI at 37 °C for 1 h. The digested DNA sample was subjected to 1007 electrophoresis in native PAGE gels and the specific 80-bp DLO Hi-C DNA 1008 fragments were excised and purified. Next, Illumina sequencing adaptors were ligated 1009 to the 80-bp DLO Hi-C DNA fragments. After biotin incubation, the ligated DNA 1010 fragments were used as template and amplified by PCR (fewer than 13 cycle) to 1011 construct the Illumina sequencing libraries.

1012
Hi-C sequencing libraries were sequenced on the Illumina HiSeq X Ten platform, 1013 generating 150 bp reads. The length of the DNA constructs in the DLO Hi-C library is 1014 between 78 and 82 bp. The length of a full linker is 40 bp, and the lengths of the 1015 target DNA sequences on each side of the linker are 19-21 bp. A Java program was 1016 used to exclude the linker parts from the reads and the target DNA fragments were 1017 used for downstream analysis. The Juicer tool (Durand, et al. 2016) was applied to 1018 map obtained target sequences against the scaffold sequences of each species using 1019 the BWA algorithm (Heng, et al. 2010), selecting the ALN parameter (other 1020 parameters as default). Mapped reads with MAPQ quality scores ≥ 30 were chosen 1021 for the next analysis. Then, the 3D-DNA pipeline (Dudchenko, et al. 2017) was 1022 applied to assemble the scaffold sequences to the chromosome level. 1023 the significance of any gene family changes along a given branch. Those  used to obtain these rates. 1099 dN/dS ratio estimation for each orthologous group: To avoid biases related to 1100 duplication among lineages and out-paralog genes, only universal single-copy 1101 orthologous groups (scOGs) were used to estimate dN/dS ratios. Protein sequences of 1102 scOGs were aligned by MAFFT (Katoh, et al. 2002) and then used to inform CDS 1103 alignments to generate DNA codon alignments with the codon-aware PAL2NAL 1104 program (Suyama, et al. 2006). Next, the aligned CDSs were trimmed by Gblocks 1105 (Talavera and Castresana 2007), with "-t c" and other parameters as default. After 1106 trimming, only orthologs consisting of aligned sequences from all species with a 1107 minimum of 150 bp and less than 20% Ns were retained for downstream analysis, 1108 which are available on-line (ftp://download.big.ac.cn/bumblebee/bumblebee-single-1109 copy-orthologs.tar.gz). Then, based on trimmed alignments, Maximum Likelihood