Reference genome of the bicolored carpenter ant, Camponotus vicinus

Abstract Carpenter ants in the genus Camponotus are large, conspicuous ants that are abundant and ecologically influential in many terrestrial ecosystems. The bicolored carpenter ant, Camponotus vicinus Mayr, is distributed across a wide range of elevations and latitudes in western North America, where it is a prominent scavenger and predator. Here, we present a high-quality genome assembly of C. vicinus from a sample collected in Sonoma County, California, near the type locality of the species. This genome assembly consists of 38 scaffolds spanning 302.74 Mb, with contig N50 of 15.9 Mb, scaffold N50 of 19.9 Mb, and BUSCO completeness of 99.2%. This genome sequence will be a valuable resource for exploring the evolutionary ecology of C. vicinus and carpenter ants generally. It also provides an important tool for clarifying cryptic diversity within the C. vicinus species complex, a genetically diverse set of populations, some of which are quite localized and of conservation interest.


Introduction
The ant tribe Camponotini contains almost 2,000 described species, of which a little more than half belong to Camponotus, the world's most widely distributed ant genus (Bolton 2023).Many species of Camponotus nest in rotting wood, earning them the common name "carpenter ants" (Hansen and Klotz 2005).All species of Camponotini harbor obligate, verticallyinherited gut bacteria (Blochmannia) that provide important nutritional benefits and likely contribute to host survival under varying environmental conditions (Feldhaar et al. 2007;Williams and Wernegreen 2015).Some Camponotus ants are also common structural pests, causing costly damage as they excavate wooden structures.
Carpenter ants in the Camponotus vicinus species complex are prominent scavenging and predatory ants, occurring in all ecoregions of California except the Colorado and Sonoran Deserts.In higher elevation conifer forests of California, C. vicinus commonly nests in and around fallen, decomposing logs, and is one of the most abundant grounddwelling arthropods (Fig. 1A).This complex includes two widespread species as well as several cryptic taxa with more limited distributions that are of conservation interest.The cryptic diversity in the C. vicinus complex includes an undescribed species endemic to the Channel Islands.
We report here a high-quality de novo reference genome assembly for C. vicinus collected near the type locality of this species at Calistoga, California (Mayr 1870).Existing genomic resources include an annotated reference genome for the relatively distantly related Camponotus floridanus (Bonasio et al. 2010;Shields et al. 2018), as well as more recent genome sequences from Camponotus pennsylvanicus (Faulk 2023) and several species collected in the American Southwest (including putative C. vicinus from Arizona) (Manthey et al. 2022).We also reconstruct a phylogeny using these C. vicinus genomes and several other Camponotus species from Manthey et al. (2022).

High molecular weight DNA extraction and nucleic acid library preparation
The flash frozen male pupa was homogenized in 650 µl of homogenization buffer (10 mM Tris-HCL-pH 8.0 and 25 mM EDTA) using TissueRuptor II (Qiagen, Germany;Cat # 9002755).650 µl of lysis buffer (10 mM Tris, 25 mM EDTA, 200 mM NaCl, and 1% SDS) and proteinase K (100 µg ml −1 ) were added to the homogenate and it was incubated overnight at room temperature.Lysate was treated with RNAse A (20 µg ml −1 ) at 37 °C for 30 min and was cleaned with equal volumes of phenol/chloroform using phase-lock gels (Quantabio, Beverly, MA; Cat # 2302830).The DNA was precipitated by adding 0.4× volume of 5 M ammonium acetate and 3× volume of ice-cold ethanol.The DNA pellet was washed twice with 70% ethanol and resuspended in an elution buffer (10 mM Tris, pH 8.0).DNA was further cleaned with Zymo gDNA clean and concentrator kit (Zymo Research, Irvine, CA; Cat # 4033).To retain large DNA fragments, columns from large fragment DNA recovery kit (Zymo Research, Cat # D4045) were used during purification.Purity of gDNA was accessed using NanoDrop ND-1000 spectrophotometer where 260/280 ratio of 1.8 and 260/230 ratio of 2.26 was observed.DNA was quantified by Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA) and total yield of 1.5 µg was obtained.Integrity of the HMW gDNA was verified on a Femto pulse system (Agilent Technologies, Santa Clara, CA) where 73% of DNA was observed in fragments above 50 Kb.
The HiFi SMRTbell library was constructed using the SMRTbell Express Template Prep Kit v2.0 (Pacific Biosciences-PacBio, Menlo Park, CA, Cat.#100-938-900) according to the manufacturer's instructions.HMW gDNA was sheared to a target DNA size distribution between 12 and 20 kb.The sheared gDNA was concentrated using 1.8× of AMPure PB beads (PacBio, Cat.#100-265-900) for the removal of single-strand overhangs at 37 °C for 15 min, followed by further enzymatic steps of DNA damage repair at 37 °C for 30 min, end repair and A-tailing at 20 °C for 10 min and 65 °C for 30 min, and ligation of overhang adapter v3 at 20 °C for 60 min.The SMRTbell library was purified and concentrated with 0.45× Ampure PB beads for size selection with 40% diluted AMPure PB beads (PacBio, Cat.#100-265-900) to remove short SMRTbell templates <3 kb.The 12 to 20 kb average HiFi SMRTbell library was sequenced at UC Davis DNA Technologies Core (Davis, CA) using two 8 M SMRT cells, Sequel II sequencing chemistry 2.0, and 30-h movies each on a PacBio Sequel II sequencer.
The Omni-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, Scotts Valley, CA) according to the manufacturer's protocol with slight modifications.First, specimen tissue (whole adult male, ID: PSW18465-M) was thoroughly ground with a mortar and pestle while cooled with liquid nitrogen.Subsequently, chromatin was fixed in place in the nucleus.The suspended chromatin solution was then passed through 100 μm and 40 μm cell strainers to remove large debris.Fixed chromatin was digested under various conditions of DNase I until a suitable fragment length distribution of DNA molecules was obtained.Chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends.After proximity ligation, crosslinks were reversed, and the DNA was purified from proteins.Purified DNA was treated to remove biotin that was not internal to ligated fragments.An NGS library was generated using an NEB Ultra II DNA Library Prep kit (NEB, Ipswich, MA) with an Illumina compatible y-adaptor.Biotin-containing fragments were then captured using streptavidin beads.The post-capture product was split into two replicates prior to PCR enrichment to preserve library complexity with each replicate receiving unique dual indices.The library was sequenced at Vincent J. Coates Genomics Sequencing Lab (Berkeley, CA) on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA) to generate approximately 100 million 2 × 150 bp read pairs per GB genome size.

Nuclear genome assembly
We assembled the genome of C. vicinus following the CCGP assembly pipeline Version 5.1, as outlined in Table 1, which lists the tools and non-default parameters used in the assembly.The pipeline uses PacBio HiFi reads and Omni-C data to produce high quality and highly contiguous genome assemblies.First, we removed the remnant adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022) and generated the initial haploid assembly using HiFiasm (Cheng et al. 2021) with the filtered PacBio HiFi reads and the Omni-C dataset.This process generated multiple assemblies and we kept the output assembly tagged as haplotype 1 given the ploidy of the specimen.We then aligned the Omni-C data to the assembly following the Arima Genomics Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline)and then scaffolded it with SALSA (Ghurye et al. 2017(Ghurye et al. , 2019)).

Genome assembly assessment
We generated k-mer counts from the PacBio HiFi reads using meryl (https://github.com/marbl/meryl).The k-mer counts were then used in GenomeScope2.0(Ranallo-Benavidez et al. 2020) to estimate genome features including genome size, heterozygosity, and repeat content.To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013).To evaluate genome quality and functional completeness we used BUSCO (Manni et al. 2021) with the Arthropoda ortholog database (arthropoda_odb10) which contains 1,013 genes.Assessment of base level accuracy (QV) and k-mer completeness was performed using the previously generated meryl database and merqury (Rhie et al. 2020).We further estimated genome assembly accuracy via BUSCO gene set frameshift analysis using the pipeline described in Korlach et al. (2017).Measurements of the size of the phased blocks is based on the size of the contigs generated by HiFiasm.We follow the quality

Endosymbiont genome assembly
We used the genome of Blochmannia (NCBI:GCF_023585685.1; ASM2358568v1; Manthey et al. 2022) as a guide to assemble the endosymbiont genome present in our sample.We aligned the contigs that were removed from the nuclear genome in the contamination process to the ASM2358568v1 reference using lastz (Harris 2007) to verify existence of the endosymbiont in the assembly.We aligned the adapter-trimmed PacBio HiFi reads to the Blochmannia sequence using minimap2 (Li 2018(Li , 2021) ) and samtools (Danecek et al. 2021), and filtered out secondary alignments, unmapped reads, and reads that failed platform/vendor quality checks.We extracted the reads left from the alignment and used them to de novo assemble a Blochmannia genome with HiFiasm.Finally, we used bakta (Schwengers et al. 2021; https://bakta.computational.bio/)to generate a draft genome annotation of the bacterial genome to assess completeness of the genome.

Phylogenetic analysis
Our dataset for phylogenetic analysis consisted of 17 wholegenome sequencing (WGS) samples described in Manthey et al. ( 2022), a C. pennsylvanicus reference genome, our assembled C. vicinus reference genome, and the C. floridanus reference genome which served as our outgroup (NCBI BioProjects PRJNA839641, PRJNA820489, PRJNA874059, and PRJNA476946, respectively).We performed quality filtering and adapter trimming of the sequencing reads from the 17 WGS samples with the bbduk.shscript from the bbmap package (Bushnell 2014).We then aligned these samples to the C. floridanus reference genome with the BWA-MEM.We used PicardTools (Broad Institute 2019) to sort our resulting SAM files and flag duplicates using the SortSam and MarkDuplicates commands.We also computed alignment metrics and read depth, as well as built bam indexes using the samtools (Li et al. 2009) flagstat, depth, and index commands.The assembled reference genomes were aligned to the C. floridanus reference genome using the MUMmer (Marçais et al. 2018) alignment tool.The resulting SAM files were reformatted using an in-house bash script to follow the proper input formatting for samtools.Finally, these files were first sorted by read group and then converted to BAM format using the samtools sort and samtools view -b commands.
We performed variant calling with BCFtools (Li 2011) for all samples using the mpileup and call commands.We then performed quality filtering with VCFtools (Danecek et al. 2011), removing sites with the following specifications: minor allele frequency (MAF) <0.05, missing in >25% of samples, quality score <30, and read depth <10 or >100.We converted our VCF file to phylip alignment format using the python script vcf2phylip.py(Ortiz 2019).We used RAxML (Stamatakis 2014) to generate our phylogenetic tree by performing a best tree search (option -f a) with 1000 rapid bootstrap replicates (option -x).We determined the "best-fit" model of nucleotide substitution to be GTR using jModelTest (Guindon and Gascuel 2003;Darriba 2012).

Sequencing data
The Omni-C and PacBio HiFi sequencing libraries generated 18.29 million read pairs and 1.4 million reads, respectively.The latter yielded 52.19 fold coverage (N50 read length 12,799 bp; minimum read length 54 bp; mean read length 11,675 bp; maximum read length of 58,419 bp) based on the Genomescope 2.0 genome size estimation of 313.7 Mb.Based on PacBio HiFi reads, we estimated 0.129% sequencing error rate.The k-mer spectrum based on PacBio HiFi reads show (Fig. 2A) a unimodal distribution with a single peak at ~51.

Nuclear genome assembly
The final assembly (iyCamVici1) genome size is close to the estimated value from Genomescope2.0 (Fig. 2A, Pflug et al.

2020
).The assembly consists of 38 scaffolds (37 nuclear, 1 mitochondrial) spanning 302.74 Mb with contig N50 of 15.9 Mb, scaffold N50 of 19.9 Mb, longest contig of 22.35 Mb and largest scaffold of 39.41 Mb.Detailed assembly statistics are reported in tabular form in Table 3, and graphical representation for the assembly in Fig. 2B.The iyCamVici1 assembly has a BUSCO completeness score of 99.2% using the Arthropoda gene set, a per-base quality (QV) of 68.45, a k-mer completeness of 99.41 and a frameshift indel QV of 54.57.
During manual curation, we generated 8 breaks and 24 joins and we were able to close a total of 11.Finally, we filtered out 22 contigs from the assembly, with 21 corresponding to the endosymbiont, Blochmannia, and 1 corresponding to a mitochondrial contaminant.The Omni-C contact maps show that the assembly is highly contiguous (Fig. 2C).We have deposited the resulting assembly on NCBI (see Table 3 and Data Availability for details).

Endosymbiont genome assembly
The final Blochmannia genome (ypCanBloch1_iyCamVici1.0) is a single gapless contig with final size of 780,225 bp, which is close but not equal to the reference used as guide (ASM2358568v1; genome size = 783,921 bp).The base composition of the final assembly version is A = 35.05%,C = 13.94%,G = 14.37%,T = 36.64%.The bacterial genome presented here consists of 624 coding sequences, 39 transfer RNAs, 1 transfer-messenger RNA, 3 ribosomal RNAs, and 2 non-coding RNAs.

Assembly comparisons
Genome metrics indicate that the bicolored carpenter ant assembly is highly contiguous (62 contigs, contig N50 of 15.9 Mb), with fewer contigs and a longer contig N50 than all currently available ant genomes (Fig. 1C, Supplementary Table S1).Although chromosome assignments were not determined for C. vicinus, 14 out of the 38 total scaffolds in the genome assembly approach sizes >15.1 Mb (MEAN ± SD = 21.6 ± 6.2 Mb), make up >99.6% of the genome assembly, and are comparable to the average chromosome sizes of genome assemblies from four representative ant species (MEAN ± SD = 16.5 ± 9.3 Mb, Fig. 1D, Supplementary Table S2).

Discussion
The high-quality bicolored carpenter ant (C.vicinus) genome assembly, presented here, will serve as a foundational reference for future evolutionary and population genomic studies in this and other related species.Our genome assembly is highly accurate, with coverage (52.19×) in range with other ant genome assemblies that include PacBio sequencing methods (coverage range: 45 to 245×, median coverage: 87×, Supplementary Table S1) and BUSCO genome completeness (99.2%, compared with Arthropoda) slightly exceeds the median BUSCO values of other ant genome assemblies compared with the same BUSCO dataset (median BUSCO: 98.3%, BUSCO range: 68.0% to 99.6%, Supplementary Table S1).In comparison with other ant genome assemblies, the bicolored carpenter ant assembly is the most contiguous (contig-level) assembly of all currently available ant genomes (Fig. 1C, Supplementary Table S1).Additionally, the 14 largest C. vicinus scaffolds compose 99.7% of the genome assembly, matching the predicted chromosome number of n = 14 for C.
vicinus, based on the reported karyotypes of the related species C. ligniperda and C. japonicus (Imai 1966;Hauschteck-Jungen and Jungen 1983), and are similar to the chromosome sizes of genome assemblies from four representative ant species (Fig. 1D, Supplementary Table S2).Taken together, these results indicate that our C. vicinus genome is a chromosomelevel assembly.
In comparison to other Camponotus ant genome assemblies available for the Florida carpenter ant (C.floridanus, Shields et al. 2018) and the black carpenter ant (C.pennsylvanicus, Faulk 2023), our bicolored carpenter ant nuclear genome assembly is similar in size (302.7 Mb) to the black carpenter ant assemblies (306.4,haplotype 1; and 305.9, haplotype 2), which are respectively 6.6%, 7.9%, and 7.7% larger than the Florida carpenter ant genome assembly (284.0Mb).Additionally, the mitochondrial genome assembly of the bicolored carpenter ant (16,542 bp) is nearly identical in size to the black carpenter ant (16,536 bp).We also assembled the Blochmannia bacterial endosymbiont for C. vicinus (780,225 bp) whose size falls in range with assemblies of Blochmannia floridanus (705,557 bp, isolated from C. floridanus, Gil et al. 2003) and Blochmannia pennsylvanicus (791,499 to 791,654 bp, isolated from C. pennsylvanicus, Degnan et al. 2005;Faulk 2023).Lastly, phylogenetic analysis of the C. vicinus reference genome, in comparison to recently published whole genome sequences representing nine Camponotus species (Manthey et al. 2022;Shields et al. 2018;Faulk 2023), revealed that C. vicinus (California, this study) is sister to a clade containing C. vicinus (Arizona) and C. sp.2-JDM (Fig. 1B).This analysis suggests that further investigation is needed to resolve the species assignment and implied monophyly or paraphyly of these representative samples.
The reference genome of bicolored carpenter ant, C. vicinus, will allow us to better understand the genetic basis of adaptations, track evolutionary changes, and assess genomic variation that may impact survival and speciation.Furthermore, the bicolored carpenter ant reference genome serves as a powerful tool for both evolutionary and conservation biologists to better understand the genetic makeup of the C. vicinus species complex, which can inform taxonomic studies of this group and contribute to efforts of the California Conservation Genomics Project (CCGP) (Shaffer et al. 2022).It fills an important phylogenetic gap in our genomic understanding of California biodiversity (Toffelmier et al. 2022).Future work comparing multiple genomes of C. vicinus across California will additionally help identify regions that are associated with species resilience and biodiversity, and aid in development of effective conservation and management strategies accordingly (Fiedler et al. 2022).
sequencing platforms at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant.We thank the staff at the UC Davis DNA Technologies and Expression Analysis Cores and the UC Santa Cruz Paleogenomics Laboratory for their diligence and dedication to generating high-quality sequence data.

Fig. 2 .
Fig. 2. Visual overview of genome assembly metrics.A) K-mer spectra output generated from PacBio HiFi data without adapters using GenomeScope2.0.The unimodal pattern observed corresponds to a haploid genome.B) Omni-C Contact map for the genome assembly generated with PretextSnapshot.The Omni-C contact map translates proximity of genomic regions in 3-D space to contiguous linear organization.Each cell in the contact map corresponds to sequencing data supporting the linkage (or join) between two of such regions.Scaffolds are separated by black lines and higher density corresponds to higher levels of fragmentation.C) BlobToolKit Snail plot showing a graphical representation of the quality metrics presented in Table 3 for the C. vicinus primary assembly.The plot circle represents the full size of the assembly.From the inside to the outside, the central plot covers length-related metrics.The red line represents the size of the longest scaffold; all other scaffolds are arranged in size order moving clockwise around the plot and drawn in gray starting from the outside of the central plot.Dark and light orange arcs show the scaffold N50 and scaffold N90 values.The central light gray spiral shows the cumulative scaffold count with a white line at each order of magnitude.White regions in this area reflect the proportion of Ns in the assembly.The dark vs. light blue area around it shows mean, maximum and minimum GC vs. AT content at 0.1% intervals.

Table 2
Species, GenBank accession numbers, and references used in chromosome-level assembly comparisons.

Table 3
Sequencing and assembly statistics, and accession numbers.