High-quality chromosome-level genome assembly and full-length transcriptome analysis of the pharaoh ant Monomorium pharaonis

Abstract Background Ants with complex societies have fascinated scientists for centuries. Comparative genomic and transcriptomic analyses across ant species and castes have revealed important insights into the molecular mechanisms underlying ant caste differentiation. However, most current ant genomes and transcriptomes are highly fragmented and incomplete, which hinders our understanding of the molecular basis for complex ant societies. Findings By hybridizing Illumina, Pacific Biosciences, and Hi-C sequencing technologies, we de novo assembled a chromosome-level genome for Monomorium pharaonis, with a scaffold N50 of 27.2 Mb. Our new assembly provides better resolution for the discovery of genome rearrangement events at the chromosome level. Analysis of full-length isoform sequencing (ISO-seq) suggested that ∼15 Gb of ISO-seq data were sufficient to cover most expressed genes, but the number of transcript isoforms steadily increased with sequencing data coverage. Our high-depth ISO-seq data greatly improved the quality of gene annotation and enabled the accurate detection of alternative splicing isoforms in different castes of M. pharaonis. Comparative transcriptome analysis across castes based on the ISO-seq data revealed an unprecedented number of transcript isoforms, including many caste-specific isoforms. We also identified a number of conserved long non-coding RNAs that evolved specifically in ant lineages and several that were conserved across insect lineages. Conclusions We produced a high-quality chromosome-level genome for M. pharaonis, which significantly improved previous short-read assemblies. Together with full-length transcriptomes for all castes, we generated a highly accurate annotation for this ant species. These long-read sequencing results provide a useful resource for future functional studies on the genetic mechanisms underlying the evolution of social behaviors and organization in ants.


Background
Ants are an ecologically diverse and extraordinarily successful animal group, which occupy almost all terrestrial ecological niches [1]. As social insects, ants live in colonies composed of up to millions of individuals, which develop into different social castes with remarkable division of labor and substantial variations in morphology, physiology, and behavior [2]. The sexual castes, including reproductively active queens, gynes (virgin queens), and males, are specialized for sexual reproduction, whereas the worker caste, which can be divided into distinct sub-castes in some species, are specialized for non-reproductive support roles, such as constructing, maintaining, and defending the nest, collecting food, and rearing the brood [3].
Understanding the genetic mechanisms underlying caste development and differentiation processes has been the major focus of recent studies on social insects. Such research has indicated that caste differentiation involves the regulation of both genetic and epigenetic factors [4]. Comparative genome and transcriptome studies have identified several key genes that show differential expression patterns among castes and may contribute to caste-specific phenotypes, e.g., vitellogenin, foraging, arrestin, and insulin/insulin-like growth factor signaling [1,[5][6][7][8][9][10][11][12]. Recent studies also suggest that alternative splicing Alternative splicing (AS), which can increase genetic regulatory complexity, may contribute to phenotypic plasticity in eusocial insects [11,[13][14][15][16]. Additionally, epigenetic mechanisms, such as long noncoding RNAs (lncRNAs), may also participate in gene expression regulation during caste differentiation [17,18]. Particularly, comparative genomic studies cross multiple ant lineages have identified many conserved lncRNAs that may play potential roles in the evolution of the caste system in ants [19,20].
However, most previous genome studies have relied on short-read sequencing technology [1,21]. This has resulted in fragmented assemblies with many sequencing gaps, which is primarily due to high GC content or repeat regions failing to sequence. Additionally, short-read-based RNA-seq also often fails to resolve complex Alternative splicing (AS) isoforms, which are ubiquitously present in eukaryotes [22]. Single-molecule real-time (SMRT) long-read sequencing overcomes these limitations by generating ultra-long reads and offering different solutions to solve genome assembly problems, including complex regions with repeated elements or segmental duplications or regions with high GC content [23]. Long-read sequencing is also beneficial in transcriptomics by providing full-length reads that span the entire transcript isoform, thereby eliminating the need for transcript reconstruction and inference. Thus, full-length isoform sequencing (ISO-seq) can substantially improve annotations of reference genomes, characterize isoforms in important genes, capture alternative splice variants, and identify lncRNAs. Currently, only 27 ant genomes have been published, most of which are limited in their quality [21]. Therefore, high-quality genomes and full-length transcriptomes of ant species are needed to understand the molecular mechanisms involved in caste differentiation and the reproductive division of labor.
The pharaoh ant Monomorium pharaonis (NCBI: txid307658; Fig. 1A) is an emerging model animal for genomic and molecular studies of caste differentiation in social insects. Unlike most ant species, the pharaoh ant has a very short life span, is easy to rear, and can mate and reproduce within the colony, which makes them a perfect model organism for genetic studies. The first draft pharaoh ant genome was assembled on the basis of short reads [24], resulting in a very fragmented assembly with a scaffold N50 length of only 75.38 kb.
In this study, using Pacific Biosciences (PacBio) SMRT DNA Sequencing and ISO-seq technology combined with Illumina short reads and Hi-C (high-throughput chromosome conformation capture) data, we produced a high-quality chromosomelevel reference genome and high-quality transcriptome for the pharaoh ant. Using these data, we further analyzed the proteincoding genes, AS isoforms, and lncRNAs. This study should help enhance our understanding of the genetic and epigenetic mechanisms of complex ant societies.

Genome assembly, assessment, and gene prediction
Following routine 17-mer analysis [25] with short-read sequencing, the genome of M. pharaonis was estimated to be 342 Mb ( Supplementary Fig. S1, Table S1). Using other k-mer sizes produced similar estimations. We generated 33 Gb (∼103×) of Illumina short-read sequencing data and >31 Gb (∼96×) of PacBio sequencing data, resulting in 4,151,307 total reads (Supplementary Table S2). Genome of M. pharaonis was assembled into contigs by Canu using the PacBio Sequel sequencing data [26] and was scaffolded using the SSPACE longRead scaffolder [27]. The assembled scaffolds were gap-filled using the PBJelly program [28], and polished with the PacBio data and short sequencing reads using Quiver and Pilon, respectively (Supplementary Table S3, see Methods for details). By BLAST searching against the Bacteria and Virus databases, we identified 151,589 bp of contaminated sequences, mainly from insect endosymbionts, such as Wolbachia, Bacillus, Acinetobacter, and Candidatus. Duplicated haplotigs and artifacts were identified and removed using the purge haplotigs pipeline [29]. After removal of the contaminated sequences, duplicated haplotigs, and artifacts, the final assembly from the PacBio reads was 313 Mb with a scaffold N50 length of 3.85 Mb (193 scaffolds) and contig N50 length of 2.77 Mb (301 contigs) ( Table 1). The Phred quality value (QV) of the whole genome was calculated as QV = 50 (99.999% accuracy), which suggests that the assembly was of high quality [30,31].
Hi-C uses high-throughput sequencing to map genome-wide chromatin contacts and has been widely used as a scaffolding method in genome assembly [32]. We generated 14.82 Gb of Hi-C sequencing data and mapped them to the polished pharaoh ant genome using Juicer software [33] after filtering low-quality data with Hic-Pro [34] to improve the connection integrity of the contigs. The locations and directions of contigs were determined by 3D de novo assembly (3d-DNA) software [35] with default parameters, after which the contigs were successfully clustered and anchored to 11 linkage groups (Fig. 1B, Supplementary Table S4), which covered 94% of the pharaoh ant assembled sequences. Last, we obtained a high-quality chromosome-level pharaoh ant genome with a contig N50 length of 2.5 Mb and scaffold N50 length of 27.2 Mb (Table 1). This final assembly produced a shorter N50 than before Hi-C linkage because some artificial links introduced by SSPACE were further removed during the Hi-C assembly process if the links were not supported by Hi-C data or violated Hi-C links. Comparison of scaffold N50s and contig N50s of 27 short-read-assembled (pink filled circles) and 4 long-read-assembled (blue filled triangles) ant genomes. Previous short-read assembly for M. pharaonis is marked on the plot. (D) Genome collinearity of short-read and PacBio long-read assemblies shows that PacBio assembly exhibits better coverage of high GC content regions and repeat sequences. Blue indicates genes assembled by both sequencing methods; red indicates genes that are incomplete in short-read assembly but complete in PacBio assembly. Compared with the other 27 published ant genomes, which were mostly sequenced and assembled using short-read sequencing, the pharaoh ant genome assembly showed a significantly higher contiguity level (Fig. 1C, Supplementary Table S5). Our genome assembly with PacBio reads was also more complete than other published ant genomes, with gaps only accounting for 0.0867% of the new assembly compared to an average of 3.75% for other ant genomes with short-read assembly. Specifically, we compared genomic regions and found that a large number of regions with high GC content were missed in previous short-read assemblies of the pharaoh ant genome [24] but are covered in the new assembly (Fig. 1D). Specifically, 9.76% of genes with >70% GC content (4 of 41) and 11.30% (52 of 460) of genes with 60%-70% GC content were missing in previ-ous short-read assemblies but were recovered in our assembly, thereby indicating that the PacBio-assembled genome had significant advantages for high GC content genes (Supplementary  Table S6). Furthermore, the completeness of our PacBio assembly was assessed by BUSCO, which indicated that 98.4% of the 4,415 expected Hymenoptera conserved genes were identified as complete (Table 1).
Gene prediction was first performed by combining homology-, de novo-, and transcriptome-RNA-seq-based searching and identification methods. The ISO-seq data were then used to further improve the gene models predicted in the previous steps (see Methods for details), including the annotation of untranslated regions (UTRs), introduction of new coding exons, modification of incorrect gene models,

High-frequency chromosome recombination in ant genome
Chromosome-level assembly can provide improved resolution to construct ancestral karyotypes and detect genome rearrangement events during speciation [36]. To demonstrate the advance in chromosome-level assemblies, we performed genome collinearity analyses between the chromosome-levelassembled pharaoh ant genome (2n = 22) [37,38] and the clonal raider ant (Ooceraea biroi) genome (2n = 28) [39]. The synteny map spanned 14 O. biroi chromosomes and 11 M. pharaonis chromosomes, covering 94% of the M. pharaonis genome ( Fig. 2A). The longest syntenic block spanned 530 genes in the pharaoh ant. On average, only 3.17 genes were maintained in the same syntenic block between the 2 species, implying a high frequency of rearrangement in the 2 genomes. Furthermore, we detected ∼150 fissions/fusions at the interchromosomal level with >500 kb block resolution between the 2 species. This represents 2.04 chromosome breakpoints per Mb per MY, a faster rate than that reported for some insect groups such as the Drosophila genus [40]. To detail the micro-synteny evolutionary pattern across ant lineages, we investigated the orthologs of genes upstream and downstream of feminizer (fem) and complementary sex determiner (csd) across 11 ant species using their recent PacBio genome assemblies from the Global Ant Genomics Alliance (GAGA) and across 2 wasp species downloaded from the NCBI. Csd is the primary sex-determining signal in most eusocial Hymenoptera and arose from the duplication of the fem gene, which plays a key role in sex determination [41,42]. On the basis of synteny analysis of fem and csd and neighbor genes, fem was present in all investigated species; however, its synteny with neighbor genes underwent several translocation and recombination events during the diversification of ant lineages (Fig. 2B). In contrast, not all ant species possessed the csd homolog and genomic locations differed among ant species. These results thus indicate that csd and fem may function differently in each lineage.

ISO-seq significantly improves gene annotation
Transcriptome data allow us to identify all expressed genes and provide important evidence for gene annotation. Currently, most published genomes have been annotated using RNA-seq data by either mapping short reads or pre-assembled transcripts with short reads onto reference genomes [43]. Single-molecule long-read sequencing produces a full-length transcript of up to 10 kb, which can be readily used for gene prediction without the need of assembly. In principle, therefore, this can significantly improve gene annotation. To provide insight into how gene annotation can be improved with long-read ISO-seq, we sequenced total RNA from the whole bodies of M. pharaonis workers, gynes, queens, and males using 2 sequencing platforms, i.e., PacBio SMRT for long reads and BGI-seq for short reads. We compared the performance of these 2 datasets in gene prediction and isoform annotation. In total, we obtained 62 Gb of long-read transcriptome data (Supplementary Table S7) and 236 Gb of RNA-seq data (Supplementary Table S8). We then generated 2 annotations for M. pharaonis using the ISO-seq and RNA-seq data separately and compared gene model predictions, AS events, UTR annotations, and predicted gene completeness (Supplementary Table S9). The ISO-seq annotation identified 186,499 transcripts on 10,626 protein-coding gene loci, with an average of 5.37 exons per transcript. On the basis of the analysis, the ISO-seq annotation improved upon the RNA-seq annotation in several ways. First, the UTRs of 10,004 genes annotated in the ISO-seq version were missed in the RNAseq annotation (Fig. 3A). Second, RNA-seq annotation missed ≥1 exon in 2,093 genes, which were identified in the ISO-seq annotation (Fig. 3B). Third, the ISO-seq annotation also corrected the models of 58 genes falsely annotated into multiple genes ( Fig. 3C) and 99 genes mistakenly merged with neighbor genes (Fig. 3D) in the RNA-seq annotation. Although high-depth RNA-seq data should, in principle, provide single-base resolution for transcriptome profiling, we found 279 genes in the ISOseq annotation that were missing in the RNA-seq annotation. Among them, >18% were high GC content genes and 38% had >200 bp repeat sequences, further demonstrating that PacBio is better for sequencing high GC content genes and repeat sequences. In total, the coding area of 15.86% of genes was refined with the ISO-seq data, thus highlighting the power of long-read sequencing in gene annotation. Altogether, we annotated 15,327 genes in the pharaoh ant after merging the ISO-seq and RNA-seq annotations.

AS landscape of M. pharaonis
To identify the AS transcripts, we first clustered all high-quality long reads into final polished isoforms. More than 97.95% of the consensus transcripts were mapped to the reference genome using GMAP [44] (Supplementary Table S10), again indicating the high completeness of the reference genome. We next collapsed the redundant isoforms into 186,499 isoforms, covering 11,499 genes expressed in ≥1 caste. Splice junctions (SJs) were detected according to the 2 pairs of dinucleotides presented at the beginning and end of the introns encompassed by the junctions. The SJs were dominated by the canonical GT-AG form, which accounted for >94.72% of total SJs. More than 99% of the SJs with the GT-AG form identified from ISO-seq were also supported by the RNA-seq data (Supplementary Table S11). These findings suggest high accuracy of the exon-intron boundary structure based on long reads and strongly support the validity of the alternative-spliced isoform detection.
A practical question in transcriptome sequencing is at what sequencing depth the data can provide sufficient signals for AS event detection and comparison. The high-coverage ISO-seq data generated here allowed us to address this question by per-  forming saturation analyses with subtractive samples. We evaluated the effects of the amount of sequencing data on the number of consensus transcripts, genome coverage, total number of isoforms, detectable genes, AS events, and detectable genes with AS ( Fig. 4A). By mapping the high-depth RNA-seq reads (236 Gb) onto the M. pharaonis genome, we estimated that 140.22 Mb of genomic regions could be transcribed in ≥1 caste. Furthermore, from the 59 Gb of raw ISO-seq long-read transcripts produced for all samples, we detected 129 Mb of expressed regions that covered 92% of potential transcribed regions detected by RNA-seq. Indeed, we found that the number of consensus transcripts, size of expressed regions, and total number of isoforms increased with the amount of ISO-seq data and only reached saturation at 50 Gb. This indicates an overabundance of transcripts in the M. pharaonis transcriptome and suggests that some low-abundance or rare transcript isoforms remain to be discovered with more sequencing data. An alternative explanation is that ISO-seq may produce artificial isoforms, and thus more novel isoforms could appear with the increase in sequencing data. To confirm this, we used RNA-seq data to validate the unique AS events for each isoform and found that ∼2% of isoforms detected in ISO-seq were not supported by RNA-seq. Nevertheless, we found that the numbers of expressed genes and genes with AS events, as well as the total number of AS events, had already reached their saturation at ∼10 Gb of ISOseq data. With this amount of data, we detected 9,656 expressed genes, covering ≥93.54% of genes from RNA-seq transcription evidence (see example shown in Supplementary Fig. S2). These results suggest that the sequencing data obtained for each caste (∼23.0, 14.0, 15.3, and 10.4 Gb for workers, gynes, queens, and males, respectively) were sufficient for covering most expressed genes and AS events.
To obtain the overall AS pattern in M. pharaonis, all ISO-seq data were pooled for AS event and gene isoform detection. Re- coverage, total number of isoforms, detectable genes, alternative splicing (AS) events, and detectable genes with AS. Consensus transcripts were yielded from multiple full-length non-chimeric reads in a single zero-mode waveguide by transcript clustering analysis. Because many isoforms could not be mapped to the reference genome owing to either sequencing errors or artificial transcripts, the total number of isoforms, which represents isoforms finally confirmed by mapping to the reference genome, was lower than the consensus transcripts. sults showed that >87% of expressed genes had ≥2 isoforms and, on average, each gene expressed 9 isoforms in all castes, indicating the complex nature of the ant transcriptome. Similar to that reported in humans and many other eukaryotic species [45], intron retention was the most dominant AS form, accounting for 48.77% of all AS events in the pharaoh ant. This ratio was also observed across caste samples (Fig. 4B, Supplementary Table S12). Of note, 654 genes had >50 isoforms. The most extreme case was the mitochondrial NADH-ubiquinone oxidoreductase gene, which was transcribed into 894 isoforms. These isoform-rich genes were enriched in many biological processes involved in cell signal transduction, including signal transduction (GO:0007165), cell communication (GO:0007154), signaling (GO:0023052), cation channel activity (GO:0005261), ion channel activity (GO:0005216), and potassium channel activity (GO:0005267) (Supplementary Table S13). The increasing transcription abundance of these genes through AS might enhance cellular responses to environmental stimuli.

Characterization of caste-specific AS isoforms
AS is an important mechanism in defining tissue specificity based on tissue-specific expression of transcripts of the same gene. Previous studies have shown that AS is associated with phenotypic variation in eusocial insects, where a single genome is able to encode for numerous caste phenotypes [11,13,15,16,46]. Thus, we investigated isoform specificity and commonality among the 4 castes. Because of the high cost of ISO-seq, we used pooled samples for each caste instead of biological replications to mitigate the variation across individuals/colonies to produce high-coverage sequencing data that ensures that we discovered isoforms with low expression levels. Owing to the lack of biological replications, our results might only be representative for the colony. However, the AS isoforms produced from this in-depth investigation can also be valuable as a reference for further studies. Among all expressed genes, 5,359 tran-scribed ≥1 caste-specific isoform that was only presented in 1 caste. These results suggest that AS has had pervasive effects on genome-wide protein-coding genes with diverse functions that may contribute to caste differentiation. Following KEGG analysis, we identified many genes with caste-specific isoforms related to the insulin and mTOR signaling pathways, which play key roles in regulating caste differentiation on morphology and longevity [12,47,48] (Supplementary Tables S14-17).
To further characterize the caste-specific AS isoforms, we highlighted some functionally important genes that may play important roles in ant sex determination and caste differentiation. Feminizer (fem) functions as a binary switch gene participating in sex determination and sexual differentiation in Hymenoptera [41,42,49]. In the pharaoh ant, fem consisted of 8 coding exons. However, full-length transcripts with all coding exons were only expressed in the female castes ( Supplementary Fig.  S3), with the male caste just expressing the first 2 coding exons. These results suggest that they have different functions according to their differences in protein domains. The sex-based differences in the AS pattern of fem seem to be conserved, similar to the transformer gene, across different insects [49]. Moreover, we found that the female castes expressed diverse transcript isoforms of this gene, with many isoforms possibly functioning as lncRNAs.
By choosing the highest-expressed isoform for each caste, we screened out the genes with dominant AS isoforms for each caste and selected 267 genes with caste-specific dominant AS isoforms (Supplementary Table S18). We reasoned that these genes might be potential candidates involved in ant caste differentiation via AS. On the basis of GO term analysis, we revealed that these genes were significantly enriched in phosphotransferase activity (GO:0016773), carbohydrate derivative binding (GO:0097367), and neurotransmitters (GO:0005328) (Supplementary Table S19). For example, cytokine receptor-like factor 3 (crlf3), which is a neuroprotective erythropoietin receptor in beetle (Tribolium castaneum) and locust (Locusta migratoria) neurons and emerged with the evolution of the eumetazoan nervous system [50,51], had several caste-specific isoforms and different dominant expressed isoforms in each pharaoh ant caste (Supplementary Fig. S4). The worker-specific isoform of this gene showed the highest expression level in workers. Queens also mainly expressed their caste-specific isoform. Considering the key role of the nervous system in caste differentiation, these results further indicate that the caste-specific dominant AS isoform of crlf3 may influence caste differentiation in pharaoh ants.

Identification and comparative analysis of lncRNAs
LncRNAs are a group of RNA molecules (>200 nt) that are not translated into proteins but that play important roles in a variety of biological processes [52]. The detection of lncRNAs has been restricted by short-read RNA sequencing technology because short-read sequencing fails to capture the full length of extremely long lncRNAs. Therefore, the number of previously detected lncRNAs is likely to be underestimated and should be improved by ISO-seq. Here, we detected 1,225 long transcripts that likely function as lncRNAs on the basis of their lack of open reading frames (see Methods). The lengths of these lncRNAs varied from 923 to 30,849 bp, which are far longer than those of lncRNAs predicted in other ant species, such as Camponotus floridanus, Harpegnathos saltator, and O. biroi, using RNA-seq (Fig. 5A) [20,53]. Using their relative positions to the annotated genome, pharaoh ant lncRNAs could be classified into 4 categories: i.e., antisense, overlapping with coding sequences, intronic, and intergenic ( Fig. 5B) [54]. Most lncRNAs were located in the intergenic region (64.33%), as observed in other organisms, and probably function as transcription regulators [55,56]. The number of lncRNAs in M. pharaonis varied among castes. Gynes had the highest number of lncRNAs, whereas males had the lowest number (Table 3). Queens had the longest lncRNAs among the 4 castes (Wilcoxon test, P < 0.01), with an average length of 5,675 bp, whereas males had the shortest lncRNAs among the 4 castes (Wilcoxon test, P < 0.01), with an average of 3,456 bp (Table 3).
Although investigating how lncRNAs work is challenging because of their relatively weak expression, cell/tissue-specificity, and variable functions, some lncRNAs exhibit high conservation in either sequence or secondary structure across species, thus providing a way to detect evolutionary signals for functional importance. By genomic comparison of the 4 ant genomes sequenced with long reads, we identified genomic regions showing extremely low mutation rates with high conservation across all detected species. We found 961 (78%) lncRNAs that contained ≥1 highly conserved genomic element across all ant species, which likely experienced strong purifying selection during ant evolution. Based on orthologous analysis of lncR-NAs between ants and parasitoid wasps (Nasonia vitripennis), ants and bees (Apis mellifera), and ants and flies (Drosophila melanogaster), we obtained a set of insect-conserved lncRNAs. Among the ant-conserved lncRNAs, 33 were conserved between ants and bees, 12 were conserved between ants and parasitoid wasps, and 6 were conserved between ants and flies, thus demonstrating that most of these lncRNAs were ant-specific.
We further identified conserved lncRNAs showing differential expression between worker and gyne by using the brain RNA-seq data produced by Qiu et al. [57]. We detected the differently expressed lncRNAs using DESeq2 and then classified lncRNA transcripts as differentially expressed between castes when the false discovery rate (FDR)-adjusted P-value was ≤0.05. By doing so, we detected 32 ant conserved lncRNAs that were differentially expressed between worker and gyne in M. pharaonis ( Fig. 5C and D). For example, the single-exon lncRNA, PB.5100.1, which is located between ACHE1 (Acetylcholinesterase 1) and RCBTB1 (RCC1 and BTB domain-containing protein 1), was highly expressed in gyne samples (Fig. 5D).

Conclusions
Our study provided a high-quality chromosome-level genome assembly and full-length transcriptomes for all 4 castes of the pharaoh ant. Our newly assembly genome showed markedly improved quality compared with previous short-read sequencing assemblies [24]. Our comparison demonstrated the importance of using long-read sequencing to cover genomic assembly of both repeat and high GC content regions, particularly for the latter, which often span genomic elements with regulatory functions (e.g., promoters). By combining PacBio assembly and Hi-C data, our study presented an efficient way in which to produce a chromosome-level assembly for the ant genome. This has now been adapted as a standard genomic sequencing and assembly pipeline for the GAGA, which aims to generate high-quality assemblies for ∼200 ant species representing broad diversity [21]. Furthermore, our ISO-seq not only produced a high-quality genome annotation for M. pharaonis but also highlighted the complexity and diversity of the ant transcriptome, which may be associated with caste differentiation. Our study also identified many protein-coding genes with caste-specific isoforms and a core set of lncRNAs that may play conserved roles in ant caste differentiation over the long evolutionary process of ants. These datasets will be valuable for downstream functional studies to reveal the genetic mechanisms underlying caste differentiation in ants.

Sample collection
M. pharaonis were collected from a house in Mengla, Xishuangbanna district, Yunnan Province, China. The colony was brought back to the laboratory and reared under constant conditions, i.e., temperature of 27 • C, relative humidity (RH) of 65%, and light: dark cycle of 12 h:12 h (light period 08:00-20:00, dark period 20:00-08:00). The queens and workers used in this study were from the starting colony (MP-MQ-018). Gyne and male samples were obtained from a newly developed colony, which was isolated from the starting colony with only eggs, larvae, and workers. For DNA and RNA sequencing, ants were collected and flashfrozen in liquid nitrogen and stored at −80 • C for later extraction. The collection procedures were in accordance with protocols approved by the Animal Care and Use Committee of the Kunming Institute of Zoology, China.

DNA and RNA extraction
Because of their small size, genomic DNA from pools of worker samples was extracted via an insect sodium dodecyl sulfate DNA extraction protocol provided by the Novogene Corporation (Nanjing, China). Total RNA was extracted from the pooled individuals of each caste (male, worker, gyne, and queen) for PacBio fulllength isoform sequencing (ISO-seq) using a Trizol Extraction Kit according to the manufacturer's instructions. All samples were collected from sub-colonies developed from the same starting colony to reduce biological variation of the data. There were 3 major purposes for performing ISO-seq analyses in the present study: (i) to assist in annotation, (ii) to identify AS forms, and (iii) to identify lncRNAs. Because these analyses rely on the full coverage of expressed transcripts, especially those with low levels of expression, the sequencing depth of the ISO-seq data was more important than biological replications. Quantification analyses were performed using the RNA-seq data (with each having 5 biological replicates) produced in our previous study on the same ant castes [57]. Male brains were dissected in cold diethyl pyrocarbonate (DEPC)-treated phosphate-buffered saline (PBS). Five replicates of pooled male brains (n = 20 males/pool) were extracted using the RNA Trizol Extraction Kit (Invitrogen, USA).
DNA and RNA quality were checked by Qubit (Life Technologies, USA). DNA and RNA integrity were examined by agarose gel electrophoresis.

PacBio ISO-seq library construction and sequencing
The full-length ISO-seq libraries were constructed using total RNA. First-strand complementary DNA (cDNA) was synthesized using a ClontechSMARTer PCR cDNA Synthesis Kit (ClonTech, Takara Bio Inc., Japan) with anchored oligo(dT) as the primer. Double-stranded cDNA was generated by large-scale PCR using an optimized PCR cycle number. Separation of different cDNA fractions by length was generated using the BluePippin Size Selection System. Once double-stranded cDNA was prepared, the SMRTbell libraries were constructed using the PacBio SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, USA) following the vendor's protocols. Three SMRT RNA libraries, 1-2 k, 2-3 k, and 3-6 k, were prepared for worker and gyne samples. Mixed libraries without size selection were prepared for queen and male samples as the protocol improved. The SMRTbell libraries were then sequenced on the PacBio Sequel platform (PacBio Sequel System, RRID:SCR 017989).

RNA library construction and sequencing
In parallel, RNA sequencing of male brains was performed by constructing a Micro-Tn5 Transposon Library following the methods described by Zhu et al. [58] and sequencing on the BGISEQ-500 PE100 platform (BGISEQ-500, RRID:SCR 017979). The RNA-seq data from workers, gynes, and queens were requested from Qiu et al. [57].

Genome sequencing
To achieve a high-quality pharaoh ant genome assembly, we adopted a combination of sequencing methods including Illumina and PacBio sequencing.
For Illumina sequencing, 3 short-insert-sized DNA libraries (250, 500, and 800 bp) were constructed using an Illumina TruSeq Nano DNA LibraryPrep Kit (Illumina, USA) following the manufacturer's instructions, and then sequenced on an Illumina Hiseq 2000 instrument using a whole-genome shotgun sequencing strategy at BGI-Shenzhen (Shenzhen, China). We obtained a total of 33 Gb of clean data with ∼103-fold sequencing depth.
For PacBio sequencing, the BluePippin Size-Selection System was used to perform size selection. In total, DNA was sheared to a ∼20-kb targeted size using ultrasonication (Covais, Woburn, MA, USA), with a final 20-kb DNA fragment retained to construct the libraries. The constructed libraries were sequenced using the PacBio Sequel system at Novogene (Tianjin, China), and a total of 12 SMRT cells were used to yield 31 Gb of sequencing subreads with an average length of 7.5 kb and N50 of 11.6 kb.

Genome size estimation
We estimated the size of the pharaoh ant genome using routine 17-mer frequency analysis [25]. The genome size was estimated according to the formula: genome size = No. k-mers/peak of depth.

Genome assembly by PacBio long reads
We used an in-house pipeline to perform genome assembly, which included 5 steps: (1) Contig construction Canu (Canu, RRID:SCR 015880) v1.5 was used for 96-fold PacBio Sequel read assembly with default parameters and the complete Canu pipeline. For the Canu assembly, contig N50 was 1.26 Mb and total assembly size was 323 Mb. (2) Linking contigs to scaffold Scaffolding was performed using the SSPACE long-read scaffolder. The SSPACE-LongRead uses the BLASR aligner, which aligned the long-read set to the Canu contig assembly. We improved assembly contiguity and acquired a larger scaffold N50 than that obtained via the Canu contig assembly. (3) Filling gaps within scaffolds After scaffolding, PBJelly (PBJelly, RRID:SCR 012091) was used to fill the gaps within the scaffold using the PacBio sequences. The running parameters were as follows: -minMatch 8 -sdpTupleSize 8 -minPctIdentity 75 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 13 -noSplitSubreads. Most gaps were filled in this step. This resulted in an assembly of 325 Mb, with a scaffold N50 of 3.63 Mb, contig N50 of 2.63 Mb, and number of undetermined bases (Ns) of 284 kb (0.08% of total genome assembly). Thus, contig N50 showed marked improvement (2 times) compared with the Canu contig assembly.
(4) Two rounds of genome assembly polishing Because the PacBio raw reads have high sequencing error rates, we performed 2 rounds of genome assembly polishing. In the first round, Arrow software was used to map the PacBio sequences to the genome assembly. Small insertions/deletions (indels) and substitutions were then corrected, and consensus sequences were obtained. We performed the second round of polishing using high-quality Illumina paired-end short reads. First, the Illumina short reads were mapped to the assembly using BWA (BWA, RR ID:SCR 010910), after which Pilon (Pilon, RRID:SCR 014731) was used to correct the sequences by input BAM alignments and assembly sequences. The parameters were as follows: "-changes -vcf -diploid -fix bases -mindepth 8." Results showed that Pilon corrected 14,680 substitutions, 46,245 small insertions, and 9 ,410 small deletions for the raw read PacBio genome assembly. (5) Removal of contaminated sequences, duplicated haplotigs, and artifacts By aligning the genome sequences against the Bacteria and Virus databases using BLAST (-e 1e−5), we obtained a total length of 2,424,757 bp contaminated sequences, which accounted for 0.75% of the genome sequences. The most frequently aligned bacteria were endosymbionts of insects, such as Wolbachia, Bacillus, and Candidatus. We filtered out the contaminated contigs with contaminated sequences ≥20%. We did not find virus contamination in the genome sequences. Altogether, we filtered out 151,589 bp of contaminated sequences.
Purge haplotigs was used to resolve duplicated haplotigs and artifacts in the genome assembly. First, we remapped the PacBio long reads to the genome assembly using minimap2. Purge haplotigs was then used to calculate sequencing depth based on BEDtools (BEDTools, RRID:SCR 006646) [59] and generate a read-depth histogram. We chose 3 cut-offs (depths of 10, 25, and 85) to capture potential duplicated regions and haplotypefused regions. Finally, purge haplotigs filtered out 12,469,513 bp of haplotigs and artifacts. The final PacBio read-based genome assembly of the pharaoh ant was 312,903,204 bp.

In situ Hi-C library preparation and chromosome assembly
To establish the chromosome-level reference genome, worker pharaoh ants were used to construct a Hi-C library by modifying the protocol in Rao et al. [60]. The library was sequenced on the BGISEQ-500 platform under 100 paired-end mode. We used HiC-Pro (HiC-Pro, RRID:SCR 017643) to filter invalid read pairs, such as self-ligation, non-ligation, start-nearRsite, PCR amplification, random break, largeSmallFragments, and ExtremeFragments. The valid read pairs were mapped to the polished pharaoh ant genome. The contact count between contigs was calculated and normalized by restriction sites in sequences. We successfully produced 11 chromosomes, which occupied 94% of the genome, using the 3D-DNA pipeline. The 11 chromosomes were consistent with previous karyotype analyses of the pharaoh ant [37].

Genome assembly evaluation
To assess base quality of the whole-genome assembly, we first aligned the high-quality Illumina short reads to the final base error-corrected assembly. The percentage of total mapped reads was 97%. We then used the variant detector FreeBayes to calculate the homozygous variant ratio by inputting the BWA alignments. We detected the homozygous variants with parameters "-C 2 -0 -O -q 20 -z 0.10 -E 0 -X -u -p 2 -F 0.6," as per Jain et al. [31]. The homozygous variations were derived from base-calling errors as the genome is diploid. The error rate was calculated as 0.001%, indicating a base QV of 50. The QV and identity were calculated using the algorithm in Jain et al. [31]. For protein-coding gene regions, we ran BUSCO (BUSCO, RRID:SCR 015008) on the genome mode to search for conserved genes in Hymenoptera species.

Combined homology-, de novo-, and RNA-seq-based gene predictions
Combined homology-, de novo-, and transcriptome-RNA-seqbased gene predictions were used to annotate the proteincoding sequences in the pharaoh ant genome, as used in our previous study on the leopard gecko [63]. For the homology-based method, reference gene sets of D. melanogaster, A. mellifera, Linepithema humile, N. vitripennis, Solenopsis invicta, and M. pharaonis from the Ensembl and NCBI databases were used. We used the same parameters and methods as for the leopard gecko [63].
For de novo prediction of the pharaoh ant genome, methods and parameters were the same as used for the leopard gecko [63].
The transcriptome-RNA-seq-based method was performed using the pharaoh ant RNA-seq data from the brains of different castes and other tissues downloaded from the NCBI database (NCBI accession Nos. DRR032044-DRR032266). TopHat v1.3.3 (TopHat, RRID:SCR 013035) was used to identify splice junctions (SJs) by aligning the RNA-seq reads to the pharaoh ant genome. Cufflinks v2.2.1 (Cufflinks, RRID:SCR 014597) was applied to assemble transcripts using the aligned RNA-seq reads. After that, we built non-redundant reference gene sets based on a priority order of transcriptome-based evidence > homologybased evidence > de novo-based evidence to combine gene evidence using the in-house script from Xiong et al. [63]. At this step, a total of 15,576 non-redundant protein-coding genes were annotated.

ISO-seq isoforms improve gene model prediction
The ISO-seq approach can improve gene annotations in eukaryotic genomes. We incorporated the ISO-seq data to improve the gene model predicted in the previous step. We first compared the location of PacBio isoforms with the reference gene location using gffcompare. The overlapping PacBio isoforms on the same strand as the reference gene loci were used to refine the gene models, introduce AS events, and update the annotations of UTRs. We modified incorrect gene models caused by incorrect gene prediction. To further investigate missing or incomplete protein-coding gene models, a Markov model was estimated with 1,000 high-quality genes using the trainGlimmerHMM tool included in the GlimmerHMM software package. The putative protein-coding sequence of each PacBio isoform was identified using the Markov model. Finally, by comparing the gene models to the reference genome, we generated 15,327 protein-coding genes, which was the final predicted gene set.

Transcriptome analysis pipeline for ISO-seq
We ran ISO-seq analysis using SMRT Link v5.0 [64] on the command line via pbsmrtpipe [65] to obtain the high-quality PacBio isoform dataset. Analysis included the following 4 steps: (1) Circular consensus sequence (CCS) identification CCSs were created from the raw subreads of PacBio sequences using CCS software (v3.0.0) within the pbsmrtpipe package. The CCS software takes multiple reads of the same SMRTbell sequence and combines them using a statistical model to produce 1 high-quality consensus sequence. (2) Classification of CCSs to full-length reads CCSs were classified as full-length non-chimeric and nonfull-length reads. This was done by identifying the 5 and 3 adapters used in the library preparation, as well as the poly(A) tail. A read was considered full length if both primers were detected at the ends with a poly(A) tail signal of ≥12 consecutive "A's preceding the 3" primer. This step also removed primers and polyA/T tails accordingly. (3) Clustering of sequences based on similarity Isoform-level clustering was performed by using the Iterative Clustering and Error correction [11] algorithm and clustering the classified transcript sequences on the basis of similarity. For each cluster, the consensus transcripts were obtained. (4) Error correction polishing of isoforms The error correction Arrow software in the pbsmrtpipe package was used to polish the consensus sequences generated from the transcript clustering step. Arrow mapped PacBio raw reads to obtain the consensus and variant calls. This output polished high-quality (predicted accuracy ≥99%) full-length isoform consensus sequences, as well as low-quality isoform consensus sequences. (5) Alignment of isoforms to reference genome We used GMAP to align the isoform consensus sequences to the genome assembly with parameters "-f samse -n 0". A Python script from the PacBio repository [66] was then used to predict the transcript structure and remove redundant transcripts. Each isoform was compared with the refer-ence annotation by gffcompare and the isoforms were further classified into 8 groups based on their exon structures.

Rarefaction analysis of ISO-seq data
To investigate whether the sequencing depth of those data was sufficient to capture most of the transcriptome of interest, we performed rarefaction analysis on all data from the 4 caste sample libraries. We first pooled all sequencing data of the caste samples to reach a total of 58.8 Gb subreads. We then randomly selected 10%, 20%, 30%, . . . , 100% of total subreads to perform similar ISO-seq analyses to measure the (i) number of consensus transcripts, (ii) genome coverage, (iii) number of total isoforms, (iv) number of detected expressed genes, (v) AS events, and (vi) detectable genes with AS. All saturation curves were plotted using ggplot2 in the R package.

Identification of AS events
To verify the PacBio transcript isoforms, we analyzed the isoforms in relation to their SJs. The SJs could be divided into canonical and non-canonical according to the 2 pairs of dinucleotides present at the beginning and end of the introns encompassed by the junctions. The canonical SJs (GT-AG) accounted for ∼95% of all introns of the pharaoh ant PacBio isoforms. We also investigated the consistency of SJs between the RNA-seq and ISO-seq data. STAR (v2.4.0) was used to map the RNA-seq data to the reference genome, and all SJs were detected. We used a Python script (alternative splice.py) to detect AS events following Wang et al. [67]. This method was specifically designed to determine AS events using ISO-seq data and has been used in various ISO-seq studies [67][68][69]. The script uniquely designates all possible splicing patterns as an AS code according to the relative position of the alternative splice sites involved in the splicing variation. Five main modes of AS (intron retention, exon skipping, alternative 3 -acceptor, alternative 5 -donor, and alternative position [both 5 -donor and 3 -acceptor]) were identified. We visualized AS types using SVG implemented in Perl. We then compared the AS type variation among the 4 castes using a custom script.

Discovery of caste-specific AS isoforms among 4 castes
To investigate differential AS isoforms from the PacBio isoforms among the 4 castes, we used the scripts from the Cupcake package [66] to chain the isoforms together across the caste samples with default parameters. The isoforms from different caste samples that had an exact match for every exon boundary were chained together. Caste-specific isoforms were defined if the isoforms only existed in a unique caste sample. The caste-specific isoforms were compared with the AS isoform dataset, and those containing AS events were defined as caste-specific AS isoforms.

LncRNA identification from PacBio sequences
We identified lncRNAs from PacBio ISO-seq datasets using a customized pipeline comprising 4 steps: (i) The PacBio isoforms were aligned to gene models in the pharaoh ant genome. Isoforms that could not be aligned were considered novel sequences. We extracted the loci of novel sequences that did not overlap with the reference annotation or overlapped with the reference annotation but on the opposite strand. (ii) To filter out the potential coding sequences, we used BLAST to screen the sequences for homology with pharaoh ant proteins, and proteins from the functional database (UniProt). (iii) The CPC, PLEK, and CPAT programs were used to discriminate non-coding sequences from protein-coding genes. Sequences predicted as non-coding by all 3 software packages were deemed candidate lncRNAs. (iv) To eliminate the possible effects of transcription or splicing noise on the identification of lncRNAs, we filtered out those lncRNAs that were supported by <2 full-length PacBio sequencing reads.

Identification and characterization of conserved lncRNAs
Conserved lncRNAs within ants were identified by screening the annotated lncRNAs in highly conserved non-coding elements (CNEs) between ant genomes. The identification method was as follows: (i) We performed pair-wise whole-genome alignment using Lastz between the pharaoh ant genome and 3 published PacBio genomes (C. floridanus, H. saltator, and O. biroi) downloaded from the NCBI. Multiple alignments of the 4 ant genomes were then generated using Multiz, with the pharaoh ant as the reference. (ii) We used PhaseCons to estimate the genome conservation index and then identified the highly conserved elements (HCEs). Briefly, we used phyloFit to estimate an initial neutral phylogenetic model. We then ran PhastCons twice, first for estimation of conserved and non-conserved models and then for prediction of conserved elements. Finally, we identified 408,113 ant HCEs, covering 56 Mb of the pharaoh ant genome. (iii) We filtered the HCEs located in the protein-coding regions, resulting in 323,193 CNEs covering 32 Mb of the pharaoh ant genome. (iv) Finally, the annotated lncRNAs located in the CNEs were considered to be ant-conserved lncRNAs. Our analysis revealed a total of 961 conserved ant lncRNAs.
We also performed orthologous analysis of lncRNAs between ants and parasitoid wasps (N. vitripennis), ants and bees (A. mellifera), and ants and flies (D. melanogaster). We first performed whole-genome alignment among these genomes. The bee, parasitoid wasp, and fly genomes were each aligned to the pharaoh ant genome using Lastz. We then used liftOver (liftOver, RRID:SC R 018160) to compare the genome coordinates of ant-conserved lncRNAs to the wasp, bee, and fly genomes according to the "chain" alignment blocks, which are "chained" based on their location in both genomes. We used liftOver with default parameters. The ant lncRNAs located within or overlapping with conserved bee/wasp/fly genome regions were considered to be conserved insect lncRNAs.

Identification of differentially expressed lncRNAs between worker and gyne using Illumina RNA-seq data
The Illumina RNA-seq data from brain tissues of worker and gyne in Qiu et al. [57] were generated from the same batch, with each caste comprising 5 biological replicates. These data allowed us to identify differentially expressed lncRNAs between the 2 castes. First, ISO-seq transcriptome quantifications were performed with the Salmon pipeline (version 1.3.0) using RNAseq data of the 10 samples, respectively. In brief, RNA-seq data from the brain samples of worker and gyne were quasi-mapped to the ISO-seq transcriptome, after which bias correction options were turned on to account for GC bias and sequence-specific bias. Isoform expression level was estimated as transcripts per million. DESeq2 version 1.16.1 (DESeq2, RRID:SCR 015687) was subsequently used to determine the differentially expressed isoforms between worker and gyne samples. The lncRNA transcripts were classified as differentially expressed between castes when the FDR-adjusted P-value was ≤0.05. Finally, we identified 32 conserved lncRNAs from the differentially expressed isoforms, in which caste affects the lncRNA expression level significantly.