A high-quality reference genome for the fission yeast Schizosaccharomyces osmophilus

Abstract Fission yeasts are an ancient group of fungal species that diverged from each other from tens to hundreds of million years ago. Among them is the preeminent model organism Schizosaccharomyces pombe, which has significantly contributed to our understandings of molecular mechanisms underlying fundamental cellular processes. The availability of the genomes of S. pombe and 3 other fission yeast species S. japonicus, S. octosporus, and S. cryophilus has enabled cross-species comparisons that provide insights into the evolution of genes, pathways, and genomes. Here, we performed genome sequencing on the type strain of the recently identified fission yeast species S. osmophilus and obtained a complete mitochondrial genome and a nuclear genome assembly with gaps only at rRNA gene arrays. A total of 5,098 protein-coding nuclear genes were annotated and orthologs for more than 95% of them were identified. Genome-based phylogenetic analysis showed that S. osmophilus is most closely related to S. octosporus and these 2 species diverged around 16 million years ago. To demonstrate the utility of this S. osmophilus reference genome, we conducted cross-species comparative analyses of centromeres, telomeres, transposons, the mating-type region, Cbp1 family proteins, and mitochondrial genomes. These analyses revealed conservation of repeat arrangements and sequence motifs in centromere cores, identified telomeric sequences composed of 2 types of repeats, delineated relationships among Tf1/sushi group retrotransposons, characterized the evolutionary origins and trajectories of Cbp1 family domesticated transposases, and discovered signs of interspecific transfer of 2 types of mitochondrial selfish elements.


Introduction
Fission yeasts (genus Schizosaccharomyces) are an ancient group of fungal species that originated more than 200 million years ago (Rhind et al. 2011;Shen et al. 2020). They belong to the Taphrinomycotina subphylum of Ascomycota (Liu et al. 2009). There are 5 currently recognized species of fission yeasts: S. pombe, S. japonicus, S. octosporus, S. cryophilus, and S. osmophilus (Helston et al. 2010;Vaughan-Martini and Martini 2011;Brysch-Herzberg et al. 2019). S. pombe was adopted for biological research more than 70 years ago (Leupold 1950;Fantes and Hoffman 2016) and is one of the most prominent model organisms (Hoffman et al. 2015;Hayles and Nurse 2018;Vyas et al. 2021). In the last decade, S. japonicus has emerged as a powerful new experimental model for biologists, especially those interested in evolutionary cell biology (Klar 2013;Aoki et al. 2017;Rutherford et al. 2022). Genetic tools have also been developed for S. octosporus (Seike and Niki 2017), adding another fission yeast species to the list of experimentally amenable organisms.
The value of fission yeast species as research subjects relies to a great extent on the availability of well-annotated genomes.
S. pombe is the sixth eukaryote to have its genome fully sequenced (Wood et al. 2002) and has one of the most completely annotated genomes . The genomes of S. japonicus, S. octosporus, and S. cryophilus were published in 2011 (Rhind et al. 2011). More recently, long-read sequencing has been used to generate more contiguous genome assemblies of S. octosporus and S. cryophilus (Tong et al. 2019). These genomes have not only served as critical resources for molecular studies using these species but also enabled cross-species comparative genomic analyses, which have shed light on how centromere structure, transposon diversity, the roles of RNAi, and carbon metabolism have evolved in the past tens to hundreds of million years (Rhind et al. 2011;Tong et al. 2019).
S. osmophilus is the newest fission yeast species to be identified. It was initially discovered in bee bread of solitary bee species belonging to the Megachilinae subfamily (Brysch-Herzberg et al. 2019). Bee bread, which is a mixture of pollen and nectar, is the food source for growing bee larvae (Fig. 1a). An extensive fission yeast isolation study revealed that bee bread of solitary bees is the only type of habitat where S. osmophilus can be frequently isolated (Brysch-Herzberg et al. 2022). Like other fission yeast species, S. osmophilus cells propagate vegetatively by medial fission (Fig. 1b). During sexual reproduction, S. osmophilus usually form asci containing 8 spores (Fig. 1c). Analyses of the sequences of rRNA genes and the rrk1 gene showed that S. osmophilus is more closely related to S. cryophilus and S. octosporus than to S. pombe or S. japonicus (Brysch-Herzberg et al. 2019). Unique among fission yeast species, S. osmophilus is an obligate osmophilic organism that requires a high-osmolarity environment for growth, likely due to its long-term evolution in high-osmolarity habitats such as bee bread, nectar, and honey (Brysch-Herzberg et al. 2019).
In this study, we generated a high quality genome assembly for the type strain of S. osmophilus and comprehensively annotated the genes in this genome assembly. We performed a series of comparative analyses using this reference genome of S. osmophilus and obtained new insights into the evolution of centromeres, Two nest chambers can be seen in the photo. A pupa occupies the left chamber, where little bee bread is remaining. In the right chamber, a dead egg is visible, and the chamber is full of bee bread. Bee bread of solitary bees is the type of substrate from which S. osmophilus has most frequently been isolated. (b) An image of vegetative cells of the S. osmophilus type strain CBS 15793 T grown in liquid YES medium containing 30% glucose at 25°C. (c) An image of an ascus of the S. osmophilus type strain CBS 15793 T undergoing sporulation in liquid PMG medium containing 30% glucose at 25°C. (d ) The sizes of the contigs in the genome assembly of the S. osmophilus type strain CBS 15793 T . (e) Circos diagram illustrating the features of the 3 chromosomal contigs. From outer to inner: protein coding sequence density (bin size = 20 kb, range = 0-1); tRNA genes; 5S rRNA genes; full-length and intact retrotransposons (arrowheads) and LTRs (lines); GC content (bin size = 5 kb, range = 25-50%, the boundary between the backgrounds represents the genome-wide average GC content, 38.27%); centromeres, telomeres (not drawn to scale), rDNA arrays, and the donor mating type region; mappability calculated by running GenMap using 100 bp k-mers with up to 2 mismatches (Pockrandt et al. 2020). (f ) Time-calibrated phylogeny of fission yeast species. This timetree is based on a maximum likelihood tree inferred using a concatenation super-matrix of 1,060 "complete and single-copy" BUSCO genes present in 5 fission yeast species and the outgroup species S. complicate ( Supplementary Fig. 5a). The 3 time calibration nodes are shown as empty circles. The divergence time between S. octosporus and S. osmophilus was calculated using RelTime and the corresponding node is shown as a filled circle. (g-j) Plots showing synteny conservation between the nuclear genome of S. osmophilus (top of each plot) and the nuclear genomes of the other 4 fission yeast species (bottom of each plot). The orientations of the chromosomes of S. osmophilus are clockwise and the orientations of the chromosomes of other species are counterclockwise. The midpoints of centromeres are indicated by vertical lines on the outer ring. Colored lines connect single-copy orthologous genes of the 2 species shown in each plot. The color of a line is based on which chromosome the S. osmophilus gene is on. The S. octosporus and S. cryophilus genomes used in this analysis are those reported in (Tong et al. 2019). The S. pombe and S. japonicus genomes used in this analysis are the reference genomes.
telomeres, retrotransposons, Cbp1 family proteins, and mitochondrial genomes in the fission yeast lineage.

Genome sequencing, assembly, and quality assessment
S. osmophilus cells were grown in the modified YES medium at 25°C and cells in the logarithmic phase of vegetative growth were collected and frozen using liquid nitrogen. For long-read sequencing library construction, frozen cells were sent to Frasergen Bioinformatics Co., Ltd. (Wuhan, China) for genomic DNA extraction, sequencing library construction, and long-read sequencing on the Sequel II platform in the circular consensus sequencing (CCS) mode. A total of 430.07 Mb of PacBio HiFi read data was generated using CCS (v4.0, https://github.com/PacificBiosciences/ ccs). The HiFi reads have been deposited at NCBI SRA under the accession number SRR21149833.
For short-read sequencing library construction, genomic DNA was extracted using the MasterPure Yeast DNA Purification Kit (Epicentre). An Illumina sequencing library was prepared using home-made Tn5 transposase as previously described (Tao et al. 2019). Paired-end sequencing was performed on the Illumina NovaSeq 6000 System (2 × 150 bp read pairs) by Novogene Co., Ltd. (Beijing, China), and a total of 4.86 Gb of raw read data was generated. Low-quality reads were filtered or trimmed using fastp (v0.20.0, https://github.com/OpenGene/fastp) and 4.67 Gb of high-quality paired-end read data was retained (Chen et al. 2018). The short-read genomic DNA sequencing data has been deposited at NCBI SRA under the accession number SRR21149832.
One round of polishing with Illumina short reads was performed using Pilon v1.23 (Walker et al. 2014). The annotation of the mitogenome is described in a later section. The sequence and annotation of the S. osmophilus mitogenome have been deposited at GenBank under the accession number OP310968.
The copy number of rDNA repeat units was estimated as follows: Illumina sequencing reads were mapped to the nuclear genome assembly using BWA (Burrows-Wheeler Aligner) (Li and Durbin 2009). The average read depth in protein-coding genes was calculated using bedtools (Quinlan and Hall 2010) and taken as an approximation of the average read depth in single-copy regions. There are 16 full-length rDNA units and 5 partial rDNA units in the nuclear genome assembly. The average read depth in each rDNA unit was calculated using bedtools and divided by the average read depth in protein-coding genes to give a copy number estimate. A weighted sum of the copy number estimates of all 21 rDNA units was calculated, with the weight being the ratio of the length of an rDNA unit to the average length of full-length rDNA units.
To assess the quality of the genome assembly, raw HiFi reads and fastp-filtered Illumina paired-end reads were mapped to the genome assembly using minimap2 (v2.21, https://github.com/lh3/ minimap2) (Li 2018) and BWA (v 0.7.17-r1188, https://github.com/ lh3/bwa) (Li and Durbin 2009) Orthologous cnt regions are grouped together. tRNA genes are denoted using single-letter codes of cognate amino acids. Isoacceptor tRNAs that accept the same amino acid but have different anticodons are distinguished by the color of the letter (white: tRNA-ArgACG, tRNA-GluTTC, and tRNA-LeuCAA; black: tRNA-ArgTCG, tRNA-GluCTC, and tRNA-LeuAAG). Syntenic tRNA gene clusters are denoted by colored arrows. The orientations of tRNA gene clusters are defined as follows: among the letters representing the tRNA genes in a tRNA gene cluster, the letter occurring earliest in the alphabet is selected and the orientation of its corresponding tRNA gene is taken as the orientation of the tRNA gene cluster. The orientations of partial tRNA gene clusters (dashed line border) are set to be the same as the orientations of their corresponding full clusters. 5S rRNA genes are shown as red arrowheads. A cluster of tandemly arranged 5S rRNA genes is shown as an arrowhead in brackets, with the number of 5S rRNA genes in the cluster shown below. Features are not drawn to scale. (c) Diagrams depicting the arrangements of cnt repeats in the cnt regions of S. cryophilus, S. osmophilus, and S. octosporus. Orthologous cnt regions are grouped together as in (b). The orientations of the cnt regions of S. osmophilus are as in (a). The orientations of the cnt regions of the other 2 species are set in a way so as to best show the similarity of the arrangements of cnt repeats. cnt repeats are represented by arrows. The colors of the arrows denote species and the fill patterns of the arrows denote repeat types. The locations and orientations of the cCNT-S and cCNT-L repeats of S. cryophilus and the oCNT-S and oCNT-L repeats of S. octosporus are as defined previously (Tong et al. 2019). oCNT-S is split into oCNT-S1 and oCNT-S2. oCNT-N is a type of repeat identified in this study. Orientations of oCNT-N and the 4 types of cnt repeats of S. osmophilus are set in a way so as to best show the similarity of the arrangements of cnt repeats. The cnt regions and cnt repeats are shown to scale in these diagrams. The distribution of the MEME-identified 11-bp motif (arrowhead, not drawn to scale) is also shown in the diagrams. (d ) Sequence logo of the MEME-identified 11-bp motif. (e) The density of the 11-bp motif (number of motif occurrences per kb) within the cnt regions and across the entire nuclear genome of each of 4 fission yeast species. 2011), GATK (v4.2.0.0), and deepvariant (v1.1.0, https://github.com/ google/deepvariant) (Poplin et al. 2018). Homozygous variants that were called by at least 2 variant callers and passed a read depth cutoff (DP > 10) were retained. Variant calling based on the mapping result of HiFi reads was performed using deepvariant (v1.1.0, -model_type PACBIO) and homozygous variants that passed both a read depth cutoff (DP > 10) and a genotype quality cutoff (GQ > 30) were retained. Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0.2, https://busco.ezlab.org/) was used to assess the assembly completeness based on the presence/absence of 1,315 predefined single-copy orthologs of Ascomycota (ascomycota_odb9 gene set) (Simão et al. 2015).
Obtaining high-confidence transcript sequences S. osmophilus cells were grown in the modified YES medium at 25°C and cells in the logarithmic phase of vegetative growth were collected for total RNA extraction. Cells were washed once in DEPC (diethyl phosphorocyanidate) water at 4°C and subsequently resuspended with TES buffer (10 mM Tris pH 7.5, 10 mM EDTA pH 8.0, 0.5% SDS). One volume of acidic phenol-chloroform (1:1) was added and incubated at 65°C for 1 hour. Then the mixture was centrifuged at 4°C, and the aqueous phase was collected and treated with phenol-chloroform (1:1) and chloroform:isoamyl alcohol (24:1) sequentially. 1/10 volume of 3 M NaAc (pH 5.2) and 2 volumes of isopropanol were added to the aqueous phase, mixed thoroughly by inverting, and stored at −20°C overnight before centrifugation at 4°C. After centrifuging, the supernatant was removed, and the RNA pellet was washed with 70% ethanol twice. The RNA pellet was dissolved in DEPC water after air-drying and stored at −80°C.
For long-read cDNA sequencing using the Oxford Nanopore Technologies (ONT) platform, the total RNA obtained above was sent to Biomarker Technologies (Qingdao, China) for sequencing library preparation and ONT cDNA sequencing. A total of 2.15 Gb of raw read data was obtained. Reads were processed using pychopper (https://github.com/epi2me-labs/pychopper) and 1.65 Gb of full-length read data was obtained. The Full-Length Alternative Isoform analysis of RNA (FLAIR, v1.5, https://github.com/BrooksLabUCSC/flair) pipeline was used for downstream analysis (Tang et al. 2020). FLAIR is a pipeline designed to perform reads mapping, reads correcting, and isoform clustering for noisy long reads generated by ONT cDNA sequencing. It can also be run optionally with short-read RNA sequencing data to help increase the accuracy of splicing site identification. Briefly, the long reads were mapped to the S. osmophilus genome using "flair.py align" submodule with default parameters. The splicing junction information generated by short-read RNA sequencing was extracted using a FLAIR script "junctions_from_sam.py" from the reads mapping SAM file and then submitted to "flair.py correct" submodule. Finally, high-confidence transcript sequences were obtained by running a FLAIR submodule "flair.py collapse." The ONT long-read cDNA sequencing data has been deposited at NCBI SRA under the accession number SRR21149831.

Annotation of protein-coding genes
Protein-coding genes in the S. osmophilus nuclear genome were annotated using the MAKER pipeline (v3.01.04, https://www.yandelllab.org/software/maker.html) (Campbell et al. 2014). The MAKER pipeline utilizes the protein evidence, the EST evidence, and ab initio gene predictions for gene annotation. For the protein evidence, protein sequences encoded in the reference genomes of S. pombe, S. octosporus, S. cryophilus, and S. japonicus were downloaded from Ensembl Fungi (https://fungi.ensembl.org/) and combined into a single FASTA file. For the EST evidence, high-confidence transcript sequences obtained as described above were used. Two ab initio gene predictors were used: SNAP (v2006-07-28, https://github.com/ KorfLab/SNAP) (Korf 2004) and AUGUSTUS (v3.2.3, https://github. com/Gaius-Augustus/Augustus) (Stanke et al. 2008), both of which needed to be trained. Because the training of SNAP and AUGUSTUS requires pre-existing gene models as training data, the first round of MAKER annotation was carried out and a genome annotation was generated using only the protein evidence and the EST evidence (the protein2genome, est2genome, and correct_est_fusion options in the maker_opts.ctl control file were set to 1). The resulting gene models were utilized for the training of SNAP and AUGUSTUS. After SNAP and AUGUSTUS were trained, their predictions were used in the second round of MAKER annotation. In this round, the protein2genome and est2genome options were set to 0. The resulting gene models were used to train SNAP and AUGUSTUS again. The third round of MAKER annotation was conducted with the same settings as the second round and the resulting gene models were used to retrain SNAP and AUGUSTUS. Finally, the last round of MAKER annotation was conducted with the protein2genome, est2genome, and correct_est_fusion options set to 1. All resulting gene models were retained (the "keep_preds" option was set to 1) and recorded in a GFF3 format file for downstream analysis.
To further improve the quality of gene models, EVidenceModeler (EVM, v1.1.1, https://evidencemodeler.github. io/) was employed to generate weighted consensus gene predictions (Haas et al. 2008). We adopted the weight values used in a published protocol for annotating the genomes of S. cerevisiae isolates (https://github.com/yjx1217/LRSDAY) (Yue and Liti 2018). Finally, all resulting EVM gene models were loaded into the Integrative Genomics Viewer (IGV, v2.9.2, https://software. broadinstitute.org/software/igv/) (Robinson et al. 2011) together with protein evidence alignments extracted from the MAKER output and ONT cDNA-seq read alignments for visual inspection. Erroneously fused genes were manually corrected and unannotated genes suggested by orthogroup analysis (see below) were manually annotated. Systematic names containing a prefix of SOMG (for Schizosaccharomyces osmophilus genes) followed by a 5-digit number were assigned to annotated genes. The annotation and naming of wtf genes are described in a later section. The annotations were recorded in a GFF3 format file.

Functional annotation of protein-coding genes
Functional annotation of protein-coding genes in the S. osmophilus nuclear genome was conducted with the aid of the orthogroup information. For an S. osmophilus gene, if it has an S. pombe ortholog and the ortholog is not annotated as a "hypothetical protein," the gene name and the product description of the S. pombe ortholog were assigned to it. If it does not have an S. pombe ortholog, we examined whether it has an ortholog in another fission yeast species (in the order of S. japonicus, S. octosporus, and S. cryophilus), and if an ortholog is found and the ortholog is not annotated as a "hypothetical protein," the product description of the ortholog is assigned to it. All remaining genes not functionally annotated were submitted to eggNOG-mapper (online version, http://eggnog-mapper.embl.de/) (Cantalapiedra et al. 2021) for ortholog searching and functional annotation. Genes without reliable hits in eggNOG-mapper were annotated as "hypothetical protein." Based on the discrepancy reports generated during the submission of the genome to NCBI, we manually adjusted the annotations of some genes to conform with the requirement of NCBI.
The RelTime method in MEGA11 was employed to estimate the divergence time between S. osmophilus and S. octosporus using the branch lengths of the concatenation-based ML tree as input (Tamura et al. 2021). S. complicata was used as outgroup in the analysis. We used 3 calibration nodes based on a recently published time-calibrated phylogeny of Ascomycota: the S. japonicus-S. pombe split (207.2 million years ago), the S. pombe-S. octosporus split (108.2 million years ago), and the S. octosporus-S. cryophilus split (29.4 million years ago) . To assess the robustness of the analysis, we also carried out the analysis using only 2 calibration nodes (the S. japonicus-S. pombe split and the S. pombe-S. octosporus split) and only one calibration node (the S. japonicus-S. pombe split). Similar divergence times between S. octosporus and S. osmophilus were obtained (using 3 calibration nodes: 15.7 million years; using 2 calibration nodes: 15.5 million years; using 1 calibration node: 14.3 million years).

Synteny analysis of nuclear genomes
To visualize the syntenic relationship of the nuclear genomes of 2 species, single-copy orthologous gene pairs were extracted from the proteinortho output. Two genes of the same orthologous gene pair were linked using a colored line in Circos (v0.69, http://circos.ca/) (Krzywinski et al. 2009  (c) Maximum likelihood tree constructed using the amino acid sequences of the RT-IN region, which corresponds to amino acids 374-1256 of the Tf1 ORF. The tree was rooted using "Plants cluster" retrotransposons as outgroup. Branch labels are the SH-aLRT support value (%) and the UFBoot support value (%) calculated by IQ-TREE. The names of the species are colored as follows: blue for animal species, green for plant species, and purple for fungal species. Three major branches in the tree are highlighted by shades. The column with the header "CCHC" indicates whether a CCHC motif is present (+) or absent (−) in the Gag ORF of a retrotransposon. The column with the header "Gag-Pol" indicates which strategy is used to express the Gag-Pol fusion protein: −1 frameshift (−1 FS), in-frame stop codon (Stop), or a single ORF (1-ORF). We note that grasshopper was thought to use +1 frameshift (Gao et al. 2003), but our analysis showed that it uses −1 frameshift (see Materials and methods).

Centromere-related analyses
The centromeric regions in the S. osmophilus genome were identified based on the synteny of centromere-flanking protein-coding genes (Supplementary Table 13) (Ács-Szabó et al. 2018). Centromeric repeats were identified by all-to-all comparison of centromeric sequences using BLASTN and YASS (Yet Another Similarity Searcher) (Noé and Kucherov 2005) and manually annotated. The central core (cnt) region in each centromere was identified as the central region flanked by a pair of near-perfect inverted repeat (IR) sequences because this is a conserved characteristics of the cnt regions in S. pombe, S. octosporus, and S. cryophilus (Pidoux and Allshire 2004;Tong et al. 2019). The annotation of retrotransposons and LTRs was described in a section below. Plots depicting centromeric sequence features were generated using gggenomes (https://github.com/thackl/ gggenomes) (Hackl et al. 2021) and manually adjusted in Adobe Illustrator.
To identify conserved sequence motifs enriched in the cnt regions, de novo motif discovery was performed using the MEME (Multiple Em for Motif Elicitation) tool in the MEME Suite (version 5.4.1) with the cnt sequences from S. osmophilus, S. octosporus, and S. cryophilus as input (Bailey et al. 2009). The "Any Number of Repetitions (anr)" option was selected. We ran MEME multiple times, with the "maximum width (maxw)" parameter set at a different value each time. Based on the recommendation given by the authors of the MEME Suite (Bailey et al. 2009), we chose to set the maxw parameter at all integers from 10 to 20, and at 25, 30, 35, 40, 45, and 50. After manual inspection of all MEME output, a 11-bp motif was selected as the conserved cnt motif because it can be found in all 9 cnt sequences from the 3 species and has strong p-values (p < 1E−7) in multiple MEME runs. To determine the genomic distribution of this motif, we used the Find Individual Motif Occurrences (FIMO) tool of the MEME Suite to scan the genomes of S. osmophilus, S. octosporus, S. cryophilus, and S. pombe. The FIMO hits were filtered using the p-value cutoff of 3E−5 because all occurrences of the motif in the cnt sequences as reported by MEME can pass this cutoff.

Telomere-related analyses
The analysis of telomeric repeat units present in the HiFi reads was performed using the software TweenMotif (https:// download.cnet.com/TweenMotif/3000-2054_4-75325319.html) (Box et al. 2008). Incomplete repeat units at the end of the HiFi reads were filtered out.

Transposon-related analyses
Automated whole-genome de-novo transposable element (TE) annotation was performed using the Extensive de novo TE Annotator (EDTA) version 2.0.0 (Ou et al. 2019). Candidate transposons identified by EDTA were filtered by removing the ones overlapping with coding genes unrelated to transposons and the ones lacking any transposon-related protein domains that can be identified by TEsorter version 1.3 . No intact DNA transposons, nonlong terminal repeat (LTR) retrotransposons, or Ty1/ Copia superfamily LTR retrotransposons were found. The only intact transposons identified were 4 Ty3/Gypsy superfamily retrotransposons designated as Tosmo1-1, Tosmo1-2, Tosmo2-1, and Tosmo3-1 (Tosmo1-1 and Tosmo1-2 are identical in sequence) (Supplementary Table 18). The LTR sequences and internal sequences (INTs) of these intact retrotransposons were used as curated sequence input to perform another round of EDTA annotation. EDTA-annotated LTRs and INTs were filtered by removing the ones overlapping with coding genes unrelated to transposons and verified by TEsorter analysis and manual inspection of sequence alignments. For LTRs judged to be incomplete based on sequence alignment, we added flanking sequences, performed sequence alignment against, and manually evaluated where their borders can be extended. If an INT and a neighboring LTR were of the same orientation and the interval between them was no longer than 5 bp, they were merged into a retrotransposon. Retrotransposon sequences generated by the merging operation were verified by manual inspection of sequence alignments. The start and end coordinates of LTRs and retrotransposons are listed in Supplementary Table 18. In that table, LTRs are classified into 4 types (A, B, C, and D) based on a phylogenetic analysis ( Supplementary Fig. 9a). For brevity, we used the midpoint coordinates of LTRs and retrotransposons to denote their locations in Supplementary Fig. 9 (the same location information is also provided in Supplementary Table 18). The sequences of Tosmo1-1, Tosmo2-1, and Tosmo3-1 (including 50-bp upstream and 50-bp downstream flanking sequences) have been deposited at GenBank under accession numbers OP263985-OP263987.
The only copy of full-length Tcry1 retrotransposon in S. cryophilus, Tcry1-1, has an annotated length of 5,055 bp (Rhind et al. 2011;Tong et al. 2019). Rhind et al. 2011 annotated 2 identical LTRs of 374 bp for Tcry1-1. Close inspection indicated that the 40-bp sequence immediately downstream of the annotated Tcry1-1 sequence should be part of the 3′ LTR, and the lengths of the 5′ LTR and the 3′ LTR should be 406 and 414 bp respectively. The length difference between the 2 LTRs is due to a 8-bp sequence (ATTTTCCC, at positions 393-400 of the LTR) being tandemly duplicated in the 3′ LTR. Apart from this duplication, the 2 LTRs are identical in sequence. The currently annotated version of Tcry1-1 does not have a target site duplication (TSD), whereas the revised version of Tcry1-1 has a perfect TSD of 5 bp (TTTAA). In the currently annotated version of Tcry1-1, the IRs at the 2 ends of the LTR are 3-bp long (5′-TGT…ACA-3′), whereas in the revised version of Tcry1-1, the IRs are 5-bp long (5′-TGTCA…TGACA-3′) and are identical to the IRs in Tosmo1, Tosmo2, Tosmo3, and Tj1. The revised LTR annotation is also supported by the presence of a closely related solo LTR (NW_013185624.1 coordinates 2876278-2875872, minus strand) in the S. cryophilus genome. This solo LTR is 407 bp long and 94% identical to the 406-bp 5′ LTR of the revised version of Tcry1-1 and possesses a perfect TSD of 5 bp (TGAAT). The revised annotation of Tcry1-1 (including 50-bp upstream and 50-bp downstream flanking sequences) (ACQJ02000037.1 coordinates 13333-18527, minus strand) has been deposited at the third party annotation (TPA) section of GenBank under the accession number BK061829.
Retrotransposon sequences used for phylogenetic analysis were sequences of accession numbers given in the legend of Fig. 4. The accession numbers of most of the nonfission-yeast retrotransposons were obtained from GyDB (https://gydb.org/) (Llorens et al. 2011). The exceptions are PpatensLTR2, Tcn1, and grasshopper. The accession number of PpatensLTR2 (GQ294565) was from (Novikova et al. 2010). The same publication provided an accession number for Tcn1 (XM_571377). XM_571377 is the mRNA of the predicted gene CNF03140 in the genome of Cryptococcus neoformans var. neoformans strain JEC21 (Loftus et al. 2005). The CNF03140 gene encompasses the coding sequences of Gag and Pol of Tcn1 but contains a wrongly annotated intron, presumably due to the inability of the gene prediction procedure to accommodate the programed frameshift between the coding sequences of Gag and Pol. We manually annotated the Tcn1 sequence at the CNF03140 locus. The annotation of this copy of Tcn1 (including 50-bp upstream and 50-bp downstream flanking sequences) (AE017346.1 coordinates 926449-934423, minus strand) has been deposited at the TPA section of GenBank under the accession number BK061830.
The grasshopper retrotransposon was first discovered in strains of the phytopathogenic fungus Magnaporthe grisea that infect members of the plant genus Eleusine (Dobinson et al. 1993). Based on current taxonomy, these Eleusine-infecting strains should belong to the fungal species Pyricularia oryzae ( 5,233-bp sequence of grasshopper at GenBank (M77661.1), which contains the 5′ LTR and the coding sequences of Gag and Pol but lacks the 3′ LTR. The length of grasshopper was reported to be approximately 8 kb (Dobinson et al. 1993). Thus, M77661.1 is missing more than 2 kb of the sequence of grasshopper. We performed BLASTN search using M77661.1 as query and identified 72 highscoring hits (query length coverage = 100% and identity > 99.5%) in the genome of the Eleusine-infecting Pyricularia oryzae strain MZ5-1-6 (GenBank assembly accession GCA_004346965.1) (Gómez Luciano et al. 2019). Compared to M77661.1, these BLASTN hits all have an extra nucleotide after the 33rd nucleotide downstream of the stop codon of the Gag open reading frame (ORF). As a result, unlike the situation of M77661.1, where the Pol ORF is in the +1 frame relative to the Gag ORF, in these BLASTN hits, the Pol ORF is in the −1 frame relative to the Gag ORF. It is likely that the missing nucleotide in M77661.1 is due to a sequence error. Manual inspection of the aligned sequences of the 6 top BLASTN hits, which are identical to each other and differ from M77661.1 by 5 substitutions and 1 indel, and their flanking sequences showed that these 6 BLASTN hits are each part of a fulllength and intact copy of grasshopper with identical LTRs at 5′ and 3′ ends. The 6 copies have nearly identical sequences (lengths ranging from 7,614 to 7,629 bp and all pairwise identities ≥99.8%). Five of the 6 copies have perfect 5-bp TSDs. We deposited the annotation of a representative copy (7,615 bp long, TSD = TAAAT) with 50-bp upstream and 50-bp downstream flanking sequences (CP034205.1 coordinates 4282303-4290017, minus strand) at the TPA section of GenBank under the accession number BK061831.
Alignment of the amino acid sequences of the reverse transcriptase (RT) and the integrase (IN) was performed using MAFFT v7.149b with the E-INS-i algorithm (Katoh and Standley 2013). Aligned sequences were examined manually using Jalview version 2.11.2.3 (Waterhouse et al. 2009). Maximum likelihood trees were calculated using IQ-TREE version 2.0.3 (Minh et al. 2020). The best-fitted substitution model was selected using ModelFinder (Kalyaanamoorthy et al. 2017) as implemented in IQ-TREE. Ten independent IQ-TREE runs were performed and the tree with the best log-likelihood value was chosen. 1,000 ultrafast bootstraps (UFBoot) were performed using the "-bb" command and 1,000 Shimodaira-Hasegawa approximate likelihood ratio test (SH-aLRT) replicates were performed using the "-alrt" command (Guindon et al. 2010;Minh et al. 2013). Rooting and visualization of trees were performed using FigTree version 1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).
Transcriptional start sites in the 5′ LTRs of Tosmo1-1, Tosmo2-1, and Tosmo3-1, and Tcry1-1 were predicted based on multiple sequence alignment that includes the LTRs of Tf1 and Tf2. The "pretzel" structures were identified by visual inspection of the RNA secondary structures predicted using the MXfold2 web server (http://www.dna.bio.keio.ac.jp/mxfold2/) (Sato et al. 2021). The alignment showing the conserved sequence context of the in-frame stop codons and the alignment showing the conservation of sequences upstream of PPTs were visualized using ggmsa (Zhou et al. 2022) and manually adjusted in Adobe Illustrator.
The sliding-window analysis was conducted using SimPlot++ v1.3 (Samson et al. 2022). The following parameters were used: Distance model = identity, Window length = 40, Step = 5, Strip gap = No Strip Gap, and Plot refresh rate = every window. For the comparison between Tosmo1, Tosmo2, and Tosmo3, an MAFFT-generated alignment of their sequences was given as input to SimPlot++ and 3 pairwise comparison plots were created. For the comparison between Tf1 and Tf2, an MAFFT-generated alignment of their sequences was given as input to SimPlot++ and a pairwise comparison plot was created. The plots generated by SimPlot++ were manually adjusted in Adobe Illustrator.
wtf genes-related analyses wtf genes in another S. osmophilus strain CBS 15792 have been identified and annotated in our recently published study (De Carvalho et al. 2022). We calculated the average nucleotide identity (ANI) between the genomes of CBS 15792 and the type strain CBS 15793 T using OrthoANIu (Yoon et al. 2017). As a comparison we calculated ANIs between genomes of 6 representative purelineage S. pombe strains (JB22, JB760, JB869, JB758, JB837, and JB864) (https://db.cngb.org/search/project/CNP0001878/) (Tusso et al. 2022). OrthoANIu analysis results are shown in Supplementary Table 19. The synteny between the nuclear genome assemblies of CBS 15792 and CBS 15793 T was analyzed using nucmer in mummer-4.0.0 (Marçais et al. 2018). The smallest of the 11 nuclear contigs in the CBS 15792 genome assembly, tig00007777_pilon_x4, which has a length of 12,408 bp and does not contain any wtf genes, was not aligned to the CBS 15793 T genome by nucmer. wtf genes in the CBS 15793 T genome that are syntenic to wtf genes in the CBS 15792 genome were identified as follows: for each wtf gene in the CBS 15792 genome, its flanking nonrepetitive protein-coding genes were extracted, and their counterparts in the CBS 15793 T genome were identified by BLASTP (Basic Local Alignment Search Tool for Proteins) analysis of the amino acid sequences of gene products. The thus defined genomic regions in the CBS 15793 T genome were inspected for the presence of wtf genes. Annotated protein-coding genes falling into these regions and sharing sequence similarities with wtf genes in CBS 15792 were classified as wtf genes. For regions lacking any annotated protein-coding genes resembling wtf genes, BLASTN analysis was performed to identify wtf pseudogenes. wtf pseudogenes not annotated by the MAKER pipeline were given systematic names but were not included in the protein-coding genes listed in Supplementary Table 4. The coordinates of wtf genes (including wtf pseudogenes) in CBS 15793 T are listed in Supplementary Table 20. Pairwise nucleotide identities were calculated using Sequence Demarcation Tool Version 1.2 (SDTv1.2) (Muhire et al. 2014) and the default Needleman-Wunsch algorithm (as implemented in MUSCLE) was used.

Mating-type region-related analyses
Sequences of mating-type regions of S. pombe, S. octosporus, S. cryophilus, and S. osmophilus (provided as GenBank format files in Supplementary Files 1-4) were obtained as follows: the S. pombe sequence was derived from the reference genome sequence by replacing the chromosome 2 region spanning coordinates 2129208-2137121 with the sequence of the "mating type region" contig (GenBank accession FP565355) and by replacing the M cassette sequence at the mat1 locus with the P cassette sequence; the S. octosporus sequence was from the Tong et al. genome assembly (http://bifx-core.bio.ed.ac.uk/∼ptong/genome_assembly/) (Tong et al. 2019); the S. cryophilus sequence was derived from the Tong et al. genome assembly by replacing the M cassette sequence at the mat1 locus with the P cassette sequence and by changing a nucleotide in the Pc-coding sequence that causes a premature stop codon to the nucleotide in the reference S. cryophilus genome; the S. osmophilus sequence was from the genome assembly generated in this study. For analyses of the mating-type loci of S. japonicus, we used sequences of PCR-cloned mat1 locus and donor region (GenBank accessions JQ735907 and JQ735908) (Yu et al. 2013). We note that the coding sequence of S. japonicus Pi in these sequences differs from the coding sequence of Pi in the reference S. japonicus genome. A 1-bp deletion in the reference S. japonicus genome causes a frameshift in the coding sequence of Pi.
The synteny plot was generated using clinker and clustermap.js (Gilchrist and Chooi 2021). Pairwise amino acid identities were calculated using Sequence Demarcation Tool Version 1.2 (SDTv1.2) (Muhire et al. 2014) and the default Needleman-Wunsch algorithm (as implemented in MUSCLE) was used. Amino acid sequence alignments of cassette-encoded proteins and nucleotide sequence alignment of the cassettes and the SAS region were generated using MAFFT (Katoh and Standley 2013) and visualized using Jalview (Waterhouse et al. 2009).
Annotations of H1, H2, H3, and IRs in the S. pombe sequence are according to GenBank accession FP565355. Annotations of H1, H2, and H3 in the S. japonicus sequences are according to GenBank accession JQ735908. H1, H2, H3, and IRs in S. octosporus, S. cryophilus, and S. osmophilus sequences were annotated by manually inspecting the self-to-self sequence comparison results generated by YASS (Noé and Kucherov 2005). The structures of the Pi-Mi complexes were predicted using AlphaFold-Multimer with default parameters (Evans et al. 2022). The predicted structures were visualized using the Mol* Viewer (https://molstar.org/) (Sehnal et al. 2021). Among the structures predicted, we chose the one with the highest confidence score. The interface area of each complex was calculated using the PDBePisa web server (https://www. ebi.ac.uk/pdbe/pisa/) (Krissinel and Henrick 2007).

Cbp1 family-related analyses
TBLASTN and BLASTP searches using 21 fission yeast Cbp1 family proteins as queries were performed using the BLAST (Basic Local Alignment Search Tool) web server at NCBI (https://blast.ncbi. nlm.nih.gov/Blast.cgi). Terminal IRs (TIRs), a common characteristic of DNA transposons, were searched by manually inspecting the self-to-self sequence comparison results generated by YASS (Noé and Kucherov 2005). Sequence alignment and maximum likelihood tree construction were performed as described above. Pairwise amino acid identities were calculated using Sequence Demarcation Tool Version 1.2 (SDTv1.2) (Muhire et al. 2014) and the default Needleman-Wunsch algorithm (as implemented in MUSCLE) was used. Local synteny plots were generated using clinker and clustermap.js (Gilchrist and Chooi 2021).

Mitogenome-related analyses
Protein-coding genes in the mitogenome were annotated using the MFannot mitogenome annotator based on genetic code 4 (https://megasun.bch.umontreal.ca/apps/mfannot/) (Prince et al. 2022). Mitochondrial introns were classified using RNAweasel (https://megasun.bch.umontreal.ca/apps/rnaweasel/) (Prince et al. 2022). Mitochondrial tRNA genes were annotated based on the predictions of tRNAscan-SE using the sequence source option "Other mitochondrial" (http://trna.ucsc.edu/tRNAscan-SE/) (Lowe and Chan 2016) and the predictions of RNAweasel. Initiator formylmethionyl-tRNA gene and elongator methionyl-tRNA gene were distinguished based on similarity to tRNA genes in annotated mitogenomes of other fission yeast species. rRNA genes were annotated based on cross-species conservation revealed by BLASTN analysis. The rnpB (RNase P RNA) gene was annotated based on the result of RNAweasel analysis and manually adjusted based on homology to S. octosporus rnpB. Gene boundaries and exon-intron junctions were manually examined and adjusted when necessary. The sequence and annotation of the S. osmophilus mitogenome have been deposited at GenBank under the accession number OP310968.
The circular graphic map of the mitogenome was generated as follows: The coordinates of genes and double-hairpin elements (DHEs) were transformed into a tab-delimited text file. The GC content of sliding windows with window size = 35 bp and step size = 5 bp was calculated using bedtools nuc v2.30.0 (https:// bedtools.readthedocs.io/en/latest/) (Quinlan and Hall 2010) and the calculation result was transformed into a tab-delimited text file. These text files were given as input to Circos v0.69 (Krzywinski et al. 2009). The backgrounds representing the GC content ranges of 0-70% and 70-100% were defined in the "plot.conf" file of Circos. The resulting Circos plot was manually modified in Adobe Illustrator. Phylogenetic analysis was performed in the same manner as described earlier for the nuclear genome. Codon usage was analyzed using Geneious Prime (Dotmatics).
Plots showing the synteny between the mitogenomes of fission yeast species were generated using the graphics module of the JCVI utility libraries (https://github.com/tanghaibao/jcvi/wiki/ MCscan-%28Python-version%29) (Tang et al. 2015). As input for JCVI, we manually constructed a BED (Browser Extensible Data) format file containing the coordinates of the genes and a "blocks" file containing the matching relationships between genes in different species. The plots were merged and adjusted using Adobe Illustrator. Local synteny plots for the mitogenome region containing trnI(cau) and the nuclear genome region containing til1 were generated using clinker and clustermap.js (Gilchrist and Chooi 2021) and manually adjusted using Adobe Illustrator. Pairwise amino acid identities were calculated using Sequence Demarcation Tool Version 1.2 (SDTv1.2) (Muhire et al. 2014) and the default Needleman-Wunsch algorithm (as implemented in MUSCLE) was used. The sliding-window analysis was conducted using SimPlot++ v1.3 (Samson et al. 2022). The following parameters were used: Distance model = identity, Window length = 40, Step = 5, Strip gap = No Strip Gap, and Plot refresh rate = every window.
Double-hairpin elements (DHEs) were identified using a GC scanning approach based on the GC-rich characteristics of S. octosporus DHEs. GC-rich regions were found using bedtools nuc (window size = 35 bp, step size = 1 bp, GC content > 75%) and then inspected visually for the potential to form double-hairpin structures. The sequence alignment of DHEs and the accompanying arc diagram were generated using ggmsa (Zhou et al. 2022) and manually adjusted using Adobe Illustrator.

Results and discussion
Assembly of the genome of the S. osmophilus type strain CBS 15793 T To assemble the genome of the S. osmophilus type strain CBS 15793 T , we generated 430.07 Mb (∼35× coverage assuming a 12 Mb genome) of PacBio HiFi read data using the PacBio Sequel II platform, and 4.86 Gb (∼405× coverage assuming a 12 Mb genome) of Illumina paired-end read data using the Illumina NovaSeq 6000 platform (Supplementary Table 1). After contig assembly, polishing, and filtering ( Supplementary Fig. 1a), we obtained a total of 6 contigs ( Fig. 1d; Supplementary Table 2). One of the six contigs is the mitochondrial genome, which will be described in detail in a later section. Among the other contigs are 3 megabase-sized contigs (5.57 Mb, 3.50 Mb, and 2.35 Mb respectively). The total length of these 3 contigs (11.42 Mb) is similar to the total lengths of nuclear genome contigs in the reference genomes of the other 4 fission yeast species (12.57 Mb,11.27 Mb,11.52 Mb,and 11.13 Mb for S. pombe, S. octosporus, S. cryophilus, and S. japonicus, respectively) (Wood et al. 2002;Rhind et al. 2011). The 6 ends of these 3 contigs are either telomeric repeat sequences (described in detail in a later section) or sequences of 18S-5.8S-25S rRNA genes (rDNA) (Fig. 1e), which are the types of sequences found at or next to chromosome ends in other fission yeast species. Within each contig, 1 centromere was identified based on synteny of centromere-flanking genes (Fig. 1e). These observations suggest that like other fission yeast species, S. osmophilus has 3 chromosomes and these 3 contigs correspond to the 3 chromosomes. By convention, we numbered the 3 chromosomes from the longest to the shortest and designated the short arm of each chromosome as the left arm.
The rDNA repeat arrays are not fully assembled in the reference genomes of S. pombe, S. octosporus, S. cryophilus, or S. japonicus. Cutting-edge sequencing technologies available today still do not allow the full assembly of long arrays of rDNA repeats. In the recently published telomere-to-telomere human genome assembly, the 3 longest rDNA arrays are not fully assembled and have to be modeled instead (Nurk et al. 2022). In our S. osmophilus genome assembly, there are approximately 2 copies, 5 copies, and 1 copy of rDNA repeat unit at the left end of chromosome 2 contig, the left end of chromosome 3 contig, and the right end of chromosome 3 contig, respectively (Fig. 1e), suggesting that like the situation in S. octosporus and S. cryophilus (Tong et al. 2019), S. osmophilus has 3 rDNA arrays. None of the rDNA arrays is fully assembled as no telomeric repeats are present at their outer ends. Based on Illumina read depth coverage, we estimated that there are a total of about 185 copies of rDNA repeat units in the S. osmophilus nuclear genome.
The remaining 2 contigs in our S. osmophilus assembly are less than 100 kb long. They both contain a few copies of rDNA repeat units at one end and telomeric repeats at the other end. They share a nearly identical (99.98% identity) 15.3-kb sequence between rDNA and telomeric repeats but differ in where the rDNA/non-rDNA junction falls within an rDNA repeat unit ( Supplementary Fig. 1b). We named them rDNA-distal contig 1 and 2 (Fig. 1d). The Illumina read depth coverage of the 2 types of rDNA/non-rDNA junctions showed an approximately 2:1 ratio ( Supplementary Fig. 1c), suggesting that rDNA-distal contig 1 may correspond to sequences located outside of the distal ends of 2 rDNA arrays and rDNA-distal contig 2 may correspond to the sequence distal to the other rDNA array.
The quality of our S. osmophilus genome assembly was assessed in multiple ways. First, we found that 99.45% of PacBio HiFi reads and 99.62% of cleaned Illumina reads could be mapped onto the genome assembly, indicative of its completeness. Second, mapping results using both PacBio HiFi reads and Illumina reads showed even depth coverage across chromosomes, with the only exception being the rDNA repeats ( Supplementary Fig. 2a), suggesting that there are no copy number errors of repeat regions outside of the rDNA arrays. Third, variant calling using HiFi reads only identified 3 SNPs and 32 indels and variant calling using Illumina reads only identified 1 SNP and 12 indels. The 2 variant sets overlap by 1 SNP and 6 indels and all of them are in rDNA repeats, likely due to sequence variations in the unassembled rDNA repeat units. Thus, the genome assembly is virtually error-free at the base level. Fourth, we analyzed the presence or absence of 1,315 Benchmarking Universal Single Copy Orthologs (BUSCO) (ascomy-cota_odb9 gene set) (Simão et al. 2015) in our S. osmophilus genome assembly and found that 1,222 (92.9%) of them are classified as "complete and single copy" and 29 (2.2%) of them are classified as "complete and duplicated" (Supplementary Table 3). This level of BUSCO completeness is similar to those of published genome assemblies of the other 4 fission yeast species (Supplementary  Table 3 and Supplementary Fig. 2b). Together, these results demonstrate that we have obtained a highly contiguous, complete, and correct genome assembly of the type strain of S. osmophilus.

Gene annotation and orthogroup analysis
To facilitate gene annotation, we obtained transcriptome data of vegetatively growing cells of the S. osmophilus type strain CBS 15793 T using both Illumina short read-based RNA-seq and longread cDNA sequencing on the ONT platform (Supplementary Table 1). High-confidence transcript sequences were generated from these transcriptome data using the FLAIR workflow (Tang et al. 2020), and were provided to the annotation pipeline MAKER as EST evidence (Campbell et al. 2014). By incorporating additional evidence including ab initio prediction and annotated protein sequences of other 4 fission yeast species, protein-coding genes of S. osmophilus were predicted ( Supplementary Fig. 3a). Upon further manual inspection and adjustment, a total of 5,098 proteincoding genes were annotated in the 3 chromosomal contigs (Supplementary Fig. 3c and Supplementary Table 4). We also annotated nuclear genes encoding functional noncoding RNAs including rRNAs, tRNAs, snRNAs, snoRNAs, and 4 other functional RNAs (srp7, mrp1, rrk1, and ter1) ( Supplementary Fig. 3b-c and Supplementary Tables 5-9).
To establish evolutionary relationships between protein-coding genes of different fission yeast species, we inferred orthologous gene groups (orthogroups, OGs) encompassing nuclear-encoded protein-coding genes of 5 fission yeast species using proteinortho, a tool that takes into account both similarity and synteny (Lechner et al. 2011(Lechner et al. , 2014. 4,873 (95.6%) of the 5,098 S. osmophilus genes fall into 4,864 orthogroups that contain genes of at least one other fission yeast species (Supplementary Table 10 and Supplementary Fig. 4a). 4,122 (80.9%) S. osmophilus genes fall into 4,120 orthogroups that contain genes in all 5 fission yeast species (Supplementary Table 10 and Supplementary Fig. 4a). We hypothesized that the inability of S. osmophilus to grow in low-osmolarity media may be due to a gene loss. Based on the orthogroup analysis results, there are 13 genes present in S. octosporus and S. cryophilus but absent in S. osmophilus (Supplementary Table 10 and Supplementary Fig. 4a). However, none of them provides an obvious explanation to the obligate osmophilic nature of S. osmophilus.
We assigned gene product descriptions to S. osmophilus proteincoding genes by transferring annotations from orthologs in other fission yeast species, giving S. pombe higher precedence because it is actively curated (Supplementary Fig. 4b). For genes without proteinortho-detected orthologs and for genes whose orthologs all have the uninformative gene product description "hypothetical protein," we resorted to the tool eggNOG-mapper (Cantalapiedra et al. 2021), which provided informative gene product annotations for 77 genes (Supplementary Fig. 4b). In total, informative gene product descriptions were obtained for 4,868 (95.5%) S. osmophilus protein-coding genes (Supplementary Table 4).

Species phylogeny and evolutionary rate difference
We used the assembled nuclear genome of the S. osmophilus type strain to examine the phylogenetic position of S. osmophilus. First, we performed a maximum likelihood analysis using the amino acid sequences of 1,060 "complete and single-copy" BUSCO genes present in all 5 fission yeast species as well as the outgroup species Saitoella complicata, which also belongs to the Taphrinomycotina subphylum of Ascomycota (Liu et al. 2009) (Supplementary Fig. 5a and Supplementary Table 11). The sequences of these genes were generated by BUSCO analysis (Simão et al. 2015), which only uses genome sequences as input and is independent of genome annotations. The maximum likelihood tree shows that S. osmophilus shares a more recent common ancestor with S. octosporus than with other fission yeast species (Supplementary Fig. 5a).
We used the BUSCO-gene-inferred phylogeny to estimate the divergence time between S. osmophilus and S. octosporus (Fig. 1f). The divergence times of the S. japonicus-S. pombe split (207.2 million years ago), the S. pombe-S. octosporus split (108.2 million years ago), and the S. octosporus-S. cryophilus split (29.4 million years ago) reported in a recently published timetree of Ascomycota were used as time calibration nodes . This analysis showed that S. osmophilus and S. octosporus diverged about 15.7 millions years ago. This divergence time is similar to the divergence time between humans (genus Homo) and orangutans (genus Pongo) (Kumar et al. 2017).
To corroborate the BUSCO-based results, we calculated pairwise amino acid identities for each pair of orthologs from the 4,085 orthogroups with single-copy orthologs (1:1:1:1:1) in all 5 fission yeast species (Supplementary Table 12). Mean pairwise identities were calculated for all species pairs ( Supplementary Fig. 5b). Consistent with the BUSCO-based species tree, the orthologs in S. osmophilus and S. octosporus have the highest mean pairwise identity of 90.52%.
Interestingly, the maximum likelihood tree shows that the evolutionary rate of S. octosporus is 51% higher than that of S. osmophilus since their divergence from each other (0.0622 substitutions per site for S. octosporus vs 0.0412 substitutions per site for S. osmophilus) (Supplementary Fig. 5a). Consistent with this observation, the mean pairwise identity between S. octosporus and S. cryophilus orthologs (85.51%) is lower than that between S. osmophilus and S. cryophilus orthologs (87.06%) (Supplementary Fig. 5b). We plotted histograms of pairwise identities of the 4,085 orthogroups and observed unimodal distributions of identity values, with the distribution for the S. octosporus-S. cryophilus comparison slightly shifted to smaller values compared to that for the S. osmophilus-S. cryophilus comparison (Supplementary Fig. 5c). To directly analyze the differences in identity values for the 2 comparisons, we calculated the differences in pairwise identities for orthologs belonging to the same orthogroup and plotted histograms of the differences ( Supplementary Fig. 5d). The difference values show unimodal distributions not centered at zero, confirming overall differences in identity values. For 74% of the 4,085 orthogroups, the pairwise identity between S. osmophilus and S. cryophilus orthologs is larger than that between S. octosporus and S. cryophilus orthologs (Supplementary Table 12).
It is known that evolutionary rates of species are inversely correlated with their generation times (Wu and Li 1985;Smith and Donoghue 2008;Thomas et al. 2010). It is possible that S. octosporus has a shorter generation time or spends less time in quiescence than S. osmophilus in native habitats and therefore has evolved faster. The second possibility is consistent with the known natural habitats of these 2 species (Brysch-Herzberg et al. 2022): the primary habitat of S. octosporus is honeybee honey, which can be available all year round in a honeycomb, whereas the main habitat of S. osmophilus is bee bread of solitary bees, which only lasts for a maximum of a few weeks before being fully consumed by the larva in a nest chamber.
During evolution, both gene sequence divergence and changes in gene order accumulate over time. To visualize how the gene order in the nuclear genome of S. osmophilus differs from those in other fission yeast species, we generated pairwise synteny plots based on single-copy orthologs (Fig. 1g-j). Consistent with the species phylogeny inferred from sequence divergence, S. osmophilus shares the highest level of synteny with S. octosporus. There are only a few large-scale chromosomal rearrangements between these 2 species. One of these rearrangements is a reciprocal translocation with breakpoints falling into 2 centromeres (described in more detail below).  Table 13). For cen1, cen2, and cen3 of S. osmophilus, the lengths of the regions between the 2 closest flanking protein-coding genes are approximately 56, 57, and 72 kb, respectively (Fig. 2a), together accounting for 1.6% of the total length of the chromosomal contigs. These regions are the most prominent low GC content regions in the nuclear genome and are also the regions with the highest densities of tRNA genes (Fig. 1e).

Centromeres of S. osmophilus and their comparison with centromeres of other fission yeasts
To investigate the organization of S. osmophilus centromeres, we performed all-to-all comparison of the sequences of the 3 centromeric regions and found that like the situation in other fission yeast species (Rhind et al. 2011;Tong et al. 2019), these regions are extensively occupied by repetitive sequences (Fig. 2a; Supplementary Table 14). Similar to the centromeres of S. octosporus, S. cryophilus, and S. pombe (Tong et al. 2019), in each S. osmophilus centromere, a central core (cnt, blue rectangles in Fig. 2a) is flanked by a pair of IRs (thick gray arrows in Fig. 2a). The lengths of cnt1, cnt2, and cnt3 are 10.1 kb, 3.1 kb, and 9.7 kb, respectively, and the lengths of IRs flanking them are 1.5 kb, 11.1 kb, and 21.4 kb, respectively. Sequence similarity exists not only within the same centromere, but also between different centromeres. We identified 20 types of repeats present in more than 1 centromere (colored arrows in Fig. 2a). Four types locate exclusively within cnt regions and are designated CntA-CntD (Fig. 2a and Table 14). We will refer to these repeats outside of cnt regions as peri-core repeats.
There are 13 5S rRNA genes in the centromeres of S. osmophilus (Fig. 2a), accounting for 13% of the 5S rRNA genes in the nuclear genome (Supplementary Table 5). S. octosporus and S. cryophilus have higher number of 5S rRNA genes in their centromeres (25 and 20, respectively), and they each have multiple types of 5S rRNA gene-associated repeats (termed Five-S-Associated Repeats, FSARs), some of which harbor protein-coding sequences (Tong et al. 2019). In S. osmophilus, there is only one type of centromeric repeats associated with 5S rRNA genes: the CenA repeats. CenA repeats are usually flanked by 2 tandemly arranged 5S rRNA genes (Fig. 2a) and likely result from the homogenizing effect of 5S rRNA gene-mediated recombination. CenA repeats do not share sequence similarity with FSARs in S. octosporus and S. cryophilus and do not contain protein-coding sequences.
There are 96 tRNA genes in the centromeres of S. osmophilus (Fig. 2a), accounting for 33% of the tRNA genes in the nuclear genome (Supplementary Table 6). This high level of tRNA gene enrichment in the centromeres is similar to the situations in S. octosporus, S. cryophilus, and S. pombe (Tong et al. 2019). Two types of peri-core repeats are flanked on both sides by tRNA genes (CenB and CenD) and many other types of peri-core repeats are flanked on one side by tRNA genes (Fig. 2a), suggesting that these repeats may have arisen through tRNA gene-mediated recombination. There are 2 degenerated full-length retrotransposons and 4 solo LTRs in the centromeres of S. osmophilus (Fig. 2a) (see a later section for a detailed description of S. osmophilus retrotransposons). In S. pombe, S. octosporus, and S. cryophilus, tRNA genes or LTRs demarcate the transitions between CENP-A Cnp1 chromatin assembled at cnt regions and H3K9me2 heterochromatin assembled at outer centromeric repeats (Tong et al. 2019). We speculate that similar situations may occur in S. osmophilus.
It has been shown that S. octosporus and S. cryophilus, but not S. pombe, share synteny of tRNA genes and 5S rRNA genes in centromeres (Tong et al. 2019). We compared centromeric tRNA genes and 5S rRNA genes of S. osmophilus, S. octosporus, and S. cryophilus (Fig. 2b). The number, types, order, and orientation of centromeric tRNA genes are strongly conserved among these 3 species. The small number of breakdowns of tRNA gene synteny appears to be mainly caused by intra-centromeric rearrangements. For example, tRNA gene order difference between S. osmophilus cen1 and its orthologous centromere in S. cryophilus can be explained by a single inversion event that may have occurred through recombination between 5S rRNA genes. Centromeric 5S rRNA genes show weaker synteny than centromeric tRNA genes and appear to have undergone gain and loss events. No obvious cross-species sequence similarity can be detected in intervening sequences between syntenic tRNA genes and 5S rRNA genes.
S. octosporus and S. pombe, but not S. cryophilus, share perfect synteny of protein-coding genes flanking the 3 centromeres (Tong et al. 2019). Thus, it is thought that the situation in S. octosporus and S. pombe represents the ancestral state (Tong et al. 2019). However, our analysis shows that S. osmophilus and S. cryophilus share identical synteny of centromere-flanking genes (Fig. 2b), suggesting that the situation in these 2 species may represent the state in the common ancestor of S. osmophilus, S. octosporus, and S. cryophilus. Regardless of which situation is the ancestral centromere arrangement of these 3 species, the break of synteny can be attributed to an inter-centromeric translocation, with 1 breakpoint at or near a syntenic tRNA gene cluster (the RIA cluster) situated between cnt3 and emc5 in S. osmophilus and the other breakpoint at or near another RIA cluster situated between cnt2 and chk1 in S. osmophilus (Fig. 2b).
To date, no inter-species sequence conservation has been detected between the cnt regions of different fission yeast species. However, there appear to be conserved arrangements of repeats in cnt regions of S. octosporus and S. cryophilus, as both species contain cnt repeats of different lengths (termed oCNT-L and oCNT-S for the long and short S. octosporus cnt repeats respectively and cCNT-L and cCNT-S for the long and short S. cryophilus cnt repeats respectively) and they are arranged in similar ways in orthologous centromeres (Tong et al. 2019). BLASTN analysis of the sequences of the cnt regions of S. osmophilus, S. octosporus, and S. cryophilus detected no inter-species sequence similarity. However, we identified previously undetected sequence similarity between the 3 cnt regions of S. octosporus. No cnt repeats were previously known to exist in cnt1 of S. octosporus. We found that a repeat of approximately 480 bp, which we named oCNT-N (N for new), is present in cnt1 and cnt2 of S. octosporus ( Fig. 2c; Supplementary Fig. 6). In addition, a part of oCNT-S is also present in cnt1. Therefore, we split oCNT-S into 2 parts: oCNT-S1 and oCNT-S2. oCNT-S1 is present in cnt2 and cnt3, and oCNT-S2 in present in all 3 cnt regions ( Fig. 2c; Supplementary Fig. 6).
There are 4 types of cnt repeats in S. osmophilus: CntA-D ( Fig. 2c; Supplementary Fig. 7 and Supplementary Table 14). They are arranged in a highly similar manner to the 4 types of cnt repeats in S. octosporus: oCNT-N, oCNT-S2, oCNT-S1, and oCNT-L (Fig. 2c). In Fig. 2c, orthologous cnt regions in S. cryophilus, S. osmophilus, and S. octosporus are grouped together to highlight inter-species similarities in cnt repeat distribution and positioning. The conserved cnt repeat arrangements suggest that inter-cnt sequence similarities have been selected for during evolution and can be maintained long after inter-species cnt sequence similarity is no longer detectable.
Even though no cross-species sequence similarities in cnt regions were found by BLASTN analysis, we wondered whether it is possible that short conserved sequence motifs, which cannot be detected by BLASTN, may be enriched in cnt regions. To test this possibility, we applied the motif discovery tool MEME (Multiple Em for Motif Elicitation) (Bailey et al. 2009). A 11-bp motif was found in the cnt regions of S. osmophilus, S. octosporus, and S. cryophilus (Fig. 2d). For each of these 3 species, the density of the occurrences of the 11-bp motif in the cnt regions is about 7 times of the average density across the nuclear genome ( Fig. 2e; Supplementary Table 15). This motif is present in all 9 cnt regions of these 3 species (Fig. 2c), but is absent in the cnt regions of S. pombe ( Fig. 2e; Supplementary Table 15). It is possible that this motif may be of functional relevance.

Telomeres of S. osmophilus and their comparison with telomeres of other fission yeasts
Telomeres are composed of tandem arrays of short repeats synthesized by the reverse transcription enzyme telomerase, which contains a RNA subunit that serves as the template for telomeric repeat synthesis. We investigated telomeres of S. osmophilus and compared them to those of other fission yeast species.
In the S. osmophilus genome assembly, telomeric repeats were found at the 3 rDNA-free ends of the chromosomal contigs (Tel1L, Tel1R, and Tel2R) ( Fig. 3a and b). For the other 3 chromosome ends that contain rDNA repeats and are not fully assembled, telomeres are expected to locate at the far ends of rDNA-distal sequences (Tel2L, Tel3L, and Tel3R) (Fig. 3a) and should correspond to the telomeric repeats in the 2 rDNA-distal contigs (Fig. 3b).
The telomeres in S. octosporus and S. cryophilus are composed of near-perfect arrays of the same type of 9-bp telomeric repeat unit: TTGGGTTAC (Rhind et al. 2011;Kannan et al. 2015) (Supplementary Fig. 8a and b). Manual inspection of the telomeric sequences in the S. osmophilus genome assembly revealed that there are 2 main types of telomeric repeat units, 9-bp TTGGGTTAC and 11-bp ACTTGGGTTAC (Fig. 3b). To exclude the possibility of assembly artifacts, we examined the HiFi reads and confirmed that these 2 types of repeat units indeed co-exist in the same telomeric sequence ( Fig. 3c; Supplementary  Table 16). The 9-bp repeat unit is identical to the telomeric repeat unit in S. octosporus and S. cryophilus. We hereafter refer to the 9-bp repeat unit as the conserved repeat and the 11-bp repeat unit as the nonconserved repeat. We identified a total of 87 HiFi reads containing telomeric repeats (Supplementary Table 16). The lengths of telomeric repeats in these HiFi reads show a normal distribution centered around approximately 200 bp (Fig. 3d). The conserved repeat and the nonconserved repeat account for about 59 and 36% of the repeat units in the HiFi reads, respectively, and other types of repeat units together only account for about 5% of the repeat units ( Fig. 3e; Supplementary Table 17).
To understand why S. osmophilus differs from S. octosporus and S. cryophilus in the telomeric repeat units, we compared the telomerase RNA genes (ter1) of these 3 species and paid special attention to the template regions of telomerase RNAs ( Fig. 3f; Supplementary Fig. 8c and d). The template region in a telomerase RNA usually consists of 2 parts: a "templating sequence" corresponding to a full telomeric repeat unit and serving as the template for reverse transcription, and on its 3′ side an "annealing sequence" that is identical to the 5′ part of the "templating sequence" and base pairs with the 3′ end of a telomeric DNA that has just finished 1 round of reverse transcription (Peska et al. 2020(Peska et al. , 2021Fajkus et al. 2021). The telomerase RNA genes of S. osmophilus, S. octosporus and S. cryophilus are well conserved ( Fig. 3f; Supplementary Fig. 8d). S. octosporus and S. cryophilus have the same 9-nucleotide templating sequence (3′-AACCCAAUG-5′). This sequence is strictly conserved in S. osmophilus (Fig. 3f). However, there is a notable difference in the annealing sequence. S. octosporus and S. cryophilus have the same 3-nucleotide annealing sequence 3′-AUG-5′, which is identical to the 5′ trinucleotide of the templating sequence. In S. osmophilus, this sequence is 3′-GUG-5′. Consequently, the templating sequence of S. osmophilus telomerase RNA is flanked on its 3′ side by 2 tandem copies of the dinucleotide sequence 3′-UG-5′ (Fig. 3f). We propose that either of these two 3′-UG-5′ sequences can base pair with the 3′-most dinucleotide of a telomeric DNA (Fig. 3g). If the templating-sequence-proximal 3′-UG-5′ forms the base pairs, a 9-bp conserved telomeric DNA repeat will be synthesized in the ensuing round of reverse transcription; on the other hand, if the distal 3′-UG-5′ forms the base pairs, the proximal 3′-UG-5′ becomes part of the templating sequence and a 11-bp nonconserved repeat will be synthesized in the next round of reverse transcription. Consistent with this model, in the HiFi reads containing telomeric repeats, except for the few most centromere-proximal repeats, the two types of repeats exhibit a random arrangement (Supplementary Table 16).

Retrotransposons in S. osmophilus and their evolutionary relationships with other retrotransposons
We surveyed the transposon content of the S. osmophilus genome using the Extensive de-novo TE Annotator (EDTA) pipeline (Ou et al. 2019). Similar to the situation in other fission yeast species (Bowen et al. 2003;Rhind et al. 2011;Tusso et al. 2022), the only type of intact transposons found in the S. osmophilus genome is Ty3/Gypsy superfamily LTR retrotransposons (see Materials and methods). We identified a total of 4 full-length and intact retrotransposons ( Fig. 1e; Supplementary Table 18). They each contain 2 identical LTRs at their 5′ and 3′ ends, indicative of recent transposition. Two of these retrotransposons are identical in sequence, and we named them Tosmo1-1 and Tosmo1-2. The other 2 were named Tosmo2-1 and Tosmo3-1. Tosmo1-1, Tosmo1-2, and Tosmo2-1 are flanked by perfect 5-bp TSDs (Supplementary  Table 18). Using the sequences of these 4 retrotransposons to guide a new round of EDTA-based analysis, we were able to identify 15 degenerated retrotransposons and 63 solo LTRs in the S. osmophilus genome (Supplementary Table 18). Phylogenetic analysis classified the LTRs in the S. osmophilus genome into 4 types ( Supplementary Fig. 9a). Tosmo1 and Tosmo3 have type-A LTRs and Tosmo2 has type-B LTRs. Retrotransposons and solo LTRs in the S. osmophilus genome are not strongly enriched in any particular genomic regions ( Fig. 1e; Supplementary Fig. 9b). In S. pombe, retrotransposons and solo-LTRs are underrepresented in convergent intergenic regions due to an insertion preference for intergenic regions containing promoters (Bowen et al. 2003). This is not the case in S. osmophilus. Among the 78 retrotransposons and solo LTRs that can be assigned to intergenic regions (the other 4 are in centromeres), about 24% are in convergent intergenic regions (Supplementary Table 18). However, convergent intergenic regions only account for about 15% of the total length of intergenic regions.
We performed comparative analysis of fission yeast retrotransposons, including the 3 types of S. osmophilus retrotransposons identified here and S. pombe Tf1 and Tf2, S. cryophilus Tcry1, and S. japonicus Tj1 (Fig. 4a). Tf1 and Tf2 are extensively characterized retrotransposons (Levin et al. 1990;Weaver et al. 1993;Esnault and Levin 2015;Maxwell 2020). A survey of retrotransposons in a diverse set of S. pombe isolates found that full-length elements are mostly Tf1 or Tf2 and the rest are hybrid elements resulting from recombination between Tf1 and Tf2 (Tusso et al. 2022). Tcry1 is the only type of full-length retrotransposon known to exist in S. cryophilus. There is a single copy of Tcry1 in the genome of the only available strain of S. cryophilus (Rhind et al. 2011;Tong et al. 2019). We noticed that the downstream boundary of that copy of Tcry1 has been mis-annotated and its length should be 5,095 bp, 40 bp longer than previously reported (see Materials and methods). S. japonicus has a large variety of retrotransposons (Rhind et al. 2011;Chapman et al. 2022). However, the retrotransposons in the S. japonicus reference genome tend to have sequence errors (Rhind et al. 2011). Thus, we included in our analysis only one type of S. japonicus retrotransposon, Tj1, because a reliable sequence of a full-length Tj1 retrotransposon is available (Guo et al. 2015), even though there is no full-length Tj1 in the S. japonicus reference genome (Rhind et al. 2011). No full-length retrotransposons have been found in S. octosporus (Rhind et al. 2011;Tong et al. 2019).
Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 each encode 4 proteins: the structural protein Gag, the protease (PR), the reverse transcriptase (RT), and the integrase (IN) (Fig. 4a). In Tf1 and Tf2, all proteins are encoded within a single ORF . In contrast, in Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1, an in-frame stop codon separates the coding sequence for Gag and the coding sequence for the Pol polyprotein that encompasses PR, RT, and IN (Fig. 4a). Such an in-frame stop codon is a known strategy employed by a number of retroviruses and retrotransposons to express a Gag-Pol fusion protein through programed stop codon readthrough and to ensure that the level of the Gag-Pol fusion protein is lower than that of the Gag protein (Yoshinaka et al. 1985;Matthews et al. 1997;Leng et al. 1998;Gao et al. 2003;Forbes et al. 2007;Guo et al. 2015). The in-frame stop codon in Tj1 has been shown to be essential for its transposition activity (Guo et al. 2015).
Interestingly, a 8-bp sequence, which includes the stop codon (TGA), the preceding codon (GAA for glutamate), and the first 2 nucleotides of the following codon (CTA or CTG for leucine), is strictly conserved, suggesting not only a common ancestry, but also selective constraints preventing divergence. TGA is known to be the most readthrough-permissive stop codon (Firoozan et al. 1991;Jungreis et al. 2011;Cridge et al. 2018). The dinucleotide AA immediately upstream of a stop codon promotes readthrough in Saccharomyces cerevisiae (Tork et al. 2004). Several types of sequences immediately downstream of the TGA stop codon are known to favor readthrough in various eukaryotic organisms, including the nucleotide C (Li and Rice 1993;Bonetti et al. 1995;Jungreis et al. 2011;Cridge et al. 2018), the dinucleotide CT (Stiebler et al. 2014), and the tetranucleotide CTAG (Loughran et al. 2014). Thus, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 appear to exploit readthrough mechanisms conserved in eukaryotic organisms to regulate the expression of the Gag-Pol fusion protein.
Ty3/Gypsy retrotransposons have been phylogenetically classified into 2 main branches: chromovirus and nonchromovirus (Marín and Lloréns 2000;Llorens et al. 2009Llorens et al. , 2011Neumann et al. 2019). Chromovirus retrotransposons usually contain a chromodomain at the C-terminus of the integrase, whereas nonchromovirus retrotransposons lack the chromodomain. Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 all contain a chromodomain ( Supplementary Fig. 10a) and are thus chromovirus retrotransposons. Tf1 is the founding member of a sublineage of chromovirus retrotransposons, termed the Tf1/sushi group (Butler et al. 2001;Goodwin and Poulter 2001;Neumann et al. 2019). A common feature of Tf1/sushi retrotransposons is that they initiate reverse transcription using a self-priming mechanism first discovered in studies of Tf1 (Levin 1995(Levin , 1996Lin and Levin 1997b;Butler et al. 2001;Goodwin and Poulter 2001). The Tf1/sushi group is equivalent to the "Fungi/Vertebrates cluster" of chromovirus retrotransposons in the classification system of the Gypsy Database of Mobile Genetic Elements (GyDB, https:// gydb.org/) (Llorens et al. 2009(Llorens et al. , 2011. The term "Fungi/ Vertebrates cluster" is a misnomer as Tf1/sushi retrotransposons are present not only in fungi and vertebrates, but also in a number of nonseed plant species (Novikova et al. 2010;Neumann et al. 2019). Previously published phylogenetic analyses have shown that Tf1/sushi retrotransposons form a monophyletic group and the closest relatives of Tf1/sushi retrotransposons are plant chromovirus retrotransposons that use host tRNAs to prime reverse transcription (Marín and Lloréns 2000;Goodwin and Poulter 2001;Levin 2007;Llorens et al. 2009;Novikova et al. 2010;Neumann et al. 2019). These plant chromovirus retrotransposons are referred to as the "Plants cluster" of chromovirus retrotransposons in GyDB (Llorens et al. 2009(Llorens et al. , 2011. We performed a phylogenetic analysis of Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 together with other representative chromovirus retrotransposons using amino acid sequences encompassing the 2 most conserved proteins encoded by retrotransposons, RT and IN (excluding the less conserved chromodomain at the C-terminus of IN) (Fig. 4c). This analysis showed that Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 form a highly supported monophyletic clade, which is sister to all other Tf1/sushi retrotransposons, suggesting that these fission yeast retrotransposons descent from an ancestral retrotransposon existing in the common ancestor of fission yeasts. Supporting their shared origin is the fact that the Gag proteins of these 7 fission yeast retrotransposons do not contain a Cys-X 2 -Cys-X 4 -His-X 4 -Cys (CCHC) motif, which is widely present in the Gag proteins of retrotransposons and retroviruses (Covey 1986;Llorens et al. 2008), including the Gag proteins of other chromovirus retrotransposons ( Fig. 4c;  Supplementary Fig. 11). As the common ancestor of fission yeasts existed more than 200 million years ago (Rhind et al. 2011;Shen et al. 2020), these fission yeast Tf1/sushi retrotransposons may have co-evolved with fission yeasts for over 200 million years.
Intriguingly, the RT-IN phylogeny is incongruent with the species tree of fission yeasts, as retrotransposons in S. osmophilus and S. cryophilus share a more recent common ancestor with S. japonicus Tj1 than with S. pombe Tf1 and Tf2. On the other hand, this phylogeny is consistent with the fact that Tf1 and Tf2 have a single ORF whereas Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 have an in-frame stop codon at the Gag-Pol junction. Tf1/sushi retrotransposons of nonfission-yeast species predominantly express the Gag-Pol fusion protein through programed −1 frameshifting at the Gag-Pol junction, whereas the outgroup "Plants cluster" retrotransposons usually have a single ORF (Fig. 4c) (Gao et al. 2003). We speculate that the ancestor of the 7 fission yeast retrotransposons may be a single-ORF retrotransposon. The in-frame-stop-codon retrotransposons would have arisen from this ancestor before the divergence of fission yeast species and would later have been lost in the lineage leading to S. pombe. The single-ORF retrotransposons would have been lost in the lineage leading to S. osmophilus and S. cryophilus. The situation in S. japonicus awaits further characterization of its retrotransposon repertoire.
The self-priming of Tf1 requires a 3-duplex RNA structure formed by the 5′ untranslated region (UTR) of the Tf1 mRNA Levin 1997a, 1998) (Fig. 4d). The central duplex is formed between the self-primer at the 5′ end of the mRNA (shown in green in Fig. 4d) and the primer-binding site (PBS) located downstream of the 5′ LTR (shown in red in Fig. 4d). The other 2 duplexes are each formed between a PBS-adjacent sequence and a 5′ LTR sequence located downstream of the self-primer. This type of RNA structure, dubbed the "pretzel" structure (Butler et al. 2001), can also be formed by the mRNAs of Tf2 and several other Tf1/sushi retrotransposons Levin 1997a, 1997b;Butler et al. 2001;Goodwin and Poulter 2001). We inspected the sequences of Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 and found that their mRNAs all have the potential to form the "pretzel" structure ( Fig. 4d), suggesting that they share a similar self-priming mechanism with Tf1.
During reverse transcription of retrotransposons and retroviral mRNAs, the synthesis of plus-strand cDNA is initiated from an RNA primer that corresponds to a purine-rich sequence located upstream and adjacent to the 3′ UTR (Wilhelm and Wilhelm 2001;Rausch and Le Grice 2004;Rausch et al. 2017). This sequence is termed the polypurine tract (PPT) (Majors and Varmus 1981). We inspected PPT sequences and PPT-flanking sequences of Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 (Fig. 4e). Interestingly, even though their PPT sequences do not show strong conservation, the 4 nucleotides immediately upstream of the PPT are exclusively TCTT in these fission yeast retrotransposons. This feature is not conserved in other Tf1/sushi retrotransposons ( Supplementary Fig. 10b). Thus, the PPT-adjacent TCTT sequence is a characteristic feature of these fission yeast retrotransposons. As the plus-strand primer of Tf1 is known to begin at the third nucleotide of the TCTT sequence (Atwood-Moore et al. 2005;Esnault and Levin 2015), the TCTT sequence may help define the 5′ end of the plus-strand primer. In addition, a T-rich sequence immediately upstream of the PPT, termed T-box or U-box, is important for plus-strand priming of several retroviruses and Saccharomyces cerevisiae retrotransposon Ty1 (Ilyinskii and Desrosiers 1998;Noad et al. 1998;Robson and Telesnitsky 1999;Wilhelm et al. 1999b). It is possible that the TCTT sequence is similarly important for plusstrand priming of these fission yeast retrotransposons.
To more comprehensively understand the relationship between Tosmo1, Tosmo2, and Tosmo3, we performed a slidingwindow analysis of nucleotide identity using SimPlot++ (Fig. 4f) (Samson et al. 2022). Consistent with the RT-IN tree, Tsomo1, Tosmo2, and Tosmo3 are very similar to each other in the Pol coding sequence (overall nucleotide identity > 96%). However, strikingly high sequence diversity exists outside of the Pol coding sequence. High sequence diversity is located in 2 separate regions: first, Tosmo2 strongly differs from Tosom1 and Tosmo3 in an approximately 140-bp segment in the U3 region of the LTR, with the downstream boundary of the segment only 9 bp away from the U3-R junction ( Supplementary Fig. 10c); and second, Tosmo1, Tosmo2, and Tosmo3 differ substantially from each other in an approximately 780-bp region beginning from the 5′ UTR sequence downstream of the PBS and extending to the C-terminal region of the Gag coding sequence. The second diversity region extends about 140 bp further downstream to the beginning of the Pol coding sequence for the Tosmo1-Tosmo2 comparison and the Tosmo1-Tosmo3 comparison. At the amino acid level, the identities of Gag proteins are 34%, 32%, and 55% for Tosmo1-Tosmo2, Tosmo1-Tosmo3, and Tosmo2-Tosmo3 comparisons, respectively ( Supplementary Fig. 11). Such mosaic patterns of diversity are remarkably similar to the situation of S. pombe Tf1 and Tf2, which share 98% nucleotide identity in the Pol coding sequence but are highly different from each other in a segment of the U3 region and in the Gag coding sequence and the adjacent 5′ UTR sequence (Fig. 4f, Supplementary Figs. 10d and 11) Kelly and Levin 2005). We hypothesize that this similarity may not be coincidental but rather due to nonrandom formation of diversity regions and/or selective advantages conferred by certain diversity patterns.
As has been proposed before for Tf1 and Tf2 (Kelly and Levin 2005), the mosaic patterns of diversity between Tsomo1, Tosmo2, and Tosmo3 likely result from inter-element recombination between divergent retrotransposons. Inter-element recombination may happen either through recombination between chromosomally integrated retrotransposons, or through extrachromosomal reverse-transcription-related recombination (Kelly and Levin 2005;Drost and Sanchez 2019). The recombination breakpoint near the U3-R junction may be a reflection of reverse-transcription-related recombination happening at the minus-strand transfer step of reverse transcription, when the minus-strand cDNA that has been synthesized to the 5′-end of the retrotransposon RNA, i.e. the beginning of the R region in the 5′ LTR, is transferred to base pair with the R region in the 3′ portion of either the same RNA molecule or another RNA molecule co-packaged in the same virus-like particle (Wilhelm et al. 1999a;Drost and Sanchez 2019).
The U3 region of the LTR serves as the promoter for the transcription of the retrotransposon mRNA and contains cisregulatory sequences such as the TATA box and transcription factor-binding sites (Bilanchone et al. 1993;Benachenhou et al. 2013). In S. pombe, a nucleosome positioned at the U3 region of the Tf2 retrotransposon by chromatin remodelers Fft2 and Fft3 prevents the transcription of full-length Tf2 transcript in unstressed cells (Persson et al. 2016). The diversification of the U3 sequence in the retrotransposons in S. pombe and S. osmophilus may widen transcriptional regulatory mechanisms and thus confer adaptability. The Gag proteins of retroviruses and retrotransposons are known to be the target of host defense mechanisms and self-restriction mechanisms (Sanz-Ramos and Stoye 2013; Saha et al. 2015;Cottee et al. 2021). The diversification of the Gag sequence in the retrotransposons in S. pombe and S. osmophilus may be beneficial to the evolutionary survival of these retrotransposons.

Comparisons of the wtf genes in CBS 15792 and CBS 15793 T
wtf (with Tf LTRs) genes, which represent the largest gene family in S. pombe, are selfish spore killer genes (Hu et al. 2017;Nuckolls et al. 2017Nuckolls et al. , 2022. Our recently published study showed that wtf genes also exist in S. octosporus, S. osmophilus, and S. cryophilus (De Carvalho et al. 2022). That study identified 42 wtf genes (including 11 pseudogenes) in a draft genome assembly of the S. osmophilus strain CBS 15792. CBS 15792 and CBS 15793 T share an average nucleotide identity (ANI) of 99.73% (Supplementary  Table 19). As a comparison, ANIs between REF-clade S. pombe strains estimated to have diverged within the last few thousand years are around 99.86%, and ANIs between NONREF-clade S. pombe strains are around 99.38% (Supplementary Table 19) (Tao et al. 2019). Consistent with the high nucleotide identity between CBS 15792 and CBS 15793 T , 41 of the 42 wtf genes in CBS 15792 have syntenic counterparts in CBS 15793 T (Supplementary Table 20 and Supplementary Fig. 12a). The wtf31 gene in CBS 15792 does not have a syntenic counterpart in CBS 15793 T . This presence-absence polymorphism is likely owing to recombination mediated by tandemly arranged 5S rRNA genes flanking the wtf31 gene in CBS 15792 (Supplementary Fig. 12b).
Among the 41 pairs of syntenic wtf genes in CBS 15792 and CBS 15793 T , 2 pairs show inter-strain differences in whether a gene is an active gene or a pseudogene (Supplementary Table 20). wtf7 in CBS 15792 is a pseudogene with a premature stop codon in exon 4, whereas its syntenic counterpart, wtf41 in CBS 15793 T , does not have the premature stop codon and is an active gene ( Supplementary Fig. 12c). Conversely, wtf33 in CBS 15792 is an active gene, but its syntenic counterpart, wtf26 in CBS 15793 T , is a pseudogene owing to a frameshifting 1-bp deletion in exon 2 ( Supplementary Fig. 12d).
The within-pair nucleotide identities of 22 (54%) syntenic pairs are 100% and the within-pair nucleotide identities of 15 (37%) pairs are lower than 100% but higher than 98% (Supplementary  Table 20). Manual inspection of the other 4 pairs whose withinpair nucleotide identities fall below 98% showed that ectopic recombination is the likely cause of their higher levels of inter-strain divergence. Supplementary Fig. 12e shows 1 of these 4 pairs, wtf13 in CBS 15793 T and its syntenic counterpart wtf22 in CBS 15792. They share an overall nucleotide identity of 96.60%. However, within an approximately 250-bp region, the nucleotide identity between these 2 genes is only 85.36%. In this region, wtf13 in CBS 15793 T is highly similar to wtf15 in CBS 15793 T (97.98% identity), and wtf22 in CBS 15792 is highly similar to wtf32 in CBS 15793 T (98.33% identity) ( Supplementary Fig. 12e). Thus, either wtf13 in CBS 15793 T or wtf22 in CBS 15792 may have undergone ectopic recombination in this region. These results suggest that ectopic recombination drives fast divergence of wtf genes in S. osmophilus.

Cross-species comparisons of the mating-type region
Haploid fission yeast cells exist in 1 of 2 mating types, P (plus) or M (minus). Whether the mating type is P or M is determined by which of 2 cassettes, P or M, is present at the mating-type locus mat1 (Klar et al. 2014). Homothallic fission yeast strains can switch mating type by replacing the cassette at mat1 with a different cassette residing at a donor locus. There are 2 heterochromatin-silenced donor loci, mat2-P and mat3-M, which are adjacent to each other and are a few genes away from mat1 in S. pombe, S. octosporus, and S. cryophilus (Rhind et al. 2011). The cassettes in the 3 mat loci are all flanked by homology boxes (H1 box on one side and H2 box on the other side), which serve as recombination substrates during mating type switching.
We examined the synteny of the mating-type region in different fission yeast species (Fig. 5a). We did not include S. japonicus in this analysis because the mating-type region in the reference S. japonicus genome is misassembled (Yu et al. 2013). For ease of comparison, we chose to use sequences in which the cassette at mat1 is the P cassette (GenBank format files of sequences shown in Fig. 5a are provided as Supplementary Files 1-4). Sequences are oriented such that mat1 is on the left and the donor region is on the right. Genes flanking the mat1 locus show strong synteny across S. pombe, S. octosporus, S. cryophilus, and S. osmophilus. Notably, nvj2 and sui1, which are the third and fourth genes on the left side of mat1 in all 4 species, are among the genes that show conserved proximity to the mating-type locus across the Ascomycota phylum (Riley et al. 2016;Hanson and Wolfe 2017). With the exception of SPBC23G7.10c, all genes locating in the 17-kb interval between mat1 and mat2-P (the L-region) in S. pombe are syntenically conserved in S. octosporus, S. cryophilus, and S. osmophilus. In S. pombe, mat2-P and mat3-M have the same orientation (H2 on the left and H1 on the right) and are separated by a 11-kb interval (the K-region). In contrast, in S. octosporus, S. cryophilus, and S. osmophilus, mat2-P and mat3-M are in opposite orientations, with their H1 boxes facing each other, and are only about 600 bp apart (Fig. 5a). This is likely the ancestral configuration as mat2-P and mat3-M of S. japonicus are also arranged in this way (Yu et al. 2013). In all 4 species, the donor region is separated from the closest neighboring genes by IRs, which have been shown to be the boundaries separating the donor region heterochromatin from the neighboring euchromatin in S. pombe, S. octosporus, and S. cryophilus (Noma et al. 2001;Tong et al. 2019). On the right side of the donor region, gene synteny is maintained between S. octosporus, S. cryophilus, and S. osmophilus but is broken between these 3 species and S. pombe.
In S. pombe, the P cassette encodes 2 proteins Pc and Pi and the M cassette encodes 2 proteins Mc and Mi (Kelly et al. 1988). Genes encoding Pc, Pi, and Mc but not Mi are annotated in the reference genomes of S. octosporus, S. cryophilus, and S. japonicus ( Rhind et al. 2011). An Mi-encoding gene has been annotated in the PCR-cloned mating-type donor region of a nonreference strain of S. japonicus (GenBank: JQ735908.1) (Yu et al. 2013). Our inspection showed that Mi-encoding genes exist in S. osmophilus, S. octosporus and S. cryophilus (Fig. 5a). Pairwise comparisons and multiple sequence alignments of the amino acid sequences of Pc, Mc, Pi, and Mi showed that all 4 proteins are highly conserved among S. osmophilus,S. octosporus,and S. cryophilus (Fig. 5b;Supplementary Fig. 13ad). Pc and Mc of these 3 species are moderately similar to their counterparts in S. pombe and S. japonicus, whereas Pi and Mi of these 3 species show rather low overall similarities to their homologs in S. pombe and S. japonicus ( Fig. 5b; Supplementary Fig. 13a-d).
In S. pombe, Pi and Mi interact with each other in the zygote (Vještica et al. 2018). We predicted a heterodimeric structure of the S. pombe Pi-Mi complex using AlphaFold-Multimer (PDB format file provided as Supplementary File 5) (Evans et al. 2022). Pi contains a homeodomain in its C-terminal region (Kelly et al. 1988). In the predicted structure of the Pi-Mi heterodimer, the N-terminal region of Pi (amino acids 1-82), which consists of 3 α-helices, forms an intermolecular 4-helix bundle with the 42-amino-acid Mi, which folds into a single α-helix ( Supplementary Fig. 13e). Analysis of the predicted structure using the SOCKET2 web application showed that the hydrophobic core of the 4-helix bundle forms knobs-in-holes packing typical of coiled coils (Kumar and Woolfson 2021). Pi and Mi of S. osmophilus are predicted by AlphaFold-Multimer to form a heterodimer with a structure resembling that of the S. pombe Pi-Mi complex (Supplementary Fig. 13f and g; PDB format file provided as Supplemenatary File 6). The Pi-Mi interface areas are 1,335 Å 2 and 1,297 Å 2 for the S. pombe complex and the S. osmophilus complex, respectively. Residues at the Pi-Mi interfaces are mostly hydrophobic, but also include a few charged residues that form interhelical salt bridges. These interface residues are substantially conserved in S. osmophilus, S. octosporus, S. cryophilus, and S. pombe, but not in S. japonicus (Supplementary Fig. 13c and d). AlphaFold-Multimer did not predict a confident heterodimeric structure for Pi and Mi of S. japonicus. Thus, Pi and Mi of S. japonicus may not interact or may interact in a different manner.
We also examined the conservation of cis-acting sequences at the 3 mating-type loci using BLASTN analysis and multiple sequence alignment. The H1 boxes in all 5 fission yeast species share similar sequences ( Supplementary Fig. 14a and Supplementary  Table 21) (Rhind et al. 2011;Yu et al. 2013). The H2 boxes in S. osmophilus, S. octosporus, S. cryophilus, and S. pombe are highly similar in the most cassette-proximal 75 nucleotides, which partially overlap with the 3′ coding sequences of Pc and Mi (Supplementary Fig. 14b and Supplementary Table 22). As a result of this overlap, Pc and Mi of S. osmophilus, S. octosporus, S. cryophilus share the same 12 amino acids in their C termini (Supplementary Fig. 13a and d). The H2 box in S. japonicus shares no obvious similarity with the H2 boxes in other species. For the H3 boxes, which locate next to the H2 boxes in the donor region, cross-species similarity is only detectable between S. osmophilus and S. octosporus (Supplementary  Table 23). No obvious cross-species similarity was detected for the IRs flanking the donor regions (Supplementary Table 24). Two cis-acting sequences called switch-activating sites (SAS1 and SAS2) are present in a 110-bp L-region segment immediately adjacent to H1 in S. pombe (Arcangioli and Klar 1991). Sequence alignment suggests that this L-region segment is conserved across 5 fission yeast species (Supplementary Fig. 14a) (Rhind et al. 2011;Yu et al. 2013). SAS1 in S. pombe harbors 3 TA(A/G)CG motifs, which are binding sites for the trans-acting factor Sap1 conserved in all fission yeast species (OG-761 in Supplementary Table 10) (Ghazvini et al. 1995). SAS1 counterparts in S. osmophilus, S. japonicus, S. octosporus, and S. cryophilus contain 1, 1, 2, and 2 TA(A/G)CG motifs, respectively ( Supplementary Fig. 14a).
In S. pombe, two 8-bp Atf1-binding sites (ATGACGTA) located in the donor region, called s1 and s2, act redundantly with the RNAi pathway to promote the establishment of the donor region heterochromatin (Thon et al. 1999;Nickels et al. 2022). Our inspection of the donor region sequences of the other 4 fission yeast species showed that there are 2, 2, and 4 ATGACGTA sequences in the donor regions of S. cryophilus, S. osmophilus, and S. octosporus, respectively (Fig. 5a). Interestingly, the ATGACGTA sequences closest to the donor cassettes in these 3 species share similar locations and surrounding sequences ( Fig. 5a; Supplementary Fig. 14c), suggesting that they have a common evolutionary origin and are maintained by purifying selection. The donor region of S. japonicus lacks such Atf1-binding sites, presumably because its positioning inside a centromere allows it to depend on neighboring centromeric sequences for heterochromatin establishment (Yu et al. 2013).
The donor regions of S. octosporus and S. cryophilus are known to harbor transposon remnants (Rhind et al. 2011;Tong et al. 2019). We found that transposon-related sequences also exist in the donor region of S. osmophilus (Fig. 5a). We used the LTRs and the INTs of full-length retrotransposons Tf1, Tf2, Tosmo1, Tosmo2, Tosmo3, Tcry1, and Tj1 as queries to perform BLASTN analysis of the donor regions of these 3 species. Based on the BLASTN results, we separately annotated sequences with similarities to LTRs and sequences with similarities to INTs ( Fig. 5a; Supplementary Table 25 and Supplementary Files 1-4). The annotated transposon-related sequences in the donor regions of S. cryophilus, S. octosporus, and S. osmophilus are most closely related to Tcry1, Tosmo1, Tosmo2, or Tosmo3 (Supplementary  Table 25). Multiple sequence alignment showed that INT2 and INT1 of S. cryophilus are substantially more similar to the INT of Tcry1 than to INTs of Tosmo1, Tosmo2, and Tosmo3 and are likely to be 2 fragments of a retrotransposon that has suffered an internal deletion during degeneration (Fig. 5c). INT4 of S. osmophilus, which is a nearly full-length INT, is more similar to Tosmo2 than to other retrotransposons (Fig. 5c). The close relationships of these INT sequences to full-length retrotransposons present in the same fission yeast species suggest that they result from retrotransposon insertions occurring after species divergence.

The evolutionary trajectories of Cbp1 family proteins
Cbp1 family proteins are transposase-derived proteins known to be present in S. pombe, S. octosporus, and S. cryophilus, but not S. japonicus (Casola et al. 2007;Rhind et al. 2011;Upadhyay et al. 2017). There are 3 Cbp1 family proteins in S. pombe: Cbp1 (also known as Abp1), Cbh1, and Cbh2 (Irelan et al. 2001). They bind to the LTRs of retrotransposons in the S. pombe genome and repress the transcription of retrotransposons (Cam et al. 2008). In their absence, replication forks stall and collapse at LTRs (Zaratiegui et al. 2011). It has been proposed that the advent of Cbp1 family proteins in the common ancestor of S. pombe, S. octosporus, and S. cryophilus resulted in lower diversities of retrotransposons in these three fission yeast species compared to S. japonicus and a switch from a retrotransposon-rich centromere structure to a retrotransposon-free or -poor centromere structure (Rhind et al. 2011). Intrigued by their unique evolutionary origins and important roles in genome defense and maintenance, we decided to take advantage of the S. osmophilus genome to better understand the evolutionary origins and dynamics of Cbp1 family proteins.
S. octosporus and S. cryophilus each have 6 Cbp1 family proteins, encoded by 6 genes sharing synteny between these 2 species (Upadhyay et al. 2017) (Fig. 6a; Supplemenarty Fig. 15a-e). We found that S. osmophilus also has 6 Cbp1 family proteins and that their coding genes share synteny with genes encoding Cbp1 family proteins in S. octosporus and S. cryophilus (Fig. 6a;. We named these 6 proteins Cbt1-Cbt6 (Cbt stands for Cbp1 family domesticated transposase). They mostly share the same domain organization with S. pombe Cbp1 family proteins (Fig. 6b). The exception is Cbt2, which lacks the N-terminal tandem helix-turn-helix domains. For all 3 S. pombe Cbp1 family proteins, the eponymous catalytic triads of their DDE/D transposase domains are degenerated, with at least one of the 3 acidic residues mutated to a noncharged residue (Fig. 6b). Interestingly, in 4 of the 6 Cbt proteins (Cbt2, Cbt3, Cbt4, and Cbt5), the catalytic triad residues are either DDE or DDD (Fig. 6b and c), suggesting the possibility that these proteins may still retain some catalytic activities. Notable examples of domesticated transposases retaining catalytic activities include domesticated piggyBac transposases acting in programed genome rearrangements in ciliates (Baudry et al. 2009;Cheng et al. 2010), vertebrate RAG proteins involved in V(D)J recombination (Huang et al. 2016), and 2 domesticated transposases involved in mating-type switching in Kluyveromyces lactis (Barsoum et al. 2010;Rajaei et al. 2014).
Cbp1 family proteins originated from DNA transposons related to the pogo element of Drosophila melanogaster (Casola et al. 2007). The domestication of pogo-related transposons has occurred independently many times during evolution, with the most famous example being the emergence of the centromere-associated protein CENP-B in the common ancestor of mammals (Tudor et al. 1992;Casola et al. 2007;Mateo and González 2014;Dupeyron et al. 2020;Gao et al. 2020). Traditionally, pogo-related transposons are classified as a family belonging to the Tc1/mariner superfamily of DNA transposons (Plasterk et al. 1999). Two recent studies have elevated pogo-related transposons to the status of a superfamily sister to the Tc1/mariner superfamily (Dupeyron et al. 2020;Gao et al. 2020). The pogo superfamily has been divided into 6 main families: Passer, Tigger, pogoR, Lemi, Mover, and Fot/Fot-like (Gao et al. 2020). In vertebrates, 4 domesticated genes (CENPB, TIGD3, TIGD4, and TIGD6) are derived from pogoR family transposons, 6 domesticated genes (CENPBD1, JRK, JRKL, TIGD2, TIGD5, and TIGD7) are derived from Tigger family transposons, and 2 domesticated genes (POGK and POGZ) are derived from Passer family transposons (Gao et al. 2020). The origins of domesticated pogo superfamily transposons in fungi have not received the same level of scrutiny.
We used the 21 fission yeast Cbp1 family proteins as queries to perform TBLASTN searches against a collection of 1,014 representative nucleotide sequences of pogo superfamily transposons curated by Gao et al. (2020). The top 10 hits for each query overlap extensively across the 21 queries. There are a total of 20 transposons in the top 10 hits for all queries (Supplementary Table 26). They are exclusively Lemi family transposons present in fungal species, including 1 Saccharomycotina species (Lipomyces starkeyi) and more than 10 Pezizomycotina species belonging to the classes of Eurotiomycetes, Dothideomycetes, and Leotiomycetes. The amino acid identities between queries and top hits are in the range of 32-39% and similarities are found throughout entire query sequences (Supplementary Table 26). Thus, we conclude that Cbp1 family proteins likely originated from Lemi family transposons.
We also used the 21 fission yeast Cbp1 family proteins as queries to perform BLASTP searches against the NCBI nr database excluding Schizosaccharomyces. Consistent with the TBLASTN results described above, transposases of Lemi family transposons are among the high-scoring hits (Supplementary Table 27). Interestingly, for a majority of the queries, the highest scoring hit is either KAF5117802.1 or KAF7497890.1, 2 nearly identical proteins (amino acid identity 99.4%) from the Saccharomycotina species Geotrichum candidum (teleomorph Galactomyces candidus) (Supplementary Table 27). These 2 proteins are from 2 different strains (LMA-317 and LMA-40) of G. candidum (Perkins et al. 2020). BLASTP searches using these 2 G. candidum proteins as queries to search against the NCBI nr database excluding G. candidum identified fission yeast Cbp1 family proteins as the highest scoring hits and transposases of known Lemi family transposons as lowerranking hits. Thus, fission yeast Cbp1 family proteins and these 2 G. candidum proteins share higher similarities with each other than they do with transposases of known Lemi family transposons.
To determine whether KAF5117802.1 and KAF7497890.1 are proteins encoded by transposons, we retrieved their coding sequences and flanking regions from the genome assemblies of LMA-317 and LMA-40. We could not find TIRs, the hallmark of DNA transposons, in nearby flanking sequences. Sequence alignment showed that in LMA-317 and LMA-40, not only the coding sequences but also the flanking regions containing multiple neighboring genes are nearly identical (nucleotide identity 97.8% in a 24-kb region). Thus, these 2 proteins are encoded by the same gene locus in the 2 G. candidum strains. No additional copies of this gene could be found in the 2 genome assemblies. The lack of TIRs, the existence as a single-copy gene, and the identical genome locations in 2 divergent strains all suggest that this gene is a domesticated transposon.  Using KAF5117802.1 as query to perform TBLASTN search against the NCBI whole-genome shotgun contigs (wgs) database led to the identification of closely related homologs in 4 fungal species belonging to the same family (Dipodascaceae) as G. candidum: Dipodascus albidus, D. geniculatus, D. ghanensis, and G. reessii. Similarities between the query and hits were found throughout the length of the query and the amino acid identities ranged from 62 to 85%. The local synteny is largely conserved between G. candidum and these 4 species (Supplementary Fig. 15f). D. albidus and D. geniculatus are estimated to have diverged about 112 million years ago from G. candidum . Thus, the gene encoding KAF5117802.1 has been residing at the same locus for over 100 million years. Together, our observations suggest that this gene originated from a transposon domestication event occurring more than 100 million years ago. Because it encodes a protein with high similarity to fission yeast Cbp1 family proteins, we named this gene CBT7.
We performed a phylogenetic analysis using the amino acid sequences of 21 fission yeast Cbp1 family proteins, 5 Cbt7 proteins from Dipodascaceae species, human CENP-B, and 21 transposases of representative transposons belonging to Lemi and pogoR families (Fig. 6c). Consistent with the BLAST results, fission yeast Cbp1 family proteins and Dipodascaceae Cbt7 proteins form a highly supported monophyletic clade, which we hereafter refer to as the Cbp1 family clade. Transposases of Lemi family transposons from the Saccharomycotina species Lipomyces starkeyi and several Pezizomycotina species form a sister clade to the Cbp1 family clade. These 2 clades are nested within a larger monophyletic clade that includes transposases of other Lemi family transposons. Thus, phylogenetic analysis lends support to the idea that Cbp1 family proteins are derived from Lemi family transposases. As expected, pogoR family transposases, which form a monophyletic group including human CENP-B, are sister to Lemi family transposases and Cbp1 family proteins.
Within the Cbp1 family clade, the 18 proteins from S. osmophilus, S. octosporus, and S. cryophilus form 6 highly supported monophyletic clusters, corresponding to Cbt1-Cbt6 (Fig. 6c). The phylogenetic relationships between the 3 proteins in each of these 6 clusters are always congruent with the species tree. The catalytic triad residues and the number of amino acids between the last 2 residues of the catalytic triad are strictly conserved within each cluster (Fig. 6c). Pairwise identities of proteins within each cluster are in the range of 86-96% for the S. osmophilus-S. octosporus comparison, in the range of 82-92% for the S. osmophilus-S. cryophilus comparison, and in the range of 79-91% for the S. octosporus-S. cryophilus comparison (Fig. 6d). These values are largely in line with those of single-copy orthologs conserved across all 5 fission yeasts species (Supplementary Fig. 5c), suggesting that since the divergence of S. osmophilus, S. octosporus, and S. cryophilus about 29 million years ago, Cbt1-Cbt6 have evolved at rates similar to conserved proteins under purifying selection. Together, these observations indicate that the domestication event(s) leading to the emergence of Cbt1-Cbt6 must have occurred earlier than the divergence of these 3 species. Furthermore, the conservation of Cbt1-Cbt6 proteins suggests that the presence of full-length retrotransposons in S. osmophilus and S. cryophilus but not S. octosporus cannot be attributed to interspecific differences in Cbp1 family proteins.
Cbt7 proteins from the 5 Dipodascaceae species also form a monophyletic cluster (Fig. 6c). Branches within the Cbt7 cluster have longer lengths than branches within Cbt1-Cbt6 clusters, consistent with longer species divergence time (112 million years vs 29 million years). The 7 Cbt clusters and the 3 S. pombe proteins reside on 10 long branches indicative of substantial divergence. Indeed, pairwise identities between any 2 proteins not affiliated with the same long branch are below 50% (Fig. 6d). Intriguingly, fission yeast Cbp1 family proteins form a paraphyletic group, with the Dipodascaceae Cbt7 cluster nested in it. The fission yeast Cbt6 proteins are sister to the Dipodascaceae Cbt7 proteins. Such phylogenetic relationships suggest that horizontal evolutionary events are likely responsible for the presence of Cbp1 family proteins in both Dipodascaceae species and Schizosaccharomyces species, which are estimated to have diverged about 563 million years ago .
The 3 S. pombe Cbp1 family proteins do not group together by themselves. Instead, Cbp1 is sister to the Cbt1 cluster and Cbh2 is sister to a group including Cbp1 and Cbt1-Cbt3 clusters. Cbh1 is sister to the Cbt5 cluster but this grouping has poor support. Because genes encoding the 3 S. pombe Cbp1 family proteins do not share synteny with genes encoding Cbt1-Cbt6 proteins (Upadhyay et al. 2017) (Fig. 6e; Supplemenarty Fig. 15g-h), the relationships between Cbp1 and Cbt1 and between Cbh1 and Cbt5 are unlikely to be orthologous. The intermingling of S. pombe Cbp1 family proteins with Cbt1-Cbt5 clusters suggests that, if the 3 S. pombe Cbp1 family proteins arose through gene duplications, the duplication events occurred before the divergence of S. pombe from the common ancestor of S. osmophilus, S. octosporus, and S. cryophilus about 108 million years ago. Similarly, if Cbt1-Cbt6 proteins resulted from gene duplications, most duplications events probably occurred in the common ancestor of the 4 fission yeast species.
We propose 2 possible evolutionary scenarios to explain the phylogenetic pattern of the Cbp1 family clade. In the first scenario, all proteins belonging to this clade originated from a single transposon domestication event occurring more than 100 million years ago in the common ancestor of the 4 Schizosaccharomyces species. A series of gene duplications resulted in 9 or more paralogs dispersed in the genome. One paralog underwent horizontal transfer into the common ancestor of the 5 Dipodascaceae species. Subfunctionalization or neofunctionalization of the paralogs was accompanied by rapid evolution (Lynch and Conery 2000;Scannell and Wolfe 2008), contributing to the long branch lengths. After the divergence of S. pombe from the common ancestor of S. osmophilus, S. octosporus, and S. cryophilus, independent gene loss events in the 2 lineages resulted in the remaining paralogs being completely nonoverlapping between the 2 lineages. In the second scenario, multiple related but nonidentical Lemi family transposons invaded an ancestral Schizosaccharomyces species more than 100 million years ago, and one of these transposons also invaded an ancestral Dipodascaceae species at around the same time. Independent domestication events happened in the 2 ancestral species. In the ancestral Schizosaccharomyces species, multiple domestication events led to dispersed paralogous genes. Later, gene loss events occurred as in the first scenario, resulting in the phylogenetic pattern we see today.

Mitogenome of S. osmophilus and comparative analysis of fission yeast mitogenomes
Mitochondria possess a genome termed the mitogenome. In S. pombe, the mitogenome is indispensable for viability in the wildtype genetic background (Chiron et al. 2007;Li et al. 2019). The analysis of intraspecific mitogenome diversity of S. pombe has offered insights into the evolutionary history of this species (Tao et al. 2019). An interspecific comparison of the mitogenomes of S. pombe, S. octosporus, and S. japonicus has been conducted (Bullerwell et al. 2003). The mitogenome of S. cryophilus has been sequenced but has not been subjected to in-depth analysis (Rhind et al. 2011). Here, we assembled the complete circular-mapping mitogenome of the S. osmophilus type strain CBS 15793 T (Fig. 7a; Supplementary Table 28). We annotated it and performed comparative analysis of the mitogenomes of 5 fission yeast species.
The mitogenome of the S. osmophilus type strain is 46,134 bp long and contains 8 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, and rps3), the RNaseP RNA gene (rnpB), 2 rRNA genes (rnl and rns), and 24 tRNA genes (Fig. 7a). All 35 genes are encoded on the same DNA strand. Such gene content and gene orientations are exactly the same as those of the reference mitogenomes of S. octosporus and S. cryophilus (Fig. 7b). Moreover, the 35 genes show perfect synteny between these 3 species (Fig. 7b). The size differences between the mitogenomes of these 3 species are mainly due to differences in the lengths of introns and intergenic regions ( Supplementary Fig. 16a). Partial gene synteny conservation can be observed between these 3 species and S. pombe, whereas the S. japonicus mitogenome shares little synteny with the mitogenomes of the other 4 fission yeast species (Fig. 7b). A phylogenetic tree generated using mitochondrial genes has the same topology as the species tree generated using nuclear genes, with S. octosporus being the closest relative of S. osmophilus ( Supplementary Fig. 16b).
We analyzed the codon usage of protein-coding genes in the mitogenomes of 5 fission yeast species (Supplementary Fig. 16c). As has been observed before for many Ascomycota species (Bullerwell et al. 2003;Kamatani and Yamamoto 2007;Procházka et al. 2012), in all 5 species, there exists a strong preference for A and T at the third position of redundant codons. The exceptions are the phenylalanine codons and the tryptophan codons. The 2 phenylalanine codons TTT and TTC are used at roughly equal frequencies. This has been explained by a tendency to avoid frameshift-prone polyU in mRNAs (Bullerwell et al. 2003). Among the 2 tryptophan codons TGG and TGA, TGG is strongly favored, perhaps due to the lack of a tRNA that can efficiently decode the TGA codon (Bullerwell et al. 2003).
Similar to the situation in S. octosporus (Bullerwell et al. 2003), we found that nonintronic protein-coding genes in the mitogenomes of S. osmophilus and S. cryophilus lack the ATA codon for isoleucine ( Supplementary Fig. 16c). In contrast, the ATA codon is used many times by nonintronic protein-coding genes in the mitogenomes of S. pombe and S. japonicus. In these 2 species, the ATA codon is decoded by a tRNA encoded by the mitochondrial tRNA gene trnI(cau) (abbreviated as I 2 in Fig. 7a-c). This tRNA is only functional when the cytidine at the wobble position is converted to lysidine (2-lysyl cytidine), which can base pair with adenosine. This tRNA modification is catalyzed by the enzyme tRNA(Ile)-lysidine synthase (Xavier et al. 2012), which is encoded by the nuclear gene til1 in S. pombe and S. japonicus. Interestingly, we found that S. octosporus, S. osmophilus, and S. cryophilus all lack the trnI(cau) gene in their mitogenomes and the til1 gene in their nuclear genomes ( Fig. 7c and d). Thus, both genes were probably lost before the divergence of these 3 species about 29 million years ago. The til1 gene is an essential gene in S. pombe (Kim et al. 2010;Hayles et al. 2013). One possible evolutionary scenario is as follows: the til1 gene in the ancestral nuclear genome was partially inactivated by mutation. Mitochondrial functions were compromised due to the insufficient modification of the tRNA that decodes the ATA codon. Selective pressure led to the replacement of the ATA codon in the mitogenome by other isoleucine codons. The til1 gene and the trnI(cau) gene thus became dispensable and eventually lost.
Mitochondrial introns are mobile elements and some of them harbor one or more ORFs for intron-encoded proteins (IEPs) (Lang et al. 2007). There are 6 introns in the mitogenome of the S. osmophilus type strain, including 4 group I introns and 2 group II introns (Fig. 7a). They all reside in the cox1 gene. We named them using a recently proposed nomenclature that denotes the intron type with a uppercase letter (P for group I introns and S for group II introns) and denotes the location in the host gene with a number, which is the corresponding position in the equivalent gene of the Pezizomycotina species Tolypocladium inflatum . This nomenclature is designed to facilitate comparisons between different mitogenomes. Thus, we also applied this nomenclature to mitochondrial introns of the other 4 fission yeast species (Supplementary Table 29 and Supplementary Fig. 16d) (Bullerwell et al. 2003;Tao et al. 2019). Five of the six S. osmophilus cox1 introns (S206, P386, P731, P1107, and S1138), which all encode proteins, reside at locations where introns have been found in other fission yeast species (Supplementary Fig. 16d). The other S. osmophilus cox1 intron (P951) does not encode a protein and is at a location where introns are not known to exist in other fission yeast species ( Supplementary Fig. 13d).
Inspection of all known fission yeast mitochondrial introns revealed that introns at the same location but in different species not only share the same intron type (group I or group II), but also mostly share the same subgroup affiliation for group I introns ( Supplementary Fig. 16d). In addition, IEPs encoded by introns at the same location are more similar to each other than to IEPs encoded by introns at other locations (Supplementary Fig. 16e and f). These observations suggest that fission yeast mitochondrial introns at the same location tend to share a common evolutionary origin. This is consistent with the "position class" concept proposed based on an analysis of cox1 group I introns in fungi and other eukaryotes (Férandon et al. 2010).
Unexpectedly, S. osmophilus cox1P386 IEP shares substantially higher homology to S. cryophilus cox1P386 IEP (pairwise identity 80.2%) than to S. octosporus cox1P386 IEP (pairwise identity 53.8%) ( Supplementary Fig. 16e). The pairwise identities between cox1P731 IEPs are also incongruent with the species tree, with the S. octosporus protein being more similar to the S. cryophilus protein than to the S. osmophilus protein ( Supplementary Fig. 16e). We hypothesized that such patterns may result from interspecific introgression or horizontal transfer. Sliding-window analysis showed that sequences that may have undergone introgression or horizontal transfer are limited to within the cox1P386 intron and the cox1P731 intron ( Supplementary Fig. 17a).
All proteins encoded by fission yeast group I introns are doublemotif LAGLIDADG homing endonucleases (LHEs), which contain 2 LAGLIDADG domains. Previously we showed that half (7/14) of the LHEs in S. pombe, S. octosporus, S. cryophilus, and S. japonicus suffer inactivating mutations at catalytic residues and thus no longer possess the ability to generate DNA double-strand breaks that initiate intron mobilization (Tao et al. 2019). All 3 LHEs in S. osmophilus contain inactivating mutations at catalytic residues ( Supplementary Fig. 17b). Interestingly, the prevalence of this type of LHE degeneration varies depending on intron location, with cox1P731 introns nearly always harbor 1 active LHE and cox1P720 introns never harbor an active LHE ( Supplementary  Fig. 17b). Such a position dependence of the extent of LHE degeneration suggests that introns descending from different origins tend to have different evolutionary trajectories.
Double-hairpin elements (DHEs) are potentially mobile elements first found in the mitogenomes of several Chytridiomycota fungal species (Paquin and Lang 1996;Paquin et al. 1997Paquin et al. , 2000. The only Ascomycota species where DHEs have been found is S. octosporus (Bullerwell et al. 2003). There are 5 DHEs in the S. octosporus mitogenome. Four of them are located in intergenic regions and one of them resides in the rnl gene (Bullerwell et al. 2003). Our inspection of the mitogenomes of S. osmophilus and S. cryophilus led to the identification of 14 DHEs in S. osmophilus and 1 DHE in S. cryophilus ( Fig. 7a and e). Except for S. osmophilus DHE11, which is located in the rns gene, the other DHEs of S. osmophilus and S. cryophilus are located in intergenic regions. The DHEs of S. osmophilus and S. cryophilus have the potential to form similar doublehairpin structures as the 5 S. octosporus DHEs (Fig. 7e).
Remarkably, S. osmophilus DHE1 and DHE2 are identical in sequence to S. octosporus DHE3 and DHE4, S. osmophilus DHE3 and DHE12 are identical in sequence to S. octosporus DHE2 and DHE5, and S. osmophilus DHE7, DHE9, DHE13, and DHE14 only differ from S. octosporus DHE1 by 2 nucleotides in a loop region (Fig. 7e). Such cross-species conservation patterns suggest that either the sequences of DHEs have been strictly maintained by purifying selection after the divergence of S. octosporus and S. osmophilus about 16 million years ago, or perhaps more likely, DHEs have recently undergone introgression or horizontal transfer across species barriers. As S. osmophilus has the highest abundance of DHEs among fission yeast species, it is an attractive model for further studying the evolution and activities of DHEs.

Conclusion
In this study, we assembled a high-quality chromosome-level genome for the type strain of the fission yeast species S. osmophilus and comprehensively annotated the genes in this genome. Cross-species comparative analyses reported here demonstrate the usefulness of this genome for understanding how fission yeast genomes have evolved. We expect that this reference genome for S. osmophilus will facilitate future research using this species and serve as a valuable resource to support comparative genomic and population genomic studies.

Data availability
Sequencing data have been deposited at NCBI SRA (BioProject: PRJNA870673). Genome assemblies have been deposited at GenBank (accession number OP310968 for the mitogenome and accession numbers CP115611, CP115612, CP115613, OQ263023, and OQ263024 for the nuclear genome). Supplemental material available at G3 online.