Shared Features Underlying Compact Genomes and Extreme Habitat Use in Chironomid Midges

Abstract Nonbiting midges (family Chironomidae) are found throughout the world in a diverse array of aquatic and terrestrial habitats, can often tolerate harsh conditions such as hypoxia or desiccation, and have consistently compact genomes. Yet we know little about the shared molecular basis for these attributes and how they have evolved across the family. Here, we address these questions by first creating high-quality, annotated reference assemblies for Tanytarsus gracilentus (subfamily Chironominae, tribe Tanytarsini) and Parochlus steinenii (subfamily Podonominae). Using these and other publicly available assemblies, we created a time-calibrated phylogenomic tree for family Chironomidae with outgroups from order Diptera. We used this phylogeny to test for features associated with compact genomes, as well as examining patterns of gene family evolution and positive selection that may underlie chironomid habitat tolerances. Our results suggest that compact genomes evolved in the common ancestor of Chironomidae and Ceratopogonidae and that this occurred mainly through reductions in noncoding regions (introns, intergenic sequences, and repeat elements). Significantly expanded gene families in Chironomidae included biological processes that may relate to tolerance of stressful environments, such as temperature homeostasis, carbohydrate transport, melanization defense response, and trehalose transport. We identified several positively selected genes in Chironomidae, notably sulfonylurea receptor, CREB-binding protein, and protein kinase D. Our results improve our understanding of the evolution of small genomes and extreme habitat use in this widely distributed group.

(Chironomus riparius, C. sonorensis, P. steinenii, and T. gracilentus), and (5) repeat libraries for all species.The data we did not generate ourselves we collected from GenBank, Sequence Read Archive (SRA), VectorBase (Amos et al. 2022;Giraldo-Calderón et al. 2015), or InsectBase (Mei et al. 2022).In the next sections, we describe our data collection in order of use, from sample collection to functional annotations.We then describe how we used these data to construct a time-calibrated phylogeny and analyze genomic features associated with compact genomes and extreme physiology in Chironomidae.

Tanytarsus gracilentus samples, extractions, and sequencing
We collected all Tanytarsus gracilentus from sites at Lake Mývatn, Iceland (65.5852º, −16.9816º) (fig.S4).DNA samples were stored in 95% ethanol and RNA samples in RNAlater, and all were kept at −20ºC in Iceland.During and after transport to the United States, DNA samples were stored at −80ºC, while RNA samples remained at −20ºC until extraction.
For long-read Oxford Nanopore Technologies (ONT) sequencing, we extracted highmolecular-weight DNA from a single adult male using a previously published protocol (Quick 2018).ONT library preparation and sequencing were performed by the Roy J. Carver Biotechnology Center, University of Illinois Urbana-Champaign.They used 50ng size-selected DNA to create a PCR library with the EXP-PBC001 and LSK-109 kits (Oxford Nanopore Technologies, Oxford, UK) and sequenced the library on one R9.4.1 FLO-MIN106 RevD SpotON flowcell for 72 hours on a MinION II sequencer.
We also used pooled short-read DNAseq for assembly polishing (Steward et al. 2021) and RNAseq to inform gene predictions.Extractions, library preparation, and sequencing were conducted by the University of Wisconsin-Madison Biotechnology Center, with sequencing on an Illumina NovaSeq 6000 (2 × 150bp reads on an S4 flow cell).DNA was extracted from a pool of 7 adult males using a QIAcube HT from QIAGEN (Hilden, Germany), and DNA concentration was verified using the Qubit dsDNA HS Assay Kit (Life Technologies, Grand Island, NY).The sample was then prepared according to the Celero PCR Workflow with Enzymatic Fragmentation (Tecan Genomics, Redwood City, CA).Quality and quantity of the finished library was assessed using an Agilent TapeStation (Agilent, Santa Clara, CA) and Qubit dsDNA HS Assay Kit, respectively.DNAseq libraries were sequenced at ≥ 100× coverage.Two RNAseq libraries were also generated, one from a pool of 10 adult males and another from a pool of 155 first-instar juveniles.Samples were homogenized using a QIAGEN TissueLyser, then RNA was extracted using organic extractions followed by on-column DNase treatment with QIAGEN RNeasy Mini Kit.RNA quality and quantity were assessed using a NanoDrop (Thermo Fisher Scientific, Waltham, MA) and Agilent BioAnalyzer.Libraries were created using the Illumina TruSeq Stranded mRNA library preparation kit and were sequenced to obtain ≥ 100 million reads per sample.

Parochlus steinenii reads and genome assembly
We generated a new reference assembly for Parochlus steinenii based on previously published sequencing data (Shin et al. 2019).This species is especially useful because of its place within the subfamily Podonominae, which represents an outgroup with respect to T. gracilentus and the other chironomid species with reference assemblies.We used ONT reads (SRA accession: SRR8180978) and paired-end Illumina DNAseq reads (SRR3951280) for genome assembly, and paired-end Illumina RNAseq reads (SRR3951285, SRR3951284, and SRR3951283) for the annotation.Trimming of Illumina reads followed the methods described for T. gracilentus.We generated an initial assembly based on ONT reads using NextDenovo v2.5.0.We polished the assembly using 6 rounds of Racon with ONT reads, then using 6 rounds of NextPolish with Illumina DNAseq reads.

Culicoides sonorensis RNAseq reads
As a member of family Ceratopogonidae, the most closely related family to Chironomidae, Culicoides sonorensis is especially informative for understanding the evolutionary history of Chironomidae.To generate gene predictions for C. sonorensis, we used paired-end Illumina RNAseq reads from a project designed to boost gene expression under many different environmental conditions (Morales-Hojas et al. 2018).We accessed these reads from the SRA using the following accession codes: ERR637904, ERR637905, ERR637906, ERR637907, ERR637908, ERR637909, ERR637910, ERR637911, ERR637912, ERR637913, and ERR637914.

Repeat elements
We described repeat elements for all 14 species' assemblies by first creating a de novo library of repetitive elements for each using RepeatModeler v2.0.3 (Flynn et al. 2020) with the LTR structural finder enabled, then identifying many of the unclassified elements using TEClass v2.1.3(Abrusán et al. 2009).Next, we combined this custom library with dipteran repeats from the RepBase RepeatMasker edition library v20181026 (Bao et al. 2015).We then soft-masked the reference assembly based on this combined library using RepeatMasker v4.1.2(Smit et al. 2013).In the species for which we created new gene predictions, we used this soft-masked assembly for all analyses downstream, but for the others, we continued to use the original assemblies that were already soft masked.For all species, we used the annotation (*.out) file from RepeatMasker to summarize repeat elements by class.Next, we calculated repeat element divergences (CpG-adjusted Kimura substitution level) using RepeatMasker's calcDivergenceFromAlign.pl script that takes a RepeatMasker align file as input.We manually parsed the resulting divsum files to create repeat landscape plots.

Phylogeny construction
To construct the phylogenomic tree, we first used BUSCO v5.4.3 to extract the amino acid sequences for the 1,436 diptera_odb10 proteins present and with only a single copy in each assembly.We removed non-homologous characters from unaligned amino acid sequences separately for each protein present in all assemblies using PREQUAL v1.02 (Whelan et al. 2018), then aligned sequences using MAFFT v7.515 (Katoh & Standley 2013).Next, we trimmed alignments using trimAl v1.4.1 (Capella-Gutiérrez et al. 2009) using heuristic selection of the automatic method.RAxML-NG v1.1.0(Kozlov et al. 2019) was then used to generate a maximum likelihood (ML) tree with Musca domestica as outgroup.We partitioned by gene and used ModelTest-NG v0.1.7 to find the best-fit substitution model (based on AIC) for each partition (Darriba et al. 2020).We assessed branch support using both Felsenstein's (1985) and transfer (Lemoine et al. 2018) bootstrap, and we used the MRE-based bootstopping test (Pattengale et al. 2010) every 50 replicates to stop bootstrapping when it converged; our bootstrapping converged after 100 replicates.
We created a time-calibrated, ultrametric tree by combining the ML tree with fossil records (queried from paleobiodb.org on 5 July 2022) using the Bayesian MCMCTree program in the PAML v4.10.6 (Yang 2007) package.Before running MCMCTree, we used (1) PAML's CODEML program to estimate the overall substitution rate and (2) the "data-driven birth-death" (ddBD) method (Tao et al. 2021) to calculate priors for the speciation birth-death process.We ran CODEML on the ML tree using three calibration point estimates, which were simplifications compared to those used in MCMCTree (described below): 232.7 million years ago (Ma) for family Chironomidae (Cranston et al. 2012), 137.573Ma for subfamily Chironominae (Cranston et al. 2012), and 76 Ma separating genera Belgica and Clunio (timetree.org).The ddBD method was run on the tree output from RelTime-ML (Mello et al. 2017) in MEGA-CC v11.0.13-1 (Tamura et al. 2021) that took the initial ML tree and concatenated protein sequence alignments as inputs.We used the LG model (the most common best-fit model in the ML tree) and a single concatenated alignment for both CODEML and RelTime-ML.
In MCMCTree, we used an autocorrelated clock, and we used ModelFinder (Kalyaanamoorthy et al. 2017) to merge partitions and to find the best substitution model for each partition.We used ModelFinder here because MCMCTree had difficulty converging without merging partitions, but we used ModelTest-NG for our ML tree because it generated a better-fitting tree as indicated by AIC.We ran the MCMC for 250,000 iterations with a sample frequency of 5, after 2,000 iterations of burn-in.For MCMCTree's rate-drift parameter priors, we used a shape parameter of 1 and scale parameter equal to the mean root age (Inoue et al. 2010); we approximated mean root age by using the middle point of the uniform portion of the distribution used for the priors on root age (i.e., (  +   ) / 2 in Table S9).To check for convergence, we ran the program 4 times and calculated the effective sample size across all parameters and runs using effectiveSize in the R package coda v0.19.4 (Plummer et al. 2006).
For MCMCTree's substitution rate priors, we used a shape parameter of 2.0 for a relatively diffuse prior and a rate value that caused the Gamma distribution mean to equal the substitution rate estimated by CODEML (8.7005 × 10 −10 substitutions per year).We used both fossil evidence and previous time estimates (Benton et al. 2009;Hedges & Kumar 2009;Cranston et al. 2012Cranston et al. , 2010) ) to inform the parameters defining the divergence time sampling distributions in MCMCTree.We defined five calibration points: (a) 238.5 Ma minimum and 295.4 Ma maximum for the root (Benton et al. 2009), (b) 242.0 Ma minimum for the superfamily Chironomoidea (Lukashevich et al. 2010), (c) 201.3 Ma minimum for family Chironomidae (Krzeminski & Jarzembowski 1999), (d) 93.5 Ma minimum for subfamily Chironominae (Giłka et al. 2022), and (e) 33.9 Ma minimum for the portion of subfamily Orthocladiinae containing genera Belgica and Clunio (Zelentsov et al. 2012).All constraints were soft, and we adjusted the probability of sampling a time that violates the constraint based on our perceived reliability of the bounds.MCMCTree samples minimum constraints with a truncated Cauchy distribution, and we defined the Cauchy parameters to make the mode near our best estimate and the variance proportional to our confidence in this estimate (see table S9 for details on parameter values).

Finding orthogroups
We used OrthoFinder v2.5.4 (Emms & Kelly 2019) to identify phylogenetic Hierarchical Orthologous Groups (HOGs) for use in the analyses for intron sizes, gene family evolution, and positive selection.Using OrthoFinder for generating HOGs and OrthoDB for the phylogeny allowed us to generate a species tree using the more highly curated OrthoDB protein sequences while identifying more orthologs that span a greater number of biological functions for our gene family and selection analyses.We first filtered all species' protein sequences to only include the longest isoform for each protein-coding gene.We input these filtered protein sets and the timecalibrated species tree from MCMCTree into OrthoFinder and used the set of HOGs for the root of the phylogeny for downstream analyses.We removed Clunio marinus from any analyses using HOGs because only 58.5% of its genes were assigned to an orthogroup by OrthoFinder, compared to >90% for all other species.

Features associated with genome size
We looked for associations between genome size and four genomic features in chironomids: protein-coding genes, intergenic sequences, introns, and repeat elements.We estimated the number of protein-coding genes and total intergenic sequence length from the assemblies' *.gff files, using the longest isoform for each gene for intergenic sequences.Repeat elements were extracted from all species' RepeatMasker annotation (*.out) files, and we aggregated all non-TE elements (rolling-circles, small RNA, satellites, simple repeats, and low complexity) into one category for analysis.We calculated average intron length based only on genes matching to HOGs that had all species represented and that contained < 4 genes per species (Martín-Durán et al. 2021).We log10-transformed these variables because they were highly right skewed, with intron lengths being transformed before calculating the mean.
We first tested for significant differences in each feature (including genome size) between chironomids and other dipterans to assess whether any associations likely pertain to the specific evolutionary history of chironomids.For these tests, we grouped Culicoides sonorensis (family Ceratopogonidae) with chironomids because exploratory plots made it clear that it shared similar genomic traits with its most recent common ancestors in our phylogeny (family Chironomidae) and because grouping C. sonorensis with Chironomidae increased all models' log-likelihoods compared to not grouping it with Chironomidae.We used phylogenetic linear regressions and tested for significance of a binary variable indicating whether the species belonged to families Ceratopogonidae or Chironomidae.We used the phylolm v2.6.2 (Ho & Ane 2014) package in R v4.3.2 (R Core Team 2023) and ran models using an Ornstein-Uhlenbeck process with a fixed state estimated at the root of the phylogeny.We next tested for whether each feature was correlated with genome size to ascertain whether any changes that occurred likely contributed to genome compaction in Chironomidae.We used the cor_phylo function in the R package phyr v1.1.2(Li et al. 2020) to compute the Pearson correlation coefficient between features and genome size while accounting for phylogenetic covariance.We constrained the phylogenetic signal to be between 0 and 1 when not doing so resulted in the covariance matrix not being positive definite, which prevented bootstrapping.For both sets of analyses, we used parametric bootstrapping (2,000 repetitions) to compute 95% confidence intervals for the parameters of interest, and we scaled the phylogeny to a maximum depth of 1 for easier interpretation because α scales with phylogeny depth (Ives & Garland 2010).

Gene family evolution
We used CAFE v5.0.0 (Mendes et al. 2020) to analyze gene family evolution along our phylogeny, looking specifically for families that expanded for Chironomidae.After CAFE removed families not present at the root of the phylogeny, we analyzed 9,057 total HOGs across our 13 species.We used a Poisson root frequency distribution and a Gamma model with 8 categories; we used 8 because from testing, this is lowest value where the changes in the negative log likelihood began to subside.From the CAFE output, we filtered for HOGs that significantly expanded (P < 0.001) at the node separating Chironomidae from their most recent common ancestor.To better understand the functions of these expanded HOGs, we connected Gene Ontology (GO) terms from our functional annotations to each HOG based on all genes from all species that matched to it.We then used the enricher function in the R package clusterProfiler v4.8.2 (Wu et al. 2021) to find enriched GO terms in our set of HOGs, using a Benjamini-Hochberg adjusted P-value cutoff of 0.1 and minimum gene size of 1. Next, we used the R package rrvgo v1.12.2 (Sayols 2023)-based on the online tool REVIGO (Supek et al. 2011)-to reduce the redundancy of the set of enriched "biological process" GO terms based on the Drosophila melanogaster-derived database org.Dm.eg.db v3.17.0 (Carlson 2023b) and to summarize these GO terms using a treemap.

Positive selection
We used HyPhy v2.5.52 (Murrell et al. 2015;Wertheim et al. 2015) to test for positive selection in Chironomidae.We first filtered for HOGs that contained exactly one copy per species and that were associated with at least one GO term (including all offspring) from our a priori list of terms associated with tolerance to extreme habitats (table S5).We identified offspring GO terms using GO.db v3.17.0 (Carlson 2023a).Because HyPhy requires codon-aware sequence alignments, we extracted the coding sequence (CDS) for the longest isoform per gene and used HyPhy's codon-msa tool (https://github.com/veg/hyphy-analyses)and MAFFT for the alignments for each HOG.For 12 HOGs, alignments failed in one or more species, so these HOGs were removed from our analyses.Next, we labelled Chironomidae on our tree using phylotree.js(Shank et al. 2018) so that we could use chironomids as the test branches (with the others being background) for our analyses.For each HOG, we tested (1) whether positive selection occurred for any chironomids using HyPhy's BUSTED method (Murrell et al. 2015) and ( 2) whether selection intensified or relaxed for chironomids compared to outgroups using HyPhy's RELAX method (Wertheim et al. 2015).The resulting set of P-values for each test across all HOGs were corrected for multiple comparisons using the Benjamini-Yekutieli procedure (Benjamini & Yekutieli 2001) under arbitrary dependence assumptions and using a false discovery rate of 0.10.(Benton et al. 2009) to compare protein sequences to five different databases: Pfam (El-Gebali et al. 2019), KOfam (Aramaki et al. 2020), eggNOG (Huerta-Cepas et al. 2019), NCBI's protein family models (NPFM) (Lu et al. 2020), and the Transporter classification database (TCDB) (Saier et al. 2021).Databases were accessed on 12 July 2022.

Figure S2 .
Figure S2.Maximum likelihood phylogenomic tree showing relationships among chironomids and outgroups within the order Diptera.Numbers near nodes indicate percent support using Felsenstein's bootstrap (top) and transfer bootstrap (bottom).

Figure S3 .
Figure S3.Distributions of repeat element divergences calculated using the CpG-adjusted Kimura substitution model (right panel) for all species in our phylogeny, where color indicates the repeat class.The phylogeny (left panel) has chironomids in black and others in gray.

Figure S4 .
Figure S4.Map and descriptions of sequenced samples.(A) Sampled Tanytarsus gracilentus were collected from one of three locations at Lake Mývatn, Iceland.(B) Descriptions of the four sequencing libraries used in this study, including number of T. gracilentus individuals in the pool, life stage of the individuals, date collected, location of collection (names coincide with part A), type of reads (long vs 2×150bp, DNA vs RNA), sequencing platform, and step(s) for which they were used.

Table S1 .
Summary statistics for genome annotations.Included is completeness of coding sequences (CDS) calculated using BUSCO and the diptera_odb10 protein database.Also included are the number of genes with tags for Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) terms, Gene Ontology (GO) terms, Clusters of Orthologous Genes (COG) categories, Enzyme Commission (EC) numbers, and KEGG enzyme pathways.
* This species only had functional annotation created here so did not have CDS completeness scored

Table S2 .
Summary statistics for BUSCO matching and resulting amino acid sequence alignments.Shown is the number of proteins from OrthoDB's diptera_odb10 database that match a protein in the assembly of each species (with exactly one copy).Also shown is the percent of missing data (before trimming) in each species' protein alignments across all 1,437 genes that were present and single-copy in all species.

Table S3 .
Estimates and parametric-bootstrapped 95% confidence interval bounds for Pearson correlations between genome size and genome features computed using cor_phylo.

Table S4 .
List of the nine HOG gene families that significantly expanded at the node separating Chironomidae from their most recent common ancestor in our phylogeny.Listed are the gene(s) for Culex quinquefasciatus (Cquinq) and Musca domestica (Mdomes) that fall within the HOG and the descriptions for these genes from each species' genome annotation.

Table S5 .
Targeted Gene Ontology (GO) terms used in the analyses for positive selection.

Table S6 .
(Shin et al. 2019)ome assemblies and annotations used in this study.Unless otherwise noted, each annotation source provides a set of both gene predictions and functional annotations with GO terms.Families are abbreviated as follows: Ch = Chironomidae, Ce = Ceratopogonidae, Cu = Culicidae, and Mu = Muscidae.Functional annotation with GO terms was not found on GenBank, so it was generated in this study † Based on reads from(Shin et al. 2019)

Table S7 .
Summary statistics for Tanytarsus gracilentus assemblies.In the "program" column, ">" indicates that this program was run on the output from the previous step, and the quickmerge assembly shown is the best permutation (see tableS7).

Table S8 .
Summary statistics for two rounds of quickmerge on initial T. gracilentus assemblies.The permutation column indicates the identity and order of assemblies input to quickmerge, and the lower set of permutations are the second round that use the four best first-round permutations.

Table S9 .
Parameters defining the statistical distributions used to sample divergence times in MCMCTree, for calibrations defining (A) lower and upper and (B) just lower bounds.All values use 100 million years ago as the time unit because MCMCTree requires that all times be 0.01-10.(A) Parameter   is the lower age bound,   is the upper age bound,   is the probability that the real age is less than   , and   is the probability that the real age is greater than   .Upper and lower constraint specifications take the form 'B(  ,   ,   ,   )' in MCMCTree.(B) Parameter  is the offset,  is the scale parameter.Lower constraint commands take the form 'L(  , , ,   )'.In the Cauchy distribution used for sampling, the location parameter is   [1 + ], and the scale parameter is    .When a distribution mode was based on a reference, this means that we adjusted parameter  so that the distribution mode matched the estimate in that reference.