Runaway GC Evolution in Gerbil Genomes

Abstract Recombination increases the local GC-content in genomic regions through GC-biased gene conversion (gBGC). The recent discovery of a large genomic region with extreme GC-content in the fat sand rat Psammomys obesus provides a model to study the effects of gBGC on chromosome evolution. Here, we compare the GC-content and GC-to-AT substitution patterns across protein-coding genes of four gerbil species and two murine rodents (mouse and rat). We find that the known high-GC region is present in all the gerbils, and is characterized by high substitution rates for all mutational categories (AT-to-GC, GC-to-AT, and GC-conservative) both at synonymous and nonsynonymous sites. A higher AT-to-GC than GC-to-AT rate is consistent with the high GC-content. Additionally, we find more than 300 genes outside the known region with outlying values of AT-to-GC synonymous substitution rates in gerbils. Of these, over 30% are organized into at least 17 large clusters observable at the megabase-scale. The unusual GC-skewed substitution pattern suggests the evolution of genomic regions with very high recombination rates in the gerbil lineage, which can lead to a runaway increase in GC-content. Our results imply that rapid evolution of GC-content is possible in mammals, with gerbil species providing a powerful model to study the mechanisms of gBGC.


Introduction
The base composition of DNA shows considerable variation within genomes and between species.For example, genes and genomes can differ strikingly in GC-content, defined as the proportion of G and C nucleotides in a sequence of DNA.An important topic in evolutionary genetics is understanding the processes that lead to changes in GC-content, particularly those with direct effects on functional genetic elements (Marais et al. 2001;Galtier and Duret 2007) or on chromatin structure (Vinogradov 2003).
Broadly speaking, the GC-content of a species is thought to result from the balance between mutation, which is generally AT-biased (Lynch 2007;Long et al. 2018), and a type of homologous recombination known as gene conversion, which is generally GC-biased (Lamb 1984;Galtier et al. 2001;Galtier and Duret 2007;Duret and Galtier 2009;Odenthal-Hesse et al. 2014).Gene conversion takes place during meiosis, through the repair of double-strand breaks.Meiotic double-strand breaks occur on a chromosome at sites where it meets its homologous chromosome, either resulting in crossovers or in noncrossovers (Arnheim et al. 2007;Cole et al.Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020 2010).Their repair involves the use of the homologous chromosome as a template for resynthesis and replacement of a short stretch of DNA on the broken chromosome.In the cases where the double-strand break occurs near a heterozygous site, the broken chromosome can be thus be 'converted', i.e., it can receive the allele of the homologous chromosome and lose the original allelic variant.Importantly, at GC:AT heterozygous sites the gene conversion of AT alleles to GC alleles occurs more often than the opposite conversion of GC alleles to AT alleles (Lamb 1984;Galtier et al. 2001;Galtier and Duret 2007;Duret and Galtier 2009;Smeds et al. 2016).Indeed, AT ('weak') to GC ('strong') conversions have been shown to occur in 68% of observable gene conversions of GC:AT heterozygous sites in humans and mice (Odenthal-Hesse et al. 2014;Williams et al. 2015;Halldorsson et al. 2016;Li et al. 2019).This non-random transmission of GC alleles between homologous chromosomes is known as GC-biased gene conversion (gBGC).
In genomic regions with high recombination rates -i.e., those with a high density of doublestrand break hotspots -gBGC causes the frequency of G and C alleles to increase over time (Eyre-Walker 1999;Webster and Smith 2004;Spencer et al. 2006;Katzman et al. 2011;Auton et al. 2012).As a consequence, a correlation is expected between the rate of recombination and the rate of weak-to-strong (AT-to-GC) substitutions between species (Duret and Arndt 2008).
Studying the genomes of recently diverged species, such as human and chimpanzee, shows that there is a correlation between the weak-to-strong (AT-to-GC) substitution rate and both the current and the reconstructed ancestral rates of recombination (Dreszer et al. 2007;Capra et al. 2013;Munch et al. 2014).In birds, where karyotypes and recombination rates are exceptionally conserved (Singhal et al. 2015), the correlation has been detected even at longer phylogenetic distances (Mugal et al. 2013;Botero-Castro et al. 2017;Corcoran et al. 2017;Bolívar et al. 2019;Rousselle et al. 2019).Additionally, GC-content has been shown to be correlated with indirect proxies of recombination rate, for instance chromosome length, with species with smaller chromosomes having on average a higher recombination rate and therefore higher GC-content (Romiguier et al. 2010;Nabholz et al. 2011;Pessia et al. 2012;Figuet et al. 2014).An implication of this gBGC hypothesis is that lineage-specific changes to GC-content are likely to have been caused by lineage-specific changes to the recombination landscape (Romiguier et al. 2010).
Thus, lineages that are affected by extreme levels of GC-biased evolution are likely to be useful models in the study of the evolution of recombination.
One such species is the gerbil Psammomys obesus, the fat sand rat.A recent study has shown that the genome of this species has an unusual region with an extremely high GC-content (Hargreaves et al. 2017).This region contains at least 88 genes, including several that are highly Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020 conserved across mammal species, such as Brca2, Cdk8, Insr and the ParaHox cluster (Gsx1, Pdx1, Cdx2), and is syntenic to the sub-telomeric region of chromosome 12 in rat.The region had previously eluded study because of the difficulty of sequencing DNA sequences with very high GC-content when the rest of a sample is not GC-rich (Chen et al. 2013;Botero-Castro et al. 2017).
Using transcriptome sequencing, Hargreaves et al. (2017) assembled 52 out of 88 known genes in the region and showed that a subset of at least 30 had a GC-content greater than their homologues in the mouse and rat, to which gerbils are closely related.By focusing on the Pdx1 gene, they also showed that the region is affected by extreme nonsynonymous evolution: the 60 amino-acid homeodomain of PDX1 is 100% conserved across all previously studied mammals, yet in P. obesus it differs by 15 amino acids.Because of the considerable level of conservation of this protein across vertebrates, it is likely that most of these changes were deleterious when they originated in P. obesus (Hargreaves et al. 2017;Dai and Holland 2019).The implication is that the region has evolved to this state not because of selection for high GC-content, but through a process that increases GC-content regardless of deleterious effects of the G or C alleles (Berglund et al. 2009;Galtier et al. 2009;Ratnakumar et al. 2010;Kostka et al. 2012;Dai and Holland 2019).The suggestion is that an expansive genomic region has been affected by anomalously stable and intense gBGC, causing extreme sequence divergence in some genes in a short period of time (the common ancestor between mice and gerbils lived between 20.6 to 22.5 million years ago; Steppan et al. 2004).We reasoned that this rapid evolution of localised GCcontent affords a powerful opportunity to explore the molecular drivers underpinning GC evolution in mammalian genomes.
In this study, we test whether gBGC was responsible for the increase in GC-content by measuring the substitution rates for weak-to-strong (AT-to-GC), strong-to-weak (GC-to-AT) and GCconservative mutations in the fat sand rat Psammomys obesus, as well as in three additional species from the gerbil clade (subfamily Gerbillinae).We hypothesised that the region evolved on the gerbil stem lineage, hence we measured the substitution rates from the point of Gerbillinae-Murinae divergence, and compared rates to those measured for two murine species: the mouse Mus musculus and the rat Rattus norvegicus.We tested whether the high-GC region is affected by a high rate of weak-to-strong (AT-to-GC) substitutions, but not of other types of substitution, as expected if the high GC in the region results from gBGC alone.Finally, we explore whether other parts of the gerbil genome are affected by similar GC-skewed evolution, as predicted if there was a global change in recombination.

All mutational categories have an increased substitution rate in the known high-GC region
Our first aim was to test whether a previously described high-GC region is found in the genomes of gerbil species other than the sand rat P. obesus.We sequenced and assembled transcriptomes from three gerbil species (Meriones unguiculatus, Meriones libycus and Meriones shawi, fig.1A).
From each species, we identified orthologues of the protein-coding genes located in the known high-GC region of the sand rat, and compared them to orthologous genes in P. obesus and in two murine species, Mus musculus and Rattus norvegicus.This gave a subset of 27 genes with representative sequences in at least one of the 4 gerbil species (20 genes had a sequence in all 4 gerbil species).For each gene and species, we measured GC in the third codon position (GC3).
Consistent with previous observations, the genes in the region had an extremely high GC3 in P. obesus, ranging from 81% to 100%, compared to the 42% to 77% in M. musculus and 45% to 78% in R. norvegicus (fig.1B).In the three other gerbil species, GC3 was also extremely high, ranging from 73% to 100% in M. unguiculatus, 74% to 100% in M. libycus and 75% to 99% in M. shawi (fig.1B).The gene sequences of the gerbil species represented in each of the 27 genes had, in all cases, a higher GC3 value than the sequences of the two murine species.This result supports the hypothesis that the region of high GC is not unique to the fat sand rat, but evolved before the divergence between the four gerbil species represented in our samples.
A hypothesis that could explain the high GC3 in the region is that the fixation rate of G or C alleles is higher than that of A and T alleles, through a process such as GC-biased gene conversion.GC-biased gene conversion is expected to directly increase the rate of AT-to-GC substitutions (weak-to-strong) yet to have no direct effect on the rate of GC-to-AT (strong-to-weak) or GCconservative (strong-to-strong and weak-to-weak) substitutions (Duret and Arndt 2008).We tested whether the genes in the high-GC region of gerbils have encountered an increase in weakto-strong substitution rate by estimating the rate of synonymous substitution (dS) for different mutational categories: weak-to-strong (dSWS), strong-to-weak (dSSW), strong-to-strong (dSSS) and weak-to-weak (dSWW).We measured these values from the node of the tree representing the murine-gerbil divergence to the tips representing each of the four species (fig.1A), thus measuring the divergence between the two groups of species (20.6 to 22.5 million years of divergence).The value of dSWS was greater than 1 for all genes in all four gerbil species, compared to a range of just 0.03 to 0.4 in M. musculus and 0.04 to 0.32 in R. norvegicus.Surprisingly, however, the other three mutational categories also showed a higher dS in the gerbil Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020 species than in the murine species (fig.1C) in all but three genes (Arpc1b, Pet100 and Cers4).
Indeed, 20 of the 27 genes had dS greater than 1 for all mutational categories in all gerbil species.These dS values observed in the gerbil species -and not in the murine species -indicate that, for most genes, the substitution rates of each mutational category are extreme and high enough to have reached saturation.Importantly, dSWS was higher than dSSW for all genes in the gerbil species (fig.1D), consistent with the overall increase in GC for the genes in the region.These results are mirrored in the substitutions causing nonsynonymous changes.In 26 of the 27 genes, GC at the first and second codon position (GC12) had a higher value in the four gerbil species than in the two murine species (fig.2A), despite both groups having overlapping ranges (45% to 72% in the gerbil species and 43% to 61% in the murine species).We measured the rate of nonsynonymous substitution (dN) for the four mutational categories: dNWS, dNSW, dNSS and dNWW (fig.2B).The main difference between dS and dN is that dN was never greater than one Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020 for any of the categories.Otherwise, we detect the same patterns for dN and for dS.First, all 27 genes had greater dNWS in all four gerbil species than in the two murine species (range 0.01 -0.46 in the gerbil species and 0 -0.07 in the murine species).Second, the three other categories also have elevated dN in the gerbil species relative to the mouse species in most genes (18 out of 27).Lastly, dNWS was higher than dNSW for all genes in the gerbil species (fig.2C).
In summary, we found that the known region of high GC in gerbil species is characterised by an increase in dS and dN for all mutational categories, but with a higher rate of weak-to-strong substitutions than strong-to-weak substitutions.

GC skew affects other genes in the gerbil genome
Our second aim was to test whether other genes in the genome of gerbils are affected by GCskewed evolution, or whether the known GC-rich region is a unique peculiarity within gerbil genomes.Based on the results above, we sought to address this by asking three questions.
(1) Are there genes with outlying dSWS values in other parts of gerbil genomes?(2) If such genes exist, do they also have high dS values for the other three mutational categories?(3) Do these genes have a higher dSWS than dSSW, thus being affected by GC-skewed evolution?
To answer these questions, we identified 8,809 orthologous genes in a set of ten rodent species and in the outgroup Homo sapiens, excluding any genes from the known high-GC region.These groups of orthologous genes have a single copy in each of two gerbil species (P.obesus and M. unguiculatus) and two murine species (M.musculus and R. norvegicus).We measured the synonymous substitution rates of the different mutational categories (dSWS, dSSW, dSWS and dSWS) from the node of the tree representing the murine-gerbil divergence to the tips representing each of the four species (fig.3A).To control for different average evolutionary rates in each of the lineages, we divided each rate measurement by the average rate for the category and the species.
For each of the mutational categories in each species, we considered genes with dS greater than 2.5 times the respective average as outliers, a threshold chosen to capture the tail of the dS distribution (supplementary fig. 1 and 2).
The most striking difference between the murine and the gerbil species is the excess of genes with high dSWS in the two gerbil species (fig.3B and supplementary fig.3).Respectively 4.1% and 4.4% of genes of P. obesus and M. unguiculatus were outliers in dSWS (360 and 387 out of 8,809, respectively; supplementary table 1), of which 323 were outliers in both species (fig.3C).By comparison, only 0.7% of genes in M. musculus and R. norvegicus were outliers in dSWS (64 and 65 out of 8,809, respectively).Thus, we conclude that the genomes of gerbil species include a large number of genes with outlying dSWS values, located outside the known high-GC region.
We then asked whether these genes also have a higher dS value for the other mutational categories.Comparing the values of dSSW, dSSS and dSWW between outlier and non-outlier genes shows that, in all species, the outlier genes have higher substitution rates than non-outlier genes for these three mutational categories (one-tailed Wilcoxon rank-sum test, p < 0.05 for all categories in all species, supplementary table 2; supplementary fig.4).It is important to note, however, that this difference is not extreme: the groups of genes with outlying dSWS did not have a disproportionately large number of genes that also had an outlying value for the other three Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020 mutational categories (chi-squared test for each species, p > 0.05; supplementary fig.4).In other words, although the genes that are classed as outliers based on dSWS have on average higher values for the other three mutational categories, some other genes in the genome can be classed as outliers based on these three mutational categories.The difference between outliers and nonoutliers was least strong for dSSW than for dSWW or dSSS (supplementary table 2).For instance, in P. obesus the difference in median normalised dSSW between outliers and non-outliers was only 0.14, compared to a difference in median normalised dSSS of 0.77.
Despite the average increase in all mutational categories in the genes with outlying dSWS, are these genes affected by a GC skew?We found that, in our four focal species, more than 90% of genes with outlying dSWS had higher dSWS than its opposing rate, dSSW (99% in P. obesus, 98% in M. unguiculatus, 95% in M. musculus and 92% in R. norvegicus; fig.3B).This effect is particularly striking for the two gerbil species.As seen in figure 3B, the relationship between dSWS and dSSW in these two species has a "chimney" shape, indicating a large excess of dSWS.In other words, the two gerbil species had a large number of genes with an outlying dSWS, of which virtually all had higher dSWS than its opposing rate.We thus conclude that gerbils are affected by GCskewed evolution in protein-coding genes outside the known high-GC region.
Is this GC skew large enough to have affected the GC-content of gerbil genes since their divergence from the murines?To study this, we measured GC3 of the representative sequences for the four focal species in our analysis for each of the 8,809 groups of orthologous genes (excluding genes for which the coding sequence annotation length was not a multiple of three: 12 genes in M. musculus, 5 in M. unguiculatus, 2 in R. norvegicus and 0 in P. obesus).The GC3 distributions of each of the four species were different from one another (Kruskal-Wallis rank sum test, χ 2 = 136.02,d.f.= 3, p < 1 ⨉ 10 -28 ; Table 1; supplementary fig.5).Pairwise comparisons between the species show a significant difference only between the murine and the gerbil species (Dunn's Kruskal-Wallis multiple comparisons with the Benjamini-Hochberg multiple testing correction, p < 0.001), with no difference between M. musculus and R. norvegicus nor between P. obesus and M. unguiculatus.The difference between the two clades is seen as bias towards high GC3 in the gerbil species (supplementary fig.6).Is this bias explained by the process of GCskewed substitution seen in the gerbil genomes? Figure 3D shows that the genes with an outlying value of dSWS in P. obesus are also those with the largest difference in GC3 between this species and M. musculus, a result that is mirrored in all gerbil-murine comparisons (supplementary fig.7).
Thus, we conclude that the genes in the gerbil lineage underwent a GC-skewed substitution process that has increased their GC3 relative to their orthologs in the murine lineage.Genes under GC-skewed evolution in other clades have been documented to have a relatively high load of deleterious mutations (Berglund et al. 2009;Galtier et al. 2009;Necşulea et al. 2011).
We determined whether the outliers in dSWS are affected by putatively deleterious substitutions by testing whether these genes have an elevated rate of nonsynonymous changes relative to the other genes.As expected for both gerbil species, the group of outlier genes had higher dNWS, dNSS and dNWW -but not dNSW -than the non-outlier genes (one-tailed Wilcoxon rank sum test, p < 0.05 for each comparison, supplementary table 3; supplementary fig.8).The difference was strongest for the WS mutational category, normalised dNWS having a median value of 1.59 and 1.54 for the outliers of P. obesus and M. unguiculatus, compared to 0.65 and 0.64, respectively, for the non-outliers of each species.These results suggest that the outlier genes have a higher load of deleterious mutations than non-outlier genes.Nevertheless, it is important to note that the difference between the two groups of genes was modest for all mutational categories, with a large overlap in their ranges (supplementary fig.8).

GC-skewed genes are organised in clusters
Above, we have shown that more than 300 of the genes in the 8,809 orthology groups have an outlying dSWS in gerbils (fig.3C), representative of a GC-skewing process specific to the gerbil lineage.An important question is whether these genes are positioned randomly across the genome, or whether they are organised into clusters in a similar way to the previously described high-GC region of P. obesus.Since gerbil genomes have not been assembled to sufficient contiguity to test this directly, we mapped each gene to the mouse reference assembly.In figure 4A, it can be clearly seen that genes with high dSWS in the gerbil species tend to be placed in close proximity to each other, forming peaks in the spatial distribution (see supplementary fig. 9 for the other mutational categories and supplementary fig. 10 for the distribution of GC12 and GC3).These groups of genes form peaks present on every chromosome except for the X chromosome.11).The fact the outlier genes tend to be organised into clusters also implies that the gerbil species have large tracts of chromosome without any outliers in the GC skew process (fig.4A).
How many regions in the genome include clusters of outlying genes?The difficulty of defining the exact boundaries of such regions is that runs of outlier genes are interspersed by genes with low dSWS.We divided the genome into sliding windows of 1 Mbp overlapping with a step of 0.25 Mbp.
We identified any window where the median dSSW is greater than 2.5 times the average dSSW for each of the species, which we defined as "outlying regions".Overlapping windows that passed this threshold were collapsed into single regions.We found 17 such regions each for P. obesus Studies of GC-biased substitutions since the divergence between human and chimpanzee have found that GC-biased substitutions occur at a higher density in sub-telomeric regions in these species (Dreszer et al. 2007;Auton et al. 2012;Capra et al. 2013;Munch et al. 2014).To test whether a similar pattern is seen in the gerbil genomes, we determined whether there is an excess of dSWS outliers mapped to the 5 Mbp region at the start and at the end of each chromosome in the mouse reference assembly.We found that approximately 20% of the outlier genes in P. obesus and M. unguiculatus are located in these sub-telomeric regions, compared to only 6% of the non-outlier genes (χ 2 test, p < 0.05, supplementary table 6).Indeed, for each of the gerbil species, the inferred sub-telomeric regions included respectively 9 of the 17 and 8 of the 17 "outlying regions" identified in the sliding window analysis above (supplementary table 7).
However, the gerbil and murid lineages are known to have different karyotypes, so it is likely that not all sub-telomeric regions in the gerbil species are sub-telomeric in mouse and vice versa.To place the outlier genes onto a gerbil karyotype, we used the genetic map of M. unguiculatus produced by Brekke et al (2019).This map includes only 1,720 out of the 8,809 (20%) sets of orthologous genes used in our analyses (and 70 out of 387 of those that were outliers in dSWS in M. unguiculatus, 18%), the remaining being located in unmapped genomic scaffolds.Thus, we could not assess whether the sub-telomeric regions of M. unguiculatus are represented in the genetic map and we were therefore unable to test whether these regions are enriched for dSWS outliers.Nevertheless, 52 of the 70 genes present in the genetic map that were dSWS outliers in M. unguiculatus were not mapped to the first or last 2.4 centimorgan of any of the linkage groups (i.e. they are very likely not sub-telomeric).This set of genes includes all 10 outlier genes located in the sub-telomeric regions of the M. musculus assembly that could be placed on the M. unguiculatus genetic map.Thus, we conclude that there are dSWS outliers located outside subtelomeric regions, including some genes that are sub-telomeric in M. musculus.

Many of the GC-skewed genes have the highest GC3 in their gene families
Above, we have shown that gerbil species carry many genes with outlying GC-skewed substitution rates relative to their orthologs in two other rodents.Despite our observation that these outlying genes tend to have higher GC-content than their orthologs in the murine species, it is not clear whether GC-skewed evolution of these genes has increased their GC-content compared to a wider phylogenetic level.To measure the extent of GC accumulation in gerbils, we compared the gerbil sequences in each orthologous group to the wider gene family to which the group belongs.For this, we assigned each orthologous group to a Hierarchical Orthology Group (HOG), defined as all orthologous and paralogous genes with a single common ancestor, as identified in the OMA database (Altenhoff et al. 2013;Altenhoff et al. 2018).We assigned 6,735 of the orthologous groups to HOGs.For each, we determined the rank of the GC3 value for the two gerbil and the two murine species relative to the sequences in the deuterostome species in each HOG (we excluded M. musculus and R. norvegicus from the HOG database to avoid doublecounting).We found that gerbil species had an enrichment of genes with a high GC3 rank in their HOGs relative to the murine species.For instance, 183 and 208 of the genes in P. obesus and M. unguiculatus, respectively, were in the top three highest GC3 of their HOGs (2.7% and 3.1% of 6,735 genes), compared to only 58 and 65 in M. musculus and R. norvegicus, respectively (0.9% and 1.0% of 6,735 genes; χ 2 test, χ 2 = 145.13,d.f.= 3, p < 1 ⨉ 10 -30 ).For all species, genes classed as dSWS outliers had a higher rank position (high GC3 to low GC3) than those not classed as such (one-sided Wilcoxon rank sum test for each species, p < 0.05, supplementary fig.14).
This pattern was most pronounced in the gerbil species.For instance, 90 genes of the 183 genes in the top 3 highest GC3 of their HOGs (49%) in P. obesus were classed as dSWS outliers, contrasting to only 6 of 58 (10%) in M. musculus (Fisher's exact test comparing the four species, p < 10 -19 ; supplementary table 8).In summary, we found that GC-skewed evolution in the dSWS outliers of the gerbil species has allowed many of these genes to evolve a higher GC3 than most of their homologues across the animal kingdom.

Discussion
Previous work identified a genomic region containing genes with very high GC-content in the genome of a gerbil (Hargreaves et al. 2017).The extreme nature of this region and its relatively recent origin in the gerbil lineage raises important questions about the nature of GC skew in genomes, its mechanism of origin, its phylogenetic distribution and its impact on the evolution of genes.Here we explore these issues and make some surprising findings concerning the evolution of GC-content in animal genomes.
Our study identifies two patterns of GC skew in gerbil genomes.The first type of skew is seen in the previously described high-GC region (Hargreaves et al. 2017).We find this region has been affected by extremely high substitution rates of all types (fig.1).For P. obesus and M. unguiculatus, all but one gene in this region had substitution rates within the top percentile of the genome-wide distribution, as measured from the 8,809 groups of orthologous genes in the rest of the genome (supplementary table 9 and supplementary fig.15).In the gerbil species, all genes in the region had a higher weak-to-strong than strong-to-weak substitution rate, consistent with an overall dramatic increase in GC-content.These results are not restricted to 'silent' sites but are also evident in mutations that change amino acids: dN values reveal evidence for considerable evolution at the amino-acid level driven by GC skew (fig.2).A second type of GC skew is observed across the rest of the genome, where we see at least 17 large clusters of genes with outlying values of weak-to-strong dS (fig.3 and fig.4).These values are considerably higher than generally seen in mouse and rat (fig.3B), yet they are not as extreme as seen in the previously described region.These genome outliers show only a modest increase in the other three substitution rate categories for dS, and a modest increase in the weak-to-strong dN.The consequence in these genomic regions is a small increase in GC-content at particular regions.musculus reference genome assembly.The horizontal grey line represents the dS threshold chosen to define outlier genes (dS larger than 2.5 times the average for the respective species and mutational category).The dark blue horizontal bars represent the location of the previously known high-GC region.mobilising recombination machinery (specifically the protein SPO11) to PRDM9-binding sites across each chromosome.These sites undergo rapid turnover (Myers et al. 2010;Brick et al. 2012;Smagulova et al. 2016;Latrille et al. 2017), creating transient hotspots of recombination.
The second pathway, which is preferentially used by females in mice, involves the mobilisation of the recombination machinery to specific histone modification marks near promoters in gene-rich regions of the genome (Brick et al. 2012;Brick et al. 2018).In theory, a change in either of the pathways could affect the process of gBGC.An example is seen in canids, where loss of the Prdm9 gene has caused recombination hotspots to be focused in certain small regions of the genome, which are consequently affected by strong gBGC (Axelsson et al. 2012).Additionally, changes in the action of gBGC can be caused by changes to the larger-scale mechanisms that control the distribution of the recombination hotspots (Coop and Przeworski 2007;Stapley et al. 2017).For example, studies of the human and chimpanzee genomes have shown that average recombination rates -and the strength of gBGC -are conserved between these two species over large genomic windows, with recombination rates being particularly high in sub-telomeric regions (Dreszer et al. 2007;Auton et al. 2012;Capra et al. 2013;Munch et al. 2014).The implication is that species with similar recombination landscapes are affected by gBGC in similar regions, whereas changes to recombination landscapes are expected to cause divergence in the effect of gBGC.This hypothesis is supported by studies of GC-content at broad evolutionary levels, which show a reverse correlation between GC-content and chromosome size in several metazoan clades (Romiguier et al. 2010;Galtier et al. 2018) Given that smaller chromosomes tend to have higher rates of recombination, these studies support the idea that karyotype is a key determinant of the location of recombination hotspots and of the strength of gBGC.
Our study presents several lines of evidence supporting the hypothesis that changes to the recombination rate in the gerbil lineage caused the GC skew we observed.First, the outliers in weak-to-strong dS were affected by a modest increase of dS for the three other mutational categories, which could reflect the mutagenic effect of recombination itself, a phenomenon also seen in birds (Bolívar et al. 2016;Rousselle et al. 2019) and mammals (Pratto et al. 2014;Arbeithuber et al. 2015;Smith et al. 2018;Rousselle et al. 2019).Second, the weak-to-strong outliers were also affected by a higher rate of nonsynonymous substitutions, consistent with a relatively high load of deleterious mutations as seen in genes affected by gBGC in other species (Berglund et al. 2009;Galtier et al. 2009;Necşulea et al. 2011).Third, the gerbil weak-to-strong outliers were clustered in the genome, as expected if they were caused by clusters of recombination hotspots (Dreszer et al. 2007;Auton et al. 2012;Capra et al. 2013;Munch et 16 and 17).This result suggests that the recombination landscapes of gerbil and murine species are divergent, and is consistent with the low levels of correlation in the recombination rate found even within murine species (Jensen-Seaman et al. 2004).The fact that there were very few weakto-strong dS outliers in the genomes of M. musculus and R. norvegicus suggests that the gerbil species have recombination rates higher than the murine species.It is important to note that our data do not allow us to infer whether the regions with putatively high recombination rates in gerbils have much higher rates than similar regions in mammals other than mouse and rat.However, the fact that the GC-content of the outlier genes had a high rank compared to their wider gene families in deuterostomes suggests that this is a likely hypothesis.
Interestingly, the recombination rate of sub-telomeric regions of M. musculus and R. norvegicus is known not to be as high relative to the rest of the genome as seen in primates (Jensen-Seaman et al. 2004).This may explain how approximately half of the weak-to-strong outliers in gerbils can map to sub-telomeric regions in the M. musculus genome assembly yet not be outliers in this species.Nevertheless, high rates of karyotypic evolution have been documented in muroid rodents (Ferguson-Smith and Trifonov 2007), with gerbil species differing in chromosome number relative to mice and carrying interstitial telomere sites, evidence of several large structural changes (de la Fuente et al. 2014).This level of karyotypic change in murids implies that subtelomeric regions of the mouse genome may not be sub-telomeric in gerbils, and vice versa.
Future work would therefore require the production of more highly contiguous gerbil genome assemblies such that regions of recombination and GC skew can be mapped at a chromosomal level.In addition to karyotypic changes, changes to chromatin states during meiosis can also modulate local rates of double-strand breaks (Saccone et al. 2002).For instance, genes and transposable elements that are active during meiosis are marked with the histone modification that initiates double-strand breaks in the PRDM9-independent pathway, in a process controlled by DNA methylation (Zamudio et al. 2015).The clusters of GC-skewed outlier genes could thus represent regions of the chromosome with high accessibility to recombination machinery.It has been suggested that the previously known high-GC region is exceptionally repetitive (Hargreaves et al. 2017).A tantalising hypothesis to explain the extreme substitution rates that we observed in this region is that its chromatin state is particularly accessible during meiosis.
In summary, we have shown that the genomes of gerbil species carry protein-coding genes with outlying levels of GC-skewed substitution.We propose that changes to the recombination landscape in this lineage seems a likely explanation for the existence of multiple regions of GC skew.Studying the mechanisms that control the large scale evolution of recombination rates would give us a better understanding of the evolution of GC-content, including the evolution and maintenance of isochore structures in vertebrates (Costantini et al. 2009).We expect that the GCskewed evolution that we characterised in the gerbil lineage can be used as a model for the study of these processes.

Transcriptomic sequencing and assembly
Many of the genes in the high-GC region of the fat sand rat P. obesus are not present in the genome assembly because of biases in the sequencing and assembly of high-GC sequences.
Previously, Hargreaves et al (2017) determined the sequence of several of these genes by producing transcriptome assemblies of several tissues.We repeated this analysis for three additional species, Meriones unguiculatus, Meriones shawi and Meriones lybicus.For each species, we sequenced the transcriptomes of kidney and liver (all species) and duodenum (M.unguiculatus and M. libycus only).We also sequenced pancreas RNA of M. libycus and generated additional transcriptomic data for the duodenum, kidney and testis of P. obesus.Animal handling was in accordance with European Union and UK Home Office animal care regulations, and approved by local animal welfare and ethical review boards.Tissues were snap-frozen on dry ice except for pancreas, which was homogenised immediately in TRIreagent; total RNA was extracted and purified by using TRIreagent followed by DNAse I treatment and re-precipitation.
mRNA was prepared for sequencing using the TruSeq stranded mRNA sample preparation kit (Illumina) with polyA selection.All libraries were then pooled and sequenced using 75bp pairedend reads across 2 lanes of the Illumina HiSeq4000 platform.The quality of all sequencing data was assessed using FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc/).Adapter contamination was removed from raw sequencing reads using Trimmomatic (Bolger et al. 2014) and subsequently quality trimmed using Sickle (Joshi and Fass 2011).Reads were then pooled per species and assembled with Trinity (Grabherr et al. 2011) using default parameters.The newly generated data for P. obesus were combined with previously generated transcriptomic data for pancreatic islets and liver (Hargreaves et al. 2017) prior to assembly.Putative transcript coding sequences were first identified using Blast+ (Camacho et al. 2009) using query amino acid sequences corresponding to orthologous genes located within the high-GC region from three In the whole-genome dataset, many alignments include partial gene sequences (supplementary fig.18).To remove the worst affected alignments, we first eliminated those for which the longest gene was smaller than 400 bp.For each alignment, we then masked any species sequence with gaps representing more than 40% of the non-gap size of the longest sequence in the alignment.
After these filters, we allowed a maximum of two species to have a masked or missing gene (except for H. sapiens, M. musculus, R. norvegicus, P. obesus and M. unguiculatus).After filtering, the whole-genome dataset included 8,815 aligned orthogroups.We removed the 6 orthogroups with genes located in the known high-GC region, resulting in 8,809 orthogroups (the remaining genes in the known high-GC region were either not represented in the genome assemblies of P. obesus or M. unguiculatus, or otherwise not represented in the orthogroups).

Substitution rates
We estimated rates of synonymous and nonsynonymous substitution (dS and dN, respectively) for different mutational categories for each nucleotide alignment using the BppSuite (Guéguen et al. 2013).For each alignment, we trimmed the tree in figure 1A to include only the species in the alignment.We used the BppML subprogramme version 2.3.1 of BppSuite to optimise branch lengths for each alignment, using the YN98 (F3X4) model (Yang and Nielsen 1998) and the parameters available online (script 1, Pracana and Hargreaves 2019).We then estimated dS and dN for different mutational categories (weak-to-strong, strong-to-weak, weak-to-weak and strongto-strong) using BppML subprogramme MapNH version 1.1.1(Romiguier et al. 2012) with the parameter "map.type= Combination(reg1=dNdS, reg2=SW)" (script 2, Pracana and Hargreaves 2019).For the high-GC region dataset, the programme could not be run successfully for 5 of the alignments, as the gerbil sequences were too short.For each of the two datasets, we retrieved the dS and dN value for M. musculus, R. norvegicus and the gerbil species by summing, for each species, the branch lengths from the Muridae node to the tree tip.We used an R script to measure GC, GC12 and GC3 for the original non-trimmed sequence of each species for each alignment.

dSWS outlier clusters
We performed 1 million permutations for each of the focal species to test whether dSWS outliers are closer to each other in the M. musculus reference genome assembly GRCm38.p6(Ensembl release 95) than would be expected if they were randomly distributed among the genes in our analysis.In each permutation, we assigned the number of dSWS outliers in the species to random genes.We then measured the inferred distance between these genes and the number that fell in runs of more than two neighbouring genes.Additionally, we identified genomic regions for each species where genes have high average dSWS.For this, we divided the genome into overlapping sliding windows (1 Mbp with a 0.25Mbp step and 0.25 Mbp with a 0.1 Mbp step along the mouse reference assembly) and determined whether the average normalised dSWS among the genes of each species was larger than 2.5 times the average dSWS among the 8,809 genes for that species, ignoring windows with less than four mapped groups of orthologous genes.We defined subtelomeric regions as those located within 5 Mbp of the start and end of each M. musculus chromosome.To position genes onto the M. unguiculatus linkage groups, we used the genetic map produced by Brekke et al (2019), which gives the average centimorgan position of each mapped scaffold in the M. unguiculatus genome assembly.We considered any scaffold mapped within 2.85 centimorgans of linkage group starts or ends as sub-telomeric, a threshold approximately equivalent to 5 Mbp considering the mouse average recombination rate of 0.57 centimorgan per Mbp (Cox et al. 2009).

Hierarchical Orthology Groups
We downloaded the Hierarchical Orthology Groups (HOG) database (version Jun 2018) from OMA orthology database (Altenhoff et al. 2013;Altenhoff et al. 2018).For each of the orthogroups (above), we used the mouse transcript ID to assign the orthogroup to a HOG.For each of the resulting 6,736 HOGs, we retrieved the sequences of all genes for all represented deuterostome species, excluding any sequences from M. musculus and R. norvegicus, and used an R script to measure their GC3.For each orthogroup, we then independently compared the rank of the GC3 for each of the four focal species (M.musculus, R. norvegicus, P. obesus and M. unguiculatus) relative to the HOG GC3 measurements.

FIG. 1 .
FIG. 1. GC-skewed synonymous evolution in protein-coding genes located in the known high-GC region.(A) topology of the phylogenetic relationship between the species analysed in this study; (B) GC-content in the third codon position (GC3) per species per gene; (C) the rate of synonymous substitution (dS) per mutational category: weak-to-strong (WS), strong-to-weak (SW), weak-to-weak (WW) and strong-to-strong (SS); (D) comparison between weak-to-strong dS (dSWS) and strong-to-weak dS (dSSW).

FIG. 3 .
FIG. 3.Outliers in the rate of synonymous substitution (dS) measured from the gerbil-murine split for 8,809 groups of orthologous genes outside the known high-GC region.(A) Topography of the tree of the species included in the alignments of each group of orthologous genes; our focal species include two gerbil species (in pink) and two murine species (in green); dS was measured in the coloured branches, i.e., from the point of gerbil-murine divergence to the tip of each of the four focal species (B) Comparison between the rate of weak-to-strong synonymous substitution (dSWS) and the strong-to-weak synonymous substitution (dSSW) in the two gerbil and the two murine species, normalised by dividing each value by the average rate of the respective species and mutation category; coloured points represent genes above the dS threshold chosen to define outliers (dS larger than 2.5 times the average for the respective species and mutational category); only values under 6 are shown.(C) Euler diagram of the dSWS outliers, where the area of each segment is approximately proportional to the number of overlapping outlier genes per species (only gene numbers greater than 1 included), shown also as a percentage of the 524 genes that were outliers in at least one species.(D) Pairwise comparisons of the GC-content at the third codon position (GC3) between the gerbil P. obesus and the mouse M. musculus for 8,797 groups of orthologous genes (out of 8,809) for which both species have a GC3 measurement.

FIG. 4 .
FIG. 4. Clustering of outliers in the weak-to-strong dS.(A) Normalised rate of synonymous substitution (dS) for two mutational categories, strong-to-weak (SW) and weak-to-strong (WS), for 8,809 genes; (B) average normalised weak-to-strong dS (dSWS) in sliding windows of 1 Mbp with a step of 0.25 Mbp, showing only windows with more than 3 genes, and with outlying regions marked with a vertical grey bar.The genes are mapped by row to the chromosomes of the M.
Downloaded from https://academic.oup.com/mbe/advance-article-abstract/doi/10.1093/molbev/msaa072/5805392 by guest on 17 March 2020We performed permutations to test whether outlier genes in figure3Bare placed closer to each other than what would be expected if they were distributed randomly.The observed distance between dSWS outliers (5.39 Mbp in P. obesus and 5.14 Mbp in M. unguiculatus) was smaller than M. unguiculatus; P ≈ 0 for both species; supplementary table 4; supplementary fig.