Gv1, a Zinc Finger Gene Controlling Endogenous MLV Expression

Abstract The genomes of inbred mice harbor around 50 endogenous murine leukemia virus (MLV) loci, although the specific complement varies greatly between strains. The Gv1 locus is known to control the transcription of endogenous MLVs and to be the dominant determinant of cell-surface presentation of MLV envelope, the GIX antigen. Here, we identify a single Krüppel-associated box zinc finger protein (ZFP) gene, Zfp998, as Gv1 and show it to be necessary and sufficient to determine the GIX+ phenotype. By long-read sequencing of bacterial artificial chromosome clones from 129 mice, the prototypic GIX+ strain, we reveal the source of sufficiency and deficiency as splice-acceptor variations and highlight the varying origins of the chromosomal region encompassing Gv1. Zfp998 becomes the second identified ZFP gene responsible for epigenetic suppression of endogenous MLVs in mice and further highlights the prominent role of this gene family in control of endogenous retroviruses.

Both 129-G -IX and B6-G þ IX strains have previously been generated through serial backcrossing but have unfortunately been lost . Extending recent work on Sgp3 (Treger et al. 2019), we have now sought to formally identify the gene corresponding to Gv1. We determine Zfp998 (2410141K09Rik) as necessary and sufficient to control the G IX phenotype and identify the source of deficiency at the locus in the prototypic G þ IX strain, 129.
linkage at 13:66.58 Mb ( fig. 1a), within a large cluster of ZFPs (Kauzlaric et al. 2017). ZFP clusters form through local gene duplication events (Huntley et al. 2006) and extensive selfhomology complicates their analysis (fig. S3); indeed, although many variations have been reported within this area (Henrichsen et al. 2009;Quinlan et al. 2010;Keane et al. 2011;Pezer et al. 2015), no two studies identified shared differences or grouped inbred strains consistently. Similarly, Mouse Genomes Project variant calls Yalcin et al. 2011) revealed no consistent differences between G þ IX and G -IX mouse strains or between Sgp3 þ and Sgp3strains within the candidate region.
Thus, the origin of the chromosomal region encompassing Gv1 differs between the prototypic G -IX strain, B6/J, and the majority of other laboratory strains. Additionally, several C57lineage strains displayed large, unreported, deletions within the candidate region.
Zfp998 is Necessary and Sufficient to Determine G IX Status As the deletion within B6/N (fig 1b) fell within the area defined by the fine mapping of Gv1 ( fig. 1a), we sought to determine the G IX status of this strain. Reflecting G IX 's historical usage as a T-cell marker  This placed Gv1 within the $810 kb deletion in B6/N, within which only two genes encoding KRAB-containing ZFPs are annotated: Zfp997 and Zfp998, with 19 and 14 zinc fingers, respectively ( fig. 1d). These genes also formed the candidates for Sgp3 (Treger et al. 2019) and, although explicit identification was precluded by lack of separation in knockout animals, Zfp998 alone displayed specificity for the MLV LTR in electrophoretic mobility shift assays and was thus suggested as the most likely candidate. As Zfp998 was also the closest annotated protein-coding ZFP gene to the predicted point of peak linkage for Gv1 ( fig. 1a), falling only 142 kb proximally, we chose to confirm its ability to control the expression of reporter-bearing MLV proviruses in a cellbased system, which was not previously assessed. Transient expression of Zfp998 significantly reduced numbers of mCherry þ cells in cultures bearing integrated viral genomes driving the reporter from either a Moloney-MLV or a consensus pMLV LTR (20.5% and 18.3% versus untreated controls, respectively) ( fig 2d). This was equivalent to the difference in MLV SU staining observed for the CD8 þ and CD4 þ CD8 þ thymocyte populations (15.3 and 29.0%, respectively) (fig. 2a) and similarly representative of the 26-59% reductions seen upon Zfp708 treatment of RMER19B LTRs (Seah et al. 2019).
To next determine if individual disruption of Zfp998 was sufficient to recreate the G þ IX phenotype in B6/J, we created knockout mice using CRISPR/Cas9 targeting. Two independent disruptions of the gene were confirmed by copynumber, split-read, and split-mate analyses of 10Â PE150 whole genome sequencing (WGS) data, one truncating exon 4 (containing the zinc fingers) and one removing exons 1-3 ( fig. S7). Importantly, no disruption of Zfp997 was noted in either case. Subsequent investigations of G IX phenotype were conducted using levels of MLV SU cell-surface staining and qRT-PCR. Within the control groups, as before, significantly higher levels of MLV SU were observed for B6/N than for B6/J mice ( fig. 2e), which was reflected in elevated xMLV and pMLV expression (fig. 2f). Consistent with Gv1's semidominant mode of action (Stockert et al. 1971), nonsignificant increases were also visible in B6/J Â B6/N F1s. In comparison to this latter group, however, B6/J-Zfp998 -Â B6/ N F1s fully recapitulated the G þ IX phenotype of B6/N, both by MLV SU staining and qRT-PCR. Identical results were observed for both forms of the knockout, which are presented together.
Resolving the Source of Zfp998 Insufficiency Together, these data confirmed that loss of Zfp998 was singularly sufficient to confer a G þ IX phenotype within the B6 background and that its expression reduced expression of MLV proviruses in a biological system. Nevertheless, its chromosomal context differs substantially between C57-and non-C57-lineage laboratory mice and, amongst these, no groupings were visible that could define G IX status (fig 1c, fig. S4). Similarly, manual assessments of mouse genealogies (Beck et al. 2000) revealed no links in the origins of G þ IX or G -IX strains. We therefore determined to clarify the status of the Gv1 locus within the prototypic G þ IX strain, 129, which displays Gv1 gene . doi:10.1093/molbev/msab039 MBE amongst the highest levels of cell-surface MLV SU (Stockert et al. 1971) and for which the AB2.2 ES cell line-derived bMQ bacterial artificial chromosome (BAC) library provides an excellent resource for comparative genetics (Adams et al. 2005). Sixteen BAC clones predicted to span the region were purchased, amplified in culture, and isolated using techniques to maintain their full lengths for subsequent Oxford Nanopore MinION long-read sequencing. This allowed for high coverages (median 2494Â) to be achieved using only !1 kb reads and the full and contiguous assembly of all 16 BACs ( fig. S8). Upon removal of pBACe3.6 vector sequences, 15 BAC sequences could be joined into eight scaffolds that aligned to the region ( fig. 3a). Although a single scaffold could not be resolved, precluding orientation in relation to the B6/J-based GRCm38 reference and determination of the quantity of missing sequence between the areas assembled, large areas of sequence duplication (multiple X-Y paths occurring over the same Y axis ranges) were visible, as well as areas of unique sequence (X axis ranges with no corresponding X-Y path). Surprisingly, Zfp998 appeared not only present but apparently existed as two independent copies within the scaffolds ( fig. 3a), both displaying 100% identity to the coding region of the Zfp998 reference. However, both exhibited defects at the splice acceptor for the terminal, zinc-finger-containing, exon, lacking the consensus mammalian splice acceptor sequence  fig. S4. All non-C57-lineage strains group with ZALENDE/EiJ, whereas the C57-lineage group separately, and C57L/J with MOLF/EiJ. Coloring is according to (c), with wild-derived inbred strains in purple. (d) View of the genes within the area deleted within B6/N mice. Pseudogenes are colored gray, lincRNAs in purple, and protein coding genes in maroon. The protein coding ZFPs are annotated. Young et al. . doi:10.1093/molbev/msab039 MBE Zfp998 reference sequence are classified as "high" impact splice acceptor variants by the Ensembl Variant Effect Predictor (McLaren et al. 2016) and would be predicted to occlude canonical splicing and production of functional protein. Full characterization of expression and splicing patterns for the identified loci would require complete, contiguous, assembly of the alternate chromosomal region derived from M. m. domesticus, as the level of variation observed ( fig. 1b, fig.   S4) otherwise precludes the non-spurious mapping of RNAseq data from non-C57-lineage strains to the B6/Jbased mouse reference across this region.
Together, these data account for the prototypic G IX positivity and negativity of 129 and B6/J, respectively. Likely through tandem duplication, the majority of laboratory mice contain two copies of the area in which Zfp998 exists within B6/J, although neither harbor functional copies of the

Discussion
The means by which hosts control the expression and, ultimately, the mobility and pathogenicity of endogenous repetitive elements is of great interest. Work detailing the complement and expression patterns of endogenous MLVs stems from the original creation of inbred lines by and for mouse fanciers, where the first detailed record-keeping allowed investigation of so-called "heritable cancers"(Russell1985). Following characterization of the G IX phenotype in the 1960s, the Gv1 locus was identified in 1971 (Stockert et al. 1971), although early mapping attempts were contradictory (Stockert et al. 1976) and the locus was only later localized to chromosome 13 (Oliver and Stoye 1999). An analogous locus, Sgp3 was recently identified as either Zfp998 (dubbed Snerv1) or Zfp997 (Snerv2) (Treger et al. 2019) and we now positively identify Zfp998 as Gv1, making it highly probable that Gv1 and Sgp3 are differing phenotypic readouts of the same underlying gene, related to the specific mouse strains in which they have been studied. As such, it may be concluded that previously reported distinctions in the subclasses of MLVs regulated by the two loci likely reflect the differing complements of proviruses between strains-indeed, only $1/3 of MLV loci are commonly shared between any two backgrounds (Frankel et al. 1990), making this highly Zfp998 >

T C A T T C C A G -G C A T G T C A G T C A T T C C G G T G C A T G T C A G T C A T T C C G G T G C A T G T C A G
intron exon 4 FIG. 3. Nanopore sequencing allows assembly of the Gv1 gene region from 129 mice.(a) Alignment of the eight contigs resulting from assembly of the BACs against Chr 13. Diagonal lines represent only those areas !1 kb with !98% identity to the GRCm38 reference and are colored according to orientation (purple-forwards, green-reverse). Sequence names detail the scaffolding order and final sizes are shown above the alignment sections. Red highlighting shows the single position of Zfp998 in C57BL/6J (horizontal) intersecting with two independent copies within the BAC assemblies (vertical). (b) Comparison of the splice acceptor sequence (black box, with arrow marking point of splicing) for exon 4 of Zfp998 in comparison to the equivalent gene regions from the two assembled scaffolds. Area shown is the reverse complement of GRCm38 13:66432223-66432240. Young et al. . doi:10.1093/molbev/msab039 MBE plausible-and we note that our data support a wider action for Gv1 than previously reported (Oliver and Stoye 1999).
Our work predicts that Zfp998 insufficiency is shared amongst non-C57-lineage mice and explains why the majority exhibit G IX positivity, albeit to varying degrees (Stockert et al. 1971). Amongst the few non-C57-lineage strains historically genotyped as G -IX , it is further possible that negativity results from a lack of expression-competent proviruses and/ or suitable transcriptional milieus, rather than from Zfp998 sufficiency. An exception must be seen in BALB/c mice; however, which also display linkage to the Gv1 locus in crosses to 129 (Oliver and Stoye 1999), suggesting that further work with the locus may yield additional insights. Additional research may also be required to clarify the mechanism of Zfp998-based control. In contrast to the previous identification of PBS Gln1 as the binding site for Zfp998 (Treger et al. 2019), our data suggest that the promoter activities of LTRs bearing either PBS Gln1 or PBS Pro are repressed to equivalent, physiological, extents.
This research highlights the difficulty of studying genes influencing multiple, insertionally-polymorphic, loci and underscores the necessity of working on defined backgrounds. High levels of homology within ZFP clusters hinder application of established wet-and dry-lab techniques and there is strong potential that further uncharacterized differences have significant bearings on inter-strain diversity. We note that even within the exceptionally well-characterized B6 substrains, we have revealed variations completely unexpected in magnitude that have eluded previous bioinformatic studies of the same WGS datasets (Simon et al. 2013). Overcoming these problems, we confirm another ZFPbased control of MLV expression alongside Zfp809(Wolf and Goff 2009), further highlighting the importance of epigenetic controls in the establishment of ERV repression.

Materials and Methods
Full methods are available as Supplementary Information.

Data Availability
All raw sequencing data generated in this study have been submitted to the European Nucleotide Archive under accessions PRJEB40145 and PRJEB40276.