Giant viruses of the Megavirinae subfamily possess biosynthetic pathways to produce rare bacterial-like sugars in a clade-specific manner

Abstract The recent discovery that giant viruses encode proteins related to sugar synthesis and processing paved the way for the study of their glycosylation machinery. We focused on the proposed Megavirinae subfamily, for which glycan-related genes were proposed to code for proteins involved in glycosylation of the layer of fibrils surrounding their icosahedral capsids. We compared sugar compositions and corresponding biosynthetic pathways among clade members using a combination of chemical and bioinformatics approaches. We first demonstrated that Megavirinae glycosylation differs in many aspects from what was previously reported for viruses, as they have complex glycosylation gene clusters made of six and up to 33 genes to synthetize their fibril glycans (biosynthetic pathways for nucleotide-sugars and glycosyltransferases). Second, they synthesize rare amino-sugars, usually restricted to bacteria and absent from their eukaryotic host. Finally, we showed that Megavirinae glycosylation is clade-specific and that Moumouvirus australiensis, a B-clade outsider, shares key features with Cotonvirus japonicus (clade E) and Tupanviruses (clade D). The existence of a glycosylation toolbox in this family could represent an advantageous strategy to survive in an environment where members of the same family are competing for the same amoeba host. This study expands the field of viral glycobiology and raises questions on how Megavirinae evolved such versatile glycosylation machinery.


Introduction
The general perception of viruses as small and simple entities has been challenged with the discovery of giant viruses (B. La Scola et al. 2003, Abergel et al. 2015. Giant viruses are endowed with ds-DNA genomes up to 2.5 Mb that can encode up to 1500 proteins while more conventional viruses have much smaller genomes and sometimes just a handful of genes (Lu et al. 2020). Giant virus capsids are so large (up to 2 μm) that they can easily be seen by light microscopy. These viruses are larger than the smallest bacteria (Mycoplasma genitalium <0.3 μm) and archaea (Nanoarchaeum equitans, 0.4 μm) and contain more genes than the smallest parasitic eukaryote encephalitozoon (Philippe et al. 2013). They all infect unicellular eukaryotes and are major players in the environment (Suttle 2005, Brussaard et al. 2008 where they regulate protist populations. Given their genomic complexity, they contain genes never encountered before in viruses (B. La Scola et al. 2003, Renesto et al. 2006, such as those related to protein translation (Abergel et al. 2007, Raoult 2004, Jeudy et al. 2012) and glycan synthesis (Parakkottil Chothi et al. 2010, Piacente et al. 2012, 2014a,b, 2017a. Currently, several families of giant viruses have been dis-covered, such as the Mimiviridae and the proposed Pandoraviridae, Molliviridae and Pithoviridae (Abergel et al. 2015). The present study is focused on the Mimiviridae family and in particular on the proposed Megavirinae subfamily (Gallot-Lavallée et al. 2017), which encompasses five clades, all infecting Acanthamoeba sp. (Fig. 1): Mimiviruses (A-clade), Moumouviruses (B-clade), Megaviruses (Cclade), Tupanviruses (D-clade) (Abrahão et al. 2018) and the recently isolated Cotonvirus japonicus (E-clade) (Takahashi et al. 2021). As of today, all members of the proposed Megavirinae are characterized by a fibril layer that differs in thickness and length among the clades, as shown by negative staining transmission electron microscopy (NS-TEM) (Fig. 1). Interestingly, NS-TEM images of Moumouvirus australiensis (Fig. 1c) and Moumouvirus maliensis (Fig. 1b) also presented marked differences in their fibril layer thickness while supposedly belonging to the same clade.
Recent studies on Mimivirus (A-clade) revealed that complex glycans with unique structures made of two large polysaccharides were branched to the fibrils (Fig. 2). These carbohydrate polymers are made of up to 20 units of sugars that are not synthetized by the amoeba host . This result challenged the  A hallmark of the members of the family is the presence of a fibril layer surrounding the icosahedral capsids with evidence for some of the associated sugars. The glycan structure of Mimivirus (A-clade) has been elucidated: poly_1 has a linear repeating unit made of 3)-α-l-Rha-(1→3)-β-d-GlcNAc-(1→, with GlcNAc modified by pyruvic acid at position 4,6; poly_2 has a branched repeating unit made of 2)-α-l-Rha-(1→3)-β-d-GlcNAc-(1→ in the linear backbone; Rha is further branched with a terminal β-d-2OMeVio4NAc, which is 75% methylated. The biosynthetic pathways for UDP-l-rhamnose (UDP-l-Rha), UDP-d-N-acetyl-glucosamine (UDP-d-GlcNAc) and UDP-d-4N-acetyl-viosamine (UDP-d-Vio4NAc) have been predicted and experimentally validated. For Megavirus chilensis (C-clade) amino-sugars are found on the fibrils, but the glycan structure is still unknown. The biosynthetic pathways for UDP-l-rhamnosamine (UDP-l-Rha2NAc) and UDP-l-quinovosamine were predicted (UDP-l-Qui2NAc, green box), but experimental validation was only performed for the UDP-l-Rha2NAc pathway. No data are available for the other clades. common belief that eukaryotic viruses divert the host machinery to decorate their envelope proteins with small oligosaccharides containing two to 10 sugar units (Bagdonaite and Wandall 2018). For Megavirus chilensis (C-clade) virions, preliminary analysis revealed that its fibrils were also composed of rare amino-sugars not synthesized by the amoeba host (Piacente et al. 2014b).
Similarly, in PBCV-1, a large dsDNA virus of the Phycodnaviridae family, the Major Capsid Protein was shown to be decorated with an oligosaccharide that was different from those found in the three domains of life (De Castro et al. 2013). The unique nature of giant viruses glycans compared to their hosts is made possible by the presence of genes involved in glycosylation in their complex genomes. In PBCV-1, two biosynthetic pathways for sugars activated as nucleotides (donor-sugars) have been found (Piacente et al. 2015), along with a great number of predicted and experimentally validated glycosyltransferases (Noel et al. 2021, Speciale et al. 2020. Similarly, for the proposed Megavirinae subfamily, biosynthetic pathways for nucleotide-sugars along with glycosyltransferases ( Fig. 2) have been discovered for Mimivirus and Megavirus chilensis (Parakkottil Chothi et al. 2010, Piacente et al. 2012, 2014a,b, 2017a. Functional pathways for UDP-l-rhamnose (UDP-L-Rha), UDP-d-glucosamine (UDP-d-GlcNAc) and UDP-d-  in Mimivirus (Parakkottil Chothi et al. 2010;Piacente et al. 2012Piacente et al. , 2014aPiacente et al. ,b, 2017a) and a synthesis pathway for UDP-l-rhamnosamine (UDP-l-Rha2NAc) in Megavirus chilensis (Piacente et al. 2014b) have been experimentally validated. However, none of the predicted glycosyltransferases (GT) have been experimentally characterized, and there are no data available on glycans and glycogenes for the other clades of the family (Fig. 2).
A global view of glycosylation in the proposed Megavirinae subfamily is a prerequisite to make progress in the nascent field of viral glycosylation and increase our knowledge on carbohydrates and their biosynthesis. Key questions include (i) What is the viral machinery (glycogenes) used to synthesize these complex sugars in the different clades of the proposed Megavirinae subfamily? (ii) What is the nature of the glycans synthesized by other members of this subfamily? (iii) Do these glycans share a common archi-tecture, as seen in Chloroviruses (De Castro et al. 2016), or is their nature and architecture clade specific? To answer these questions, the present study aimed at comparing glycosylation of the giant viruses of the proposed Megavirinae subfamily by exploring the composition of their glycans and searching their genomes to identify possible genes involved in their biosynthesis.
To do this, we combined carbohydrate chemistry with bioinformatic analyses for members of each clade. Mimivirus (Scola 2003) and Megavirus chilensis (Arslan et al. 2011) were used as prototypes for the A-and C-clades, respectively, while we used both Moumouvirus australiensis and maliensis for the B-clade. For each, we chemically characterized their glycan composition, and performed an in silico comparative analysis to evaluate similarities/differences within each clade and within the entire family. As dense fibrils also decorate Tupanviruses (D-clade) and Cotonvirus japonicus (Eclade), we searched their genomes for putative glycogenes and included these data to provide a comprehensive comparative analysis covering all members of the family.

Phylogeny of the proposed Megavirinae subfamily
The phylogenetic tree of the proposed Megavirinae subfamily is based on concatenation and alignment of the following seven protein markers: Asp-Synthase, Helicase, mRNA Capping Enzyme, MutS, Packaging ATPase, PolyA polymerase and VLTF3. Fifty-five genomes of members of the different clades were used to generate the phylogenetic tree using the command line: mafft (v7.307) for the alignment, trimal (v1.4. rev22) for the gap filtering and iqtree (v1.6.9), with default settings at each step.

Production and purification of the virions
All viruses described here as prototypes of the different clades have been isolated by our laboratory. The protocol adopted to propagate and purify the viral particles is the same for all members of the different clades. Briefly, viral particles were propagated using A. castellanii (Douglas) Neff (American Type Culture Collection 30010TM) cells, which were infected with viruses at a multiplicity of infection of 0.25. After 2 days of incubation at 32 • C, the infection was complete and led to cell lysis and release of viral particles into the culture medium. Viral particles were then purified by removing cell debris by centrifugation at 500 g for 10 min at 20 • C. Subsequently, the supernatant containing viral particles was spun at 6,800 g, 45 min at 20 • C. The pellet containing the virions was washed with water, resuspended in CsCl 1.2 density, deposited on a discontinuous CsCl gradient made of successive layers of 1.3/1.4/1.5 densities (g/ml) and finally spun for 20 h at 100 000 g. The white band corresponding to viral particles was recovered with a syringe and washed three times with water. Purified virions were imaged by light microscopy (ZEISS) and the concentration of the virion was estimated on a spectrophotometer (Eppendorf) at OD 600 nm.

NS-TEM of the virions
Viral particles were visualized by NS-TEM, as reported previously . Briefly, viral particles were fixed in glutaraldehyde (2.5% v/v in water) for 1 h at room temperature. Samples were centrifuged at 5000 g for 10 min and pellets were washed twice with water. The structure of the fibrils was visualized by NS-TEM using methyl cellulose (M6385 Sigma) and uranyl acetate (2% v/v in water) as reported ). Viral particles were observed by TEM on a TECNAI G2 • 200 KV.

Sugar composition
Monosaccharide composition analysis (as acetylated methyl glycoside) and determination of their absolute configuration (as octyl-glycosides derivatives) was performed on 1×10 10 viral particles, as previously reported (De Castro et al. 2010). Gas chromatography-mass spectrometry (GC-MS) analyses were performed on an Agilent instrument (GC instrument Agilent 6850 coupled to MS Agilent 5973) equipped with a SPB-5 capillary column (Supelco, 30 m × 0.25 i.d., flow rate, 0.8 mL min -1 ) and He as the carrier gas. Electron impact mass spectra were recorded with an ionization energy of 70 eV and an ionizing current of 0.2 mA. The temperature program used for analyses was as follows: 150 • C for 5 min, 150 to 280 • C at 3 • C/min, 300 • C for 5 min. Interpretation of these data is based on the following concept: each sugar that is derivatized as acetylated methyl-glycoside or octyl-glycoside is eluted in a specific range of the chromatogram and each peak corresponds to a specific fragmentation pattern, which enables identification of the monosaccharide (Lönngren and Svensson 1974). Elution time is compared with those of standard monosaccharides and allows discrimination between sugars of the same class (i.e. glucose and mannose) and between sugars with different absolute configuration (D or L).

Identification of new genes encoding proteins involved in fibril glycosylation for each clade
We used an in silico approach to search for genes coding for potential proteins involved in glycosylation, hereafter referred to as glycogenes. Because giant virus genetics was not yet available, we could not validate these predictions by mutagenesis. Instead, we considered that identification of predicted sugars by chemical analysis of the viral particles validated the presence of a functional pathway encoded by the virus, because the amoeba host does not synthesize these rare bacteria-like sugars. The general approach used to attribute a specific function to a gene consisted in the comparison of the translated sequences with reference annotated protein sequences by a multiple alignment based on structural information, using the Expresso Server (Armougom et al. 2006) (http://tcoffee.crg.cat/apps/tcoffee/do:expresso). Multiple alignments were submitted to the ESPript server (http://espr ipt.ibcp.fr/ESPript/ESPript/) to illustrate sequence similarities and display secondary structure elements (Gouet et al. 2003).
Multiple alignments allowed to verify if catalytic residues were conserved in candidate proteins compared with the reference protein for which the function and structure were known. Conservation of the catalytic site was a prerequisite to attribute a specific function. In addition, we used the HHpred server (Hildebrand et al. 2009) for remote homology detection to annotate hypothetical proteins. We only kept those for which the confidence level was above 99%.
To understand if identified glycogenes were recently acquired from other microorganisms, such as bacteria on which the amoeba feeds, we compared the GC content of these genes with those of the seven marker genes (Asp Synthase, Helicase, mRNA Capping Enzyme, MutS, Packaging ATPase, PolyA polymerase and VLTF3) as representatives of whole genome GC content. GC content was computed using geecee (Emboss package v6.6.0, https: //www.bioinformatics.nl/cgi-bin/emboss/geecee).

Conservation of the proteins involved in fibril glycosylation in the proposed Megavirinae
Tblastn was used to assess the presence and conservation level of all proteins possibly involved in fibril glycosylation in all mem- bers of the three clades. We restricted the study to those for which complete genome sequences were available in the NCBI database (Table S1). Results are presented as conservation heatmaps, which were built using an in-house developed software. Construction of the heatmap was based on the following steps: 1. tblastn of N proteins (protein file) against M genomes (genome file), so that each protein sequence (query) is compared with the six-frame translations of nucleotide genome sequences; 2. for each protein, recovery of the best scoring ORF in each genome, with normalization by the size of the protein (score/autoscore), which defines the 'conservation score' in heatmaps; 3. generation of the matrix (N best scores) x (M genomes) and plot of the heatmap (library pheatmap in R, with default parameters).
NCBI accession numbers of complete genome sequences and of protein sequences (query) are reported in Tables S1 and S2, respectively.

Sugar composition varies in fibrils from viruses in different clades
Mimivirus (A-clade), Moumouvirus australiensis and maliensis (Bclade) and Megavirus chilensis (C-clade) were used as prototypes to investigate the glycan composition of their fibrils and identify conserved and clade-specific features. The nature of two complex polysaccharides (Fig. 2) composing Mimivirus fibrils has recently been characterized , and they involve three different sugar moieties: rhamnose (Rha), N-acetyl-glucosamine (GlcNAc) and N-acetyl-viosamine (Vio4NAc) (Fig. 3a). Here, we characterized all sugar moieties constituting the fibrils of representative members of B-and C-clades.
First, chemical characterization of B-clade Moumouvirus australiensis and maliensis fibrils revealed different sugar compositions. Fibrils from Moumouvirus australiensis (Jeudy et al. 2020) contained glucosamine (GlcN), quinovosamine (Qui2N) and bacillosamine (diNAcBac) as major components and glucose (Glc) as a minor component (Fig. 3b). The peak at 19 min was identified as diNAcBac by applying the fragmentation rules of the acetylated methylglycosides derivatives (Lönngren and Svensson 1974); indeed, the EI-MS spectrum contained a fragment at m/z 271 consistent with the oxonium ion of a six-deoxy-sugar with two amino functions (Fig. S1).
Previous analysis performed on intact virions from Megavirus chilensis, the prototype of C-clade, had revealed the presence of GlcN, rhamnosamine (RhaN) and RhaN methylated at position 4 (4-OMe-RhaN) as the main components of the fibrils together with an unidentified rhamnosamine epimer (Piacente et al. 2014b). Here, we completed this analysis and identified the nature of this component as Qui2N, in agreement with previous results revealing the biosynthetic pathways encoded by the virus (Piacente et al. 2014b). In addition, our current analysis confirmed the presence of GlcN, RhaN and 4-OMe-RhaN (Fig. 3d).
We determined absolute configurations for most sugars for members of all clades. The absolute configuration of GlcN was experimentally confirmed to be d for Mimivirus (Fig. S2), in agreement with the presence of a biosynthetic pathway for UDP-d-GlcNAc (Piacente et al. 2014a). Similarly, we showed that glycans in Moumouvirus australiensis and Megavirus chilensis contained d-GlcNAc (Fig. S2), and that Qui2N was in the l configuration. By contrast, we determined that Qui2N and Fuc2N were both in the d configuration for Moumouvirus maliensis (Fig. S2). We could not determine the absolute configuration of Rha2N and diNAcBac due to the lack of appropriate standards. However, it is likely that the absolute configuration of Rha2N is l, because the UDP-l-Rha2NAc biosynthetic pathway was validated in vitro for Megavirus chilensis (Piacente et al. 2014b). In bacteria, diNAcBac is in the d configuration (Morrison and Imperiali 2014), suggesting that this sugar could adopt the same configuration in Moumouvirus australiensis glycans.

Filling the gap between experimental and genomic data for Mimivirus (A-clade)
The elucidation of the glycan structures of Mimivirus fibrils ( Fig. 2)  ) prompted the search for genes coding for proteins that could be involved in the biosynthesis of these polysaccharides. A previous study had proposed a nine-gene cluster (Piacente et al. 2012) that includes genes encoding proteins necessary for the biosynthesis of Vio4NAc (R141, L136, L142), a gene annotated as pyruvyltransferase (L143), several genes encoding glycosyltransferases (L137, L138, R139, L140 and the C-ter of L142) and a gene encoding a GMC-type oxidoreductase (R135). However, a gene coding for a Vio4NAc methyltransferase was missing, raising the question on the origin of this sugar modification. Viosamine is not produced by the amoeba host and is only encountered in bacteria. For example, Pseudomonas syringe possesses a Vio-island including all the genes involved in viosamine production, methylation and transfer (Yamamoto et al. 2011). Thus, we hypothesized that a similar organization could ex-ist in the Mimivirus genome. We narrowed the search within 2 kbp of the nine-gene cluster (Piacente et al. 2012(Piacente et al. , 2017b and identified R132 as a promising candidate. The R132 gene is predicted to encode a 221-amino-acid protein belonging to class I SAMdependent methyltransferases. Its closest homologs are bacterial methyltransferases from Rizhobiales (WP_112 557 168.1), Lelliottia (WP_107 702 876.1) and Pantoea dispersa (WP_021 509 887.1), which share 35 to 39% sequence identity with R132 on the entire length of the protein sequence. The R132 protein was modeled using Phyre2 with 100% confidence based on several sugar-methyltransferases structures and confirmed by AlphaFold (Jumper et al. 2021). The best ranked model was obtained using the C-terminal catalytic domain of MycE as template (residues 161-399, 20% sequence identity). This SAM and metal-dependent methyltransferase domain is responsible for methylation of the 6-deoxyallose sugar moiety of mycinamicins (Akey et al. 2011). A multiple-alignment of R132 and its orthologs in A-clade members with the C-terminal domain of MycE ( Fig. 4a) revealed that all catalytic residues are conserved, suggesting that R132 could encode a functional methyltransferase (Fig. 4a). Further support of this prediction comes from two additional observations. First, previous studies showed that timing and expression level of R132 were comparable with those of the nine genes belonging to the glycan formation cluster (Legendre et al. 2010). Second, the R132 gene is absent from the genome of Mimivirus M4. This virus is the result of a 150-times subculture of Mimivirus in a germ-free amoeba host that led to the dramatic reduction of its genome from 1.20 to 0.993 Mbp due to two large deletions, mainly at the two extremities of the genome. One of these deletions includes genes encoding for proteins known to be involved in sugar biosynthesis (Boyer et al. 2011a, Piacente et al. 2012) as well as structural proteins composing the fibrils such as GMC-oxidoreductase R135 (Klose et al. 2015). Taken together, our results suggest a possible role for R132 in O-2 methylation of Vio4NAc. We thus propose that R132 is part of the gene cluster encoding a biosynthetic pathway for the sugars of structural proteins making Mimivirus fibrils.
As proposed in a previous study, the methylation reaction should occur after the acetylation of the amino function at position 4 ( Fig. 3b), due to the steric hindrance caused by the 2OMe on the viosamine moiety that would impair acetylation by the L142 enzyme (Piacente et al. 2017b). We still do not know if methylation occurs on the UDP-sugar, as reported in Fig. 4 or on viosamine inside the polysaccharide chain, this point being only addressable experimentally.

Moumouvirus australiensis (B-clade)
Here, we identified the monosaccharides composing Moumouvirus australiensis fibrils (Fig. 3b) as GlcN, Qui2N and diNAcBac. To determine if its genome encodes proteins necessary for biosynthesis of these sugars as activated nucleotides, we searched for genes and proteins similar to those already reported for the Mimivirus and Megavirus chilensis biosynthetic pathways for GlcN (Piacente et al. 2014a), Qui2N (Piacente et al. 2014b) and to those of Campylobacter jejuni for the biosynthesis of diNAcBac (Olivier and Imperiali 2008, Morrison and Imperiali 2014, Riegert et al. 2015. Sequence identity values with reference sequences are reported in Table 1. Structural multiple alignments were performed to assess conservation of the catalytic sites for each enzyme of the pathway and to infer whether Moumouvirus australiensis candidate enzymes could be functional.
Ma467 is conserved in all B-clade members (Table 1), while Ma465 and Ma466 have no homolog inside the B-clade. Surprisingly, their closest homologs are found in the Tupanvirus strains (D-clade) and Cotonvirus japonicus (E-clade) ( Table 1). By contrast, lower sequence identities were found with the corresponding enzymes of C. jejuni (Table 1) (Riegert et al. 2017). Our in silico analyses of Ma467, Ma465 and Ma466 revealed that all catalytic residues are conserved (Figs. S3-S5), suggesting that these enzymes are functional and responsible for the biosynthesis of UDP-d-diNAcBac in Moumouvirus australiensis as suggested by the presence of diNAcBac in the fibrils of Moumouvirus australiensis (Fig. 3b). In addition, the identified biosynthetic pathway supports a d configured bacillosamine. Finally, in Moumouvirus gp464 and Moumouvirus Monve mvR525 proteins, which are homologues of Ma467, one of the catalytic aspartates is replaced by an asparagine (Fig. S3). This change has been associated with a loss of activity in C. jejuni D396N-PglF mutant (Riegert et al. 2017), suggesting that these strains (and the B-clade) could be in the process of losing the pathway.

UDP-l-N-acetyl-quinovosamine pathway (UDP-l-Qui2NAc)
Previous studies had shown that two Megavirus chilensis proteins (Mg534, Mg535) were involved in UDP-l-Rha2NAc biosynthesis ( Fig. 5b) and had predicted that a third protein (Mg536) could convert Rha2NAc into Qui2NAc (Piacente et al. 2014b). Although the function of Mg536 was not experimentally validated, the bioinformatic prediction is based on the identification of Qui2NAc as a component of the glycans decorating Megavirus chilensis fibrils (Fig. 3d).
Ma458 is also present in  For each pathway, we report proteins used as references, homologs in Moumouvirus maliensis and its orthologs in the proposed Megavirinae subfamily and in Pseudomonas aeruginosa.

Orthologs in the proposed
present in Cotonvirus and in KNV1 (Table 1). Therefore, these results also support a functional pathway for Qui2NAc in Cotonvirus and KNV1 (Table 1). By contrast, CTV1 seems to only have the two first enzymes of the pathway, consequently restrained to Rha2N production.

UDP-d-N-acetyl-glucosamine pathway (UDP-d-GlcNAc)
Although the amoeba host produces GlcNAc using typical eukaryotic enzymes, Mimivirus and Megavirus chilensis also possess necessary enzymes to synthesize this sugar (Piacente et al. 2014a
By using the A. thermoaerophilus 4-reductase (King et al. 2009) as a reference gene, we identified two putative 4-reductase enzymes (Mm419 and Mm421), which both have the typical consensus sequence for NADP binding and the catalytic triad common to all 4-reductase enzymes (Fig. S8, Table 2). At this stage, it is not possible to discriminate between the two 4-reductases to reach a complete understanding of the pathways for d -Fuc2N and d-Qui2N, as the stereospecificity of the two enzymes can only be assessed experimentally. In addition to the Fuc2N/Qui2N pathway, all the genes responsible for the GlcN production are also present and conserved compared with experimentally validated enzymes (Table 2). The low amount of GlcN revealed by our chemical analysis of Moumouvirus maliensis could possibly be due to conversion of most GlcN into Fuc2N and Qui2N.

Biosynthetic pathways for nucleotide-sugars in Cotonvirus japonicus and Tupanviruses
In contrast to other clades (Fig. 3), the presence of sugars associated with the fibrils of Cotonvirus japonicus and Tupanviruses has not been experimentally characterized. Here, we searched their genomes for genes encoding for possible biosynthetic pathways for nucleotide-sugars. We used protein sequences of enzymes in-    Fig. 5a), with conservation of all catalytic residues suggesting all enzymes can be functional (Figs S3, S4 and S5). In addition, Tupanviruses possess the biosynthetic pathway for UDP-d-Qui2N/Fuc2N (Figs 6 and S7, Table 2) conserved in members of the B-clade, except in Moumouvirus australiensis. The first enzyme of the pathway (L515) is also the first enzyme in the diNAcBac pathway and is conserved in Moumouvirus australiensis (Fig. S7). However, in contrast to Moumouvirus australiensis, Tupanviruses possess the second enzyme (L518) corresponding to the 4-reductase in Moumouvirus maliensis (Mm421). The catalytic triad of this enzyme is conserved (Table 2, Fig. S8), suggesting that the entire pathway should be functional in Tupanviruses.
In the vicinity of the genes encoding the diNAcBac pathway in Tupanviruses, we identified the R520 gene as encoding an UDPglucose-6-dehydrogenase enzyme. This result suggests that Tupanvirus could convert UDP-d-glucose (UDP-d-Glc) into UDP-dglucuronic acid (UDP-d-GlcA) (Fig. S9a). The closest homologs of R520 are found in KNV1 and Cafeteria roenbergensis virus (CroV), which also belongs to the Mimiviridae family (Fischer et al. 2010). Using HHpred for remote homology and structure prediction (Hildebrand et al. 2009), we found that Burkholderia cepacia UDPglucose-6-dehydrogenase (PDB:2Y0C) was homologous to R520 (30% of identity and 100% confidence). Comparison of their sequences revealed that all catalytic residues were conserved in R520, suggesting that it is a functional enzyme (Fig. S9). In contrast to the other clades that use UDP-d-Glc from their host, Tupanviruses possess the L502 gene encoding a 634-amino acid protein with a predicted glucose-1P-uridyltransferase N-terminal domain (confidence 99%). This enzyme uses glucose-1P to synthesize UDPd-Glc, allowing Tupanviruses to be completely independent from the host for glycosylation. The biosynthetic pathway for UDP-d-Glc is present in the amoeba host and is expressed during the late stage of the infection, as is seen for the viral pathway. Therefore, we cannot exclude that viruses could also recruit the host UDPd-Glc to build their own glycans.

Genomic organization of glyco-related genes in the different clades
Our analysis of the genomes of prototypical viruses from different clades revealed that most genes responsible for the synthesis of nucleotide-sugars, along with others likely involved in the glycosylation process, are located within the same region of the genome, namely in gene clusters (Figs. 7 and 8, Tables 3, 4 and 5).
For Moumouvirus australiensis (B-clade), we identified gene clusters responsible for l-Qui2NAc (ma458, ma459, ma460) and d-diNAcBac (ma465, ma466, ma467) synthesis. Next to these genes, we found others that are also related to glycosylation, allowing us to define a larger cluster containing 12 genes (Fig. 7b, Table 3). Two of these genes encode proteins involved in sulfate metabolism (ma457, ma461), with Ma457 predicted as a bifunctional Sulfate adenylyltransferase/Adenosine-5'-phosphosulfate kinase and Ma461 as a sulfotransferase. This result suggests that sugars could be further modified by a sulfate group, but experimental validation is needed. In the same region, we found two predicted papain-like proteins (Ma462 and Ma464) that are not strictly related to glycosylation and a predicted glycosyltransferase (Ma463), which is related to glycosyltransferases (GTs) from Streptococcus sanguinis (5V4A) and S. parasanguinis (4PHR). In addition, ma468 encodes a 1666-amino acids protein with four GT  Table 3. domains. Moumouvirus australiensis thus contains the genes necessary for its nucleotide-sugars production as well as the GTs responsible for their assembly into oligosaccharides or polysaccharides.
For Megavirus chilensis (C-clade), we had also identified a sixgene cluster (Piacente et al. 2014b), which encodes proteins involved in Rha2NAc (Mg534, Mg535) and l-Qui2NAc (Mg534, Mg535, Mg536) production, as well as a protein with three GT domains (Mg539), a hypothetical protein (Mg537) and a pyruvyltransferase (Mg538) which could play the same role as Mimivirus L143 (Fig. 7d,  Table 3). We found that the closest homologs of Mg537 are Streptococcus GTs (PDB codes: 4PHR and 5V4A, 99% confidence), also  Tables 4 and 5. identified above for Ma463. These results suggest that Mg537 and Mg539 could correspond to GTs involved in the C-clade glycosylation pathway.
In accordance with the complexity of their decorated capsids and tails, Tupanviruses (D-clade) also have a bigger and more complex glycosylation gene cluster covering a region of 49 Kb with up to 33 genes (Fig. 8a, Table 4). This cluster includes several biosynthetic pathways for nucleotide-sugars such as UDPd-diNAcBac (L515, L509 and L514), UDP-d-Qui2N/Fuc2N (L515 and L518) and UDP-d-glucuronic acid (R520), suggesting that fibrils decorating the capsids and tails may also be glycosylated. Their sugars could be further modified by sugar methyltransferases (L506 and L517) present in the cluster (Fig. 8a). We also found a SAT/APS kinase homolog of Ma457, but the homolog of the sulfotransferase (Ma461) is missing. As a result, experimental validation of these modifications is needed. Compared with other clades (Fig. 7), Tupanviruses present an increased number of GTs (Fig. 8, Table 4), with 11 genes encoding a single GT domain and two genes (R496, L510) encoding four and two GT domains, respectively. Final glycan structures could be very complex and heterogeneous reflecting the presence of fibrils decorating both capsids and tails. Two additional enzymes (L521 and L523) clearly related to RNA metabolism (see Table 4 for details) were identified in the cluster as well as ORFans proteins (R497, R500, R522, R524, L525 and L526). Their function remains to be determined. For Cotonvirus japonicus (E-clade), biosynthetic pathways for UDP-l-Qui2NAc (ORFs 708, 709 and 710) and UDP-d-diNAcBac (ORFs 718, 719 and 720) are conserved and next to other glycogenes. This allowed us to define a 21-gene cluster in Cotonvirus (Fig. 8b, Table 5). This cluster includes seven GTs with some homology to Moumouvirus australiensis GTs. For instance, GT 713 and 717 share 58 and 40% identities with Ma468 with a coverage of 70 and 95%, respectively, while GT 722 and 712 share 72 and 29% identity with Ma463 on the entire protein sequences. In addition, two enzymes involved in the sulfate metabolism were identified as ORFs 724 and 725 that correspond to Ma461 and Ma457 enzymes, with identities up to 60% along the entire protein sequences (Table 5). Inside the cluster, there are genes not related to glycosylation such as two predicted papain-like proteins (ORFs 721 and 723), a permease, MutS and an RNA polymerase subunit (ORFs 714, 726 and 727) ( Table 5).

Except for Mimivirus, gene clusters involved in proposed
Megavirinae glycosylation are located between a conserved helicase and a thioredoxin-like gene (Fig. S10). The number of genes in this region ranges from six genes for B-and C-clades to 12 genes for Moumouvirus australiensis and up to 33 genes for Tupanviruses. In Mimivirus, this region is only 3.5 kb long and only includes one GT (R363; Fig. S10), while the 12-gene cluster (Fig. 7a) is located in a different genomic region between a putative transcription factor gene (R131) and an orfan gene (L144). In Cotonvirus japonicus, only seven glycogenes of the glycosylation gene cluster (Fig. 8b) lie between the helicase and thioredoxin-like genes (Fig. S10). This part of the cluster presents strong homology with the Megavirus chilensis cluster and includes the l-Qui2N pathway and one GT. After the thioredoxin-like gene (715), there are the remaining 14 glycogenes that are part of the 21-gene cluster (Figs 8b and S10). Table 4. Glycosylation gene clusters for Tupanviruses. The NCBI Accession number of the complete genomes and the names of the genes in each cluster are reported as well as the Accession number of the encoded protein, its length and expected function.

Tupanvirus deep ocean
Tupanvirus soda lake

Expanding the glycosylation feature of each prototype to the clade
To gain further insights into the evolution of the glycosylation machinery in the proposed Megavirinae subfamily, we examined conservation of the gene clusters (Figs 7 and 8) within and among clades. We only used clade members for which the genome sequence was complete (Figs 9, 10, 11 and S11;, Table S1). For the A-clade, our analysis shows that enzymes for UDPl-Rha (R141, L780) and UDP-d-2OMe-Vio4NAc (R132, L136, R141 and L142) biosynthesis are conserved (Fig. 9). While L780 (YP_003 987 312; 289 aa) does not belong to the gene cluster (Fig. 7a), it was included because it is the second enzyme in the Rha biosynthesis pathway (Parakkottil Chothi et al. 2010). The highest divergence was obtained for L142 (Table 3), which is a bifunctional enzyme with an N-terminal N-acetyltransferase domain responsible for Viosamine acetylation and a C-terminal putative GT domain, which could transfer Vio4NAc onto its acceptor (Piacente et al. 2017a). This divergence can be explained by the split (or fusion) of the L142 gene leading to two genes in Mimivirus M4 (L142a and L142b) and in Hirudovirus strain sangsue (HIUR_S825 and HIUR_S826), where the first ones correspond to the N-acetyltransferase domain and the second ones to the GT domain. Moreover, all GTs in the Mimivirus gene cluster (L137, L138, L139, R140 and the C-ter L142), predicted as type 2 GTs in the Carbohydrate-Active enZYmes database (CAZY) (Cantarel et al. 2009), are specific to the A-clade and are likely involved in building the two polysaccharides constituting Mimivirus fibrils (Fig. 2) ). They share a high level of sequence identity (Table S3), as expected because the repeated units of the two glycans include Rha and GlcN linked in a different way (Fig. 2)  . R133 and L134 are hypothetical proteins that seem to be specific of the A-clade and at this stage it is not possible to conclude on their possible contribution to fibril glycosylation or maturation. Inside the A-clade, the laboratory strain Mimivirus M4, which lacks fibrils (Boyer et al. 2011a), is the only one without a glycosylation cluster (Fig. 9, yellow arrow) and without the proteins constituting Mimivirus fibrils.
In line with the A-clade, the Moumouvirus maliensis six-gene cluster is conserved in B-clade except for Moumouvirus australiensis (Fig. 10). Table 5. Glycosylation gene cluster for Cotonvirus japonicus. The NCBI Accession number of the complete genome and the name of the genes in the cluster are reported as well as the Accession number of the encoded protein, its length and expected function. Moumouvirus australiensis, Cotonvirus japonicus and the two Tupanvirus strains share with the B-clade the 4,6-dehydratase enzyme (Mm422) involved both in diNAcBac and Qui2N/Fuc2N production (Fig. 11) as well as the non-functional Mm419 protein.

Genome accession
Tupanviruses share the UDP-d-Qui2NAc/Fuc2NAc biosynthetic pathway with the B-clade, as evidenced by the presence of a 4reductase enzyme homolog to Moumouvirus maliensis Mm421, as well as a glycosyltransferase homolog to Mm420. Despite clearly belonging to the B-clade phylogenetic tree when using conserved core genes (Fig. 1), Moumouvirus australiensis appears as an outsider of the B-clade for its glycosylation genes (Figures 10 and 11). Enzymes involved in UDP-l-Qui2NAc biosynthesis (Ma458, Ma459 and Ma460) and glycosyltransferases (Ma463 and Ma468) are completely absent from the B-clade.
Even more surprisingly in terms of evolution, Moumouvirus australiensis shares the UDP-l-Qui2N biosynthesis pathway with the entire C-clade and Cotonvirus japonicus (Fig. 11), while it shares the UDP-d-diNAcBac biosynthesis pathway with Cotonvirus japonicus and the more distant Tupanvirus strains (Fig. 11). Enzymes involved in sulfate modification of sugars (Ma457 and Ma461) are both conserved in Cotonvirus japonicus. By contrast, only one enzyme remains in Tupanviruses (Ma457) (Fig. 11). The Moumouvirus australiensis gene cluster is completely conserved in Cotonvirus japonicus (Fig. 11), but the gene cluster of Cotonvirus counts 21 genes instead of 12 and includes several additional glycosyltransferases and hypothetical proteins (Fig. 8, Table 4). These data allowed us to address the evolutionary position of Moumouvirus australiensis between B-clade, Cotonvirus japonicus and Tupanviruses with whom it appears closely related in terms of glycosylation (Fig. 11).
This analysis confirmed our initial hypothesis that the glycosylation machinery is clade-specific and suggests that Moumouvirus australiensis likely represents an intermediate prototype for the evolution of a new glycosylation cassette.
Finally, the UDP-d-GlcNAc biosynthesis pathway is conserved in the five clades and the corresponding genes are not arranged into clusters (Fig. S11). In addition, the A-clade, C-clade and Cotonvirus japonicus share a putative pyruvyltransferase (Fig. S11) that could be involved in GlcNAc modification with pyruvic acid, as is observed in Mimivirus polysaccharides (Fig. 2)  . Even if GlcNAc is present in the amoeba host, all giant DNA viruses encode their own proteins to synthesize this sugar, which is also the precursor of other 6-deoxy-amino-sugars (Rha2NAc, Qui2NAc, diNacBac and Fuc2NAc) that constitute the capsid fibrils of the different members of the three clades.

Discussion
Here, we show that complex gene clusters are involved in glycosylation of the fibril layer surrounding the viral particles in the proposed Megavirinae subfamily (Figs 7 and 8). However, fibril glycosylation occurs in a clade-specific manner, suggesting a high degree of variability in glycan structures for the different members of the family. Our analyses raise two important questions: (i) What are the possible evolutionary implications of organizing glycosylation genes into clusters? and (ii) What is the role played by sugars constituting the fibrils for the different clades? Answering these questions is essential for a clear understanding of glycosylation in giant viruses.

Implications of clustering glycogenes
Organization of glycogenes into clusters is reminiscent of what occurs in bacteria. For example, genes involved in biosynthesis of lipopolysaccharide (LPS), the main component of the Gram-  negative bacteria outer membrane, are organized in a cluster controlled by a unique promoter defining the waa operon in Escherichia coli K12 (Gronow and Brade 2001). Similarly, genes responsible for the biosynthesis of the N-/O-glycans that decorate flagellin or pili are organized into operons (all controlled by the same promoter). Campylobacter jejuni presents a cluster of genes responsible for flagellin glycosylation, called pgl, which is further organized in two operons: operon I includes pglB, pglA, pglC and pglD, while operon II contains pglE, pglF and pglG (Szymanski et al. 1999). Interestingly, pglE and pglF are co-transcribed, while pglD transcription is regulated by another operon system (Szymanski et al. 1999), despite all being involved in UDP-d-diNAcBAc biosynthesis. Because operons are often controlled by a single promoter, levels and timing of expression are comparable for all the genes.
It is currently unknown how transcription and expression of glycogenes occur in giant viruses. The transcriptomes of Mimivirus (Legendre et al. 2010) and Megavirus chilensis (Arslan et al. 2011) being the only ones available, they are used as reference for the entire family. We had previously showed that there are three main expression classes: early (from T = 0 to T = 3 h), intermediate (from T = 3 h to T = 6 h) and late (from T = 6 h to T = 12 h). Only early and late promoters have been identified, with a highly stringent early promoter (AAAATTGA) (Suhre et al. 2005) and a less stringent late promoter (Legendre et al. 2010). All glycogenes are expressed in the late stage of the infection cycle whether they are organized into a cluster or not. Mimivirus glycosylation genes are clustered (Fig. 7a) and all have their own late promoter, except R133, which encodes a hypothetical protein. All are expressed in the late phase (6-12 hours post infection ) of the infectious cycle, supporting a role for this cluster in fibril glycosylation. On the contrary, for genes in the Megavirus chilensis glycosylation cluster (Fig. 7d), only mg535, mg536 and mg539 have a promoter. Given the low conservation of the late promoter, it could have been missed by the annotation or these genes are expressed as polycistronic mRNAs, as shown previously for some Mimivirus genes (Byrne et al. 2009).
Previous studies (Piacente et al. 2012(Piacente et al. , 2014a(Piacente et al. , 2015 and this one show that it is highly challenging to establish the origin of these glycosylation gene clusters. For the viosamine synthesis pathway, the patchwork of bacterial-like and eukaryotic-like genes composing the cluster precludes the hypothesis that they were all acquired at once from a single organism (Piacente et al. 2012). To detect putative horizontal gene transfer, we compared the GC content of glycogenes (Figs 7 and 8) with Mimiviridae core genes (Fig. S12). We showed that glycogenes are as AT-rich as the overall genome, suggesting that they could have been acquired early The GTs are in red, while proteins responsible for sulfate metabolism are in dark yellow. The helicase Ma456 (in cyan) is used for reference of the conservation level between clades. The two papain-like proteins (Ma462, Ma464) are not included because they are not directly related to fibrils glycosylation. The conservation score for each protein in the different genomes ranges from 0 to 1 (low to high: blue to red). in evolution so that their GC content had time to evolve. Giant viruses possess up to five genes encoding transposases that might catalyze integrations of foreign DNA into their genomes, but these genes are far (more than 30 Kb) from the genomic region containing the glycogenes. In addition, the fact that glycogenes have the typical viral promoter (Legendre et al. 2010) could also suggest that giant viruses have evolved their own glycosylation machinery prior to the radiation of eukaryotes.
We showed that glycosylation gene-clusters of the B-clade (Moumouviruses), C-clade (Megaviruses), D-clade (Tupanviruses) and E-clade (Cotonvirus japonicus) are all located between conserved helicase and thioredoxin-like genes (Fig. S10). This result suggests that giant viruses could have exchanged these glycogenes through homologous recombination events or horizontal gene transfer inside the amoeba host, by analogy with bacteria (Aydanian et al. 2011). However, further studies are required to experimentally prove that these genes can be exchanged between giant viruses belonging to different clades and eventually with bacteria.

Bacteria-like sugars as markers of proposed Megavirinae clades
From a structural point of view, the type of sugars found in giant viruses is drastically different from what was reported for eukaryotic viruses. For example, SARS-Cov-2 virus exhibits oligosaccharides of discrete size made by sugar units typical of the eukaryotic world, such as glucosamine, galactosamine, fucose and sialic acid (Zhao et al. 2021). By contrast, members of the proposed Megavirinae subfamily possess their own glycosylation machinery to synthesize and decorate their fibrils with rare amino sugars that are only encountered in bacteria and are absent from their amoeba host. The organization of these glycogenes into clusters also increases the evolvability of the glycosylation machinery resulting in different sugars between clades and even inside the same clade, as was observed for Moumouvirus australiensis.
By comparing the conservation level of glyco-enzymes, we identified one or two sugars as markers of a specific clade. A-clade is characterized by the rare sugar d-viosamine, which is a component of Pseudomonas syringe flagellin (Yamamoto et al. 2011) and has also been identified in the O-chain of several E. coli strains. In addition to viosamine, the A-clade is the only clade to have rhamnose, a deoxy-sugar found as a component of the O-antigen of several bacteria, such as S. enterica (Samuel and Reeves 2003), in the N-glycans of archaea (Kaminski and Eichler 2014) and plant cell-wall polysaccharides. Interestingly, the A-clade follows the plant way instead of the microbial pathway to synthetize rhamnose (Parakkottil Chothi et al. 2010). The precursor for both these sugars is UDP-d-Glc, which is the only UDP-sugar for which the Aclade does not appear to have a dedicated biosynthetic machinery, thus relying on its host.
Except for Moumouvirus australiensis, B-clade members have two rare amino sugars, d-Qui2NAc and d-Fuc2NAc, whose biosynthetic pathways are closely interconnected. d-Fuc2NAc was found in the LPS of Pseudomonas aeruginosa O5 (Burrows et al. 1996) and in the O-Chain of the marine bacterium Pseudoalteromonas agrivorans (Perepelov et al. 2000). d-Qui2NAc was also found in the O-chain of Gram-negative bacteria, such as Rizobium etli CE3 (D'Haeze et al. 2007) and Pseudomonas aeruginosa O10 (Knirel et al. 1986).
Finally, Moumouvirus australiensis (outsider of the B-clade) along with Cotonvirus japonicus and Tupanviruses are characterized by another rare amino sugar, d-diNacBac, which has also been found in pathogenic bacteria (Morrison and Imperiali 2014) as the reducing sugar of N-linked and O-linked glycoproteins of C. jejuni and Neisseria gonorrhoeae, respectively. It is also a component of the O-Chain of Pseudomonas reactans and Vibrio cholerae, and of the CPS of Alteromonas (Perepelov et al. 2000, Aydanianet al. 2011.
The precursor of the rare amino-sugars characteristic of Band C-clades, Cotonvirus japonicus and Tupanviruses is UDP-d-GlcNAc, for which all clades possess the biosynthetic pathway. Consequently, in contrast to the A-clade, they could be completely independent from their host for synthesis of their glycans, which could possibly extend the range of hosts that they can replicate in.
Chemical analyses of the fibrils in Tupanvirus and Cotonvirus japonicus were not performed and gene expression data are also lacking. However, our results showing conservation of the biosynthetic pathways for Qui2N and diNAcBac in Cotonvirus suggest that its fibrils could be decorated with the corresponding glycans, while those in Tupanviruses could be composed of diNAcBac, Fuc2N/Qui2N and GlcA. The UDP-GlcNAc biosynthetic pathway identified in the A-, B-and C-clades is also conserved in Cotonvirus japonicus and in Tupanviruses (Table 1). Thus, further studies are needed to determine whether their fibrils could be glycosylated using virus-encoded enzymes.
Why are giant viruses covered by rare amino sugars and why are these different depending on the clade? The presence of a highly glycosylated fibril layer for all members of proposed Megavirinae isolated so far suggests an essential role played by these glycans in the environment. Regarding the first question, we think that the presence of bacterial-like sugars on the surface of giant viruses can be considered as a strategy to mimic the bacteria that the amoeba feeds on, thus facilitating competition with other parasites for the same host in the natural environment. We could also speculate that they play an important role in the physiology of these viruses. In fact, it has been established that glycans play a crucial role in the adhesion process to the host cell, while they do not affect dramatically the viral replication even if they appear to have a fitness cost (Boyer et al. 2011b). In this context, amino sugars are essential because their physiochemical properties enable them to create a highly viscous or sticky surface (Salton 1965), which is suitable for adhesion on cell surfaces. In addition, glycans constitute a protective barrier both against the natural environment in which they must propagate and the enzymes of the host cell. Finally, these viruses are targets of viral infection by virophages, which stick to the giant virus fibrils (Duponchel and Fischer 2019;La Scola et al. 2008), allowing them to be transported along into the host cell. Some virophages, like Sputnik (La Scola et al. 2008), are deleterious to the giant virus, while others, like Zamilon (Jeudy et al. 2020), appear to be commensal. Evolving different sugar compositions for fibril glycosylation could thus impair virophage interaction with the fibrils and be advantageous against pathogenic virophages.
The answer to the second question comes from the organization of the glycogenes in hot-spot mutation areas that could be essential to introduce variability in glycan composition. Because fibril glycans play a pivotal role in the interaction with the host cell, an arms race is taking place between the host and the giant viruses, but also among giant viruses that compete for the same host in the natural environment. Having a flexible toolbox for glycan synthesis could reflect the need to continuously adjust the set of glycans decorating the capsids to trigger increased phagocytosis and outcompete other parasites. It is tempting to compare giant virus glycosylation clusters with antibiotic resistance cassettes in bacteria. The fact that this complex glycosylation machinery is lost in laboratory conditions, as was shown for M4 (Boyer et al. 2011a), also suggests that there must be a fitness cost for such complex glycan synthesis, again echoing what is observed for antibiotic resistance genes in bacteria.

Conclusions
This in-depth study of glycosylation in the proposed Megavirinae subfamily revealed that giant DNA viruses are different from other eukaryotic viruses in several aspects. First, they possess a complex glycosylation machinery consisting of clusters of six to 33 genes responsible for the formation of glycans constituting their fibrils. This results in glycosylated fibrils with sugars different from those found in the host and from those of other eukaryotic viruses. Giant viruses produce rare amino sugars, yet there is a clade-specific glycosylation trend, although some exceptions could occur, as is shown for Moumouvirus australiensis that could be in the process of evolving a new glycosylation cassette (losing and/or gaining new glycosylation genes). This variegated glycosylation between the different clades could be the result of adaptation to the host they infect and/or competition with other viruses and bacteria and could also affect their ability to be infected by virophages.
Studying the glycobiology of these giant DNA viruses is important as it may reveal new enzymes encoded by genes of unknown function located within these clusters and could provide clues to the origin of the various elements of their glycosylation machinery, the glycosylation machinery of their ancestor that may predate the radiation of eukaryotes, and ultimately lead to advances for the glycobiology discipline in general. Finally, this work could be considered as a pilot study, which can be extended to other giant and large DNA viruses, such as the atypical Pandoraviruses, Pithoviruses and Molliviruses.