G-quadruplexes in the evolution of hepatitis B virus

Abstract Hepatitis B virus (HBV) is one of the most dangerous human pathogenic viruses found in all corners of the world. Recent sequencing of ancient HBV viruses revealed that these viruses have accompanied humanity for several millenia. As G-quadruplexes are considered to be potential therapeutic targets in virology, we examined G-quadruplex-forming sequences (PQS) in modern and ancient HBV genomes. Our analyses showed the presence of PQS in all 232 tested HBV genomes, with a total number of 1258 motifs and an average frequency of 1.69 PQS per kbp. Notably, the PQS with the highest G4Hunter score in the reference genome is the most highly conserved. Interestingly, the density of PQS motifs is lower in ancient HBV genomes than in their modern counterparts (1.5 and 1.9/kb, respectively). This modern frequency of 1.90 is very close to the PQS frequency of the human genome (1.93) using identical parameters. This indicates that the PQS content in HBV increased over time to become closer to the PQS frequency in the human genome. No statistically significant differences were found between PQS densities in HBV lineages found in different continents. These results, which constitute the first paleogenomics analysis of G4 propensity, are in agreement with our hypothesis that, for viruses causing chronic infections, their PQS frequencies tend to converge evolutionarily with those of their hosts, as a kind of ‘genetic camouflage’ to both hijack host cell transcriptional regulatory systems and to avoid recognition as foreign material.


INTRODUCTION
Hepatitis B virus (HBV) belongs to the genus Orthohepadnavirus; its genome is constituted of double-stranded DNA. This virus causes Hepa titis B , a highly contagious, potentially fatal disease that affects an estimated 257 million people worldwide, resulting in an estimated 820 000 deaths e v ery year ( https://www.who.int/news-room/ fact-sheets/detail/hepatitis-b ). This virus is particularly severe, as a pproximatel y one in fiv e carriers die from cirrhosis and / or de v elop hepatocellular carcinoma. HBV is transmitted primarily through blood and body fluids and the incubation period is variable, usually between 30 and 180 days. During replication, HBV DNA forms a minichromosome in the nucleus of infected hepatocytes ( 1 , 2 ) and its genome is replicated through a process of re v erse transcription of the key intermediate pre-genomic RNA in hepatocytes, which is also an mRNA template for the HBV proteins ( 3 , 4 ). Hepadnaviruses infecting other hosts have recently been identified, including bats , frogs , lizards , fish, and the capuchin monkey (5)(6)(7)(8). Analyses of ancient genomes hav e re v ealed that the most recent common ancestor of all HBV lineages is estimated to have existed between ∼20 000 and 12 000 years ago, and the virus was found to be present in European and South American hunter-gatherers during the early Holocene period ( 9 ).
A large amount of literature is devoted to the occurrence of G-quadruplexes in viruses, especially to the possibility of using these structures in therapy. Comprehensi v e bioinformatics analyses have traced putati v e G4-forming sequences in the genome of almost all human viruses, showing that their distribution and pr esence ar e highly conserved. Therefore, these DNA or RNA structures can be a suitable target for targeted therapy. Some G-quadruplex ligands have been shown to have antiviral activity, for example, against HIV ( 10 ), herpes simplex virus I (HSV-1) ( 11 ), SARS-CoV-2 ( 12 ) and others. A comprehensi v e analysis of all sequenced viruses that have a latent phase in their life cycle showed that their G-quadruplex content is correlated with that of the host ( 13 ). In contrast, viruses causing acute infections without a latent phase tend to eliminate G-quadruplex sequences, as they can become roadblocks during replication, transcription and / or re v erse transcription ( 14 ).
More specifically, G4s have been found to be relevant in HBV infection, both at the DN A and RN A le v el ( 15 ). Chakraborty and Ghosh reported that an RNA sequence present in HBV RNA exhibited a sequence-independent trans-acting nuclease activity, and that this sequence adopts a G4 conformation ( 16 ). Biswas et al. analyzed a G4 prone motif ( GGGAGTGGGAGCATTCGGGCCAGGG ) that is highly conserved only in HBV genotype B, and was shown to adopt a hybrid structur e ( 17 ). Inter estingl y, m utations disrupting this G-quadruplex in HBV genotype B constructs were associated with impaired virion secretion. The authors proposed that this G4 mediates enhancement of transcription and virion secretion in this HBV genotype. In a later review, they note that among viruses containing a G4 in their genome, those associated with cancer are over-represented, including HBV ( 18 ).
In contrast, some G4s tend to be conserved in all genotypes, as reported by Meier-Stephenson et al ( 19 ) for a DNA sequence found in the pr e-cor e promoter region ( CTGGGAGGAGCTGGGGGAGGAGA ). They demonstrated a role of this quadruplex in viral replication by comparing the wild-type motif to non-G4-forming mutants in vitro . Fleming et al identified conserved potential G4 sequences in se v er al vir al genomes relevant to human health, and showed that these motifs can provide a frame wor k for N6methyladenosine (m6A) installation within the loops of RNA G4 sequences found in se v eral viruses, including HBV ( 20 ). Somkuti et al. determined volume changes in three HBV G4 structures using biophysical approaches in vitro . They investigated three DNA sequences: GGCTGGGGCTTG-GTCATGGGCCATCAG , GGGAGTGGGAGCATTCGGGCCAGGG and TTGGGTGGCTTTGGGGCATGGAC ( 21 ). The same group investigated one of these sequences in more details (HepB; GGCTGGGGCTTGGTCATGGGCCATCAG , found in the coding region of the polymerase protein), and analyzed its interaction with G4 ligands in vitro ( 22 ). Finally, Sun et al. recently investigated the role of cellular G4s (motifs found in the host cell genome) in HBV infection, demonstrating that the DDX5 helicase, known to be capable of resolving RNA G4 structures, is a key regulator of the interferon (IFN) response against this virus. DDX5 downregulation is ob-served during HBV replication and in poor prognosis HBVrela ted hepa tocellular carcinoma (HCC) ( 23 ). All of these results point out the links between viral (or host) DNA and RNA G4s and HBV.
In this paper, we have analyzed 232 HBV genomes from samples covering a more than 10-thousand-year history for the presence of G-quadruplex forming sequences. Our results show an evolutionary shift for an increased number of G-quadruplex es in r ecent HBV viruses, pointing to the importance of G-quadruplexes in the HBV life-cycle in human li v er cells.

Genomes
232 HBV alignments were downloaded from the supplementary materials at (9). Sequences were obtained for 122 modern genotypes and different groups of 110 ancient strains, divided into groups based on mPTP classification. As the r efer ence genome, we took NC 003977.2 and, for analyses of phylogenetically related viruses with hosts other than human, we filtered reference genomes from Orthohepadnaviruses. In total, we downloaded 21 additional HBVs having a non-human host, infecting birds (7 genomes), bats (5 genomes), fish (2 genomes), other Mammals (i.e. neither human nor bats; 6 genomes) and amphibians (1 genome, Tibetan frog hepatitis B virus). HBV G4 contents were compared to the gapless human genome, the new telomere-totelomere assembly of the human genome ( 24 ), which was downloaded from NCBI (T2T-CHM13v2.0).

G4Hunter analyses
All sequences were analyzed using G4Hunter ( http://bioinformatics .ibp .cz ) to identify PQS sequences. G4Hunter's default parameters were used (25 nucleotides for window size and 1.2 for threshold). These settings have previously been shown to identify experimentally-validated quadruplex structures. The list of all organisms tested and the results of the analyses were downloaded from the supplementary materials at ( 9 ).

Statistical evaluation
Data with G4Hunter results were merged in an Excel file for sta tistical evalua tion. G4Hunter results , lengths , and GC content of analyzed sequences are accessible in Supplementary material 01. A scatter plot was generated in Graph-Pad Prism (v 8.0.1), Violin plots were constructed in R (v 4.2.0) with ggplot2. Statistical significance was tested using Student's T-test. Normality of data was determined using Shapiro-Wilk test.

Construction of LOGO sequence
All sequences for ancient and modern HBV genomes were uploaded into UGENE software ( 25 ) and the location of PQS sequences were extracted using ClustalW alignment. LOGO sequence was generated in aligned sequences and WebLogo 3 tool ( 26 ). *Complete telomer e-to-telomer e human genome (22 + X + Y chromosomes) **The groups were defined by Kocher et al. ( 10 ) and is based on the multi-rate Poisson Tree Processes (mPTP) as a results of genetic clusters number considering a phylogenetic input tree ( 26 ).

RESULTS
We analyzed the presence of PQS in 232 HBV genomes (122 ancient and 110 recent) using G4Hunter. All human HBV genomes are similar in length, varying from 3180 to 3300 bp. Comparisons of ancient and recent samples show a slight and non-significant change in average length from 3217 to 3210 bp. On the other hand, these genomes vary in GC content, and we found the presence of G-quadruplexforming sequences in all HBV genomes in the dataset. In total, we found 1258 PQS with a mean frequency of 1.69 PQS per kbp ( Table 1 ). The mean frequency of PQS in ancient genomes is 1.50 / kb, compared with the mean frequency in modern HBV genomes of 1.90. For comparison we also analyzed the ne wly pub lished human gapless assembly. The PQS frequency in the human genome is 1.93, which is almost identical to the average PQS frequency of modern HBV genomes (Table 1 ). Although there is no significant change in the length of ancient and modern HBV genomes, comparison of PQS density shows that the modern viruses are substantially richer in PQS ( Figure 1 ) ( P -value = 3.6e-08). The modern HBV genomes not only have a higher PQS frequency, but also have a higher GC content. To evaluate if the change in PQS frequency is statistically significant after taking into account GC content, we recalculated the PQS frequency according to GC content (Table 1 , last column). Even after this correction, the PQS frequency / GC content is higher in modern than in ancient HBV genomes (3.88 versus 3.46 per thousand GC for the modern and ancient HBV genomes, respecti v ely: P -value = 1.8e-03). We then further divided genomes according to different genotypes. While most current genotypes have an av- erage PQS frequency higher than 2 PQS / kb, four of the fiv e ancient genomes have PQS frequencies lower than this value. The highest PQS frequencies were found in modern genotypes B, E and H, the lowest in ancient American, Mesolithic and other ancient genotypes (Figure 2 ).
We present the PQS frequency per kb (for all motifs with a G4Hunter score > 1.2) and the length of the genome as a function of time for each ancient sequence (Supplementary material 02). As mentioned before, the longest sequence only differs from the shortest by 120 bp, and the length of the genome does not change significantly over time (Pearson r = −0.1387, P (two-tailed) = 1.6e-01). Unlike genome length, PQS frequencies per kbp were found to increase over Nucleic Acids Research, 2023, Vol. 51, No. 14 7201  The whole genome of HBV is transcribed into one long pre-genomic RN A (mRN A-pgRN A) that encodes all HBV proteins, and pgRNA also acts as a template for re v erse transcription. Analysis of PQS localization in the HBV reference genome (NC 003977.2) shows that all PQS are located in the vicinity of gene regions, which is not surprising considering that the HBV genome is small and the entire genome is used very effectively to produce the few proteins necessary for its function, such as DN A pol ymer ase, tr ansactivation and capsid proteins ( Table 2 ).
The PQS with the highest G4Hunter score in the HBV r efer ence genome is located surrounding position 1722, in the region that codes for transactivation protein X. Interestingly, a PQS is present in almost all HBV genomes at this location (214 of 232; or 92.2% of HBV genomes analyzed), making it the most conserved motif between all PQS in the r efer ence HBV genome. Conservation implies that this sequence position has been maintained by selecti v e pr essur e. Comparison of the LOGO sequence for this location in modern and ancient HBV genomes (Figure 3 ) demonstra tes tha t a G-rich motif is preserved in all strains. Ne v ertheless, this 'G-richness' is e v en more striking in modern compared to ancient strains, with Gs becoming predominant at positions 8 and 11 (Figure 3 , arrows), while ancient  A in position 8). A less striking G-enrichment is also found at positions 14 and 15. As a consequence, while both consensus motifs are compatible with G4 formation, the modern sequence is more favorable than the ancient one (G4hunter scores of 2.11 and 1.83, respecti v ely). HBV genomes more often exhibit T / A or A at these positions. A higher --near 100% --prevalence of Gs is also visible at positions 14 and 15 ( Figure 3 ) in modern genomes. Overall, while both ancient and modern consensus motifs are G4-prone, the modern sequence is more favorable, as shown by the higher G4Hunter score. We also divided HBV genomes according to the geographic place of sampling. Most of the ancient HBV samples were found in Europe, only one in Africa and none in Australia. Nonetheless, ancient genomes have a lower PQS frequency than modern HBV genomes, regardless of their continent of origin (Table 3 ; also see Graphical Abstract).

DISCUSSION
Pathogens e volv e in response to human biological changes alongside sociocultural and technological de v elopments ( 27 ). Ancient viral genomes provide information on the evolution of viruses over both time and space and provide insight into the changes that may have occurred in virulence and transmissibility ( 28 ). Current advanced techniques for isolation of nucleic acids and sequencing have allowed paleogenomic or 'archeovirology' investigations. The infamous 1918 'Spanish' influenza pandemic was the source of the first ancient pathogen genome ( 29 ) and archeovirology has been growing rapidly since then. Considering that two-thirds of all human pathogens are viruses ( 30 ), paleogenetics of viral genomes provides an interesting viewpoint on human history ( 31 ).
While se v er al ancient vir al genomes are available, the most comprehensi v e dataset deals with ancient HBV genomes ( 28 ). With more than 200 million people suffering fr om chr onic HBV infection, HBV can be considered as a common virus. HBV has a life-cycle that r equir es its doublestranded DNA genome to reach the host cell nucleus ( 32 ), in contrast with RNA viruses such as influenza or SARS-CoV-2 that cause only acute infections and contain RNA genomes that can be replicated and translated in the cytoplasm. Viruses causing acute infections tend to have a low PQS frequency, while G4s tend to be ubiquitous in most organisms.
G4s in the HBV genome are important structures that regulate transcription and virion secretion in HBV genotype B ( 15 , 17 ). In this report, we analyzed the presence of PQS in multiple HBV genomes, from ancient to current HBV strains. Importantly, we found that PQS frequency is higher in recent compared to ancient strains. It was shown previously that the G4 frequency of dsDNA viruses correlates with the PQS frequency of the host, shown for dsDNA viruses infecting Archaea, Bacteria and Eukaryota ( 33 ). In agreement with those data, we found that the density of G4 motifs in modern HBV strains (dsDNA viruses that also experience a latent phase) tends to converge to the overall G4 density of the human genome. We propose that mutations which led to 'PQS integration' (new G4 motifs within the HBV genome) were evolutionary preferred and fixed in HBV pathogenic strains during evolution. It seems that the opposite process may occur in viruses causing acute infections, as found for SARS-CoV-2, where the PQS frequency is extremely low compared to the PQS frequencies of other coronaviruses ( 34 , 35 ).
Viruses have highly variable genomes and are prone to mutations, in contrast to cellular and especially multicellular organisms, as described repeatedly. This is especially true for RN A viruses, w here the m uta tion ra te is se v eral or ders of magnitude higher than in DNA-based genomes. A drama tic decr ease in GC content has been described for se v eral bacterial species ( 36 ) and for some plant species with holocentric chromosomes ( 37 ). Similarly, a rapid decrease in GC content over time was f ound f or some viruses causing acute infections (and without latency connected to nuclear localization), including Nidovirales, influenza genomes and the contemporary SARS-CoV-2 outbreak ( 38 ). Comparison of SARS-CoV-2 genomes showed a strong pr efer ence for mutations in GC islands and C > U transitions, leading to a decrease in G4 propensity ( 39 , 40 ). As a consequence, the PQS frequency for these viruses causing only acute infections is generally very low (0.03 for SARS-2 ( 35 ), 0.56 for influenza H1N1 genomes ( 41 )).
The opposite trend is true for HBV, which exhibits a latent state and maintains its genome in the nucleus: modern HBV genomes have a significantly higher PQS frequency compared to ancient HBV genomes. Our results are in line with a broad study comparing PQS frequencies in viruses with predominantly persistent or acute types of infection ( 42 ). According to that study, viruses causing persistent infection are enriched in PQS compared to acutely infectious viruses. Importantly, this observation is also valid within viruses causing hepa titis: HAV (hepa titis A virus -causing acute infection) have a low PQS frequency, while the Hepadnaviridae that cause chronic infections (to which human HBV belongs) have a significantly higher PQS frequency ( 42 ).
The presence of G4s depends on guanine content in the genome, and one of the possible advantages of a GC-rich genome is to provide additional gene regulation opportunities. In this respect, G4s have been shown to be important for transcription in higher organisms. An increase in GC content has also been documented in plant species that can grow in seasonally cold climates, possibly indicating an advantage of GC-rich DNA during cell freezing, and the genomic adaptations associated with changing GC content are suggested for grass-dominated biomes during the Tertiary period ( 37 ). G4s are often overrepresented in the promoter regions of higher eukaryotes and have also been demonstrated to contribute to directed genome editing in nematodes. For viruses experiencing a latent phase in the nucleus, having a similar genome organization as the host is advantageous to both avoid recognition as unusual (foreign) DNA, and to hijack the host regulatory machinery. In addition, the high GC content of HSV DNA is suggested to act as a protecti v e feature against retrotransposon insertion ( 43 ).
As AT-rich regions in humans are mostly associated with condensed chromatin ( 44 ), the shift to GC-rich viruses could be important for viruses with latent phase to have a better chance of being acti v e in the future. HBV has a latent period, ther efor e, the evolutionary pr essur e to incr ease GC content and PQS presence could be evolutionary favored. In this model, the original (non-human) pre-HBV host could have had a lower G4 frequency -adaptation to the human host may have led to a host-pathogen PQS convergence and a concomitant increase in G4 density in the virus.

CONCLUSION
We performed the first paleogenomic analysis of G4 propensity, applied here to the Hepatitis B virus. We found that the density of PQS motifs increased over time, as it is higher in modern than ancient HBV genomes. The frequency in modern viruses is now very close to that of the human genome. This study should pave the way to the paleo genomics anal ysis of G4 sequences (and other similar motifs) in other pa thogens. Unfortuna tely, historical information about viruses genomes is only rarely available: Archeovirology is a nascent field ( 45 ), which faces the same obstacles as modern genomics, but with the additional problem of analyzing partially degraded DNA. As noted by the authors ( 45 ), HBV is 'an excellent target for the recovery of ancient sequences due to its relati v ely stab le, partially double-stranded circular DNA genome, its high prevalence in the human population, and prolonged high viremia during chronic infection'. Double-stranded DNA is generally better preserved than single-stranded DNA or RNA. This may explain why HBV is currently the only instance in which sequence data are available for over 100 ancient viruses. In other situations, such as variola virus, only a few ancient genomes are available (there are only 4, 10 and 11 HSV-1, Parvoviruses and VARV ancient sequences availab le, respecti v el y ( 28 ). Anal yses of HIV-1 or the 1918 influenza viruses is possible, but only over a limited period of time.
Archeovirology will be useful to identify polymorphisms important for human adaptation to pathogens, and viceversa during the complex virus-host relationships crucial for continued viral prevalence ( 28 ). More paleogenomic data will be needed to test the hypothesis that, for viruses causing chronic infections, their PQS frequencies tend to converge evolutionarily with those of their host. In particular, gi v en recent serious outbreaks, we hope that the analysis of ancient viral pathogens will provide critical knowledge about the nature of new viral diseases.

DA T A A V AILABILITY
All data are available in the manuscript and supplementary files. HBV sequences were uploaded from the supplementary materials at ( 9 ).