The evolution of protein domain repertoires: Shedding light on the origins of the Herpesviridae family

Abstract Herpesviruses (HVs, Family: Herpesviridae) have large genomes that encode hundreds of proteins. Apart from amino acid mutations, protein domain acquisitions, duplications and losses are also common modes of evolution. HV domain repertoires differ across species, and only a core set is shared among all species, aspect that raises a question: How have HV domain repertoires diverged while keeping some similarities? To answer such question, we used profile Hidden Markov Models (HMMs) to search for domains in all possible translated open reading frames (ORFs) of fully sequenced HV genomes. With at least 274 domains being identified, we built a matrix of domain counts per species, and applied a parsimony method to reconstruct the ancestral states of these domains along the HV phylogeny. It revealed events of domain gain, duplication, and loss over more than 400 millions of years, where Alpha-, Beta-, and GammaHVs expanded and condensed their domain repertoires at distinct rates. Most of the acquired domains perform ‘Modulation and Control’, ‘Envelope’, or ‘Auxiliary’ functions, categories that showed high flexibility (number of domains) and redundancy (number of copies). Conversely, few gains and duplications were observed for domains involved in ‘Capsid assembly and structure’, and ‘DNA Replication, recombination and metabolism’. Among the forty-one primordial domains encoded by Herpesviridae ancestors, twenty-eight are still found in all present-day HVs. Because of their distinct evolutionary strategies, HV domain repertoires are very specific at the subfamily, genus and species levels. Differences in domain composition may not only explain HV host range and tissue tropism, but also provide hints to the origins of HVs.


Introduction
Viruses are known by their unorthodox modes of evolution and outstanding strategies of immune evasion and cell infection. Their biological origins have long been debated, with many theories being proposed, without a consensus (Wessner 2010;Nasir and Caetano-Anolles 2015;Harish et al. 2016), probably because there is no single answer for such complex question (Wessner 2010). When it comes to large DNA viruses, such as members of the family Herpesviridae (hereinafter referred simply as 'herpesviruses' or 'HVs'), answering this question becomes even harder, specially taking into consideration their genomic complexity.
Herpesviridae is the best-characterized family of large doublestranded DNA viruses (Davison 2007;Davison et al. 2009), being divided in three subfamilies: Alpha-; Beta-, and; Gammaherpesvirinae (Davison et al. 2009). Herpesviridae is related to other two families of viruses also classified in the order Herpesvirales, they are: Alloherpesviridae, which have fishes and frogs as hosts, and eight fully sequenced genomes; and Malacoherpesviridae, which have bivalves as hosts, and two fully sequenced genomes (Davison et al. 2009). Among the seventyfive fully sequenced genomes of Herpesviridae members, which range in size from 109 to 241 kbp, the total number of ORFs encoded by them varies from 69 to 223 (Brister et al. 2015). A set of at least forty-one core genes are shared among all members of the Herpesviridae family, which play essential roles in viral infection, such as assembly of virion structure, DNA replication, viral entry, and egress pathways (Mocarski 2007).
During infections, HVs are known to establish latency in host cells, where their genomes are circularized, packed with histones, and copied by the cellular machinery during mitosis (Grinde 2013); and at this stage, genetic elements from hosts can be captured by viruses via horizontal gene transfers (HGTs) (Holzerlandt 2002). Each viral or host protein is composed by one or more domains, functional units that play specific roles and evolve independently from each other (Vogel et al. 2004). The Pfam database (Finn et al. 2014) contains information about more than 16,700 domain families (Pfam-A), which are defined based on probabilistic models generated from high-quality position-specific amino acid alignments. When implemented, such models can efficiently identify remote homology between sequences belonging to the same domain family. By identifying the domains of all proteins encoded by viral genomes, their species-specific repertoires can be reconstructed, revealing not only their array of encoded domains, but also the set of molecular functions they are able to perform. Domain repertoires can evolve under three main events: domain gains (Raftery, Mü ller, and Schö nrich 2000;Holzerlandt 2002), losses (Albalat and Canestro 2016), and duplications (Gao et al. 2017).
In this study, we extracted all possible ORFs from 75 fully sequenced genomes of members of the Herpesviridae family, and identified at least 274 protein domains encoded by them. By using a time-calibrated phylogenetic tree of viral species, and a matrix of domain counts for each species, a parsimony model was applied to reconstruct the ancestral states (presence/absence) of domains at all internal nodes of the tree, including its root. This analysis allowed us to reconstruct parsimonious scenarios explaining the evolution of HV domain families by events of domain gain, loss, and duplication. It revealed that current herpesviral species encode a core set of twenty-eight domains inherited from their ancestors since $400 million years ago. By classifying the HV domains in functional groups, it was possible to determine that domains implied in virus-host interactions were the main elements acquired by these HVs, especially in early periods of their evolution. Our results were interpreted in light of the current hypothesis explaining genomic and viral evolution, and shed light on the possible origins of HVs.

The current domain repertoire of members of the Herpesviridae family
To circumvent the limitations of simple sequence comparisons, which fail at identifying homology between distantly related proteins (Mocarski 2007), in this study we applied HMM profiles generated from Pfam alignments to identify domains encoded by fully sequenced herpesviral genomes. To allow a good balance between sensitivity and selectivity, the per-sequence and per-domain (conditional) E-values were defined using a conservative cutoff of 0.001, as used in previous studies (Ekman, Bjö rklund, and Elofsson 2007;Moore and Bornberg-Bauer, 2012). Such robust approach allowed us to detect a non-redundant set of at least 274 domains (Supplementary Tables S1 and S2), which cluster mostly following the phylogeny of HVs (Fig. 1). Interestingly, domains belonging to distinct functional groups have shown to be unequally distributed across the Herpesviridae subfamilies and species (Fig. 2).
A total of forty-six domains-most of them involved in 'Capsid assembly and structure'; 'Envelope'; and 'DNA-related processes'-were found in all subfamilies of Herpesviridae (Fig. 3). Nevertheless, since some of these domains were lost or have been independently acquired along the evolution, only twenty-eight of them are encoded by all members of the Herpesviridae family, and since these HVs encode on average 75.6 6 8.4 domains, non-core domains compose nearly two thirds of the HV repertoires.  Interestingly, 203 out of 274 domains are specific to a single subfamily, with 73 of them being strict to only one or two HV species, an aspect that highlights a trend for molecular specialization. A few domains are shared between two subfamilies while being absent in a third one, and as expected due to their closer phylogenetic relationship, Beta-and GammaHVs share higher number of domains (n ¼ 14, five of them envelope proteins; Fig. 3). Finally, another group of forty-six domains (16.8%) was found to be species-specific, most of them performing auxiliary (enzymatic) functions, or playing a role in the modulation and control of gene expression.

The interplay among domain gains, losses, and duplications
By taking the current set of domains encoded by herpesviral genomes, we reconstructed a possible scenario explaining the origins of HV domain repertoires. To do that, we applied a parsimony reconstruction method that revealed an intense   Fig. 1, with heatmaps of distinct subfamilies clearly separated. Domains with similar function were grouped, and the horizontal order of domains (clockwise sense) was defined based on their absolute frequency (i.e. most conserved domains are shown at the first columns of each group). Please observe that the lightest color shade of each group indicates domains that are absent in some species. As observed, only a small set of domains (n ¼ 28, see arrowheads) was found across all species, and most domains (n ¼ 203) are restricted to a single subfamily. Supplementary Table S1 presents a matrix of domains counts per viral species. The circular map was designed using Circos (Krzywinski et al. 2009). dynamics of domain gains, losses, and duplications along more than 400 millions of years of herpesviral evolution (Fig. 4). In this study, an event was classified as a 'gain' when a new domain, not yet present in the parent of a certain node, was incorporated in the repertoire of a HV lineage. When an existing domain decreased its overall copy number at a child node, a 'loss' was recorded, and accordingly, when its frequency increased, it was classified as a 'duplication', disregarding if the origin of the extra copy was cis-duplication or HGT.
To assess the importance of each of these events over time, and to explain how domains of distinct functions evolved, Fig. 5 shows the distribution of domain gains, losses, and duplications across four-time intervals, starting from the time to most recent common ancestor (tMRCA) of all members of the Herpesviridae family.
It is possible to observe, for example, that events of domain gain outnumber losses and duplications in all intervals. Between the Devonian (D) and the end of the Cretaceous (K), similar numbers of losses and duplications of domains of distinct functions were observed. As the number of sampled taxa differs in Alpha-, Beta-, and GammaHVs, to compare the impact of domain gains, losses, and duplications in the evolution of their repertoires, we established event rates by normalizing the total number of events (e) by the tMRCA of Hespesviridae ($394 Mya), and the total number of taxa (spp) per subfamily (e Â tMRCA À1 /spp; Fig. 6A).
For domain gains, these results revealed that Beta-and GammaHVs evolved under higher rates of acquisitions (14.77 and 13.83 Â 10 À3 gains per Myr per species), nearly twice as high as AlphaHVs (8.15 Â 10 À3 ). The levels of domain losses in all subfamilies were very similar (around 5 Â 10 À3 ), but lower than gains and duplications. Finally, rates of domain duplications varied among the three HV groups, with BetaHVs showing the highest rate (11.19 Â 10 À3 duplications per Myr per species).

The primordial domain repertoire of HVs
Prior to the split of Herpesviridae ancestors into their three known subfamilies, their last common ancestors ($394 Mya) encoded a minimal set of forty-one domains (Table 1). In this set, nearly one third of the domains (twenty-seven out of fortyone) are involved in processes of 'Capsid assembly and structure' or 'DNA Replication, recombination, and metabolism', an aspect that matches previous analysis of functional conservation in HVs (Alba 2001). Domains encoded by thirty-three HV core genes are found in this conserved set, which are shown in Table 1 with their corresponding annotation in the Human Herpesvirus 1 (HHV1) genome. The following sections provide details about how these events impacted the evolution of HV domain repertoires and each functional category.

Domains with auxiliary functions
Domains classified as 'Auxiliary' perform distinct enzymatic and accessory functions, and were mainly acquired by Alphaand GammaHVs, between the Paleogene (Pe) and the Quaternary (Q) periods (Fig. 5A). In these periods, AlphaHVs acquired five auxiliary domains more than one time, they are: Lipase; dNK; Herpes_IR6; zf-RING_UBOX; and Keratin_B2_2. For GammaHVs, three domains were gained in multiple independent events: Lectin_C; DHFR_1; and RINGv.
Duplications of such domains were not as common as in other functional categories, being kept at very low levels until the Cretaceous (Fig. 5A). Only from the Paleogene a few domains got duplicated, especially in GammaHVs, which expanded the number of copies of the domains Herpes_LP and Herpes_ORF11. At the same period, AlphaHVs experienced multiple duplications of Herpes_IR6, a domain only found in this subfamily.
Losses were uncommon among 'Auxiliary' domains; however, a few of them were lost in more than one occasion, in distinct time intervals. The domain dNK (Deoxynucleoside kinase), for example, was lost in most Alpha-and BetaHVs along the Triassic-Jurassic interval. Later, between the Paleogene and Quaternary periods, AlphaHVs from the genus Simplexvirus lost zf-RING_2 (a Ring finger domain) at least three times. Finally, at the same period, GammaHVs from different genera lost one of their multiple copies of Herpes_ORF11, a domain found in dUTPase proteins.

Capsid domains: conservation as a rule
The functional category of domains implied in 'Capsid assembly and structure' were the most conserved and less flexible group. Among the seventeen existing capsid domains, fourteen were inherited from the common ancestors of Herpesviridae (Table 1), meaning that only three domains of this kind were acquired along the evolution of these viruses (Fig. 5A), one per subfamily (Figs 2 and 6B). AlphaHVs acquired the domain Herpes_UL35 along the Triassic-Jurassic interval; and at the same period, BetaHVs acquired the domain HV_small_capsid; and GammaHVs acquired the domain Herpes_capsid much earlier, most likely along the Cretaceous and Permian period.
Although very rare, duplications of capsid domains were slightly more common than gains. The domain Herpes_env is an interesting exception, being duplicated in most Beta-and  GammaHVs, while present as single copy in AlphaHVs (Fig. 2). This very domain family was also the one facing more losses, which took place especially in more recent times (Pe-Q interval).

Domains performing DNA-related functions
As observed for 'Capsid' domains, the functional category 'DNA Replication, recombination and metabolism' represents the second most conserved and less flexible set of domains of HVs. Apart from thirteen domains already present in the primordial repertoire of Herpesviridae ancestors (Table 1)  importantly, dUTPase, observed as two copies in several GammaHVs.
Losses of DNA-related domains, albeit uncommon, were recurrent mainly in BetaHVs (Fig. 6B), which lost some domains inherited from the common ancestors of Herpesviridae, such as Ribonuc_red_lgN and Ribonuc_red_sm, which are ribonucleotide reductase domains; and Herpes_TK, a herpesviral thymidine kinase. These domains are mostly absent in all BetaHVs since the Permian and Triassic periods.

Envelope domains: a highly flexible and redundant functional group
The functional category of 'Envelope' domains was undoubtedly the most flexible and redundant domain group in HVs (Fig. 2), which were especially acquired by BetaHVs of the genus Cytomegalovirus (Fig. 6B). Overall, the gain rates of envelope domains were much higher than those of other functional categories (Fig. 5A). The following domains, which are of animal origin (Holzerlandt 2002), were gained in multiple and independent events, especially between the Paleogene and Quaternary periods: immunoglobulin domain; immunoglobulin V-set domain; immunoglobulin C1-set domain; CD80-like C2set immunoglobulin domain; Class I histocompatibility antigen; TNFR/NGFR cysteine-rich domain; 7 transmembrane receptor; and Serpentine type 7TM GPCR chemoreceptor Srsx.
Duplication of envelope domains were mostly common from the Paleogene to present times, being very infrequent until the end of the Jurassic period (Fig. 5B). Proportionally to the rates of gains, BetaHVs experienced frequent events of envelope domain duplications (Fig. 6B), especially involving those most likely acquired by HGT. For AlphaHVs, the envelope domain The domains are shown with their absolute frequency among the fully sequenced genomes of members of the Herpesviridae family (75 ¼ shared by all viruses), their distribution among the three subfamilies (a, b, and c), and their functional categories. Genes are named as found in the HHV1 genome annotation.
-, genes with no homologs in HHV1.
Herpes_glycop_D, found in glycoproteins D/GG/GX, was duplicated in at least three independent occasions. Duplications of envelope domains in GammaHVs took place mostly along the Pe-N-Q interval, involving domains unique to this subfamily, in particular Herpes_BLLF1, encoded by a major outer envelope glycoprotein, and the domains encoded by latent membrane proteins, Herpes_LMP1 and Herpes_LMP2. Losses of envelope domains were kept at low rates until the Late Cretaceous (Fig. 5C). Notably, these losses took place predominantly among AlphaHVs (Fig. 6B), which lost HV-specific domains in multiple occasions, they are: Herpes_glycop_D; Herpes_glycop_H; Herpes_UL43; Herpes_UL49_5; Herpes_UL73 and Herpes_US9.

Domains involved in modulation and control
Domains encoded by proteins implied in 'Modulation and Control' of host physiology and gene expression evolved under a regime similar to that of 'Envelope' domains, with distinct subfamilies, genera, and species developing their specific repertoires. Only a single modulation domain is shared among all members of the Herpesviridae family: Herpes_UL69, a transcriptional regulator inherited from the ancestors of all HVs ( Table 1). The rates of gains, duplications, and losses of these domains increased steadily over time, being among the highest rates when compared with other functional categories (Fig. 5). Acquisitions of domains in this category were mostly common among members of the subfamily Gammaherpesvirinae (Fig. 6B), which gained at least twenty modulation domains, now strictly found in a few species, such as: the interleukin (IL) domains IL6, IL17, IL22, and IL31; the interferon regulatory factors IRF and IRF-3; the cyclin domains K-cyclin_vir_C and Herp-Cyclin; and other domains participating in distinct pathways, such as CARD; BH4; Bcl-2_3; bZIP_2; TNF; EBV-NA3; Herpes_LAMP2; and KSHV_K8. Finally, two interleukin domains-IL8 and IL10gained at multiple independent events, are now present in all subfamilies, although not in all species (Fig. 2).
Duplications of modulation domains were frequent all over the evolution of HVs, and most of them involved domains specific to a single HV subfamily. AlphaHVs, for example, obtained duplicates that are now ubiquitous among them, as the Zinc finger domains zf-C3HC4, zf-C3HC4_2, and zf-C3HC4_3; and the viral regulatory domains Herpes_IE68, and Herpes_ICP4_C. BetaHVs, on the other hand, obtained extra copies of domains found only in a few viral lineages, as the immediate early protein domains HHV6-IE (unique to Roseolovirus) and Herpes_IE1 (unique to Cytomegalovirus). In GammaHVs, the main domains with duplicates were: Sushi domain, mostly found in multiple copies in Rhadinovirus, and; the death effector domain, mainly encoded by members of Rhadinovirus and Percavirus.
Modulation domains were among the main functions lost after the Jurassic period (Fig. 5C), and a large number of them were encoded by GammaHVs (Fig. 6B). The oldest losses of modulation domains in this subfamily took place most likely in the Cretaceous period (K), when the domain Bcl-2, an apoptosis regulator, was independently extinct at ancestors of Macavirus, Rhadinovirus, and Percavirus. In BetaHVs, the interleukin domain IL8 was lost at least four times, mostly along the Pe-N-Q interval, similar to what happened with AlphaHVs, where the zinc finger domain zf-C3HC4_2 was extinct in distinct species of Simplexvirus and Mardivirus.

Domains of tegument proteins
Tegument proteins establish mutual interactions forming a tight matrix that links the capsid and the envelope of HVs (Mocarski 2007;Owen, Crump, and Graham 2015). 'Tegument' domains are highly conserved across subfamilies, and four of them were inherited from the common ancestors of Herpesviridae: Herpes_UL7; Herpes_UL16; Herpes_UL24; and Herpes_teg_N, the last two still present in all HV species (Fig. 2). The total numbers of tegument domains acquired by each subfamily were similar, and notably this was the only functional category where rates of domain gain decreased over time (Fig. 5A), with the level of acquisition reaching its peak in early times, between the Devonian and the Permian. Along these periods, AlphaHVs acquired a set of seven domains still encoded by almost all their members: Herpes_UL14; Herpes_UL21; Herpes_UL36; Herpes_UL37_1; Herpes_UL46; Herpes_UL51; and Pkinase. At the same time interval, BetaHVs acquired four subfamily-specific domains: Herpes_U59; Herpes_pp85; DUF2664; and UL97, a domain with protein kinase activity (GO:0004672). Finally, during those early times, GammaHVs acquired their own tegument domains, such as: Herpes_BTRF1; Herpes_BLRF2; Tegument_dsDNA; DUF832; DUF848; and DUF2733.
Following an inversely proportional trend of domain gains, the rate of 'Tegument' domain duplication increased over time (Fig. 5B), with BetaHVs providing a major contribution to that (Fig. 6B). The main duplications in this subfamily took place in the Pe-N-Q interval, and involved the domains Herpes_pp85, Herpes_UL82_83, and US22. Finally, duplications in Alpha-and GammaHVs involved mostly the domains Pkinase and Tegument_dsDNA, respectively.
Overall, the functional category 'Tegument' experienced the largest number of losses (Figs 5B and 6B), being the main function lost between the Paleogene and Quaternary, especially by BetaHVs. In this period, this viral group lost copies of US22, while a few domains were extinct in some species, in particular the domain U71, mostly extinct in Cytomegalovirus; Herpes_U30, lost in some Proboscivirus; and DUF2664, independently lost in distinct genera.

Domains with other/unknown functions in HVs
Although domains classified with other/unknown functions do not make a functional category per se, it is important to highlight a few instances of gains, duplications and losses of such domains, as it could provide hints on their possible role on HV infection cycle. Overall, although at low levels, the flux of such domains in and out of HV repertoires was constant. Between the Devonian and the Permian periods, AlphaHVs gained domains that now are ubiquitous in their subfamily, such as Herpes_UL4 and Herpes_UL3 (found in nuclear proteins), US2, and Gene66. Around the same periods, BetaHVs acquired a set of domains to date found only among them: Herpes_U5; Herpes_U55; DUF587; and DUF570. GammaHVs also gained their subfamily-specific domains, in particular Herpes_BBRF1 and DUF717, now present in most members of that group. Finally, most probably in the Devonian, ancestors of Beta-and GammaHVs acquired three domains still conserved in all their species: Herpes_UL87; Herpes_UL92; and Herpes_UL95.
Duplications were mainly observed in recent times (Pe-N-Q; Fig. 5B), with AlphaHVs obtaining duplicates of Gene66 and Collagen, the later also duplicated in a few GammaHV species. In BetaHVs of the genus Roseolovirus, multiple duplications of DUF865 took place, and now such viruses encode from eight to twenty-two copies of this uncharacterized domain. Finally, losses involved mostly domains unique to AlphaHVs (Fig. 6B), such as Herpes_UL3, Gene66, and US2, which were lost along the Neogene and Paleogene periods.

Discussion
This study provides a thorough overview on the evolution of domain repertoires of HVs (Family: Herpesviridae), tracking gains, losses, and duplications of domains from distinct functions along the viral phylogeny. Since all possible ORFs were scanned for domains disregarding their sizes, initial codons, or whether or not they were annotated in the original genome annotations, our approach detected a large set of at least 274 domains encoded by members of the Herpesviridae family. It was only possible due to the sensitivity of HMM profile screenings, which allows the detection of remote homology between sequences, a task that conventional methods of sequence comparison are unable to achieve (Mistry et al. 2013). Ancestral character reconstructions can be performed using Parsimony, Maximum Likelihood, and Bayesian approaches (Joy et al. 2016). The domain repertoires of HVs evolved at distinct rates, and the interplay among gains, losses, and duplications of domains from distinct functional categories determined the current composition of HV domain repertoires. In this study, a simple linear meristic parsimony model was used to infer ancestral states. Future studies would benefit from using and comparing Parsimony approaches with Maximum Likelihood or Bayesian ones, which although being more computationally expensive, can deal better with ancertaints (Joy et al. 2016).
The evolution of Herpesviridae was largely characterized by high rates of domain acquisition. Since the early Devonian, these HVs nearly doubled the size of their domain repertoires, increasing from around forty-one domains encoded by their ancestors, to an average of 75.6 6 8.4 domains encoded by present-day HVs. Most domains acquired by these viruses are from the functional categories 'Auxiliary', 'Envelope', and 'Modulation and Control', and were mainly gained after the Jurassic (Fig. 5A), coinciding with the rapid adaptive radiation underwent by mammalian species in that period (Lee and Beck 2015). As host range and cell tropism are mostly defined by cell surface receptors, intracellular factors, and the ability of viruses to manipulate host physiology (Werden, Rahman, and McFadden 2008;Grinde 2013), the high levels of gains of domains performing 'Auxiliary', 'Envelope', and 'Modulation and Control' functions was probably an adaptive strategy adopted at least by mammalian HVs to cope with the rapid diversification of their host species. Many of these domains allowed viruses to evade and modulate the host immune system (Bernet et al. 2003), particularly by mimicking and/or blocking cellular processes, giving HVs the ability to establish latency and lifelong infections (Holzerlandt 2002;Grinde 2013).
Duplications were particularly common for domains involved in 'Tegument', 'Modulation and Control', and 'Envelope' functions. The rates of these events varied across subfamilies, with BetaHVs showing the highest rate of duplications. In this HV subfamily, 'Envelope' domains were the main category with duplicates, most of them involving cytomegaloviruses. When a gene or domain gets duplicated, their copies can have two possible fates: domain loss, or neofunctionalization (Wagner 1998). Especially in Beta-and GammaHVs, the overall rates of domain duplications were higher than losses (Fig. 6), indicating that most duplicated domains were kept in their repertoires. With duplicates following independent evolutionary paths, neofunctionalization can operate by means of multiple mutations leading to radical changes in function, or by fine changes on protein affinity and specificity (Levasseur and Pontarotti 2011). Several Beta-and GammaHV domain duplicates are known to have diverged to perform distinct functions. This is observed, for example, in dUTPase homologs, where neofunctionalization generated a broad set of proteins that perform distinct roles in viral infection (Davison and Stow 2005;McGeoch, Rixon, and Davison 2006). Another known example of duplication leading to neofunctionalization involves the domain Herpes_glycop_D, found only in AlphaHVs and present as two copies in most species: the glycoprotein D, an envelope protein that inhibits cellto-cell adhesion and plays a role in cell entry and fusion (Zhang et al. 2011); and the glycoprotein G, which by mimicry shows chemokine binding activity, antagonizing cellular chemokine receptors (Bryant 2003;Van de Walle et al. 2009).
Nearly a quarter of the domains (67 out of 274) in the repertoire of Herpesviridae are currently present as more the one copy in some species. As retention of domain duplicates was higher than losses in most HVs, our results indicate that, apart from the aforementioned examples, neofunctionalization may be a more common process in HVs than previously thought. Cases of duplication illustrated in Fig. 2 provide room for further investigations about the functions carried out by duplicates during infections.
The rates of domain losses for each HV subfamily were very similar, although the functional categories of lost domains differed across these viral groups (Fig. 6). Although domain gains and duplications outnumbered losses, the elimination of domains from HV repertoires played a central role as a source of genetic variation across species, genera, and subfamilies. Along the evolutionary history, gene losses can be either an adaptive or neutral phenomenon, which brings about two main hypotheses explaining the evolution of genomes: the 'less-is-more' hypothesis, in a context of adaptive evolution, and; the 'reductive evolution' hypothesis, in a scenario of neutrality (Albalat and Canestro 2016). Considering the less-is-more hypothesis, while HVs evolve and explore new hosts or tissues, adaptations in response to environmental changes and new pressures are essential to maintain viral fitness (Daugherty and Malik 2012;Grinde 2013). If a domain proves to be highly antigenic in a certain biological context, instead of evading the host defense via multiple mutations, losses of whole domains can be an efficient and fast mode of adaptation (Daugherty and Malik 2012;Elde et al. 2012;Albalat and Canestro 2016). Another important aspect of adaptive evolution via domain losses regards the negative stoichiometric effect caused by the excessive expression of certain gene products, a phenomenon known as 'dosage sensitivity' (Albalat and Canestro 2016). Interestingly, it was found that some eukaryote gene families associated with 'DNA-related' functions tend to be duplication-resistant, and their extra copies are prone to be lost (Albalat and Canestro 2016). Our results have revealed similar trends for HV domains associated with 'DNA Replication, recombination, and metabolism', which have shown low rates of gains and duplications (see Figs 2, 5C, and 6B), suggesting that domains belonging to this functional category, when duplicated, have a low probability of retention, being quickly purged from viral populations as a result of the deleterious effects of dosage imbalance (Levasseur and Pontarotti, 2011;Albalat and Canestro 2016).
Apart from dosage sensitivity, another aspect favoring the loss of duplicates is the 'dominant negative effect', a phenomenon that emerges when a second copy of a protein interferes with the functioning of its original version (Levasseur and Pontarotti 2011). Such mechanism can lead, for example, to the formation of defective complexes (Perica et al. 2012), and as such, it could be a possible explanation for the low frequency of duplicates in domains involved in 'Capsid assembly and structure' (see Figs 2, 5C, and 6B). Probably, keeping duplicates of such protein domains may lead to structurally distinct copies of capsid subunits, which could potentially harm the capsid structural integrity by dominant negative subunits.
In the evolution of 'Tegument' domains, losses were very common, while their gains exceptionally decreased over time. Tegument proteins are known to establish a myriad of interactions among each other in the space between the capsid and the viral envelope, strategic position that allow them to perform structural and/or modulatory roles upon cell entry (Owen, Crump, and Graham 2015). Inside mature virions, these proteins create a densely packed and symmetric inner layer of proteins surrounding the capsid, and an outer layer of loosely packed proteins around it, closer to the viral envelope (Zhou et al. 1999;Yu et al. 2011). Given such heterogeneous structural and functional properties, it is not clear what phenomenon could have played a major role driving losses and preventing gains of tegument domains. Given their essential roles in viral infection, and their poor levels of conservation across HV subfamilies (Fig. 2), further studies would be necessary to explain why these viral groups now encode their own specific sets of tegument domains.
Finally, considering the hypothesis of regressive evolution in a context of neutrality, genetic drift can also operate removing domains according to their levels of conditional dispensability. A genetic element can be considered dispensable when the environmental pressures justifying their maintenance in the genome cease to exist, generating a condition where keeping the gene/domain is not strictly advantageous (Albalat and Canestro 2016). This phenomenon was experimentally demonstrated, for example, in Poxviruses, where losses of duplicates were frequent upon absence of specific selective pressures (Elde et al. 2012). In scenarios of domain neofunctionalization after duplication, one of the copies can keep its original function, whereas the other can evolve faster. Most extra copies of genes/domains tend to accumulate loss-of-function mutations leading to the generation of pseudogenes, which can be lost due to their high levels of dispensability (Wagner 1998). In a similar way, domains that are functionally linked in pathways/complexes with domains encoded by pseudogenes can be co-eliminated in consequence of functional bias (Albalat and Canestro 2016). Since our search for domains screened all possible ORFs, disregarding their sizes or initial codons, some domains do not match existing genes annotated in the reference genomes (please see Supplementary Data and Data Availability). Further experimental studies would be necessary to clarify this aspect, and to determine what mode of evolution (adaptive or neutral) prevails at shaping the evolution of HV repertoires and driving losses of domains from distinct functional categories.
The evolutionary origins of viruses remain unclear despite decades of debates, raising distinct (and to some extent opposing) hypothesis to explain their origins, such as: (1) the 'Escape' hypothesis, which considers viruses as genetic elements that escaped from cells and expanded themselves by gene acquisitions; (2) the 'Regressive' hypothesis, which claims that viruses derive from free-living unicellular organisms that lost several genes, thus becoming obligate intracellular parasites; and (3) the 'virus-first' hypothesis, which asserts that viruses originated before cellular organisms, and co-evolved with them (Wessner 2010;Nasir and Caetano-Anolles 2015;Harish et al. 2016).
Using phylogenetic and domain composition information, our reconstruction of ancestral characters revealed that ancestors of HVs encoded a primordial set of at least forty-one domains, which perform fundamental functions such as 'Capsid assembly and structure' and 'DNA Replication, recombination and metabolism'. As only twenty-eight of these domains are still encoded by all members of the Herpesviridae family, at least thirteen domains were lost along their evolution, proving not to be strictly essential. Despite that, events of domain accretion outnumbered losses, and HV genomes expanded progressively by capturing or generating domains de novo to perform specific functions (Davison 2002;Holzerlandt 2002). This observation agrees mostly with the 'Escape' hypothesis, once acquisition of extrinsic genetic elements was predominant in HVs.
When HVs from the families Alloherpesviridae and Malacoherpesviridae had their genomes scanned for domains using the same approach used for members of Herpesviridae, we observed that only one domain is conserved in all members of the Herpesvirales order: the domain DNA_pol_B, found in DNA polymerase family B. Remarkably, this domain is also encoded by tailed bacteriophages (order Caudovirales; Petrov and Karam 2004), which also share a common DNA packing machinery with members of Herpesvirales (Rixon and Schmid 2014). Based on these similarities it has been hypothesized that these two viral orders may have had a common ancestor (Rixon and Schmid 2014), which most probably existed prior to the split between Eukaryotes and Akaryotes (Bacteria and Archaea) at least 2000 Mya, in the Paleoproterozoic Era (Lovisolo, Hull, and Rö sler 2003;Gradstein et al. 2012;Rixon and Schmid, 2014). Considering these genetic and functional similarities between members of Herpesvirales and Caudovirales, it is plausible to suggest that from a small set of core genes, these DNA viruses co-evolved with their hosts, developing specific repertoires by acquiring new genes via HGT (Lovisolo, Hull, and Rö sler 2003;de Andrade Zanotto and Krakauer 2008;Krupovic and Koonin 2017) and by innovative and diversifying processes of duplication and de novo generation of domains (Davison 2002;Mocarski 2007).
By associating information about the domain content of large DNA viruses, and the viral phylogenetic history, ancestral character reconstruction can provide important insights about the origins of these genetic elements, and of viruses per se. Acquisitions, duplications and losses of domains over timeshaped HV domain repertoires, and are important drivers of genetic diversity. The flow of domain coding regions in and out of HV genomes varied greatly in distinct time periods, and across viral groups, not only in terms of event rates, but also on domain functional groups. Remarkably, rates of domain loss were nearly four times lower than gains and duplications, an aspect that explains the expansion of HV genomes. Despite sharing a core set of twenty-eight domains, along more than 400 millions of years of evolution, co-infections with other viruses, and coevolution with their hosts, HVs developed very specific domain repertoires at the subfamily, genus, and species levels. Further studies focused on the role of such group-specific domains could help us understand which genetic elements define HV host range and tissue tropism, and could also suggest new targets for diagnosis and treatment of HV infections.

HV species and protein domain identification
To identify domains encoded by genomes of members of the Herpesviridae family, all possible ORFs from seventy-five fully sequenced genomes available on NCBI Viral Genomes (Supplementary Table S3; Brister et al. 2015) were translated, and included in our query database. Using the program hmmscan implemented in HMMER-3.1 (Mistry et al. 2013), all peptides were searched against profile HMMs generated using Pfam-A seed alignments (version 31.0;Finn et al. 2014). To decide which hits were reliable enough to be kept as true domains, strict inclusion thresholds of 0.001 were defined for both persequence e-value and per-domain conditional e-value. Considering these thresholds, python scripts were used to retrieve the domain repertoire of each herpesviral species from the hmmscan outputs. By integrating information from Pfam (Finn et al. 2014), Interpro, Pfam2GO (Mitchell et al. 2015), and previous studies (Davison 2002;Mocarski 2007), each domain was classified by its main functional category, such as: 'Auxiliary'; 'Capsid assembly and structure'; 'DNA Replication, recombination, and metabolism'; 'Envelope'; 'Tegument'; 'Modulation and Control'; or 'Other/Unknown' in case it does not fit in any of the previous categories.

Phylogenetic analysis
To understand the evolution of protein domains, the inference of a herpesviral phylogenetic tree is an essential step. Using MAFFT (Katoh and Standley 2013) we generated three multiple sequence alignments (MSAs) of conserved viral proteins encoded by the genes UL15, UL27, and UL30, which were treated as three independent partitions. These genes have proven to contain enough signal to accurately reconstruct the evolutionary relationships of Herpesviridae members (Alba 2001;McGeoch, Dolan, and Ralph 2000;McGeoch and Gatherer, 2005;Escalera-Zamudio et al. 2016). Amino acid substitution models were determined using ProtTest (Darriba et al. 2011). The MSAs were used as partitions on *Beast to reconstruct a time-calibrated maximum clade credibility (MCC) tree using a Markov Chain Monte Carlo Bayesian approach implemented in BEAST v2.4.5 (Bouckaert et al. 2014). Two divergence time priors were used to date internal nodes: the Herpesviridae MRCA, according to (McGeoch and Gatherer 2005); and the MRCA of HHV1 and HHV2 according to (Wertheim et al. 2014). Priors of monophyletic constrains were incorporated following taxonomic classification provided by the International Committee on Taxonomy of Viruses (King et al. 2018). The analysis was run for 35 million generations using the Yule model as a coalescent prior, and relaxed (uncorrelated lognormal) molecular clock.

Ancestral state reconstruction
Taking the domain repertoires of the existing HV species, a matrix of domain counts was constructed using Python scripts (see Data availability section below). Using the 'trace ancestral character' function implemented in Mesquite (Maddison and Maddison 2009), this matrix was used to infer the ancestral states, treating domains as meristic (additive) characters, and using linear parsimony reconstruction. This model reconstructs the ancestral states (number of domains) at the internal nodes of a tree, while trying to minimize the number of changes (events) given the phylogeny and the matrix (domain distribution). This approach allowed us to infer events of domain gain, losses and duplications, and to map them along the branches of the Herpesviridae phylogeny using iTOL (Letunic and Bork 2016), highlighting a possible parsimonious scenario for the evolution of HV domain repertoires.

Supplementary data
Supplementary data are available at Virus Evolution online.

Data availability
All accession numbers of genomes, domains and GO terms mentioned in this study are listed in Supplementary Materials. All data and codes generated in this study are deposited in the following repository on GitHub: https://github.com/anderson brito/openData/tree/master/brito_2019_HVsDomains