The Eukaryotic Ancestor Had a Complex Ubiquitin Signaling System of Archaeal Origin

The origin of the eukaryotic cell is one of the most important transitions in the history of life. However, the emergence and early evolution of eukaryotes remains poorly understood. Recent data have shown that the last eukaryotic common ancestor (LECA) was much more complex than previously thought. The LECA already had the genetic machinery encoding the endomembrane apparatus, spliceosome, nuclear pore, and myosin and kinesin cytoskeletal motors. It is unclear, however, when the functional regulation of these cellular components evolved. Here, we address this question by analyzing the origin and evolution of the ubiquitin (Ub) signaling system, one of the most important regulatory layers in eukaryotes. We delineated the evolution of the whole Ub, Small-Ub-related MOdifier (SUMO), and Ub-fold modifier 1 (Ufm1) signaling networks by analyzing representatives from all major eukaryotic, bacterial, and archaeal lineages. We found that the Ub toolkit had a pre-eukaryotic origin and is present in three extant archaeal groups. The pre-eukaryotic Ub toolkit greatly expanded during eukaryogenesis, through massive gene innovation and diversification of protein domain architectures. This resulted in a LECA with essentially all of the Ub-related genes, including the SUMO and Ufm1 Ub-like systems. Ub and SUMO signaling further expanded during eukaryotic evolution, especially labeling and delabeling enzymes responsible for substrate selection. Additionally, we analyzed protein domain architecture evolution and found that multicellular lineages have the most complex Ub systems in terms of domain architectures. Together, we demonstrate that the Ub system predates the origin of eukaryotes and that a burst of innovation during eukaryogenesis led to a LECA with complex posttranslational regulation.


Introduction
Of the three domains of life, eukaryotes have the most complex forms of cell organization. Understanding the emergence and early evolution of the eukaryotic cell is a major challenge for evolutionary biology. Recent findings have profoundly changed our long-held view of a simple last eukaryotic common ancestor (LECA) (Cavalier-Smith 1987, 1991, pointing instead to an ancestor that was already equipped with the machinery required for many of the cellular processes occurring in extant eukaryotes. These include, for instance, the cell division machinery (Makarova et al. 2010), the endomembrane apparatus (Brighouse et al. 2010), the spliceosome (Collins and Penny 2005), nuclear pores (Mans et al. 2004), a wide repertoire of transcription factors (de Mendoza et al. 2013), the RNA interference machinery (Shabalina and Koonin 2008), and cytoskeletal motors (Wickstead and Gull 2011;Seb e-Pedr os et al. 2014). It is unclear, however, whether the LECA already used tightly regulated signaling pathways to control these cellular processes.
We know that signaling systems are crucial in complex cells, as they provide the basis for finely tuned regulation of processes such as transcription Turjanski et al. 2007;Whitmarsh 2007), the cell cycle (Harashima et al. 2013), interactions with the milieu (Seger and Krebs 1995;Deshmukh et al. 2010;Suga et al. 2012), and localization of components within the cell (Field and Dacks 2009;Brighouse et al. 2010). Many of these functions rely on kinase activity and posttranslational protein modification, two signaling strategies of prokaryotic origin that gained importance at the origin of eukaryotes . In eukaryotes, posttranslational protein modification by ubiquitin (Ub) constitutes a major source of proteome regulation (Hochstrasser 2009). Thus, understanding the evolution of Ub signaling can provide clues not only into how the LECA regulated its cellular processes but also into the role of signaling systems during the origin and early evolution of eukaryotes. Despite some evolutionary studies devoted to specific gene families (Gagne et al. 2002;Mar ın 2009aMar ın , 2009bMar ın , 2010aMar ın , 2010bMar ın , 2010cMar ın , 2013Eme et al. 2011;Grau-Bov e et al. 2013), however, a global picture of the evolution of Ub posttranslational signaling in eukaryotes is still missing.
Ubiquitination consists of the posttranslational modification of proteins by the covalent attachment of Ub, a adaptor and target recognition subunits (Cardozo and Pagano 2004;Willems et al. 2004;Petroski and Deshaies 2005;Stone et al. 2005;Deshaies and Joazeiro 2009).
The ligase activity of E3s can be reversed by DUBs, isopeptidase enzymes that cleave Ub chains after the C-terminus of the peptide label (Amerik and Hochstrasser 2004). Some DUBs are specific to a particular kind of Ub linkage (usually Lys48 or Lys63) but most are unspecific and promiscuous (Komander et al. 2009). According to their catalytic mechanism, DUBs are divided into cysteine proteases (UCH, USP, OTU, and Josephin) and metalloproteases (JAB). Finally, the SUMO and Ufm1 systems employ specific E3 and peptidase protein families. There are two E3s (zf-MIZ, RINGs, and IR1-M) and three peptidases (ULP/SENP, WLM, and C97) in SUMO; and one E3 (DUF2042) and one peptidase (C78) in Ufm1.
In this work, we use comparative genomics to decipher the origin and evolution of three Ub-like systems: Ub itself, SUMO, and Ufm1. Our reconstruction shows that the ubiquitination toolkit of the LECA was as complex as that of most modern eukaryotes, in terms of diversity of gene families. Furthermore, various species of Archaea belonging to three different lineages (Euryarchaeota, Crenarchaeota, and Aigarchaeota) already had a minimal but complete ubiquitination toolkit. Thus, Ub signaling existed prior to the origin of eukaryotes and underwent a profound process of innovation during eukaryogenesis, resulting in a complex Ub system in the LECA. Analysis of the subsequent evolution of the Ub-like posttranslational systems in eukaryotes shows that E1 and E2 predate the LECA and underwent little innovation during early eukaryotic evolution, whereas most E3 families appeared concomitantly with eukaryotes and underwent multiple lineage-specific expansions and diversifications of protein domain architectures. We also describe two independent expansions of the Ub signaling system at the origins of multicellularity in animals and plants. Overall, we show that the complexity of the LECA involved the capacity to perform posttranslational regulation of different cell processes by Ub and Ub-like systems. This suggests that Ub signaling was key to the origin of eukaryotes and was later expanded in some specific, mostly multicellular, lineages.

A Comparative Survey of the Ub System Reveals an Archaeal Origin and a Complex Toolkit in the LECA
To elucidate the origin and evolution of Ub-like systems, we first examined the presence and abundance of 40 protein families related to Ub, SUMO, and Ufm1 signaling in a broad range of eukaryotic genomes (see supplementary table S1, Supplementary Material online, and Materials and Methods). Specifically, we surveyed the generalist E1 and E2 enzyme protein families, 27 specific components of the Ub system (including the peptide label, E3s, and peptidases), 7 families related to SUMO, and 4 related to Ufm1 (see supplementary table S2, Supplementary Material online, and Materials and Methods). Our survey revealed that 38 of these 40 protein families are widespread among eukaryotic groups ( fig. 1). We found that complete toolkits for Ub, SUMO, and Ufm1 systems exist in all the main groups of eukaryotes except for Fungi, in which Ufm1 is missing (see below). This phylogenetic distribution indicates that Ub, SUMO, and Ufm1 are ancient systems that were already present in the LECA ( fig. 2).
To trace back the origin of the different signaling systems, we also examined a comprehensive database of prokaryotic genomes (see Materials and Methods). Although none of the analyzed bacterial genomes contained a complete Ub toolkit, many bacteria were found to possess signaling systems that employ JAB peptidases, and E1 and E2 enzymes akin to the ones acting in ubiquitination Hochstrasser 2009;Humbard et al. 2010). These bacterial homologs act in functional contexts unrelated to protein labeling, such as molybdopterin and thyamin biosynthesis (ThiF E1) and siderophora biosynthesis (JAB) Koonin 2006). We also found F-box, U-box, and DUB enzymes in a few genomes of obligate intracellular parasitic bacteria, such as Agrobacterium tumefaciens, Legionella pneumophila, Candidatus Amoebophilus asiaticus, or various Chlamydiae, probably as a result of independent horizontal gene transfer (HGT) events (Koonin et al. 2001;Spallek et al. 2009;Schmitz-Esser et al. 2010). Despite lacking Ub systems of their own, these pathogens exploit their hosts' by mimicking various signaling effectors (Spallek et al. 2009). Overall, the Ub-specific components analyzed clearly evolved after the origin of bacteria.
Unlike in bacteria, Ub-specific protein families were observed in many Archaea. Previous work by Nunoura et al. (2011) 2). Interestingly, the number of Ub-related genes in some of these genomes was found to be quite high, including nine C3H2C3 RING (zf_RING_2 domain) E3s in an aigarchaeote and up to six C3H2C3 RING plus a RINGv in the euryarchaeote. In addition, C3H2C3 RING genes have also been detected in two unclassified archaea ( fig. 2 and supplementary file S1, Supplementary Material online).
To determine whether HGT of eukaryotic sequences into prokaryotic genomes could have occurred, we conducted Basic Local Alignment Search Tool (BLAST) similarity searches for all the protein families present in Archaea and phylogenetic analyses of Ub, UQ_con, and UCH (see Materials and Methods for details). None of the prokaryotic genes were found to be unexpectedly similar to eukaryotic sequences according to these methods. Thus, under the current taxon sampling, we can rule out a HGT origin for the archaeal toolkit (supplementary figs. S6 and S7 and file S3, Supplementary Material online).
In contrast, both SUMO and Ufm1 were found to be absent from Archaea and Bacteria. Thus, extant archaeal genomes contain a complete Ub toolkit that includes Ub label, E1 ThiF enzyme, E2 UQ_con enzyme, two different E3 ligases (C3H2C3 RING and RINGv), and two different DUBs (JAB and USP) ( fig. 2), whereas SUMO and Ufm1 are specific to eukaryotes.
Evolution of Ub Signaling in Eukaryotes: Massive Secondary Losses, Few Gains, and Expansion of Gene Families To better understand the evolution of the Ub system in eukaryotes, we examined the counts of two generalist gene families (E1 and E2 enzymes) and 38 protein families that are specific to a particular Ub-like system (peptide labels, E3 ligases, and peptidases) ( fig. 1). We then reconstructed the patterns of gains and losses of each Ub-like signaling toolkit across eukaryotes using information of the phylogenetic distribution of each protein family ( fig. 3). Finally, we also checked for statistically significant gene enrichments and depletions between eukaryotic groups ( fig. 3), that is, significant quantitative changes in the number of proteins of a particular family. In contrast, gains and losses are defined as zero-to-one or one-to-zero state changes.
Our analysis indicates that the LECA already had most of the surveyed gene families, independently of whether we root eukaryotes between unikonts/amorpheans and bikonts (Derelle and Lang 2012) or between excavates and the rest (He et al. 2014). In particular, under the modified "unikontbikont" hypothesis for the root of eukaryotes ( fig. 3), we identified only two gains: SOCS-box and IR1-M gene families (part of the Ub and SUMO E3 toolkits, respectively). Under the assumption of the "Excavata-first" hypothesis, the sole difference was the appearance of Sina E3s after the divergence of excavates (supplementary fig. S1A, Supplementary Material online). Finally, using likelihood-based gain/loss reconstruction (supplementary fig. S1B and C, Supplementary Material online), we obtained a similar result compared with the parsimony-based analysis (33 and 36 gene families in the LECA, respectively, under the "unikont-bikont" hypothesis for the root of eukaryotes). This shows that the recruitment of novel machinery in Ub-like systems is a relatively exceptional event during eukaryotic evolution, especially when compared with the frequent losses of individual system-specific gene families.
Among Ub-like signaling systems, we found that ubiquitination is the most gene-rich pathway in most of the examined eukaryotes, followed by SUMO and Ufm1

FECA
Pre-eukaryotic evolution of Ub and Ub-like systems. The dashed lines indicate two possible phylogenetic scenarios: the Eocyte hypothesis for the origin of eukaryotes within Archaea (1) (Williams et al. 2012) and the "three domains" hypothesis for the relationships among Eukaryota, Bacteria and Archaea (2) (Woese et al. 1990). The reconstruction is the same with both hypotheses. The Ub, SUMO, and Ufm1 toolkits before and after the eukaryogenesis process (i.e., the FECA and the LECA) are shown. Boxes to the right of the cladogram represent the components of the Ub toolkit found in each archaeal group.

729
Origin and Evolution of Ub Signaling . doi:10.1093/molbev/msu334 MBE supplementary information, Supplementary Material online, we describe our findings for specific components of the system.

The Diversification of the Eukaryotic Ub System Is Driven by Architectural Rearrangements
To further analyze the diversification of Ub-like systems in eukaryotes, we used the array of domain architectures of each protein family as a proxy to assess the diversity and versatility of the Ub, SUMO, and Ufm1 toolkits. In particular, we compared the number of different protein domains that co-occur alongside the core protein domain of each protein family (see Materials and Methods). The most abundant families (e.g., canonical RINGs, F-box, BTB, DUBs, and deSUMOylases) are also the most diverse in terms of architectures ( fig. 1  To test whether there are phylogenetic patterns in the profiles of gene counts and architectural diversity of the Ub-like systems, we performed principal component analyses (PCA, see Materials and Methods for details) ( fig. 4). The PCA based on gene counts revealed that embryophytes and metazoans have gene content profiles that differ from those of other eukaryotes ( fig. 4A). In particular, we found that the principal component 1 identified a group of genomes rich in genes related to Ub-like signaling systems, including embryophytes, many animals (especially eumetazoans: Homo sapiens, Capitella teleta, or Nematostella vectensis) and ichthyosporeans (Abeoforma whisleri, Pirum gemmata, and Amoebidium parasiticum). Furthermore, PC2 differentiated most  holozoans from embryophytes, which both clustered separately from the rest of the eukaryotes due to the loadings of many protein families that appeared or expanded in holozoans (e.g., HECT, BTB, SOCS-box, IR1-M, and C3HC4 RINGs) and plants (e.g., F-box, U-box, and C3H2C3 RINGs), respectively. The distinction between plants and holozoans (particularly animals) was also recovered by the PCA based on protein architectures ( fig. 4B) 5A-G, see fig. 3 for the phylogenetic positions of the reconstructed nodes). We inferred that many Ub-related genes already employed multiple accessory protein domains (in black) in several Ub-related genes in the LECA ( fig. 5D), although less than in most extant eukaryotes. For example, the LECA's Ub toolkit used highly promiscuous domains such as Ankyrin repeats (linked to C3HC4 RINGs), UBA (Ub-associated domain, linked to USPs and Ub), and LRR (linked to F-box). Architectural diversification during eukaryogenesis also led to specific domain combinations in E1 and E2 protein families, which use exclusive sets of accessory domains (e.g., E1s have UBA_e1_thiolCys, UBACT, and UBA_e1_C domains) and have little interconnection with other nodes. These E1 and E2 types are conserved in all the other ancestral nodes and characterize the eukaryotic Ub network. Also, the usage of multidomain proteins in the early eukaryote appeared as an important difference compared with archaeal systems, in which all genes encode single-domain proteins.
Since the origin of eukaryotes, the connectivity and network density of Ub and SUMO toolkits independently increased in Amorphea and Bikonta, although to a lesser extent in Bikonta. This led to rich signaling systems in multicellular animals and plants ( fig. 5A-C and E-G), confirmed by the PCA based on domain architectures ( fig. 4B). Nevertheless, we found that the network structure of the deep ancestors influenced later ancestors and extant organisms. For example, the urembryophyte's less extensively connected domain network could be traced back to the urbikont ( fig. 5E-G). This phylogenetic inertia constrained the Ub and SUMO systems of plants, whose expansion was not accompanied by a significant increase of protein architectures. Conversely, the diversified toolkits of animals were recapitulated in the denser domain networks of the urmetazoan, the urholozoan, and the uramorphean ( fig. 5A-C).
Despite these differences in network density, patterns common to all the ancestral networks emerged (fig. 5A-C and E-G). The most abundant catalytic machinery of Ub signaling employed a similar core of highly connected nodes in all the post-LECA ancestors. This included the C3HC4 variants (which shared most of their accessory domains and often co-occurred themselves), C3H2C3/zf-RING_2 (highly connected but not directly linked to other RINGs), IBR, or U-box. The CRL substrate recognition subunits BTB and F-box were both highly connected, particularly to protein-binding domains. In contrast, BTB and F-box shared few nodes, thus suggesting independent diversifications. For example, F-box often co-occurred with Kelch (in plants), LRR, and WD40, whereas BTB used Ankyrin, Kelch, BACK, and NPH3 (a signal-transducing motif that appears at the origin of plants).

The Ancient Ub System and the Origin of Eukaryotes
Our data show that the core components of the eukaryote Ub system originated in Archaea and predate the process of eukaryogenesis that led to the LECA. In particular, the core Ub toolkit inferred from extant Archaea includes Ub, E1s, E2s, two different RING E3s, and two different DUBs ( fig. 2). Interestingly, ubiquitination has been hypothesized to be a key mechanism for the symbiogenic origin of eukaryotes, during which it would be needed to act as a barrier against aberrant proteins resulting from the massive invasion of bacterial Group II introns into the host archaeal genome (Koonin 2006(Koonin , 2011. Thus, our results are consistent with the presence of a complete Ub signaling toolkit in the theoretical protoeukaryote, termed the first eukaryotic common ancestor (FECA) (Koonin 2011;Koumandou et al. 2013).
The initial toolkit was expanded during the stem phase of eukaryotic evolution with the addition of numerous new types of enzymes and an increase in the number of genes in some families ( fig. 2). Similarly, the network of accessory domains of the LECA ( fig. 5D) reveals that eukaryotic Ub-like systems switched to the use of multidomain protein families during their early evolution, whereas archaeal toolkits consist only of the catalytic protein domains. The presence of accessory domains within protein families reflect their ability to physically interact with other cellular components (Basu et al. 2008), which indicates that the rise of new protein families during eukaryogenesis was accompanied by an increasingly connected Ub domain architecture network. Interestingly, this increase in the LECA's regulatory potential was concomitant with the appearance of eukaryote-specific cellular functions regulated by ubiquitination, such as endocytosis, vesicle trafficking, and histone modification, as well as nuclei-specific DNA repair machinery. Altogether, we find that Ub signaling expanded in multiple ways as the first complex eukaryotes evolved.
Overall, our analyses indicate that the LECA had a rich and complex repertoire of Ub signaling genes, generating an extensive ancestral core machinery shared by most of the extant eukaryotic lineages. Given that some gene families were also secondarily, and recurrently, lost during eukaryotic evolution ( fig. 3), our results suggest that there were two phases in the evolution of Ub signaling: 1) an initial period of rapid innovation during eukaryogenesis, in which the minimal FECA toolkit was enriched with new gene families exclusive to eukaryotes and 2) a long process of toolkit contraction (loss of gene families) in various eukaryotic lineages. These findings fit the biphasic model of reductive genome evolution proposed by Wolf and Koonin (2013) and strengthen the idea of eukaryogenesis as a burst of innovation in the history of life.

Diversification of Ub Signaling and the Origins of Multicellularity
Our data show that the core machineries of Ub, SUMO, and Ufm1 signaling were already present in the LECA (fig. 2)

733
Origin and Evolution of Ub Signaling . doi:10.1093/molbev/msu334 MBE systems. This dynamic evolutionary history was mainly driven by lineage-specific gene expansions, architectural diversification of protein domains, occasional recruitment of new machinery, and abundant gene losses.
Gene expansions mostly affected E3 ligases and peptidases of Ub and SUMO toolkits, that is the effector enzymes responsible for substrate selection. Also, we found that the most enriched E3 and peptidase families often made use of promiscuous protein-binding domains, namely RINGs (canonical, IBR and U-box) and CRLs' substrate selector subunits (BTB and F-box), HECTs, and USPs. Likewise, HECTs are also rich in motifs that bind to lipids, complex sugars, and poly-A tails of RNA (Grau-Bov e et al. 2013). The presence of such domains in the effector enzymes increases the substrate specificity and fine-tuned localization of Ub and SUMO (Tordai et al. 2005;Bhattacharyya et al. 2006;Di Roberto and Peisajovich 2013). Thus, the expansions of Ub and SUMO signaling brought an increased regulatory accuracy and functional diversification.
Our analysis also reveals that deSUMOylases are more abundant and diverse than SUMO E3s in most eukaryotes. The opposite pattern is found in ubiquitination, where Ub E3s outnumber DUBs (fig. 6). We therefore propose that two different strategies underlie the specificity of SUMO and Ub labeling in eukaryotes: SUMO relies on postlabeling regulation mediated by peptidases, whereas Ub depends on directed E3 activity. Consistent with this hypothesis, the expansion of SUMO peptidases in Arabidopsis thaliana entailed sub-and neofunctionalization events, whereas its E3s are often redundant (Chosed et al. 2006;Colby et al. 2006). In addition, humans, yeast, and Ar. thaliana can tune SUMOylation using a substrate-specific SUMO paralogs and paralog-specific peptidases (Saitoh and Hinchey 2000;Mukhopadhyay and Dasso 2007;Hickey et al. 2012). We also know that SUMO E2s can directly affect signaling in a nonspecific manner, without using E3s (Reverter and Lima 2005). We see how, from an identical pathway in the early eukaryote, different modes of posttranslational signaling regulation evolved for SUMO and Ub.
Comparing the two structural types of Ub E3s, we see that RING families are more abundant and architecturally diverse than HECTs in all eukaryotes ( fig. 1 and supplementary fig. S4, Supplementary Material online). This might be explained by the fact that HECTs' tertiary structure is intrinsically constrained, as they require their catalytic site to be at the Cterminus to be active (Huang et al. 1999;Verdecia et al. 2003;Rotin and Kumar 2009). Consequently, they do not undergo C-terminal domain shuffling in any eukaryote (Grau-Bov e et al. 2013). Also, the evolvability of RING-based catalysts was further increased by the emergence of CRLs, a combinatorial system of modular subunits with specific functions (e.g., interaction with E2s and substrates). Thus, historical and protein structural constraints explain the prevalence of RINGbased catalysts in eukaryotes.
The greatest sophistication of Ub-like signaling systems is found in embryophytes and metazoans. These groups have the richest and most diverse Ub and SUMO systems among all eukaryotes ( fig. 1). Moreover, the reconstruction of domain networks of ancestral Ub toolkits reveal that extensive innovation occurred at the origin of both animals and plants, probably through processes of domain shuffling that made use of already-in-place molecular machineries ( fig. 5).

MBE
Although most of the surveyed protein families existed prior to the origins of animals and plants, we find that ubiquitination diversified extensively in these multicellular contexts through new domain combinations and gene number expansions ( fig. 1 and supplementary fig. S4, Supplementary Material online). This may be due to the complex multicellularity of plants and animals, which requires fine-tuned regulation of cellular functions. Indeed, parallel to this complexification of posttranslational regulation, animals and plants are known to have a rich transcriptional regulation machinery, probably related to their complex development (de Mendoza et al. 2013). Despite their similarities, the expansions of Ub-like signaling in multicellular animals and plants were independent: Each lineage expanded different protein families ( fig. 4A) and diversified its toolkit with different accessory domains ( fig. 5). This lack of protein architecture conservation among eukaryotes is common in other multidomain protein families (Basu et al. 2008(Basu et al. , 2009. The rise and diversification of multidomain protein families by shuffling is also recurrent in animal genomes (Tordai et al. 2005) and is regarded as a key genomic event to explain the origin of multicellularity (King et al. 2008). Shuffling of ubiquitous and promiscuous domains is a major source of evolvability in eukaryotic signaling networks (Basu et al. 2008), as exemplified by tyrosine kinases (Deshmukh et al. 2010;Suga et al. 2012), Notch (King et al. 2008;Gazave et al. 2009), or Hedgehog toolkits (Snell et al. 2006;Adamska et al. 2007). Here, we identify independent bursts of innovation by domain shuffling underlying the complex Ub and SUMO systems of both animals and plants.

Conclusions
In summary, we found that Ub signaling predates the origin of eukaryotes, as core components of the pathway are present in three different archaeal groups: Aigarchaeota, Crenarchaeota, and Euryarchaeota. The Ub machinery of the earliest eukaryotes thus consisted of E1 and E2 enzymes (common to all three domains of life), two RING E3 types (canonical C3H2C3 and RINGv), and two peptidases (USP and JAB). This early Ub system underwent an important process of innovation during the eukaryogenic phase that led to the LECA.
We propose that three processes shaped Ub signaling during early eukaryotic evolution. First, almost all the Ub-related gene families seen in extant eukaryotes emerged at that time. This includes new catalytic mechanisms (e.g., HECTs and new peptidases) and, most importantly, two eukaryotespecific signaling systems (SUMO and Ufm1). Second, some gene families underwent massive expansions (e.g., RINGs and the highly versatile multisubunit CRLs). Finally, new and diverse protein domain architectures were acquired in both ancient and new enzyme families (e.g., E1s and CRLs' substrate selectors BTB and F-box). Altogether, these events identify the stem phase of eukaryotic evolution as a period of rapid and intense innovation in posttranslational signaling.
After the initial eukaryotic radiation, the Ub and Ub-like systems further evolved by protein family expansion and domain architectural diversification, in a largely lineage-specific manner. There was, however, little protein family innovation, with only IR1-M (animal SUMO E3s) and SOCS-box selectors (holozoan CRLs) evolving later on. These diversification processes particularly affected E3s ligases (in the case of the Ub system) and delabeling peptidases (in the case of the SUMO system) probably because they are in charge of the target selection specificity. In this sense, the diversification of domain architectures in these families is related to the substrate specificity, with new accompanying domains allowing selective interaction with other proteins, complex sugars, lipids or nucleic acids. This process of architectural innovation was especially intense at the origin of animals and plants, coinciding with their need for a precise regulation of multicellularity-related protein products and processes. Thus, alongside the eukaryogenic phase of Ub expansion, the origins of multicellular animals and plants represent the main bursts of innovation in Ub systems in eukaryotes.
Overall, our investigation into the diversity of early eukaryotic Ub signaling clearly points to an important burst of evolutionary innovation at the origin of eukaryotes. This suggests that the LECA was much more complex than previously thought, not only in terms of cellular machineries but also in terms of elaborate regulation systems such as Ub signaling.

Materials and Methods
We obtained all the proteins related to Ub, SUMO, and Ufm1 systems from a selection of 78 eukaryotic proteomes, the nonredundant Archaea and Bacteria protein database from National Center for Biotechnology Information (NCBI), and genomic data from the Microbial Dark Matter project (Rinke et al. 2013) (supplementary table S1, Supplementary Material online). The selection of eukaryotic taxa includes 14 animals, 10 unicellular holozoans, 16 fungi, 1 apusozoan, 4 amoebozoans, 7 embryophytes, 7 unicellular algae (chlorophytes, rhodophytes, and glaucophytes), 6 heterokonts/stramenopiles, 5 alveolates, 1 rhizarian, 1 haptophyte, 1 cryptophyte, and 5 excavates. We obtained the proteomes from publicly available databases, with the exception of Oscarella carmela and Mnemiopsis leidyi, kindly provided by Scott A. Nichols (University of Denver) and Andy Baxevanis (National Human Genome Research Institute), respectively. We also used RNA-Seq data generated in-house (Ministeria vibrans, P. gemmata, Abeoforma whisleri, A. parasiticum, and Corallochytrium limacisporum) (de Mendoza et al. 2013). We performed a Pfamscan on all eukaryotic proteomes and transcriptomes using Pfam A version 26 and selecting the gathering threshold as a conservative approach to minimize false positives (Punta et al. 2012). The identification of bacterial and archaeal sequences was done using HMMER (Eddy 1998), searching the hmm profiles of all the domains (supplementary table S2, Supplementary Material online) against the NCBI Bacteria and Archaea databases and the Microbial Dark Matter project database (Rinke et al. 2013).
We unambiguously assigned each protein of interest (including labeling peptides and E1, E2, E3, and delabeling enzymes) to a certain Pfam domain, referred to as the core defining domains of each protein family (see supplementary  (Burroughs et al. 2009); zf-MIZ were selected by picking those architectures involving this domain combined with PINIT and/or SAP motifs; and DCAFs were identified by selecting proteins composed of WD40 domains and then retaining those that had a DWD motif (He et al. 2006;Hua and Vierstra 2011) with the following logo: Using R (R Development Core Team 2008), we built heat maps based on 1) the number of proteins involving a given core domain in each genome and 2) the number of accessory domains (i.e., total number of different domains that appear with a particular core domain in the same predicted ORF). Additional heat maps of the domain architectures in which each core domain is involved were built (supplementary fig.  S5, Supplementary Material online). Statistical analyses were performed using R to detect enrichments or depletions in gene content in different lineages, using the Wilcoxon rank sum tests with a significance threshold of P < 0.01.
We used the BLAST (Camacho et al. 2009) to look for a potential HGT origin for the archaeal Ub, UQ_con, zf-RING_2, RINGv, and UCH proteins (supplementary fig. S7 and table S3, Supplementary Material online). We searched all the archaeal sequences (identified by HMMER searches, see above) with a cut-off value of 10 À5 and against a combined database including the full NCBI nonredundant protein database, the Microbial Dark Matter database, and the full genomes and transcriptomes included in this study. We took the top 50 hits and searched them back to the same combined database, with a cut-off value of 10 À10 . The network visualizations of this reciprocal BLAST analyses were generated using Cytoscape 3.1.1 (Smoot et al. 2011). We included the raw BLAST outputs in supplementary file S3, Supplementary Material online. Additionally, we performed phylogenetic analyses with UQ_con, UCH, and Ub families (zf-RING_2 and RINGv are not suitable for phylogenetic analysis because they are defined by short and poorly-conserved amino acid motifs). For these analyses, we used 1) all the Pfamscan-identified proteins from our selection of eukaryotes, 2) the identified archaeal sequences from NCBI and the Microbial Dark Matter databases, and 3) the top 100 hits from the BLAST searches in these databases. The alignments were performed using the Mafft L-INS-i algorithm, optimized for local sequence homology (Katoh and Standley 2013), and inspected and manually revised. We used the matched-pairs test of symmetry (Ababneh et al. 2006), implemented in Homo 1.2 for amino acids (http:// www.csiro.au/Homo last accessed 1 October 2014), to determine whether the aligned sequences of amino acids are consistent with evolved under time-reversible conditions (assumed by most model-based phylogenetic programs). Based on the PP plots shown in supplementary figure S6A, Supplementary Material online, it was concluded that the data did not violate this assumption. The phylogenetic trees of UQ_con, UCH, and Ub were estimated using the Le and Gascuel (LG; evolutionary model with a discrete gamma (G) distribution of among-site variation rates (four categories), according to the respective analyses performed with ProtTest 3.4 (Darriba et al. 2011). The LG+G model with four categories was used in 1) maximum likelihood (ML) phylogenetic trees estimated with RaxML 7.2.8, using 100 bootstrap replicates as statistical support for the bipartitions (Stamatakis 2006) and 2) Bayesian inference trees calculated with PhyloBayes 3.3 (Lartillot et al. 2009), using two parallel runs for 500,000 generations and sampling every 100; and using Bayesian posterior probabilities as statistical support.
The reconstruction of ancestral states of each core element was inferred with Mesquite 2.75 using both a parsimony criterion and the AsymmMk likelihood model (http://mesquiteproject.org, last accessed 1 October 2014). We assumed two scenarios for the root of eukaryotes: 1) the modified "unikontbikont" hypothesis (Derelle and Lang 2012) but renaming Unikonta as Amorphea (Adl et al. 2012) and 2) the "Discoba-first" hypothesis (He et al. 2014). For the relationships between Eukaryota, Bacteria, and Archaea, we contemplated both the "Eocyte" (eukaryotes root within Archaea) (Williams et al. 2012;Williams et al. 2013) and "three domains" hypotheses (Woese et al. 1990). The AsymmMk model was implemented with bias of 0.1 between gain and loss rates, with rates of change estimated by the model and taking into account branch lengths. To estimate the branch lengths, we built a multiprotein alignment with Hsp90, Hsp70, and actin homologs using Mafft L-INS-i (Katoh and Standley 2013), which was manually inspected. The matched-pairs test of symmetry performed using Homo showed that these sequences did not violate the time-reversibility assumption (supplementary fig. S1D, Supplementary Material online). In this case, ProtTest showed that the best evolutionary model for our data set was LG with a G distribution of four discrete categories and a proportion of invariable sites (LG+G+I). Using this model (PROTGMMAILG), we used RAxML with a fixed topology (consensus eukaryotic phylogeny, as in fig. 3  and supplementary fig. S1A, Supplementary Material online).
A PCA was performed using built-in R prcomp function, using scaling (so that all variables have unit variance before the analysis takes place) and a covariance matrix, and plotted using bpca R package. We used scaling because our data, although presenting the same units (counts of number of genes), show very different ranges of values (with some families having hundreds of genes and others just one or two). The PCA of the protein counts ( fig. 4A) was based on the number of genes of each family in each species. In the PCA of protein domain architectures ( fig. 4B), instead, the species were clustered based on the number of proteins with a particular domain architecture. To this end, we first created a list of all the existing protein domain architectures (for all protein families) and then counted how many proteins (with each particular architecture) each species has. These raw counts can be visualized in supplementary figure S5, Supplementary Material online.
Finally, we inferred the accessory protein domains of each protein family at ancestral nodes of the eukaryotic tree by comparing domain architectures (same raw data as for the PCA in fig. 4B and supplementary fig. S5, Supplementary Material online) within the corresponding clades. We represented these reconstructions as networks of co-occurring domains using Cytoscape 3.1.1 (Smoot et al. 2011). Our criterion linked core domains (central nodes, listed in supplementary table S2, Supplementary Material online) to accessory domains (other protein domains that co-occur with a core domain in the same protein) if such co-occurrence existed in at least the earliest-branching lineage of a clade and another internal taxon. We used a nested approach, first reconstructing the most external nodes and proceeding inward (e.g., first Bilateria, then Eumetazoa, followed by Metazoa, Holozoa, etc.). The abundance of each core domain (represented by the size of the node) at the reconstructed ancestors of particular clades was estimated with the median gene count of all the analyzed species in that clade (e.g., in the Urmetazoan in fig. 5A, the median of the counts of a particular core domain in all animals included in this study). The frequency of each domain co-occurrence (represented by the thickness of the edge between nodes) was estimated analogously. We calculated the network density index of each reconstructed ancestor using the Cytoscape NetworkAnalyzer module (Assenov et al. 2008).