Citrullination Was Introduced into Animals by Horizontal Gene Transfer from Cyanobacteria

Abstract Protein posttranslational modifications add great sophistication to biological systems. Citrullination, a key regulatory mechanism in human physiology and pathophysiology, is enigmatic from an evolutionary perspective. Although the citrullinating enzymes peptidylarginine deiminases (PADIs) are ubiquitous across vertebrates, they are absent from yeast, worms, and flies. Based on this distribution PADIs were proposed to have been horizontally transferred, but this has been contested. Here, we map the evolutionary trajectory of PADIs into the animal lineage. We present strong phylogenetic support for a clade encompassing animal and cyanobacterial PADIs that excludes fungal and other bacterial homologs. The animal and cyanobacterial PADI proteins share functionally relevant primary and tertiary synapomorphic sequences that are distinct from a second PADI type present in fungi and actinobacteria. Molecular clock calculations and sequence divergence analyses using the fossil record estimate the last common ancestor of the cyanobacterial and animal PADIs to be less than 1 billion years old. Additionally, under an assumption of vertical descent, PADI sequence change during this evolutionary time frame is anachronistically low, even when compared with products of likely endosymbiont gene transfer, mitochondrial proteins, and some of the most highly conserved sequences in life. The consilience of evidence indicates that PADIs were introduced from cyanobacteria into animals by horizontal gene transfer (HGT). The ancestral cyanobacterial PADI is enzymatically active and can citrullinate eukaryotic proteins, suggesting that the PADI HGT event introduced a new catalytic capability into the regulatory repertoire of animals. This study reveals the unusual evolution of a pleiotropic protein modification.


Introduction
Posttranslational modifications (PTMs) allow for temporal and spatial control of protein function in response to cellular and environmental signals and comprise an integral part of cellular and organismal life. The development of ever more sensitive and quantitative analytical methods has made possible the identification of PTMs within cells and has enhanced our understanding of the molecular and cellular functions they regulate. This has led to renewed interest in studying previously known, as well as newly identified modifications. Although PTMs have been classically studied in eukaryotic organisms, an increasing number of them are also found in bacteria (Koonin 2010). Some PTMs, such as phosphorylation, acetylation, and glycosylation are ubiquitous across all domains of life suggesting that the enzymes that catalyze them existed in the Last Universal Common Ancestor (LUCA) (Beltrao et al. 2013). In other cases, such as protein ubiquitylation, this is less clear. Although ubiquitin itself is absent from eubacteria and archaea, other ubiquitin-like domains have been identified and shown to be added and removed from proteins in a similar manner in bacteria (Iyer et al. 2008;Pearce et al. 2008;Hochstrasser 2009;Koonin 2010;Macek et al. 2019).
Citrullination is the posttranslational conversion of a protein arginine residue to the noncoded amino acid citrulline and is catalyzed by peptidylarginine deiminase (PADI) enzymes in a calcium-dependent manner (Sugawara et al. 1982;Wang and Wang 2013). Although citrullination involves a small mass change of only 0.98 Da, the removal of a positive charge from the arginine side chain can lead to profound biochemical changes and is known to alter protein structure, subcellular localization and affinity to other proteins and nucleic acids (Tanikawa et al. 2009(Tanikawa et al. , 2018Guo and Fast 2011;Stadler et al. 2013;Christophorou et al. 2014;Snijders et al. 2015;Sharma et al. 2019). Via these alterations PADIs regulate fundamental physiological processes such as gene expression, chromatin compaction, and the innate immune response to bacterial infection (Wang et al. 2009;Wang and Wang 2013;Christophorou et al. 2014). Notably, deregulation of PADIs is strongly implicated in the etiology of a host of pathologies including autoimmunity (rheumatoid arthritis, ulcerative colitis, psoriasis, and type I diabetes), neurodegeneration (multiple sclerosis, Alzheimer's, and prion diseases), and metastatic cancer (Suzuki et al. 2003;Musse et al. 2008;Zhang et al. 2011;Wang and Wang 2013;Yuzhalin et al. 2018), whereas loss of PADI activity compromises neurodevelopment, fertility, and embryo development (Christophorou et al. 2014;Xu et al. 2016;Falcão et al. 2019).
In an evolutionary context, PADIs are puzzling. Orthologs of the human PADIs are ubiquitous in bony fish, birds, reptiles, amphibians, and mammals, but are unexpectedly missing from many eukaryotes including plants, yeast, worms, and insects. The PADI gene is widely thought to have appeared first in the last common ancestor of teleosteans and mammals (Balandraud et al. 2005;Wang and Wang 2013;Nicholas and Bhattacharya 2014), with duplications in subsequent lineages resulting in five mammalian paralogs. Therefore, citrullination seemingly defies the perception that PTMs are of ancient origin (Beltrao et al. 2013).
Mammalian PADIs consist of three structural domains, the N-terminal (PAD_N, Pfam annotation: PF08526), middle (PAD_M, Pfam annotation: PF08527), and catalytic C-terminal domain (PAD_C, Pfam annotation: PF03068). Although PADI proteins are widely considered to be specific to vertebrates, their crystal structures (Arita et al. 2004;Slade et al. 2015) hint at a possibly more ancient origin as they reveal that the catalytic (PAD_C) domain adopts the same pentein fold as a variety of other widely distributed proteins that otherwise show little similarity in terms of amino acid conservation (Shirai et al. 2001;Linsky and Fast 2010) (supplementary fig.  S1, Supplementary Material online). The pentein-fold containing group of proteins comprises a broad family of guanidino-group (the functional group of the side chain of arginine and agmatine) modifying enzymes that possess hydrolase, dihydrolase, and amidinotransferase catalytic activity, sharing a catalytic core of a Cys, His, and two polar guanidinebinding residues-Asp or Glu (Linsky and Fast 2010). Two such proteins with citrullinating activity are known among some bacteria and eukaryotes: pPAD, an extended agmatine deiminase found in Porphyromonas gingivalis and giardiaADI, an extended form of the free L-arginine deiminase gADI, found in the human parasite Giardia lamblia (Touz et al. 2008;Goulas et al. 2015). These enzymes contain a distant PAD_C domain but lack PAD_N and PAD_M domains, are highly divergent in sequence, and have different substrate specificities. In addition, mammalian genomes encode two distant homologs of the PAD_C domain: N(G),N(G)dimethylarginine dimethylaminohydrolase (DDAH) and glycine amidinotransferase (AGAT) (Linsky and Fast 2010). Both DDAH and AGAT are divergent in sequence, also lack PAD_N and PAD_M domains and do not appear to catalyze citrullination. The presence of this ancient fold and catalytic triad within PAD_C suggests that it may have been present early in cellular life, but the evolutionary provenance of the animal PADI enzymes has remained unclear.
A 2015 study by Crisp et al., identified possible PADI homologs in some bacterial species. Based on the finding that a possible homolog could be identified in prokaryotes but not in multiple Drosophila and Caenorhabditis species, the authors included PADIs among a list of 145 genes proposed to have been transferred into the genome of a vertebrate ancestor of extant mammals by horizontal gene transfer (HGT, also known as lateral gene transfer) (Crisp et al. 2015). HGT is the nonheritable transmission of genetic material from one organism to another, often via a virus or mobile genetic element and involving endosymbiotic or commensal relationships between donor and recipient (Boto 2014;Soucy et al. 2015). HGT is widespread among prokaryotes and is recognized as a mechanism that shapes the evolution and adaptive potential of bacteria, for example, in the acquisition of antibiotic resistance (Ochman et al. 2000;Koonin et al. 2001). Although many cases of horizontal transfer have been reported between bacteria and unicellular eukaryotes, fewer bacteria-to-animal HGT events have been reported to date (Keeling and Palmer 2008;Dunning Hotopp 2011;Boto 2014). The majority of cases involve transfer into an invertebrate host, such as an insect or worm (Gladyshev et al. 2008;Moran and Jarvik 2010;Chou et al. 2015;Lacroix and Citovsky 2016;Dunning Hotopp 2018). Moreover, it has been proposed that HGT into animals with specialized germline cells is very rare (Jensen et al. 2016). These few accounts of bacteria-to-animal HGT have been the topic of intense debate (Stanhope et al. 2001;Crisp et al. 2015;Martin 2017;Salzberg 2017;Husnik and McCutcheon 2018;Leger et al. 2018). The genome-wide approach employed by Crisp et al. to search for possible HGT events in vertebrates was disputed by Salzberg, and 45 of the highest confidence candidates were reanalyzed and rebutted on a case-by-case basis. In the instance of the PADI gene, this reanalysis showed that a PADI can also be identified in Priapulus caudatus (a marine worm) Cummings et al. . doi:10.1093/molbev/msab317 MBE and therefore that the lack of PADI in at least Drosophila spp. must be explained by gene loss (Salzberg 2017). Salzberg additionally recalculated the HGT index for many of the possible HGT candidates, including the PADIs, in light of additional sequences that can be identified showing that they no longer pass the original parametric criterion for HGT proposed by Crisp et al. (Salzberg 2017). Individual claims of HGT should be considered carefully and tested against the alternative hypothesis of widespread independent gene losses (Salzberg 2017). In light of the absence of PADI homologs in most invertebrate animals, PADI evolution requires detailed consideration.

Comprehensive Identification of PADI Homologs
In order to understand the distribution and evolution of citrullination, we sought to identify all PADI homologs from across life. We started by collecting orthologous PADIs using the EggNOG database, employing an unsupervised clustering algorithm of all proteins contained in 2031 genomes across cellular life . To expand on this list, we used HMMER searches to identify all sequences in current sequence databases that contain a PAD_C domain, as defined by having significant sequence similarity (E-value <1Â10 À3 ), and assessed these for the presence of critical substratebinding and calcium-binding residues annotated to human PADIs (Slade et al. 2015). This was supplemented by additional iterative jackhmmer searches as well as TBlastN and Position-Specific Iterated BLAST (PSI-BLAST) searches of genomic databases.
The taxonomic distribution of PADIs and proportion of species that harbor a PADI ortholog are presented in table 1. PADIs are not ubiquitous across the metazoa but are present across major branches of vertebrates, including jawless fish, sharks and rays, bony fish, amphibians, reptiles, birds, and mammals. Out of all species whose genomes have been sequenced to date, the earliest diverging invertebrate animals with a PADI gene are P. caudatus (an ecdysozoan), Saccoglossus kowalevskii (a hemichordate), and Branchiostoma belcheri (a cephalochordate). In addition, we identified a number of PADI sequences with conservation of substrate and calcium-binding residues in bacteria and fungi. PADIs are also not ubiquitous across bacteria (found in fewer than 1% of bacterial species), and are most prevalent within cyanobacteria (found in 11% of cyanobacteria). No eukaryotes diverging before opisthokonta have a detectable PADI homolog. Our searches also returned two outliers, one in archaea and one in viruses. However, upon closer inspection, both hits were determined to be due to misattribution (supplementary figs. S2 and S3, Supplementary Material online; see also Materials and Methods) and were therefore not included in further analyses. This taxonomic distribution could suggest an evolutionary model in which PADI genes were lost independently in many separate lineages. In this scenario, gene loss occurred in all early branching lineages leading to at least 306 nonopisthokont eukaryotes and in other lineages, for example, those leading to Drosophila and Caenorhabditis.
To explore the relationship of PADIs to other distantly related sequences, we aligned fungal, bacterial, and animal PADIs with sequences possessing significant HMMER similarity to pPAD and gADI and conducted phylogenetic analysis under a time-reversible model (supplementary fig. S4, Supplementary Material online). Bacterial, fungal, and animal PADIs form a single outgroup that excludes both pPAD and gADI enzyme types, showing that each of the three types of protein is phyletically distinct. The pPAD and gADI type proteins can therefore be excluded from further consideration of the evolutionary origin of animal PADIs.
A Strongly Supported Clade Contains Cyanobacterial and Animal but Not Fungal PADIs Firstly, we used HMMER to obtain all PADI sequences in the UniProtKB rp55 database and performed phylogenetic analysis using MrBayes and IQTree, recovering a clade of animal and bacterial PADIs distinct from fungal and other bacterial PADIs (supplementary fig. S5a and b, Supplementary Material online). We then repeated the phylogenetic analyses on a subset of 150 sequences, ensuring the length of the alignment of PADI sequences (495 columns) was at least three times the number of taxa considered in the tree, to limit "rough likelihood surface" issues that may arise with data sets of relatively few sites and many taxa (Stamatakis et al. 2020) (supplementary fig. S5c, Supplementary Material online). To avoid possible biases in subsampling, we took all bacterial PADI sequences contained within the Pathosystems Resource Integration Center (PATRIC) database for analysis (82 sequences). We then included 35 fungal sequences that cover the broadest span in HMMER sequence similarity to the human sequence (E-values between 5Â10 À26 and 1.4Â10 À46 ). Finally, we subsampled metazoan sequences to maximize lineage representation in species maintaining a PADI (the five paralogs in Homo sapiens, Pongo abelii, and in Mus musculus, the three paralogs found in Gallus, Chelonia mydas, and Alligator mississipiensis, and the single paralog found in Xenopus laevis, Takifugu rubripes, Tetraodon nigroviridis, Astyanax mexicanus, Danio rerio, Oncorhynchus mykiss, Callorhinchus milii, B. floridae, and P. caudatus). Amino acid sequences were used as this enables more reliable alignment among widely divergent taxa. This is especially important as PADI sequences span across bacteria, fungi, and metazoa. All sequences, intermediate alignments, and trees are provided in supplementary files 1-5, Supplementary Material online. Very strong bootstrap support (>95%) was obtained for a clade restricted to certain cyanobacterial and animal PADIs that excludes a fully supported outgroup clade containing fungal, actinobacterial, and proteobacterial sequences (supplementary fig. S5a-c, Supplementary Material online). With full branch support, the fungal and actinobacterial sequences were recovered as clades and found to be sister taxa in the tree. This tree topology, whereby animal sequences have closer affinity to those in cyanobacteria than to other eukaryotic (fungal) sequences is surprising because it is inconsistent with the known species tree.
Phylogenetic tree inferences, in particular those obtained from single genes, are subject to errors. It is possible that the HGT of Citrullination from Cyanobacteria to Animals . doi:10.1093/molbev/msab317 MBE observed topology represents the failure of phylogenetic inference in the case of this individual gene, such that an artifact (e.g., model misspecification) might explain the affinity of the separate eukaryotic PADIs to different bacterial PADI types. For instance, using a fixed rate matrix of amino acid substitutions to produce the tree (Jones et al. 1992;Whelan and Goldman 2001;Kalyaanamoorthy et al. 2017) can be inappropriate if there is evolutionary rate variation over different parts of the tree or deviation from typical protein substitution rates. In particular, attention has been drawn previously to heterotachous evolution, where the evolutionary substitution rate of a given site may change over time (Lopez et al. 2002). Heterotachy is particularly plausible in the case of the PADI gene tree because PADI is found in species across the tree of life (animals, fungi, cyanobacteria, actinobacteria). This could be detected if the tree topology was found to vary under different models of rate variation.
To analyze whether our phylogenetic tree may be subject to model violation, we undertook more parameter-rich analyses on 50 sequences that were subsampled from the larger tree, and assessed their topological congruence and node support. We removed multiple paralogs in metazoa using the basal paralog PADI2 and removed sequences with close branches such that we were able to maintain the maximum sequence diversity in the tree (nine fungi, 13 metazoa, 29 bacteria). We then performed the same fixed empirical rate matrix phylogenetic analysis on the smaller set of sequences to check for congruence, before undertaking a number of phylogenetic analyses ( fig. 1 and supplementary file 6, Supplementary Material online). This included a Bayesian approach that samples over different fixed empirical rate matrices (Ronquist et al. 2012); a maximum likelihood approach using a mixture model of 20 different fixed amino acid rate matrices (C20) (Quang et al. 2008); a Bayesian approach that allows for infinite mixture model categories sampled from the alignment by making use of a Dirichlet process prior (CAT-GTR) (Lartillot and Philippe 2004); and a maximum likelihood approach, designed specifically for heterotachous data sets, that allows different branch length classes across the tree (GHOST model) (Crotty et al. 2020). In addition, we produced maximum likelihood trees where eukaryotic sequences were constrained to be monophyletic under the best performing models (Trees 8 and 9, fig. 1b and supplementary file 7, Supplementary Material online).
All of the above analyses recovered a single topology in support of a clade of cyanobacterial and animal sequences to the exclusion of a clade of fungal and actinobacterial sequences ( fig. 1a, clades Ai, Aii, Bi, and Bii). Posterior probabilities or bootstrap values for this topology were high, approaching 100% for each of the diverse methods ( fig. 1b). The analysis was repeated using additional bootstrap algorithms, including the full nonparametric bootstrap, obtaining full support (Felsenstein 1985;Hoang et al. 2018). Topology constraint tests rejected a number of randomly generated trees, which confirmed the high branch support values. These alternative trees and the constrained trees for the expected model where eukaryotic PADIs are restricted to a monophyletic group were all significantly rejected (P < 0.001) by multiple statistical tests including the AU-test (Shimodaira 2002;Strimmer and Rambaut 2002;Susko 2014) (fig. 1b). We conclude that the topology of a clade of cyanobacterial and animal PADI sequences to the exclusion of fungal and actinobacterial sequences is robust to differently specified models.

Cyanobacterial and Animal PADIs Share Unique Synapomorphies
The high bootstrap values and congruent topologies across a wide variety of methods lend strong support to our tree topology. Nevertheless, we sought to identify features of the Felsenstein bootstrap values with 100 replicates for tree 2; or posterior probabilities for trees 10 and 11 are presented in the table for the nodes labeled in the tree that are critical to different evolutionary scenarios. Log likelihoods and the Bayesian information criterion are presented for all maximum likelihood trees. In addition, maximum likelihood constraint trees 8 and 9 were constructed where opisthokonta were constrained to be monophyletic under the maximum likelihood models used for tree 1 and tree 5. Trees were concatenated and analyzed using the AU-test with 10,000 replicates. Nomenclature for the different models is as used in IQtree 1.6.12. The best supported maximum likelihood tree and the Bayesian trees are shown in bold.
HGT of Citrullination from Cyanobacteria to Animals . Secondly, we analyzed representative fungal, actinobacterial, cyanobacterial, and metazoan PADI sequences for the conservation of calcium-binding and active site residues ( fig. 2b). The allosteric binding of up to six calcium ions allows formation of the PADI2 active site cleft and is an absolute requirement for catalytic activity (Slade et al. 2015). All catalytic residues and substrate-binding residues are fully . The coloring scheme indicates the average BLOSUM62 scores of each alignment column: red (>3.5), violet (between 3.5 and 2), and light yellow (between 2 and 0.5). Peach arrows shown below the cyanobacterial sequences indicate PsiPred predicted secondary structure (beta sheets). Green arrows (beta sheets) correspond to the known secondary structure of the PAD_N domain of human PADI2. (b) Analysis of synapomorphic regions, representing six PADI sequences from each of metazoa, cyanobacteria, actinobacteria, and fungi. Consensus sites across the six species are shown with standard single letter amino acid abbreviations. "nc" (nonnonserved) represents the absence of consensus conservation to one or two amino acids across the six species. The numbering given above the alignment and corresponds to the ungapped site of human PADI2 such that residues can be compared with Slade et al.
Sites showing conservation across all four domains are colored in green; sequence features common to metazoan and cyanobacterial PADIs that are excluded from fungal/actinobacterial sequences are colored in purple; sequence features common to fungal and actinobacterial PADIs that are excluded from metazoanand cyanobacterial sequences are colored in yellow. The existence of both purple and yellow sequence features is indicative of synapomorphic primary sequence features. (c) Crystal structure of human PADI2 presented with PAD_N domain colored in black, PAD_M domain in gray, and PAD_C domain in white. Synapomorphic regions are colored in cyan and calcium ions are shown as yellow spheres.  . 2b). In addition, calcium-binding sites 3 and 1 appear to be fully conserved, whereas calcium site 5 is also likely conserved. Calciumbinding site 6 is likely to be conserved functionally, as the substitution of D125 to N and E131 to D, which are present in both actinobacterial and fungal sequences, are expected to preserve ion binding. Intriguingly, however, calcium sites 2 and 4 appear to be exclusive to Clade Ai (late diverging cyanobacterial and metazoan) sequences. The fungal and actinobacterial sequences diverge from binding sites 2 and 4 to a different amino acid motif. Critically, only Clade Ai PADI sequences conserve the calcium switch residue D389 (residues: 369-389). In actinobacterial and fungal sequences, this residue is substituted to glycine and therefore incompetent for metal coordination (Slade et al. 2015) (fig. 2b). This indicates that the ordered, sequential calcium binding in the PAD_M domain, which is responsible for the allosteric communication between PAD_M and the catalytic PAD_C domain in human PADI2 (Slade et al. 2015) is likely to be conserved only in Clade Ai PADIs. As a result, a potentially different mode of calcium regulation operates in the fungal and actinobacterial PADIs.
In addition, we find that fungal and actinobacterial sequences share features that are not present in the Clade Ai PADIs. This includes a conserved region within calcium-binding sites 3-5 that is absent from the metazoan and cyanobacterial sequences ( fig. 2b: amino acids 155-180, where differences conserved between fungal and actinobacterial sequences are highlighted in yellow). Also of interest is a highly conserved ten amino acid beta sheet that connects the PAD_M and PAD_C domains ( fig. 2b: amino acids 292-302). This region is conserved closely in fungal and actinobacterial sequences, but to a different ten amino acid sequence containing a distinctive triple histidine motif ( fig. 2b: amino acids 300-302).
We therefore find primary and tertiary amino acid sequences that are specific to either the cyanobacterial/metazoan or the actinobacterial/fungal PADIs. It is implausible that blocks of sequence of up to ten amino acids were derived convergently and independently in these two groups of PADIs. Thus these sequence features are indicative of a common ancestry of actinobacterial and fungal PADIs that is distinct from the ancestry of cyanobacterial and metazoan PADIs and constitute synapomorphies. The phylogenetic topology presented in figure 1 is consistent whether built with or without the above synapomorphic sequence features and PAD_N domain (supplementary fig. S6b, Supplementary Material online). As these features occur at the level of the amino acid sequence and at the level of a whole protein domain ( fig. 2c), they are robust to differences in rate variation across the tree and to saturated sequence artifacts (Doolittle 1994;Zhang and Kumar 1997;Bazykin et al. 2007;Baalsrud et al. 2018). These features therefore provide strong additional support of the phylogenetic topology presented in figure 1.

The PADI Sequence Divergence between Cyanobacteria and Animals Is Anachronistically Low
The remarkably high similarity of Clade Ai cyanobacterial and animal PADIs prompted us to examine the rate of sequence change between them in more detail. To do this, we firstly sought to understand the extent of change of PADIs relative to other highly conserved proteins in the species that bridge the closest PADI homologs. We therefore analyzed a large number of the most conserved proteins in life to approximate a mean minimum extent of accumulated genetic divergence (AGD), represented by sequence change, occurring between Cyanothece sp. 8801 and B. belcheri and compared this with the divergence of the PADI sequence between these two species ( fig. 3a). As a negative control, we compared the difference in bitscore density (Dbitscore) for these conserved proteins and for the PADI sequence between B. belcheri and H. sapiens ( fig. 3b). We also analyzed 19 proteins of likely endosymbiont gene transfer (EGT) origin and ten proteins encoded in mitochondrial genomes. The approach used to calculate the AGD of a given protein between its homologs in H. sapiens, Branchiostoma, and Cyanothece is described in supplementary figure S7, Supplementary Material online. Since mitochondrial and EGT-derived proteins in metazoa may be closer to their bacterial homologs than might be expected for other vertically inherited genes, the AGD for these classes of protein may be even lower than the AGD for highly conserved ribosomal proteins. We reasoned that the AGD of mitochondrial and EGT-derived proteins may therefore mimic that of an anciently horizontally transferred gene into eukaryotes and the AGD calculated for PADIs may be even lower than these classes of proteins if it was acquired more recently than the mitochondrion (as we hypothesize for the PADI gene). Indeed, EGT and mitochondrially encoded proteins have an average AGD that is significantly lower than that of vertically acquired proteins between Cyanothece sp. 8801 and B. belcheri ( fig. 3a), but not between B. belcheri and H. sapiens ( fig. 3b). We find that the AGD of PADI falls below that calculated for vertically transferred protein sequences, as assessed over the same timescale ( fig. 3a), falling six SDs below that of vertically transferred protein sequences, but behaves as expected between B. belcheri and H. sapiens ( fig. 3b). PADIs show less sequence change than all proteins individually analyzed over this timescale and less even than ribosomal RNA (see Materials and Methods). Indeed, they fall two SDs below even the mean of EGT candidate genes ( fig. 3a). Finally, we calculated the AGD for each mitochondrially encoded protein as compared with its own closest bacterial homolog (as opposed to the homolog from Cyanothece sp. 8801). PADIs exhibit a lower AGD than any of the individual mitochondrially encoded proteins relative to each of their nearest bacterial homologs. With a P value of 0.0073 (see Materials and Methods), we reject the null hypothesis that PADIs fall within the normal distribution of AGD values calculated for mitochondrially encoded proteins relative to their closest bacterial homolog. A model of vertical descent of PADIs from bacteria, or PADI acquisition via EGT, requires that, across lineages where PADIs cannot be observed in modern genomes, in addition to the large number of independent gene losses, PADIs would have been under greater constraint than any other known sequence in life (Isenbarger et al. 2008).
We then used a Bayesian phylogenetic approach to predict the divergence time between Ai Clade cyanobacterial and HGT of Citrullination from Cyanobacteria to Animals .  (Drummond et al. 2006;Drummond and Suchard 2010;Bouckaert et al. 2014). In the relaxed UCLN clock model, distinct rates are given along each branch with rates drawn at random from a lognormal distribution. Under a model of descent from bacteria, or under a model of EGT, these predictions are expected to be at least as old as the last eukaryotic ancestor, since horizontal transfer is known to be common in bacteria and archaea (Betts et al. 2018). In general, the prediction of the divergence time of a node derived from analysis of a single gene would be significantly greater than the global estimate, as evolutionary rates for a single gene may be greater than the minimum in either lineage. We performed parallel analysis on the median gene from our EGT candidates above (enolase or ENO) to provide an internal comparison for the divergence time predicted by PADI sequences and calibrations from fossil ages. Our analysis yielded an estimate of less than 1 Gy for the age of the root of the tree as estimated by PADI sequences ( fig. 3c-e). Under all approaches, the divergence times were not congruent with the geologically defined divergence and were found to be 1.7 Gy (strict clock) or 1.3 Gy (UCLN relaxed clock) lower than that predicted by the ENO gene ( fig. 3c-e). The upper bound of our divergence times (95% credible interval) was found to be below the lower bound of the range of globally and geographically defined estimates for the date of the LUCA (>3,900 Ma), the date for eukaryogenesis (1,866-1,679 or 1,842-1,210 Ma), and the date of the symbiotic origin of mitochondria (2,053-1,210 Ma). The use of ENO as a control is likely to be conservative as seen from its AGD, which is lower than any individual ribosomal protein ( fig. 3a). These divergence time estimates are therefore inconsistent with vertical descent of metazoan PADIs from bacteria or with descent via EGT and are instead consistent with a horizontal acquisition event that is more recent than the acquisition of the mitochondrion by eukarya. The divergence times predicted by these clock models are approximately dated at the time of divergence of the last common ancestor of PADI-harboring metazoa.

The Cyanobacterial PADI Protein Is Catalytically Active
Considering the high degree of similarity between Clade Ai cyanobacterial and metazoan PADIs, including all necessary catalytic residues and calcium-binding residues, we hypothesized that the ancestral cyanobacterial protein is likely to be catalytically active and calcium dependent ( fig. 4a). To test this, we prepared a recombinant version of the three-domain PADI from Cyanothece sp. 8801 (here referred to as "cyanoPADI") and assayed its catalytic activity alongside human PADI4. Analogously to the human enzyme, cyanoPADI can citrullinate multiple proteins in mouse cell lysates ( fig. 4b).
In addition, cyanoPADI shows absolute dependence on calcium for activity. This demonstrates that the calciumdependent regulation found in mammalian PADIs is also a feature of the ancestral cyanobacterial protein and suggests that the conserved calcium-binding sites, which were used in the evolutionary analysis as signifiers of synapomorphy, are functional (figs. 2b and 4). Remarkably, and despite the absence of histones from bacteria, cyanoPADI catalyzes citrullination of histone H3 ( fig. 4c), which is a known target of mammalian PADI4. The enzyme is additionally active at a physiologically relevant temperature for cyanobacteria ( fig. 4c). Thus cyanoPADI is a bona fide calcium-dependent PADI with sufficient similarity or promiscuity to catalyze citrullination of mammalian substrates.

Discussion
It has been hypothesized that very few protein modification types existed in the LUCA and these have been diversified to give rise to the >200 PTMs known today (Beltrao et al. 2013). We sought to map the evolutionary origin of citrullination, which is implicated in the regulation of a variety of physiological and pathological processes in humans. Our analyses of PADI homologs across life reveal the existence of two clearly discernible PADI types: one containing three structural domains and sharing functionally relevant sequence features and one containing two structural domains and divergent sequence features. The taxonomic distribution of these two types of homologs is highly unusual, in that three-domain PADIs are present in animal and late-diverging cyanobacteria, whereas two-domain PADIs are present in fungi and all other bacteria ( figs. 1 and 2; supplementary fig. S6, Supplementary Material online). This evidence can be reconciled with vertical evolutionary descent if the last eukaryotic common ancestor (LECA) harbored two paralogous PADI genes which underwent widespread and mutually exclusive losses throughout evolution: firstly, the three-domain PADI present in latediverging cyanobacteria and metazoa was lost from lineages leading to every other species in life; and secondly, the twodomain PADI present in fungi, actinobacteria, and proteobacteria must be separately accounted for in independent gene losses in lineages leading to all other species. In lineages that harbor no PADI, the two paralogs must have been lost independently (supplementary fig. S8, Supplementary Material online). It is notable that no species is observed to possess both PADI types.
The above scenario, although highly unparsimonious, would be supported if rates of PADI sequence evolution across a species phylogeny were consistent with respect to geologically defined timings and with genes well known to have been inherited vertically from bacteria or by EGT from the LECA. Our analyses of sequence divergence provide evidence to the contrary. In absolute terms, the similarity of cyanobacterial and branchiostomal PADIs to human PADIs is almost identical: 70.20% versus 70.90%, respectively, by pairwise amino acid similarity. However, a much greater amount of time has elapsed since the cyanobacterial and human genes have shared a last common ancestor than the genes from the other species pair (branchiostoma and humans). Even under assumptions of heterotachy, where rates of evolution may differ between different lineages, a minimal amount of nearly neutral genetic divergence HGT of Citrullination from Cyanobacteria to Animals . doi:10.1093/molbev/msab317 MBE nonetheless accumulates over evolutionary timescales in all lineages (Takahata 1996;Isenbarger et al. 2008). Under the assumption of vertical descent, the observed PADI sequence changes are anachronistically low even compared with the most highly conserved genomic sequences in life, including ribosomal proteins and even EGT candidates and genes encoded in the mitochondrion.
The explanation for the observation of such little sequence change is more mundane under the assumption of horizontal transfer ( fig. 5). A HGT event from late-diverging SPM/NX clade cyanobacteria to a last common ancestor within the animal lineage, although ancient, would have occurred much more recently than the LUCA and also more recently than the mitochondrion. HGT can therefore fully account for the phylogenetic distribution, as well as the slow rates of evolution observed. The two lines of evidence are complementary and independent. The timing of transfer (neoproterozoic: 1,000-542 Ma) is consistent with the presence of marine nitrogen fixing cyanobacteria with specialized arginine catabolic pathways (Schriek et al. 2007), and with the emergence of metazoa in the cyanobacterial habitat (Erwin et al. 2011;Yuan et al. 2011;S anchez-Baracaldo et al. 2014). A second HGT event, from actinobacterial species that are known to be fungal pathogens, most parsimoniously explains the existence of the two-domain fungal PADI (Clade Bi in fig. 1 and supplementary fig. S6, Supplementary Material online; figs. 2 and 5). This is consistent with the absence of a PADI gene either in eukaryotic species diverging before opisthokonts or in early diverging fungi such as yeast.
Closer examination of PADI phylogeny in bacteria provides additional support for HGT and indicates the directionality of horizontal transfer (supplementary figs. S9 and S10, Supplementary Material online). Firstly, strong support is found for bacterial PADIs that form an outgroup to both the two-domain and three-domain PADI sequences (supplementary figs. S5 and S10, Supplementary Material online). These bacterial outgroup sequences suggest that PADIs were not horizontally acquired by bacteria. Secondly, the fact that the metazoan-type three-domain PADI only emerges in the late-diverging SPM and NX clades of All but one metazoan PADI sequence identified by our comprehensive searches in genomic and proteomic databases were found in deuterostomes-the exception being found in the P. caudatus genome, a protostome. This suggests that the HGT took place either at the root of the deuterostomes, or possibly at the root of bilateria. Note that this part of the tree of life remains poorly resolved, with an extremely short branch between the bilaterian common ancestor and the deuterostomes (Philippe et al. 2019).
Biochemical analyses of the ancestral three-domain PADI (cyanoPADI) show that it is competent for catalysis ( fig. 4), whereas a recent study has identified catalytically active PADI homologs in the thermotolerant fungi Emericella dentata and Aspergillus nidulans (El-Sayed et al. 2019). The discovery of catalytically active PADI orthologs in bacteria and fungi offers fertile ground for investigation of the roles of citrullination in these organisms.
Our finding that the cyanoPADI can citrullinate mammalian substrates ( fig. 4) indicates that a novel catalytic capability HGT of Citrullination from Cyanobacteria to Animals . doi:10.1093/molbev/msab317 MBE was added to the regulatory repertoire of metazoan cells by HGT. The newly acquired regulatory function is likely to have enhanced biochemical diversity in animals. Fish genomes contain a single PADI gene, but duplications resulted in five tandem repeated paralogs in mammalian genomes (Chavanas et al. 2004) (supplementary fig. S10, Supplementary Material online). The fact that these duplicated genes were retained across many animal genomes suggests that they were unlikely to be functionally redundant. In the course of vertebrate evolution, citrullination was thus expanded in scope and adapted to a variety of cellular contexts, ranging from neutrophil extracellular trap release to stem cell potency, and from oligodendrocytes to bone marrow and keratinocytes (Nicholas and Bhattacharya 2014). The emerging physiological roles of the vertebrate PADIs, such as in the regulation of pluripotency and embryonic development (Brahmajosyula and Miyake 2013;Christophorou et al. 2014;Xu et al. 2016;Xiao et al. 2017), and the newly described role of the fish PADI in tissue regeneration (Golenberg et al. 2020), point to possible selective advantages conferred to metazoans by PADIs and offer a possible explanation for the fact that PADIs were retained so widely (Huang 2013). In a similar vein, it is interesting to consider our findings in light of the proposal that genes with a role in antimicrobial defence are amenable to co-option by eukaryotic innate immune systems (Chou et al. 2015). The extent to which the molecular mechanisms that regulate the human PADIs were also conserved from cyanobacteria or were newly co-opted in vertebrates remains an intriguing open question.
It is notable that no citrullination-reversing enzyme has been identified in any species to date. The evolutionary analysis of PADIs presented here adds extra complexity as to whether the reverse catalytic process might have also arisen or been propagated. It has been postulated that "toolkits" of PTM writer, eraser, and reader enzymes may have evolved in a coordinated fashion and this has been studied formally in the context of protein phosphorylation (Lim and Pawson 2010). In this context, the investigation into potential reverse catalysis for citrullination should be extended to include bacterial and fungal enzymes.
A related consideration is prompted by the known role of PADIs in autoimmunity. It has been proposed that the exogenous citrullinating activity of pPAD at sites of periodontal infection is an initiating event in the development of rheumatoid arthritis, by predisposing individuals with prior periodontal infection to the development of autoantibodies against citrullinated endogenous proteins (Anti-Citrullinated Protein Antibodies, ACPAs) (Mikuls et al. 2012). It is therefore of note that pPAD and gADI genes are more widespread than previously thought (supplementary fig. S4, Supplementary Material online) and that the PADIs described in this paper can be found in a number of human pathogens and in Stachybotrys chlorohalonata (black mold). A re-evaluation of the initiating events responsible for citrullination-specific breaks in immune tolerance may therefore be warranted.
This work reveals the remarkable evolutionary trajectory of the PADI gene family and uncovers the origin of a protein modification with diverse functions in human physiology and disease. In combination, the pieces of evidence presented above comprise a compelling case of ancient horizontal transfer of a bacterial gene into animals.

Structural Analyses
Structural homology searches were performed using the Dali server v3.1 with the extracted PAD_C domain used as query (Holm and Rosenström 2010). Superposition of known structures was performed in Chimera (Pettersen et al. 2004) using the MatchMaker tool (Meng et al. 2006). Briefly, the two structures (PDB: 4n2c and 1xkn) were aligned for the bestaligning pair of chains using the Needleman-Wunsch algorithm and BLOSUM62 matrix. A secondary structure score of 30% was included. The superposition was iterated by pruning long atom pairs such that no pair exceeds 2.0 Å.

Identification of PADI Orthologs
A graph-based unsupervised clustering algorithm used by the EggNOG database was used to infer PADI orthologous groups from 2,031 genomes across the tree of life (ENOG410ZKF3: 217 proteins from 74 species) (Huerta-Cepas, Szklarczyk, et al. 2016). Phylogenetic reconstruction for the identified PADI orthologs was performed within EggNOG as implemented within the ETE3 suite (eggnog41) and described at http:// eggnogdb.embl.de/#/app/methods (last accessed June 7, 2020) (Huerta-Cepas, Serra, et al. 2016;Huerta-Cepas, Szklarczyk, et al. 2016). In addition, a list of proteins with significant similarity (E-value < 1Â10 À3 ) to the metazoan PAD_C domain from human PADI2 were collected using HMMER searches against Reference Proteomes and UniProtKB databases (Potter et al. 2018). Additional more sensitive sequence searches and iterative searches were performed using TBlastN and psiblast against nr/nt; jackhammer against reference proteomes and UniProtKB; and hhpred against Pfam-A, COG_KOG, and PDB_mmCIF70 (Altschul et al. 1997;Söding 2005;Alva et al. 2016). To verify the exhaustive nature of our search for PADI homologs, we employed two state-of-the-art remote homology detection tools. The first, HHBlits, is used to search databases of hidden Markov models (HMM) generated with clustered proteomic data sets with a query hmm and is available in the HHSuite (Hildebrand et al. 2009;Remmert et al. 2012). The second tool, hmmsearch, is included in the HMMER suite (Eddy 2011) and used to search proteomic data sets with an HMM. A PADI alignment was generated from our initial data set of known homologs for use with both tools using clustal omega on default parameters for three iterations to generate our query HMM. HHBlits was used to search the Uniclust30 database. Its construction and contents are detailed on the MMseqs website (Steinegger and Söding 2017)  MBE download/). The HHBlits search results were filtered with a cutoff of 90À probability and 50 amino acids. No additional hits were found to sequences in clades that were not in the starting data set.

Analysis of Spurious Viral and Archaeal Hits
The HMMsearch results were filtered with an E-value cutoff of 10 À10 and 50 amino acids. Two sequences attributed to unexpected clades were found: RefSeq identifiers AXN91134.1 and RCV64870.1 which are found in Namao virus and Methanophagales archaeon, respectively. The two anomalous sequences are the only representatives of the PADI family within their taxonomic kingdoms and this extremely sparse distribution of these sequences would either imply many independent gene loss events or an extremely recent horizontal transfer event of PADI to these clades if the genes are in fact correctly attributed to their genome. To verify the validity of the attribution of these sequences to their respective genomes, we calculated a phylogeny including by aligning a subset of high confidence PADI sequences with the two putative homologs. The alignment was then used with IQTree (Trifinopoulos et al. 2016) on default parameters and automatic selection of the appropriate model to generate a phylogeny. The resulting tree was visualized with figtree (Rambaut 2016) (supplementary fig. S2, Supplementary Material online). The placement of the two sequences in the phylogeny does not agree well with a plausible evolutionary scenario considering their taxonomic origin; rather their placement suggests that the allegedly Methanophagales archaeon sequence is in fact a cyanobacterial sequence, and that the allegedly Namao virus sequence, is in fact a fish sequence. These hypotheses are corroborated by the origin of the samples used to obtain them: metagenomic isolates in the case of Methanophagales archaeon, and infected tissue samples taken from fish in the case of Namao virus. In both cases, the samples were susceptible to gene misattribution due to incorrect binning (Sangwan et al. 2016) or contamination. To further probe the association of these genes to their respective genomes and discount the possibility of the PADI genes belonging to a transferred genomic segment, we used the k-mer spectra of the genomes to study the possibility of horizontal transfer events. These analytics are regularly used to find transferred regions in prokaryotic genomes (Bernard et al. 2018). Normalized k-mer spectra for DNA sequences were generated by counting occurrences of all kmers and normalizing by the total amount of words counted to give a unit vector. The results presented in supplementary figure S3, Supplementary Material online, were derived using 4-mers. To detect possible horizontally transferred genomic regions, an average spectrum for the entire genome was calculated. A spectrum was then calculated for a sliding window of 1 kb using 500-bp steps and subtracted from the genomic average at each window position. The absolute value of the difference between the genomic average and window spectra is represented over the entire genome. The code for running these kmer-based analyses is available at https://github.com/ DessimozLab/PADI (last accessed June 7, 2020).

Phylogenetic Methods
For all phylogenetic trees, branch support information was visualized and figures produced using FigTree v1.4.3 and iTOL (Letunic and Bork 2016). Amino acid sequences for PADI homologs were obtained from UniProtKB, NCBI, and Pathosystems Resource Integration Center (PATRIC) databases using HMMER and BLAST searches (Altschul et al. 1990;Finn et al. 2015;Wattam et al. 2017). PADI2 was used for species with multiple PADI paralogs, as it closest resembles the PADI gene in metazoa with one PADI (such as fish; György et al. 2006), and with the PADI2 from metazoan species with three PADIs such as birds or reptiles (supplementary fig. S7, Supplementary Material online).

Phylogenetic Analysis of Other Citrullinating Enzymes
Sequences of the arginine deiminase from Giardia lamblia (gADI; Touz et al. 2008) and the porphyromonas-type peptidylarginine deiminase from Porphyromonas gingivalis (pPAD; McGraw et al. 1999) were used as a seed for HMM searches of reference proteomes to identify sequences from other species of similar length and most significant similarity (Finn et al. 2015;Potter et al. 2018). These amino acid sequences were aligned with 25 representative PADI sequences using MAFFT L-ins-I (Katoh et al. 2018) and singly aligning columns were removed. IQTree was used to produce a maximum likelihood phylogenetic tree (Nguyen et al. 2015;Trifinopoulos et al. 2016). The LG empirical rate matrix with eight categories of rate variation under the FreeRate model (LG þ R8) was used, as determined by ModelFinder (Le and Gascuel 2008;Kalyaanamoorthy et al. 2017) according to the corrected Akaike Information Criterion. The Ultrafast Bootstrap 2 with 1,000 replicates (Hoang et al. 2018), Shimodaira-Hasegawa (SH)-like approximate likelihood-ratio test (aLRT) with 1,000 replicates (Shimodaira andHasegawa 1999, 2001;Guindon et al. 2010), and aBayes parametric tests (Anisimova et al. 2011) were used to assess node support.

Phylogenetic Analysis of PADI Orthologs
All PADI sequences in the UniProtKB rp55 database were obtained using HMMER and fragment sequences (<450 amino acids) were removed . Sequences were aligned using MAFFT L-ins-I (Katoh et al. 2018) and the alignment trimmed with TrimAL using gappyout settings (495 columns) (Capella-Guti errez et al. 2009). Bayesian phylogenetic analysis was performed using MrBayes v3.2.7 x64 using the CIPRES Gateway on XSEDE with the Markov chain Monte Carlo (MCMC) sampling different amino acid rate matrices according to their probability (Aamodelpr¼mixed) and five gamma distributed rate categories to allow among site rate variation (Ronquist et al. 2012). Maximum likelihood phylogenetic analysis was performed using IQTree (WAGþR5þF) with node support tested using Ultrafast Bootstrap 2 with 1,000 replicates (Nguyen et al. 2015;Trifinopoulos et al. 2016;Hoang et al. 2018).
Meaningful statistical inference becomes challenging if the number of parameters exceeds the sample size. A useful proxy for the sample size in phylogenetic analysis is given by the HGT of Citrullination from Cyanobacteria to Animals . doi:10.1093/molbev/msab317 MBE number of columns in the alignment setting a constraint for the total number of taxa that can be analyzed in the single gene tree. An unbiased subsample of bacterial sequences was obtained by including all bacterial PADI sequences contained in the PATRIC database when the analysis was performed (82 taxa in total) (Wattam et al. 2017). Metazoan sequences were subsampled to maximize representation of lineages maintaining a PADI (33 in total: the five paralogs in Homo sapiens, Pongo abelii, and in Mus musculus, the three paralogs found in Gallus gallus, Chelonia mydas, and Alligator mississipiensis, and the single paralog found in Xenopus laevis, Takifugu rubripes, Tetraodon nigroviridis, Astyanax mexicanus, Danio rerio, Oncorhynchus mykiss, Callorhinchus milii, Branchiostoma floridae, and Priapulus caudatus). Finally, 35 fungal sequences were subsampled to span the range of sequence diversity with respect to the human sequence according to HMMER bitscore spanning a range of significances of fungal proteins giving E-values between 5.0Â10 À26 and 1.4Â10 À46 . Sequences are provided in full in supplementary files 3 and 6, Supplementary Material online. The collected amino acid sequences were aligned using MAFFT L-ins-I and singly aligning columns were removed (1,100 columns) (Katoh et al. 2018). IQTree was used to produce maximum likelihood phylogenetic trees (Nguyen et al. 2015;Trifinopoulos et al. 2016). The WAG empirical rate matrix with ten categories of rate variation under the FreeRate model with base frequencies counted from the alignment (WAGþR10þF) was used, as determined by ModelFinder according to the corrected Akaike Information Criterion (Whelan and Goldman 2001;Kalyaanamoorthy et al. 2017). Ultrafast Bootstrap 2 with 1,000 replicates, SH-like aLRT with 1,000 replicates, and aBayes parametric tests were used to assess node support (Shimodaira andHasegawa 1999, 2001;Guindon et al. 2010;Anisimova et al. 2011;Hoang et al. 2018). The tree is shown rooted at the midpoint with solid circles indicating consensus node support of >95%. The critical nodes for testing different evolutionary hypotheses mentioned in later analyses are labeled in full.

Phylogenetic Analysis of Subsampled PADI Orthologs for Topology Testing
For parameter rich analyses, 50 sequences were subsampled from the larger tree. We removed multiple paralogs in metazoa using the basal paralog PADI2 and removed sequences with close branches so as to maintain the maximum sequence diversity in the tree (9 fungi, 13 metazoa, 29 bacteria). In addition, both the closest and the most distant bacterial homologs with respect to the metazoan sequence were retained to allow for the broadest distribution of protein sequences. To assess the effect of sequence subsampling, a maximum likelihood phylogenetic tree using the fixed WAG empirical rate matrix was performed, with five categories of rate heterogeneity across sites allowed under the FreeRate model, and with base frequencies estimated by maximum likelihood (WAGþR5þFO) to check for congruence with the larger tree topology. This analysis was then repeated using the original Felsenstein bootstrap with 100 replicates.
Additional maximum likelihood phylogenetic analyses were performed using IQTree using the CIPRES Gateway on XSEDE (Quang et al. 2008) or using IQTree version 1.6.12 with node support assessed by Ultrafast Boostrap 2 with 1,000 replicates. Additional parameter rich models included the C20 mixture model of fixed empirical rate matrices with base frequencies estimated from the alignment (C20þFO), the C20 mixture model of empirical rate matrices where rate heterogeneity across sites was also relaxed according to the free rate model (WAGþFþC20þR5), the GHOST model (Crotty et al. 2020), which is specifically designed to analyze heterotachous data sets, where different classes of branch lengths are inferred across the tree (WAGþFOþH4), and the most general form of the GHOST model, where relative rate and base frequency parameters are unlinked and separated across the different branch length classes (WAGþFO*H4). In addition, a constraint tree was inferred using the WAGþFþC20þR5 model, where sequences from opisthokonta were constrained to be monophyletic. All trees were concatenated and used for topology testing in IQTree using the AU test. Log likelihoods and the Bayesian Information Criterion are presented alongside P values for the AU test in figure 1b.
Bayesian phylogenetic inference was firstly performed using MrBayes v3.2.6 x64 using CIPRES Gateway on XSEDE with mixed model MCMC jumping across different fixed empirical rate matrices and five different gamma distributed rate categories (Ronquist et al. 2012). Analysis was performed with four runs each of 1,000,000 chains. The average SD of split frequencies was observed to be <0.005, parameters all had an effective sample size (ESS) >500 and potential scale reduction factor of 1.000 (to four significant figures). The summary tree was generated with a burn-in of 25% over the runs. Posterior probability was used for node support-that is, where posterior probability was 100, the topology was congruent in every tree sampled by the MCMC after burn-in. The aminoacid model prior was set as "mixed" such that the MCMC jumps across different models, that is, mixture of models with fixed rate matrices. Poisson, Jones, Dayhoff, Mtrev, Mtmam, Wag, Rtrev, Cprev, Vt, and Blosum models were used and all have equal prior probability. The WAG model had posterior probability of 1.000, and SD <0.0001-and was exclusively sampled from the posterior (Whelan and Goldman 2001). This is consistent with the WAG model being identified as the best empirical matrix identified according to ModelFinder and the corrected Akaike information criterion from the maximum likelihood analysis in IQTree.
A second approach to Bayesian phylogenetic inference was performed using PhyloBayes under the CAT-GTR model (Lartillot and Philippe 2004;Lartillot et al. 2013). This is an infinite mixture model of rate matrices making use of a Dirichlet process prior. Eight chains were performed in parallel for 24 h such that more than 20,000 cycles were achieved as recommended in the PhyloBayes manual using the MRC IGMM and University of Edinburgh computing cluster Eddie3. Readpb, bpcomp, tracecomp tools in PhyloBayes and Tracer software were then used to analyze runs. Posterior consensus trees were generated for each run and were reproducible across the eight different runs. The trace plots for independent runs were also analyzed to assess for apparent stationarity aiming for an ESS of at least 100. Maxdiff was observed to be <0.1 (maxdiff ¼ 0.06209, meandiff ¼ 0.00330).
Tree topologies were congruent across the different methods with tree files provided in full ( fig. 1b and supplementary  files 4 and 5, Supplementary Material online). Topology testing of parameter rich models and maximum likelihood constraint trees was performed using IQTree version 1.6.12 and results are provided in figure 1b. Additional topology testing was performed in PAUP*4.0a163, where 100 random trees were generated along with the maximum likelihood constrained tree with fungal and metazoan sequences constrained to be monophyletic. The SH test, approximately unbiased (AU) test and expected likelihood weight (ELW) tests were performed and all other alternative trees, including the constraint tree were rejected (P < 0.001) (Shimodaira and Hasegawa 2001;Shimodaira 2002;Strimmer and Rambaut 2002;Susko 2014).

Phylogenetic Analysis Excluding Synapomorphic Regions
Phylogenetic analyses from figure 1 were repeated using an alignment with the PAD_N domain removed and with an alignment in which both the PAD_N domain and regions of synapomorphy (supplementary fig. S4, Supplementary Material online) were removed. Maximum likelihood analysis using IQTree with ModelFinder using the same best performing fixed empirical rate matrix (WAGþR5þFO) as above (Nguyen et al. 2015;Trifinopoulos et al. 2016;Kalyaanamoorthy et al. 2017;Hoang et al. 2018). Topologies were congruent with analysis of the whole alignment and node support values (Ultrafast Bootstrap 2) for the clades labeled in figure 1

PADI Domain Annotation
To identify putative locations for the three PAD domains within PADI homolog sequences from bacteria and fungi, each target PADI sequence was aligned to five metazoan sequences using TCoffee (Di Tommaso et al. 2011). Putative domain sequence regions were then used as a target query for HMMER or HHPred searches Finn et al. 2015). HMMER searches were made against the UniProtKB database and HHPred searches were performed, firstly against a database of HMM profiles of protein domains in the Protein Data Bank (PDB_mmCIF_4_Aug) and secondly, against a database of profiles from Pfam (Pfam-A_v31.0) (Alva et al. 2016). Once individual sequences were identified as possessing a specific domain architecture, multiple sequence alignments of groups of sequences with common putative domain architecture were made and these were used as queries for each type of search.
For the reported E-values in supplementary figure S4a, Supplementary Material online, the following method was used. All sequences from the highlighted clade in the phylogenetic tree were aligned using TCoffee. PAD_C, PAD_M, and PAD_N domains from the cyanobacterial sequences, and secondly PAD_C and PAD_M domains from the clade containing a mixture of bacterial and fungal sequences were extracted. These alignments were used as a seed for searches with HHPred against a database of profiles made of the entire human proteome, and against a database of profiles of Pfam domains (Pfam-A_v31.0). HHPred searches were performed using the MPI Bioinformatics Toolkit of the Max Planck Institute for Developmental Biology, Tübingen, Germany (Alva et al. 2016;Zimmermann et al. 2018).

Multiple Sequence Alignment of PAD_N Domain
Amino acid sequences were aligned using the TCoffee algorithm (Edgar 2004;Di Tommaso et al. 2011) and visualized using Jalview (Waterhouse et al. 2009). Putative PAD_N domains from the SPM/NX clade cyanobacterial PADI sequences were identified using HHPred as showing significant statistical evidence for affinity (E-value: 2.5Â10 À5 ) (Alva et al. 2016;Zimmermann et al. 2018). These were aligned with the PAD_N domain from human PADI paralogs and Rhincodon typus (whale shark). The alignment was presented with the program Belvu using a coloring scheme indicating the average BLOSUM62 scores (which are correlated with amino acid conservation) of each alignment column (Henikoff and Henikoff 1992), as represented in figure 2a. PsiPred (Jones 1999) was used to predict secondary structure for the cyanobacterial PAD_N domains (beta sheets) and presented with the alignment. The experimental secondary structure of the PAD_N domain of human PADI2 was identified from the crystal structure (PDB: 4n2a) (Slade et al. 2015).

Synapomorphy Analysis of PADI Calcium-Binding Sites
Representative fungal, actinobacterial, cyanobacterial, and metazoan PADI sequences were analyzed for the conservation of all of the calcium-binding sites (a minimum of three residues coordinate each calcium-binding site) and for other critical residues contained at the active site ( fig. 2) Bitscore density is calculated by taking the bitscore of a query sequence to the target sequence produced by HMMER and dividing by the bitscore of the query sequence to itself (longer sequences have higher bitscores), which gives a value between 0 and 1 (Finn et al. 2015). The bitscore densities of the similarity of 1) the cyanobacterial homolog to the human sequence: ÂbitscoreD Cy-Hu (ACþAH) and 2) of the branchiostomal homolog to the human sequence: DbitscoreD Br-Hu (XBþXH) were both calculated (supplementary fig. S7, Supplementary Material online). A measure of the total accumulated genetic divergence between late-diverging cyanobacteria (Cyanothece spp.) and the last common ancestor of Branchiostoma spp. and Homo sapiens was then calculated by subtracting DbitscoreD Br-Hu from the DbitscoreD Cy-Hu . This accumulated genetic divergence (AGD) value was calculated for: 1) 26 ribosomal proteins (uS2, uS3, uS4, uS5, uS7, uS8, uS9, uS10, uS11, uS12, uS13, uS17, uS19, uL1, uL2, uL3, uL4, uL5, uL6, uL11, uL13, uL14, uL15, uL22, uL23, uL24), 2) 19 sequences whose proteins are mitochondrially located so are reasonable EGT candidates from the mitochondrion (OTC, ASS1, ARLY, CPS1, PGK, ENO, GAPDH, PK, NAXE, G6PD, RPIA, FUMH, SDHB, SDHA, CS, MDHM, DLAT, DLDH, ACLY) (Timmis et al. 2004), and 3) all ten proteins still encoded in the mitochondrial genome (MT-ATP6, MT-CO1, MT-CO2, MT-CO3, MT-CYB, MT-1, MT-2, MT-3, MT-4, MT-5). It is notable that by definition, only very highly conserved proteins have an AGD that can be calculated in this extreme example between the last common ancestor of late diverging cyanobacteria and humans: if a protein has diverged substantially then the similarity of the human homolog to the cyanobacterial will not be discernible and no bitscore can be calculated. AGD values of proteins in each category were tested for deviation from a normal distribution using the Shapiro-Wilk test (W¼b 2 / SS) (Shapiro and Wilk 1965). Where the calculated P value exceeded 0.05, the null hypothesis was retained, and the data treated as being normally distributed. Kurtosis and skew were also within the range of the normal distribution. The AGD for PADI proteins (AGD PADI proteins ¼ 0.07) was then compared with the mean AGD of each category of control proteins (e.g., AGD ribosomal proteins ¼ 0.70) and the zscores were calculated and are presented as P values.
To compare the extent of divergence relative to ribosomal RNA (rRNA), nucleotide sequences for rRNA were obtained from the SILVA database (Yilmaz et al. 2014). Nucleotide sequences for PADIs were obtained from NCBI and exons extracted. Comparisons were made with EMBOSS Needle using the Needleman-Wunsch global alignment algorithm (Needleman and Wunsch 1970) (gap open: 10, gap extend: 0.5).

Sequence Divergence Analyses
From the AGD analysis performed above, the median EGT candidate protein was selected as a control (ENO). Sequence divergence analysis was performed on ENO and PADI DNA sequences. BEAST v2.4.8 was used to produce a time tree of the clade of subsampled metazoan PADIs and the full clade of closest SPM/NX cyanobacteria contained within the PATRIC database using the GTR model with four gamma-distributed rate categories (Drummond and Rambaut 2007;Bouckaert et al. 2014;Uyeda et al. 2016). ENO sequences from the same species under the same model specifications as PADIs were used for the control analysis. The following metazoan species were used: Homo sapiens (HS), Mus musculus ( . DNA sequences were translated in silico and sequence before the start codon and after the stop codon was removed. DNA sequences were aligned using MAFFT L-ins-I and singly aligning columns were removed. Metazoan divergence times from the fossil record were obtained from timetree.org with bounds on the distributions chosen to span the range of times reported in the literature centered on the median value (Kumar and Hedges 2011). The calibrated Yule model was used as the tree prior. XML files were generated in BEAUti and the MCMC analysis was run using BEAST2 on the CIPRES Gateway on XSEDE. An initial MCMC run of 5,000,000 chains was run for each clock model (Drummond et al. 2006;Drummond and Suchard 2010). Then analysis was performed with two independent runs of 10,000,000 chains under two different clock models-the strict clock model and the relaxed uncorrelated lognormal (UCLN) clock model. The UCLN model relaxes the strict clock by allowing rate heterogeneity across branches: each branch is assumed to have its own rate that is drawn from a shared parametric rate distribution (the log-normal distribution). The different analyses were additionally run under the tree prior (i.e., in the absence of sequence data). Analysis of parameters was performed in Tracer to assess apparent stationarity for the different tree parameters and for acceptable ESS values and congruence was assessed across the independent runs. The predicted divergence time of the metazoan and cyanobacterial clades was given by the marginal posterior distribution of the age of the root of the whole tree. This is given by the TreeHeight parameter. These data were plotted with the kernel density estimate against the TreeHeight parameter for the different runs and summarized in a box and whisker plot. Summary data for the TreeHeight parameter are provided in figure 3E and include the highest posterior density 95% credible interval.
In calculating AGD for each mitochondrially encoded protein as compared with its own closest bacterial homolog (as opposed to the homolog from Cyanothece sp 8801), we tested for normality using the Shapiro-Wilk test (P ¼ 0.109), retaining the null hypothesis that the points are normally distributed. We then calculate the z-statistic for the PADI AGD to its Cummings et al. . doi:10.1093/molbev/msab317 MBE nearest homolog (z¼À2.439). This corresponds to a P value ¼ 0.0073. With a P value of 0.0073, we therefore reject the null hypothesis that PADIs fall within the normal distribution of AGD values calculated for mitochondrially encoded proteins relative to their closest bacterial homolog.

Preparation of Recombinant Proteins
PADI gene sequences were obtained from NCBI and synthesized by Thermo GeneArt with flanking EcoRI (at the 5 0 end) and XhoI (at the 3 0 end) restriction sites. Cyanothece sp. 8801 PADI and human PADI4 sequences were subcloned into a modified pGEX vector, which included an additional 10Â His tag immediately N-terminal of the enzyme sequence (generous gift from Dr Martin Reijns, MRC Human Genetics Unit), by InFusion cloning. GST-His-PADI4 and GST-His-cyanoPADI were expressed in BL21 (DE3) in 2TY cultures. Cells were grown (37 C; 180 rpm) to an OD 600 of 0.6 and induced overnight at 18 C with 0.5 mM b-D-1-thiogalactopyranoside (IPTG). Bacterial pellets were harvested by centrifugation (8,000Âg; 10 min) and frozen at À80 C. Cell pellets were resuspended in 50 mM Tris pH 7.5, 500 mM NaCl, 20 mM imidazole, 5% glycerol, 1 mM DTT (1 g dry cell mass in 4 ml), 1Â EDTA-free protease inhibitors (Roche), 5 mM MgCl 2 , and 10 units benzonase at 4 C with stirring. Cells were lysed on ice by sonication (7Â45 s, with 45 s breaks) and the lysate was cleared by centrifugation (20,000Âg; 20 min). Supernatant was sterile filtered (0.2 lm) before loading by Superloop onto a 5 ml HisTrap column, which was pre-equilibrated with binding buffer. Proteins were purified using an AKTA FPLC system (GE Healthcare). The column was washed with 50 mM Tris pH 7.5, 500 mM NaCl, 25 mM imidazole, 5% glycerol, 1 mM DTT, and the recombinant proteins were eluted with 50 mM Tris pH 7.5, 500 mM NaCl, 250 mM imidazole, 5% glycerol, 1 mM DTT. The purified sample was concentrated using Vivaspin MWCO filters into 50 mM HEPES pH 7.5, 150 mM NaCl, 5 mM DTT, 5% (v/v) glycerol, and the concentration determined using Nanodrop.

Citrullination Activity Assays
Using Mouse Cell Lysates E14 mouse embryonic stem cells were cultured in GMEM supplemented with 10% fetal calf serum, 0.1 mM nonessential amino acids, 2 mM L-glutamine, 1 mM sodium pyruvate, 0.1 mM beta-mercaptoethanol, and 10 6 units/l leukemia inhibitory factor (ESGRO, Millipore) and grown on a six well plate until 70% confluent. Cells were harvested in 0.5% NP-40, 20 mM Tris pH 7.6, 1Â EDTA-free protease inhibitors, 5 mM DTT, after two washes in PBS (one in PBS containing 2 mM EDTA and one in plain PBS). To shear chromatin and clarify lysates, benzonase and 2 mM MgCl 2 were added and samples were rotated at 4 C for 30 min, sheared by passing through a 25G needle and centrifuged at 20,000xg for 5 min. Citrullination activity assays were performed with 500 nM recombinant enzyme in 50 mM HEPES pH 7.5, 150 mM NaCl, 5 mM DTT, 5% (v/v) glycerol, either in the presence of 5 mM CaCl 2 or water. Reactions were incubated for 30 min at 37 C and quenched by boiling at 95 C for 5 min. Samples were stored at À80 C before immunoblotting.
Using Recombinant Histone H3 Substrate Reactions were performed in 50 mM HEPES, 137 mM NaCl, 5 mM DTT with 1.5 lM recombinant H3 (New England Biolabs), vehicle or 500 nM recombinant enzyme, and in the presence of either 5 mM CaCl 2 or water. Reactions incubated for 30 min at 15 C or 37 C and quenched by boiling at 95 C for 5 min before immunoblotting.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.