Phylogenomic Test of the Hypotheses for the Evolutionary Origin of Eukaryotes

The evolutionary origin of eukaryotes is a question of great interest for which many different hypotheses have been proposed. These hypotheses predict distinct patterns of evolutionary relationships for individual genes of the ancestral eukaryotic genome. The availability of numerous completely sequenced genomes covering the three domains of life makes it possible to contrast these predictions with empirical data. We performed a systematic analysis of the phylogenetic relationships of ancestral eukaryotic genes with archaeal and bacterial genes. In contrast with previous studies, we emphasize the critical importance of methods accounting for statistical support, horizontal gene transfer, and gene loss, and we disentangle the processes underlying the phylogenomic pattern we observe. We first recover a clear signal indicating that a fraction of the bacteria-like eukaryotic genes are of alphaproteobacterial origin. Then, we show that the majority of bacteria-related eukaryotic genes actually do not point to a relationship with a specific bacterial taxonomic group. We also provide evidence that eukaryotes branch close to the last archaeal common ancestor. Our results demonstrate that there is no phylogenetic support for hypotheses involving a fusion with a bacterium other than the ancestor of mitochondria. Overall, they leave only two possible interpretations, respectively, based on the early-mitochondria hypotheses, which suppose an early endosymbiosis of an alphaproteobacterium in an archaeal host and on the slow-drip autogenous hypothesis, in which early eukaryotic ancestors were particularly prone to horizontal gene transfers.


Figure S2:
Phylogenies of the LECA gene HBG298928_1 ("Alpha/beta hydrolase fold protein") using 183 (left) or 882 (right) prokaryotic genomes. This figure illustrates one of the subtle issues caused by sampling. At first this gene was labeled "actinobacteria-related" on the basis of the 183-prokaryotic-genomes dataset, as all its homologs were actinobacterial. It then appeared using the 882-prokaryotic-genomes dataset that it was also present at low frequencies in at least three orders of gammaproteobacteria. In addition, depending on where the larger tree was rooted, the sister group of Eukarya was Actinobacteria, Gammaproteobacteria, or both. Thus an important part of the molecular history of this family of homologs was missed, what rendered the initial annotation questionable.

Figure S3:
Phylogeny of the LECA gene HBG487932_1 ("Cytochrome b"). This tree illustrates the limits of considering only the support of the node at the base of the stem of Eukaryotes. For this LECA clade, the NBS support was 7%, even though the tree is rather well resolved. In contrast, the SGS score was 68%.

Figure S4:
LECA clades displayed according to the first and second levels of the KEGG ORTHOLOGY ontology. Some LECA clades may appear in several categories.

Figure S5:
Comparison of the annotations of LECA clade positions based on the "reference" configurations (article Fig. 1), a "relaxed" criterion or a "naive" one. (A) Schematic diagrams for the criteria. The relaxed criterion was similar to the reference one except eukaryotes are allowed to branch as a sister group of the putatively related prokaryotic sequences (when the reference one required that they branched among them). With the naive criterion, a relationship was inferred whenever eukaryotes have a taxonomically homogeneous sister clade, even when this clade was made of only one sequence (e.g. main-text Fig. 1C). (B) Annotations obtained using the naive (left), relaxed (middle) or reference (right) criteria. The figure is to be read like main-text Fig. 2. The LECA clades appear in the same order in the three panels. The sorting is based on the left panel (note that the rows of the right panel are the same as those of article Fig. 2, but resorted).
The greatest difference was between the naive and relaxed criteria, that is when taxonomic representativeness criteria were introduced. Most often, the relaxed and reference criteria differed only quantitatively. The reference criterion was adopted because it was immune to rooting issues.
Interestingly, the ability of the method to detect putatively alphaproteobacterial genes varied little with the criterion used. Indeed, the number of candidate alphaproteobacteria-related LECA clades using the naive and reference criteria were respectively 46 ($\>$50\% of alphaproteobacteria-related bootstrap replicates) and 41 ($\>$5\% of alphaproteobacteria-related replicates representing more than 80\% of all replicates annotated to a specific taxonomic group).