Main

The origin of the eukaryotic cell, with its complex and compartmentalized features, is regarded as the biggest evolutionary discontinuity since the advent of cellular life on Earth1,2. Yet, many key details regarding eukaryogenesis, the series of evolutionary events that led to the emergence of the eukaryotic cell from prokaryotic ancestors some 2 billion years ago3,4, remain elusive. The eukaryotic cell is the result of a symbiosis comprising an archaea-related host cell5,6 and a bacterial endosymbiont, the mitochondrial progenitor7,8. While the identity of the endosymbiont was traced back to the Alphaproteobacteria several decades ago9,10, the archaeal host remained elusive until recently. This changed with the discovery of Asgard archaea (phylum Asgardarchaeota11), which were shown to represent the closest prokaryotic relatives of the archaeal host cell from which eukaryotes evolved12,13,14,15. Analysis of Asgard archaeal genomes revealed the presence of numerous homologues of proteins previously deemed eukaryote-specific—so-called eukaryotic signature proteins (ESPs)16. Intriguingly, many of these ESPs represent fundamental building blocks of eukaryotic cellular complexity, including proteins essential for vesicular biogenesis and trafficking, as well as for the dynamic eukaryotic cytoskeleton. Recent work has indicated that several Asgard archaeal ESPs function similarly to their eukaryotic counterparts17,18,19,20, suggesting that Asgard archaea might display eukaryote-like cellular features beyond the dynamic actin cytoskeleton observed in the first enrichment cultures21,22. However, the detailed cellular characteristics and level of complexity of present-day Asgard archaea and the Asgard archaeal ancestor of eukaryotes remain unclear.

The definition, identification and characterization of ESPs are crucial for reconstructing the ancestral Asgard archaeal lineage and understanding its contributions to eukaryogenesis. Yet, the identification process is currently limited by several factors. Defining ESPs has proven challenging as increasingly sensitive homology search algorithms and improved sampling of genomic diversity across the tree of life have facilitated the discovery of ESP homologues in diverse prokaryotes13,15,23, including Asgard archaea12,13,15,23. While this has clarified the prokaryotic origins of many proteins in the last eukaryotic common ancestor, it has also reduced the set of strictly eukaryote-specific proteins. Therefore, a more relaxed definition of ESPs has been adopted, referring to proteins associated with conserved key eukaryotic processes6, or more specifically related to cellular complexity22. Furthermore, many eukaryotic proteins, especially those absent in common model organisms, remain poorly characterized. This, coupled with the limitations of sequence homology detection, makes it difficult to identify ESPs. Given the extensive divergence between present-day Asgard archaeal and eukaryotic proteins, reliable homology detection remains challenging. It becomes increasingly difficult to infer homology between two proteins with decreasing sequence similarity24. As the stem separating eukaryotes from their archaeal relatives represents one of the longest branches in the tree of life14,15, sequences from present-day Asgard archaea and eukaryotes have diverged extensively. Therefore, homology between these two groups might not even be detected, even when using sensitive methods24. However, protein structure is several times more conserved than protein sequence25, and structural information has been shown to increase the sensitivity of sequence homology inference26. Recent advances in de novo protein structure prediction using AlphaFold27 and related tools enable the large-scale generation of high-quality protein structure models. Combined with new methods to efficiently search large databases for similar structures28, it has become feasible to identify highly divergent homologues by using structural information29,30. This is particularly useful for non-model organisms, for which very few protein structures have been resolved. For example, the Protein Data Bank currently contains fewer than 50 Asgard archaeal protein structures (accessed on 31 March 2025).

Here, we explore these recent advances in protein structure prediction and comparison tools to expand the identification and characterization of ESPs in Asgard archaea beyond sequence similarity. By analysing an extended Asgard archaeal pangenome, we identified 908 new structure-based ‘isomorphic’ ESPs (iESPs), more than tripling the overall number of reported Asgard archaeal ESPs. Our structural catalogue of the Asgard archaeal pangenome reveals a marked increase of Asgard archaeal ESPs involved in information storage and processing, and in cellular processes and signalling, suggesting that the archaeal ancestor of eukaryotes was more eukaryote-like than was previously assumed.

Results

Structural modelling of the Asgard archaeal pangenome

To generate structural models of representative proteins encoded by the Asgard archaeal pangenome, we analysed a diverse set comprising 936 Asgard archaeal draft genomes (Fig. 1a and Supplementary Data 1), including 497 metagenome-assembled genomes (MAGs) that were compiled and described in a recent study31. In addition to the previously sampled Asgard archaeal diversity11,15, this expanded dataset encompasses MAGs from Atabeyarchaeia32 and Ranarchaeia31, two additional deep-branching clades (Extended Data Fig. 1a). We grouped protein sequences encoded by these Asgard archaeal genomes by combining reference-based clustering into previously established Asgard archaeal clusters of orthologous genes (AsCOGs)23 with de novo gene clustering (Fig. 1b). This resulted in 96% of Asgard archaeal proteins grouped in 37,313 clusters of at least 5 proteins, including 22,609 de novo clusters (Fig. 1b). For computational feasibility, we selected one evolutionary representative protein sequence per cluster (Methods) to generate a high-quality structural model (Fig. 1c).

Fig. 1: Modelling the Asgard archaeal structural pangenome.
Fig. 1: Modelling the Asgard archaeal structural pangenome.
Full size image

a, The number of Asgard archaeal draft genomes per group in the database used for pangenome-wide structural analyses (also see Extended Data Fig. 1a). Fill colour indicates publicly available genomes (grey) and newly added Asgard archaeal draft genomes (blue), respectively. b, Protein sequence clustering into existing Asgard archaeal COGs and de novo clustering with unassigned proteins. The x axis indicates the number of proteins and the y axis the number of respective clusters. Fill indicates protein sequences from publicly available genomes (grey) and added Asgard archaeal draft genomes (blue), respectively. c, Workflow for the pangenome-wide prediction of Asgard archaeal protein structures. d, Scatter plot depicting pLDDT scores of structure predictions of 100 randomly selected Prometheoarchaeum syntrophicum proteins computed with the default (x axis) and the Asgard archaea-enriched (y axis) ColabFold database, respectively. The diagonal black line indicates x = y, and the purple line indicates linear correlation fitted to the data. e, The distribution of average pLDDT scores of 37,223 predicted Asgard archaeal protein structures. MSA, multiple sequence alignment.

To determine an efficient and effective approach for de novo structure prediction, we modelled structures for 100 randomly selected proteins of the Asgard archaeon Prometheoarchaeum syntrophicum (Supplementary Data 2). As AlphaFold relies on homology information to predict protein structure, it tends to perform poorly if few homologues are found within its reference sequence database27. To solve this issue, we used ColabFold33, an accelerated AlphaFold workflow, with an expanded database containing all available Asgard archaeal protein sequences. In addition, we used ESMfold34, a prediction tool based on a protein language model that circumvents the time-consuming sequence homology search. We classified predictions as high quality if they had an average predicted local distance difference test (pLDDT) score of at least 80. We found that incorporating the Asgard archaeal proteins to the ColabFold homology search database led to better models for some proteins (Fig. 1d and Extended Data Fig. 1b). Overall, we obtained the most high-quality structure predictions when combining protein language model and sequence alignment-based techniques (Extended Data Fig. 1c). To optimize workflow efficiency, we predicted structures for each representative protein sequence using the fast ESMfold algorithm, and only if the average pLDDT score was below 80 did we employ the more time-consuming ColabFold method (Fig. 1c and Extended Data Fig. 1c,d). This approach resulted in 37,223 predicted structures with a median pLDDT of 82 (interquartile range (IQR) 71–86), covering 99.8% of all clusters (Fig. 1e).

Annotation beyond the twilight zone of sequence similarity

Using sensitive sequence and structure-based annotation methods, we identified homologues for nearly half of the Asgard archaeal protein clusters (Fig. 2a). Structure-based searches enhanced the detection of homologues (Fig. 2b), particularly for clusters with high divergence, recovering significant hits in the SwissProt database for 47% of clusters (n = 17,309) versus 29% using sequence homology detection (n = 10,681). Of note, almost half of the protein representatives with both a highly confident (sequence-based) cluster of orthologous genes (COG) and structural hit displayed less than 20% sequence identity to their best structure hit (n = 4,263; median 18.6%, IQR 14.2–28.0%), falling below the ‘twilight zone’ of sequence identity (the zone between 20% and 35% sequence identity where homology becomes challenging to predict with regular algorithms)24. To illustrate the ability of our approach to annotate protein clusters even in cases of low sequence identity, we recovered the recently discovered distant Asgard archaeal homologue of Vps29 (ref. 15), a component of the eukaryotic retromer and retriever complexes, with sequence similarity searches (best structure hit amino-acid identity = 27.5%, HHsearch P = 99.8), as well as with local and global structural alignment (Foldseek E-value = 1.9 × E−20, DaliLite Z-score = 30; Fig. 2c). Extended analyses of these annotations, including domain-specific enrichment and sequence divergence patterns, are detailed in Supplementary Information (also see Supplementary Fig. 1).

Fig. 2: Structural information recovers significantly more eukaryotic best hits.
Fig. 2: Structural information recovers significantly more eukaryotic best hits.
Full size image

a, Workflow to annotate Asgard archaeal proteins based on homology using sequence and structural similarity (also see Supplementary Fig. 1). b, Venn diagram depicting the number of clusters or cluster representing protein structures annotated using sequence homology detection with HHsearch against the COG/KOG database (orange) and structural similarity searches against AF2 SwissProt (violet), respectively. The intersection of both techniques is marked in pink. c, Structure prediction of Vps29 Asgard archaeal representative (left), its most similar SwissProt prediction (right; Cattle Vps29, Q3T0M0) and their overlay with the eukaryotic protein in violet (bottom). AA-identity, amino acid identity to best structure hit; P, hhsearch probability.

Expanding the repertoire of ESPs

Next, we used structure-based similarity searches to identify novel iESPs in Asgard archaea (Fig. 3a). We define an iESP as an Asgard archaeal protein structure that exhibits either exclusively eukaryotic hits, or a statistically significant overrepresentation of eukaryotic protein structures in (1) all hits or (2) the top 95% bit-score quantile of hits (Fig. 3b; Methods). This structure-based approach refines previous ESP classifications by incorporating a quantitative enrichment threshold rather than relying solely on presence/absence criteria. Unlike earlier definitions, which varied in their strictness or permissiveness, our method applies a standardized framework for assessing the overrepresentation of eukaryotic homologues for the investigated protein. This ensures that ESP identification remains systematic, biologically relevant and statistically justified.

Fig. 3: Structure-guided identification of functionally diverse iESP structural clusters.
Fig. 3: Structure-guided identification of functionally diverse iESP structural clusters.
Full size image

a, Workflow to cluster protein structures and identify iESPs. b, Identification of Asgard archaeal iESPs based on structural similarity. c, Bar chart summarizing the clustering of previously described ESP and iESP protein structures into structural clusters, respectively. d, Sankey diagram displaying functional categories of newly identified iESP clusters and clusters containing previously established ESPs. Categories are inferred from the best SwissProt hits EggNOG annotation. ‘Multiple’ indicates an association of a structural cluster with multiple functional categories. e, Subgraph of protein structure similarity network, highlighting small GTPase (black outline) and Argonaute proteins. P, probability.

We identified 1,319 iESPs that have thus far not been identified as Asgard archaeal ESPs (Fig. 3b). Of note, we captured only 46% (611 proteins) of the 1,323 previously established Asgard archaeal ESPs, indicating that previous definitions for ESPs have been rather permissive (also see above; Fig. 3b and Supplementary Data 3). For example, 40 AsCOGs containing roadblock/LC7 domains were considered ESPs in a previous study, and Asgard archaeal proteins have been shown to form similar structures to their eukaryotic relatives35. However, only four Asgard archaeal roadblock/LC7 clusters (cog.000673, cog.000921, cog.006948 and cog.008459) are enriched in eukaryotes in our study. The marked change in coverage of previous ESPs is caused by our enrichment-based approach, which, rather than simply relying on sequence-based homology, is based on the overrepresentation of eukaryotic hits in structural similarity searches. Indeed, roadblock/LC7 domain (PF03259) containing proteins are common in prokaryotes with 24,892 and 2,494 such proteins encoded by bacterial and archaeal genomes, respectively, compared with 5,724 proteins in eukaryotes (Pfam database accessed 12 June 2024). While roadblock/LC7 domain proteins have important functions in eukaryotic cells, their widespread presence in prokaryotes suggests that previous studies may have overestimated the Asgard archaeal provenance of these proteins.

To reduce redundancy, and to obtain an overview of the structural connectivity within the (i)ESP landscape, we clustered the 37,223 predicted Asgard archaeal protein structures on the basis of their similarity, which we amalgamated into 19,775 structural clusters (Methods; Fig. 3a and Extended Data Fig. 2a). In total, the 1,319 newly identified iESPs and all 1,323 previously identified ESP protein structures are contained in 908 and 425 structural clusters (Fig. 3c), respectively, indicating that our structure-based approach more than triples the potential number of Asgard archaeal proteins that entered the eukaryotic stem lineage. A high-level functional assessment revealed remarkable differences between iESP and ESP structural clusters (Fig. 3d and Supplementary Data 3), despite the largely sparse distribution across Asgard archaeal genomes (Extended Data Figs. 3 and 4). For example, 64% of previously identified ESP clusters (336 of 425) have functions in cellular processing and signalling, including a hub of 59 clusters collectively encompassing 932 Asgard archaeal small GTPase protein representative structures (Fig. 3e), which are known to have undergone extensive duplication in both eukaryotes and Asgard archaea12,13,23,36,37. By contrast, only 28% of iESP clusters’ eukaryotic counterparts (258 of 908) are involved in cellular processing and signalling functions (when including clusters containing multiple functional categories). Among these, we identified a single cluster containing eight Argonaute-related Asgard archaeal iESPs (Extended Data Fig. 2). Argonautes are involved in DNA and RNA interference in prokaryotes and eukaryotes, respectively38. Recent studies indicate that some Asgard archaeal Argonautes appear to exhibit similar functions to their eukaryotic counterparts39,40. We obtained the best structural hits to eukaryotic AGO and PIWI proteins (Fig. 3e and Extended Data Fig. 2), illustrating their higher structural conservation despite their high level of sequence divergence38.

We also retrieved many iESP clusters specific to metabolism (Fig. 3d, n = 137), which was thus far poorly represented among previously found ESPs in Asgard archaea (n = 24; Extended Data Figs. 3 and 4). For example, we identified diverse iESPs, including best hits to proteins of the eukaryote-type mevalonate pathway (phosphomevalonate kinase, Swissprot accession: Q2KIU2), the oxygen-dependent degradation of prenylated proteins (PCYOX1, Q5R748), and reactive oxygen species defence (SOD1, P80566). As an outstanding feature, we identified many (n = 271) iESP clusters involved in information storage and processing functions, of which 169 are related to translation, ribosomal structure and biogenesis, a function in eukaryotes that is known to have an archaeal provenance41. iESPs identified within the latter functional category included best structural hits to eukaryotic elongation factor 1A lysine methyltransferase 1 (EEF1AKMT1, Q17QF2) and the malignant T-cell-amplified sequence 1 that is involved in translation re-initiation (MCT-1, Q2KIE4) (Supplementary Data 3). Altogether, our structure-based and functionally unbiased approach identified hundreds of new ESPs, bearing relevance for efforts to reconstruct the physiology and cell biological features of both extant Asgard archaea as well as the archaeal ancestor or eukaryotes.

iESPs indicate extended Asgard archaeal cellular complexity

The emergence of intricate cellular compartments has been a hallmark process of eukaryogenesis, yet the origins of many genes responsible for the formation of these compartments remain elusive42. To identify Asgard archaeal proteins potentially involved in cellular compartmentalization, we investigated iESPs with robust structural assignment but limited ‘twilight zone’ sequence similarity (Fig. 3d) and examined their relationship to their evolutionary eukaryotic counterparts. By using targeted sequence-based searches with iterative refinement guided by structural similarity, we could link several iESPs at the sequence level, after which we constructed multiple sequence alignments and performed phylogenetic analyses (Methods).

One of the eukaryotic complexes with a role in cell compartment biology and lacking a clear prokaryotic ancestry is the vault, the largest reported ribonucleoprotein complex conserved in diverse eukaryotes. This complex has been suggested to be involved in transport between cellular compartments, signal transmission, cellular stress protection and immune response43. Vaults are primarily composed of two symmetric cups, each consisting of 39 molecules of the major vault protein (MVP)44. While prokaryotic homologues of MVP have so far been described in only a few Bacteria45, we identified an Asgard archaeal protein structure with a reciprocal best hit to Xenopus laevis MVP (Q6PF69; Extended Data Fig. 5). In total, we found ten Asgard archaeal MVP homologues, half of which in our phylogenetic analysis affiliate with a clade including eukaryotic MVPs (Fig. 4a and Extended Data Fig. 5a). The representative Asgard archaeal MVP displays a predicted structure similar to the resolved rat MVP, including the cap helix, shoulder and repeat domains, even though the Asgard archaeal homologue contains only five instead of nine repeat domains present in the rat protein46 (Fig. 4b). While estimating multimeric stoichiometries remains a computationally challenging task in the absence of experimental data, here we used structural modelling to build a first model of the Asgard archaeal vault. Multimer structure modelling suggests a closed cup with ten Asgard archaeal MVP molecules (interface predicted template modelling score (ipTM) = 0.525, average pLDDT = 71.4; Extended Data Fig. 5). While the role of MVP homologues in Asgard archaea remains unknown, our findings support a prokaryotic—possibly Asgard archaeal—origin of eukaryotic MVP.

Fig. 4: Asgard archaeal protein complexes implicating cellular compartmentalization.
Fig. 4: Asgard archaeal protein complexes implicating cellular compartmentalization.
Full size image

af, Asgard archaeal proteins related to eukaryotic MVPs (ac) and COMMD-containing proteins (df). a, Phylogeny of prokaryotic and eukaryotic full-length MVPs. See Extended Data Fig. 5a for tree based only on the shoulder domain. b, Rat MVP complex46 next to Lokiarchaeial MVP (predicted structure) indicating the cap helix, shoulder and repeat domains (R). c, Biological assembly of the rat MVP cap (left) next to a multimer model of the Asgard archaeal homodecamer (right). d, Human COMMD2 next to Lokiarchaeial homologue indicating the HN and COMM domains. e, Phylogeny of prokaryotic and eukaryotic COMMD-containing proteins. f, Resolved human COMMD heterodecamer47 next to a multimer model of the Asgard archaeal homodecamer. g,h, Identification of Asgard archaeal iESPs of eukaryotic Ufm1 (g) and CINP (Hodarchaeales clade indicated with grey background) (h). Asgard archaeal query protein structure, best-scoring SwissProt target structural model and phylogenetic analysis of related protein sequences are indicated in the left, middle and right panel, respectively. Structural models exclude long terminal disordered regions. Additional data include Foldseek E-value, Dali Z-score, enrichment of eukaryotic structures (Fisher’s exact test, Bonferroni-corrected P value, ‘p-EukEnr’) and amino-acid identity to best structure hit (‘AA-identity’). Phylogenetic analyses highlight sequences for query and target structures, input MSA positions and substitution model. Scale bar, 1 amino acid substitution per position. Multimer model confidence measures (pLDDT, pTM and ipTM) are indicated. pTM, predicted template modelling score.

Another eukaryotic complex with an elusive origin is Commander. This complex is required for endosomal recycling of diverse transmembrane cargos and is composed of 16 subunits arranged into the CCC and retriever subcomplexes. While some retriever components have been reported in Asgard archaea before (Vps29, Fig. 2c; Vps35)47, the CCC (named after its components CCDC22, CCDC93 and COMMD) subunits, including the heterodecamer-forming COMMD proteins, thus far lacked prokaryotic homologues47. Our structure-based searches retrieved an Asgard archaeal iESP that displayed the characteristic COMMD protein structure, that is, an α-helical N-terminal (HN) and a C-terminal COMMD domain48, while displaying extremely low sequence identity (8.5%) (Fig. 4d). Subsequent sensitive HMM-based searches yielded homologues in diverse Asgard archaea (Lokiarchaeales, Helarchaeales and Heimdallarchaeia) and some other prokaryotes. In our phylogenetic analysis, eukaryotic COMMD proteins (COMMD1-10) form a near-monophyletic group (Fig. 4e), confirming that eukaryote-specific gene duplications gave rise to the COMMD heterodecamer47,49. While our phylogenetic analyses failed to resolve the origin of eukaryotic COMMD, multimer modelling of an Asgard archaeal homologue suggests that 8, 10 or 12 molecules may form a homomultimeric complex with high confidence (homomultimeric n = 10 in Fig. 4f; ipTM = 0.889, pLDDT = 88.4; see other homomultimers in Extended Data Fig. 5d,e).

In addition to homologues of eukaryotic proteins involved in cellular compartmentalization, we newly identified some proteins uniquely shared between eukaryotes and Asgard archaea. Despite limited sequence similarity, Ubiquitin fold modifier 1 (Ufm1) exhibits structural similarities to ubiquitin50 and is implicated in DNA damage and endoplasmic reticulum stress responses, although it has not been characterized extensively51. We identified Ufm1 homologues in nine of the major Asgard archaeal clades, but not in any other prokaryote (Fig. 4g), indicating an Asgard archaeal provenance of Ufm1 in eukaryotes. Similarly, no prokaryotic homologues have yet been reported for the cyclin-dependent kinase 2-interacting protein (CINP), a protein involved in DNA replication complex and DNA damage control52,53 that was recently also implicated in eukaryotic ribosome biogenesis54. Our sequence similarity searches revealed it is present in five major Asgard archaeal clades, but not in other prokaryotes. Phylogenetic analyses revealed that eukaryotic sequences are monophyletic and cluster with Hodarchaeal sequences with good support (Fig. 4h, UFBOOT: 99%), suggesting that eukaryotes inherited this protein from their Heimdallarchaeial ancestor15.

Discussion

This study leverages state-of-the-art structural prediction tools to uncover a broader spectrum of ESPs in Asgard archaea. Large-scale analyses of the protein structure universe are becoming powerful approaches to predicting the origins and functions of proteins beyond the capabilities of standard sequence-based homology searches55,56. Here, we explored the potential of these tools to gain insight into the archaeal provenance of the eukaryotic cell. By building and analysing a structural catalogue of the Asgard archaeal pangenome, we improved the annotation of Asgard archaeal proteins lacking significant sequence similarity. Our approach revealed many Asgard archaeal protein families, iESPs, that are structurally most similar to those of eukaryotes. As in previous studies that relied on sequence similarity searches to identify ESPs12,13,15,23, we identified iESPs involved in cellular processes and signalling, including many that participate in intracellular trafficking, secretion and vesicular transport. However, our extended analyses retrieved many iESPs involved in additional processes, such as information storage and processing. This observation is in line with the general conception that many eukaryotic proteins involved in translation, transcription, replication and DNA repair have an archaeal provenance57. Furthermore, we found that iESPs are also relatively enriched in metabolic functions, which contrasts with previous work indicating that metabolic functions in eukaryotes predominantly are of bacterial origin58,59. The underlying reason for this observation is unclear. Yet, in congruence with recent work showing that eukaryotic central carbon metabolic pathways are in part of Asgard archaeal origin60, these metabolic iESPs represent ancient homologues of eukaryotic proteins that have evolved beyond the limit of reliable sequence similarity detection. Given the scale of our dataset and the inclusion of high-confidence structure predictions independent of domain annotations, we anticipate that future studies will uncover novel domain architectures or previously uncharacterized folds among these proteins. Altogether, our analyses suggest that a thus far underappreciated fraction of the eukaryotic metabolic repertoire is of Asgard archaeal provenance. We point out that iESPs do not necessarily represent eukaryotic proteins that were directly inherited from Asgard archaea. Instead, they are Asgard proteins whose closest structural matches—often highly similar—are disproportionately found in eukaryotes. This pattern of enrichment suggests functional and evolutionary relevance, but not necessarily direct ancestry. Phylogenetic analyses to investigate the exact evolutionary relationship between iESPs and eukaryotic proteins are often hampered due to limited sequence similarity.

While several studies have revealed that some previously identified ESPs, such as small GTPases, actin homologues and several subunits of the endosomal sorting complex required for transport (ESCRT complex), are nearly universally distributed across Asgard archaeal genomes, many ESPs display a rather patchy distribution13,15,23. This patchiness is evident, for example, for Asgard archaeal homologues of adaptor proteins, Golgi-associated retrograde protein, homotypic fusion and protein sorting, and class C core vacuole/endosome tethering complexes15. A similar observation can be made for iESPs, which predominantly display patchy distribution patterns across Asgard archaeal taxa. These patchily distributed ESPs and iESPs probably represent ancient protein families that were already present in the Asgard archaeal lineage from which eukaryotes emerged, and were subject to multiple loss events or horizontal gene transfers among Asgard archaeal lineages. Overall, given their patchy distribution, combined with the evolutionary distance between present-day Asgard archaeal and eukaryotic proteins, it remains unclear to what extent Asgard archaeal iESPs are functionally equivalent to their eukaryotic counterparts. While structural conservation has been shown to be tightly linked to protein function, even at high levels of sequence divergence61, future studies are needed to corroborate the functions of Asgard archaeal iESPs and ESPs. Biochemical studies and high-resolution structural analyses will be crucial in determining whether these iESPs operate in cellular contexts analogous to their eukaryotic counterparts. Such efforts will provide deeper insights into the transitional features of eukaryotic common ancestors and refine our models of early eukaryotic evolution.

Methods

Genome dataset selection

Dataset assembly

To construct a representative initial dataset, we retrieved all publicly available Asgard genomes from the National Center for Biotechnology Information (NCBI)62 up to 6 October 2022. This collection also included the recently published Asgard archaeal MAGs from refs. 31,32. To ensure data quality, MAGs were evaluated using CheckM v1.2.1 (ref. 63). Those MAGs with estimated completeness below 50% and estimated contamination exceeding 10% were identified as low-quality and consequently excluded from the initial dataset. Taxonomic classification of the initial dataset was conducted using GTDB-Tk v2.3.2 (ref. 64) with default parameters. The final dataset comprised 936 genomes (Supplementary Data 1) covering all known Asgard archaeal lineages. Gene prediction was performed using Prokka v1.14.6 (ref. 65) (options ‘--metagenome --kingdom Archaea’).

Phylogenomic inference of the species tree

To obtain an adequate outgroup dataset for inferring the phylogenetic relationships among the different Asgard archaeal lineages, we downloaded genus-level representatives of other archaeal lineages from the Genome Taxonomy Database (GTDB), release 214 (ref. 66). We based our selection on genome quality score (GQS), defined as GQS = completeness (%) − 5 × contamination (%), as described in ref. 67. In cases where two genomes had equal GQS, a random selection was made between the two. The final outgroup dataset included 311 genus-level representatives classified as members of the Thermoproteota (excluding Korarchaeia, to avoid artefacts derived from their uncertain affiliation68 and their strong thermophilic compositions15), Methanobacteria B and Hadarchaeota lineages.

To infer the species tree, we performed phylogenomic analysis based on 47 non-ribosomal proteins, which were selected from a set of 200 markers previously identified as core archaeal proteins69 (Supplementary Data 1). Homologous sequences within the final genome dataset were recruited using PSI-BLAST70 v2.10.0+ (‘-evalue 1e-10’). All recruited sequences per taxon per protein marker were selected, aligned using MAFFT L-INS-i71 v7.453, followed by trimming with trimAl72 v1.4.rev22 (‘-gt 0.5’) and removal of sequences with more than 60% gaps. We constructed the individual protein phylogenies using IQ-TREE73 v2.1.3, incorporating model selection from ModelFinder74. The best-fitting model was selected among the combination of the LG, Q.pfam and WAG models by adding the mixture model C20 with rate heterogeneity (+R4 or +G4) (‘-mset LG+C20,Q.pfam+C20,WAG+C20 -mrate G4,R4 -mfreq ""’). We assessed branch robustness for each marker with 1,000 ultrafast bootstraps75 and Shimodaira–Hasegawa-like approximate likelihood ratio tests (SH-aLRT)76. From the resulting phylogenies, we removed sequences indicative of contamination, paralogy or horizontal gene transfer events and realigned and trimmed the remaining sequences as described above. The curated alignments were then concatenated into a supermatrix containing 1,244 sequences. To mitigate effects related to compositional bias, we performed heterogeneous site removal using χ2 trimming77 where the 50% most heterogeneous sites were removed, resulting in an alignment of 8,068 amino acid positions. We inferred a species phylogeny for the χ2-trimmed alignment using ModelFinder within IQ-TREE v2.1.3 to select among the LG + C10, Q.pfam + C10 and WAG + C10 models and rate heterogeneity components (+R4 or +G4). A posterior mean site frequency (PMSF) approximation78 of the best-fitting model (WAG + C10 + R4) using the resulting tree was then employed to reconstruct a final tree with 100 non-parametric bootstrap pseudoreplicates.

Clustering and selection of representative protein sequences

The dataset of 936 Asgard archaeal genomes comprised 2.68 million proteins. We assigned AsCOG domains23 to 2.1 million Asgard archaeal proteins according to the best hit to an AsCOG member using MMseqs2 (ref. 79) with ‘-e 0.001’ and ‘-s 9’ and at least 80% of the best hit had to be covered. Unassigned proteins (0.6 million) or protein fragments (0.2 million) of at least 60 amino acids were clustered de novo using MMseqs2 (ref. 80) v14.7e284 at 20% sequence identity and a coverage of 50%. We built sequence profiles for 14,467 (2,084,964 represented proteins) AsCOGs and 22,846 (448,812 represented proteins) de novo clusters with at least 5 members. To select an evolutionary representative sequence per cluster, we searched members of the 37,313 clusters with at least five members against their respective cluster profile using MMseqs2 (‘mmseqs search’), ranked them based on their bit-score and selected the highest-ranked sequence per cluster as the representative sequence79.

Protein structure prediction

Supplementing the ColabFold database with Asgard archaeal proteins

Protein structure prediction using AlphaFold2 (ref. 27) has been shown to generally perform poorly if few sequences can be aligned to the target sequence27. We therefore wanted to evaluate whether adding our Asgard archaeal protein dataset to the ‘genetic search’ workflow of ColabFold33, an accelerated adaptation of AlphaFold2, would increase overall prediction quality. To this end, we implemented a version of the ‘genetic search’ workflow of ColabFold that queries the Asgard archaeal protein dataset (‘enriched’) in addition to the default databases (‘default’). For the enriched workflow, we added a third MMseqs2 sequence search step against the Asgard archaeal protein database as after the searches against the two default ColabFold databases with the same parameters.

Comparing performance of structure prediction algorithms for an Asgard archaeon

To evaluate performance of different structure prediction algorithms as well as the ColabFold ‘default’ versus the ‘enriched’ database, we created a test set of 100 Asgard archaeal proteins (Supplementary Data 2). We downloaded 100 randomly selected proteins of a reference Asgard archaeal proteome, P. syntrophicum, from UniProt (Supplementary Data 2; Proteome ID: UP000321408; accessed on 17 January 2023). We first predicted structural models from the primary sequences using the protein language model based ESMfold v2.0.0 (ref. 34) with option ‘-r 12’. To measure the quality of predictions, we used the average pLDDT score, ranging from low to high confidence (0–100). We considered predictions with an average pLDDT ≥80 as high-quality, as a compromise between the suggested pLDDT ≥90 for ‘high accuracy’ and pLDDT ≥70 for ‘general correct backbone’ according to ref. 27. Second, we generated multiple sequence alignments with the ‘genetic search’ module of ColabFold v1.3.0 (ref. 33) with default and enriched database, respectively. We then ran the ColabFold prediction workflow on each alignment using the default ‘exhaustive’ setting and a premature stopping rule (‘early-stop’) designed to reduce computation time; specifically, the algorithm terminates if a pLDDT of at least 85 is reached or if the first prediction yields a pLDDT below 50 (‘--stop-at-score 85 --stop-at-score-below 50’). The ‘genetic search’ module was run on a computer equipped with two AMD EPYC 7H12 processors (64 cores each, 2.6 GHz, 280 W) and 1 TiB of memory, whereas the ‘prediction’ module was run on a system with four NVIDIA A100 graphics processing units (40 GiB HBM2 memory each).

Protein structure prediction workflow

Based on the highest ratio of high-quality proteins and lowest computational resource demands for our 100 test proteins, we opted for a hybrid approach of using protein language model- and multiple sequence alignment-based prediction algorithms. We first used ESMfold v2.0.0 (ref. 34) with ‘-r 12’ to calculate structural models for each representative Asgard archaeal protein. Second, structures with an average pLDDT <80 in ESMfold were predicted again using ColabFold v1.3.0 (ref. 33) with the enriched database and the ‘early-stop’ settings. Large proteins that could not be folded with ESMfold v2.0.0 and ColabFold v1.3.0 because of exceeding memory demands were attempted to be folded with ColabFold v1.5.2.

Structure similarity searches

Best structural hit annotation

We searched Asgard archaeal structures reciprocally against SwissProt predicted structures (downloaded 8 July 2022) using FoldSeek v6.29e2557 (ref. 28) ‘foldseek search’ with ‘--max-seqs 10000’. To ensure robustness in structural comparisons, we use the default local structural alignment via Foldseek rather than relying on global fold similarity (for example, TM-score). This mitigates potential biases introduced by differences between ColabFold and ESMFold models, as functionally relevant local motifs remain detectable regardless of global conformational variations. We retained the highest bit-score non-overlapping hits along the query sequence to accommodate fusion proteins and checked for reciprocal best hits. We mapped the annotation of the SwissProt best hits to each query protein. As described above, but unidirectionally, we searched Asgard archaeal structures against the Protein Data Bank and UniProt50 databases (downloaded 9 February 2023).

EggNOG annotation of SwissProt best hits

Proteins representing the best SwissProt hits were mapped against EggNOG v5 (ref. 81) with the emapper user interface (http://eggnog-mapper.embl.de/) with default parameters, and we extracted root non-supervised orthologous group (NOG) and eukaryotic NOG identifiers and functional categories.

Identification eukaryotic hit enriched structures

For each Asgard archaeal predicted structure, we collected the best 10,000 hits of predicted UniProt50 structures (downloaded 9 February 2023), which contains proteins from all domains of life, ensuring that our ESP identification pipeline inherently considers homologues across bacteria, archaea and eukaryotes. Per Asgard archaeal protein representative, we performed a one-tailed Fisher’s exact test with the function ‘fisher.test’ and the ‘alternative=less’ parameter with Bonferroni correction with the function ‘p.adjust’ in R v4.2.1 (ref. 82) on the domain-level taxonomy of hit UniProt proteins to test for a statistical enrichment in eukaryotic sequences. To test for eukaryotic enrichment in only the most similar proteins, we also performed the same statistical test using only the top 5% bit-score percentile of the hits. Structures with an enrichment in hits to eukaryotic proteins were classified as candidate isomorphic (i)ESPs, that is, proteins that look structurally similar to proteins that are overrepresented in eukaryotes. We clustered all Asgard archaeal structures with Foldseek ‘foldseek cluster’ into clusters of isomorphic protein structures and identified structural clusters uniquely added with iESPs.

NCBI COG and KOG annotation of gene families

We created multiple sequence alignments for each Asgard archaeal protein cluster using FAMSA v2.2.2 (ref. 83) with ‘-refine_mode on’. We performed profile–profile searches with the HHsuite3 (ref. 84) program HHsearch v3.3.0 with parameters ‘-glob -M 50’ against the profile COG–eukaryotic orthologous groups (KOG) database (ftp://ftp.tuebingen.mpg.de/pub/protevo/toolkit/databases/hhsuite_dbs/COG_KOG.tar.gz)85.

Mapping of ESPs described by Eme et al. (2023)

To identify conserved protein domains in the proteomes of the Asgard archaeal dataset, we used InterProScan v5.57-90.0 (ref. 86) with default parameters and using hidden Markov models (HMM) from the databases AntiFam v7.0 (ref. 87), CDD v3.18 (ref. 88), Coils v2.2.1 (ref. 89), Gene3D v4.3.0 (ref. 90), MMobiDBLite v2.0 (ref. 91), PANTHER v15.0 (ref. 92), Pfam v35.0 (ref. 93), PIRSF v3.10 (ref. 94), PRINTS v42.0 (ref. 95), SFLD v4 (ref. 96), SMART v7.1 (ref. 97), SUPERFAMILY v1.75 (ref. 98) and TIGRFAM v15.0 (ref. 99).

We then identified the AsCOG and de novo cluster protein domains containing at least 80% of the length of a Pfam or Interpro domains reported as ESPs15.

Phylogenetic inferences of iESPs

iESP selection

To illustrate how iESP confer information about the origins of eukaryotic functions and their proteins, we selected several iESPs for phylogenetic analysis, based on the following criteria: the Asgard archaeal query structure is well covered (>80% of protein length) by its alignment to its best structure hit; the best (eukaryotic) structure hit reciprocally has the Asgard archaeal query structure as its best hit; eukaryotic structures are overrepresented among the hits (Fig. 3b); the eukaryotic hit structures are consistent (are evidently homologous to one another); they comprise eukaryote-relevant functions; neither the query nor the hit appears to embody particularly complex evolutionary histories (for example, they do not contain repeat domains or highly composite multidomain architectures); and the Asgard archaeal query is unlikely to represent contamination, as it is found in more than one Asgard archaeal taxon. Finally, we required that the candidates lack a well-scoring sequence-based hit to eukaryotic sequences, as determined by HHsearch; consequently, they fall into the ‘twilight zone’ of sequence homology (Fig. 3c).

Establishing remote sequence similarity between iESP and eukaryotic structure hits

Subsequently, we found that the iESPs, although divergent, retain sequence signals that connect them to the eukaryotic proteins they match structurally. For this, we sought to gradually expand the homologue set of the iESP via manually supervised, iterative HMM searches. In each round, we checked the newly hit proteins before adding them to the multiple sequence alignment, as we ensured these are genuine homologues by inspecting both their sequences and (predicted) protein structures. We executed these profile HMM-based searches using online tools (HHpred and HMMer web server) as well as local hmmsearches onto our local databases (see description below). Note that, in addition to eukaryotic and Asgard archaeal sequences, we included bacterial and other archaeal sequences in the search database, as they may also have homologues that could help link the iESP and related Asgard archaeal sequences to their eukaryotic structural hits.

Selecting homologues for phylogenetic inference

We made use of three sequence datasets for retrieving sequences for the phylogenetic analysis. First, we subsampled our in-house Asgard archaeal set, including only a single representative protein set per species. This representative for a given species was selected based on the quality of the predicted proteomes, as reflected by their predicted completeness and contamination, measured by CheckM63. Note that ‘species’ here signifies groups of genomes that can be clustered at the 95% average nucleotide identity level. Second, we used a subsampled version of an in-house eukaryotic dataset100, including 25 eukaryotic taxa of all of the major eukaryotic groups, taking the taxon with the best, most complete, predicted proteome quality, as measured by BUSCO101. Third, we used a subsampled version of GTDB (r207)66, of which first the Asgard archaea were removed, and then we selected the best assembly for each family, which was also based on the CheckM quality parameters. Using the final, most inclusive yet accurate profile HMM obtained, and our manually determined bit-score cut-offs (described above), we employed hmmsearch onto these three datasets and retrieved all sequences meeting the cut-off. Because COMMD and CINPL comprised virtually full-length hits, both at the structural comparisons as well as in our sequence similarity searches, we extracted the entire protein sequence of each hit protein. For Ufm1, we observed that some hits in our sequence searches were not full-length, and others contained multiple hit regions; in these cases, we extracted only the protein segment corresponding to the best-scoring hit. For the MVP, in addition to the smaller full-length phylogeny (Fig. 4a), we performed a broader phylogenetic analysis of the shoulder domain only, which is a type of Band 7 domain found in many prokaryotic and eukaryotic proteins45,102, and which are united in the SPFH (for stomatins, prohibitins, flotillins and HflK/C) family ‘clan’ (https://www.ebi.ac.uk/interpro/set/pfam/CL0433/entry/pfam/).

Phylogenetic analysis and annotation of the phylogeny

For each family, we inferred gene trees using multiple sequences alignments generated by MAFFT (v7.505, mode L-INS-i)71 and the web server of PROMALS3D103. For the latter, we used the default options, except for detecting and using homologues with three-dimensional structures (included DaliLite v5 (ref. 104)), pairwise alignments between input three-dimensional structures (included DaliLite) and aligning sequences within groups in the first alignment stage (PROMALS instead of MAFFT). We supplemented PROMALS3D with predicted protein structures from diverse sequences in the AlphaFold Protein Structure Database, as well as with structures from our own predictions (described above), including those of the iESPs and, where available, other Asgard archaeal homologues. Before inferring the gene trees, we trimmed the multiple sequence alignment using BMGE v1.12 (settings: ‘-m BLOSUM30 --h 0.6 -g 0.7 -b 3’)105, which selects good-quality aligned positions. However, in some cases (for example, COMMD MAFFT alignment), this produced very short alignments, prompting us to switch to trimAl (v1.4.1, mode ‘gappyout’)72. For phylogenetic inference in a maximum-likelihood framework, we used IQ-TREE v.2.0.3 (settings ‘-B 1000 -m MFP -mset LG,JTT,Q.pfam,WAG,LG+C20,LG+C40,LG+C60,LG+C20+R+F,LG+C40+R+F,LG+C60+R+F,WAG+C20,WAG+C40,WAG+C60,WAG+C20+R+F,WAG+C40+R+F,WAG+C60+R+F,JTT+C20,JTT+C40,JTT+C60,JTT+C20+R+F,JTT+C40+R+F,JTT+C60+R+F,Q.pfam+C20,Q.pfam+C40,Q.pfam+C60,Q.pfam+C20+R+F,Q.pfam+C40+R+F,Q.pfam+C60+R+F’)73 to first select the best evolutionary model using ModelFinder74 and then infer a phylogeny with 1,000 ultrafast bootstraps75. For each iESP/family, we subsequently selected the phylogeny displaying the most informative and probably accurate tree, which entailed post-hoc selecting the alignment algorithm (MAFFT-L-INS-i versus PROMALS3D) (based on ultrafast bootstrap support values at key branches, and monophyly of expected monophyletic sequence groups). We coloured the branches in the tree according to the species group the sequences belong to: Eukaryota, Asgard archaea, Archaea (other) and Bacteria. We also annotated the eukaryotic clades with the names of their proteins, specifically labelling each clade reflecting a single gene in the last eukaryotic common ancestor. Trees were visualized using iTOL106.

Visual representation of protein structures

Structural models were either visualized in ChimeraX v1.6.1 (Fig. 4b–f)107 or in R with the ‘r3dmol’ package v0.1.2 (Fig. 4g,h) (https://github.com/swsoyee/r3dmol)108.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. For benchmarking structure prediction methods (Extended Data Fig. 1b–d), 100 proteins were randomly sampled from the proteome of P. syntrophicum (UniProt ID: UP000321408; Supplementary Data 2). Each protein was evaluated once per prediction condition; no technical replicates were performed. This sample size was selected to provide a representative yet computationally feasible comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.