Prediction of eukaryotic cellular complexity in Asgard archaea using structural modelling

Köstlbacher, Stephan; van Hooff, Jolien J. E.; Panagiotou, Kassiani; Tamarit, Daniel; De Anda, Valerie; Appler, Kathryn E.; Baker, Brett J.; Ettema, Thijs J. G.

doi:10.1038/s41564-026-02273-y

Download PDF

Article
Open access
Published: 05 March 2026

Prediction of eukaryotic cellular complexity in Asgard archaea using structural modelling

Nature Microbiology volume 11, pages 747–758 (2026) Cite this article

11k Accesses
3 Citations
73 Altmetric
Metrics details

Subjects

Abstract

Asgard archaea played a key role in the origin of the eukaryotic cell, with extant genomes encoding relatives of diverse eukaryotic signature proteins (ESPs) involved in cellular organization. However, their often punctuated distribution and the absence of detectable homologues for many eukaryotic proteins limit our ability to reconstruct the cellular complexity of the Asgard archaeal ancestor of eukaryotes. Here we used de novo protein structure modelling and sequence similarity detection across an expanded Asgard archaeal genomic dataset to build a structural catalogue of the Asgard archaeal pangenome. We identified 908 ‘isomorphic’ ESPs—Asgard archaeal proteins with statistically enriched structural matches to eukaryotic proteins, often bridging deep sequence divergence. These isomorphic ESPs are enriched in information storage and processing roles and contain key components of the eukaryotic Vault (MVP) and Commander (COMMD) complexes, with potential roles in cellular compartmentalization and endosomal processing. These findings expand the repertoire of eukaryotic-like proteins in Asgard archaea and suggest a higher degree of eukaryote-like cellular complexity in the archaeal ancestor of eukaryotes.

Diversity, ecology, cell biology and evolution of the Asgard archaea

Article 05 March 2026

Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes

Article Open access 14 June 2023

Actin cytoskeleton and complex cell architecture in an Asgard archaeon

Article Open access 21 December 2022

Main

The origin of the eukaryotic cell, with its complex and compartmentalized features, is regarded as the biggest evolutionary discontinuity since the advent of cellular life on Earth¹^,2. Yet, many key details regarding eukaryogenesis, the series of evolutionary events that led to the emergence of the eukaryotic cell from prokaryotic ancestors some 2 billion years ago^3,4, remain elusive. The eukaryotic cell is the result of a symbiosis comprising an archaea-related host cell^5,6 and a bacterial endosymbiont, the mitochondrial progenitor^7,8. While the identity of the endosymbiont was traced back to the Alphaproteobacteria several decades ago^9,10, the archaeal host remained elusive until recently. This changed with the discovery of Asgard archaea (phylum Asgardarchaeota¹¹), which were shown to represent the closest prokaryotic relatives of the archaeal host cell from which eukaryotes evolved^12,13,14,15. Analysis of Asgard archaeal genomes revealed the presence of numerous homologues of proteins previously deemed eukaryote-specific—so-called eukaryotic signature proteins (ESPs)¹⁶. Intriguingly, many of these ESPs represent fundamental building blocks of eukaryotic cellular complexity, including proteins essential for vesicular biogenesis and trafficking, as well as for the dynamic eukaryotic cytoskeleton. Recent work has indicated that several Asgard archaeal ESPs function similarly to their eukaryotic counterparts^17,18,19,20, suggesting that Asgard archaea might display eukaryote-like cellular features beyond the dynamic actin cytoskeleton observed in the first enrichment cultures^21,22. However, the detailed cellular characteristics and level of complexity of present-day Asgard archaea and the Asgard archaeal ancestor of eukaryotes remain unclear.

The definition, identification and characterization of ESPs are crucial for reconstructing the ancestral Asgard archaeal lineage and understanding its contributions to eukaryogenesis. Yet, the identification process is currently limited by several factors. Defining ESPs has proven challenging as increasingly sensitive homology search algorithms and improved sampling of genomic diversity across the tree of life have facilitated the discovery of ESP homologues in diverse prokaryotes^13,15,23, including Asgard archaea^12,13,15,23. While this has clarified the prokaryotic origins of many proteins in the last eukaryotic common ancestor, it has also reduced the set of strictly eukaryote-specific proteins. Therefore, a more relaxed definition of ESPs has been adopted, referring to proteins associated with conserved key eukaryotic processes⁶, or more specifically related to cellular complexity²². Furthermore, many eukaryotic proteins, especially those absent in common model organisms, remain poorly characterized. This, coupled with the limitations of sequence homology detection, makes it difficult to identify ESPs. Given the extensive divergence between present-day Asgard archaeal and eukaryotic proteins, reliable homology detection remains challenging. It becomes increasingly difficult to infer homology between two proteins with decreasing sequence similarity²⁴. As the stem separating eukaryotes from their archaeal relatives represents one of the longest branches in the tree of life^14,15, sequences from present-day Asgard archaea and eukaryotes have diverged extensively. Therefore, homology between these two groups might not even be detected, even when using sensitive methods²⁴. However, protein structure is several times more conserved than protein sequence²⁵, and structural information has been shown to increase the sensitivity of sequence homology inference²⁶. Recent advances in de novo protein structure prediction using AlphaFold²⁷ and related tools enable the large-scale generation of high-quality protein structure models. Combined with new methods to efficiently search large databases for similar structures²⁸, it has become feasible to identify highly divergent homologues by using structural information^29,30. This is particularly useful for non-model organisms, for which very few protein structures have been resolved. For example, the Protein Data Bank currently contains fewer than 50 Asgard archaeal protein structures (accessed on 31 March 2025).

Here, we explore these recent advances in protein structure prediction and comparison tools to expand the identification and characterization of ESPs in Asgard archaea beyond sequence similarity. By analysing an extended Asgard archaeal pangenome, we identified 908 new structure-based ‘isomorphic’ ESPs (iESPs), more than tripling the overall number of reported Asgard archaeal ESPs. Our structural catalogue of the Asgard archaeal pangenome reveals a marked increase of Asgard archaeal ESPs involved in information storage and processing, and in cellular processes and signalling, suggesting that the archaeal ancestor of eukaryotes was more eukaryote-like than was previously assumed.

Results

Structural modelling of the Asgard archaeal pangenome

To generate structural models of representative proteins encoded by the Asgard archaeal pangenome, we analysed a diverse set comprising 936 Asgard archaeal draft genomes (Fig. 1a and Supplementary Data 1), including 497 metagenome-assembled genomes (MAGs) that were compiled and described in a recent study³¹. In addition to the previously sampled Asgard archaeal diversity^11,15, this expanded dataset encompasses MAGs from Atabeyarchaeia³² and Ranarchaeia³¹, two additional deep-branching clades (Extended Data Fig. 1a). We grouped protein sequences encoded by these Asgard archaeal genomes by combining reference-based clustering into previously established Asgard archaeal clusters of orthologous genes (AsCOGs)²³ with de novo gene clustering (Fig. 1b). This resulted in 96% of Asgard archaeal proteins grouped in 37,313 clusters of at least 5 proteins, including 22,609 de novo clusters (Fig. 1b). For computational feasibility, we selected one evolutionary representative protein sequence per cluster (Methods) to generate a high-quality structural model (Fig. 1c).

**Fig. 1: Modelling the Asgard archaeal structural pangenome.**

To determine an efficient and effective approach for de novo structure prediction, we modelled structures for 100 randomly selected proteins of the Asgard archaeon Prometheoarchaeum syntrophicum (Supplementary Data 2). As AlphaFold relies on homology information to predict protein structure, it tends to perform poorly if few homologues are found within its reference sequence database²⁷. To solve this issue, we used ColabFold³³, an accelerated AlphaFold workflow, with an expanded database containing all available Asgard archaeal protein sequences. In addition, we used ESMfold³⁴, a prediction tool based on a protein language model that circumvents the time-consuming sequence homology search. We classified predictions as high quality if they had an average predicted local distance difference test (pLDDT) score of at least 80. We found that incorporating the Asgard archaeal proteins to the ColabFold homology search database led to better models for some proteins (Fig. 1d and Extended Data Fig. 1b). Overall, we obtained the most high-quality structure predictions when combining protein language model and sequence alignment-based techniques (Extended Data Fig. 1c). To optimize workflow efficiency, we predicted structures for each representative protein sequence using the fast ESMfold algorithm, and only if the average pLDDT score was below 80 did we employ the more time-consuming ColabFold method (Fig. 1c and Extended Data Fig. 1c,d). This approach resulted in 37,223 predicted structures with a median pLDDT of 82 (interquartile range (IQR) 71–86), covering 99.8% of all clusters (Fig. 1e).

Annotation beyond the twilight zone of sequence similarity

Using sensitive sequence and structure-based annotation methods, we identified homologues for nearly half of the Asgard archaeal protein clusters (Fig. 2a). Structure-based searches enhanced the detection of homologues (Fig. 2b), particularly for clusters with high divergence, recovering significant hits in the SwissProt database for 47% of clusters (n = 17,309) versus 29% using sequence homology detection (n = 10,681). Of note, almost half of the protein representatives with both a highly confident (sequence-based) cluster of orthologous genes (COG) and structural hit displayed less than 20% sequence identity to their best structure hit (n = 4,263; median 18.6%, IQR 14.2–28.0%), falling below the ‘twilight zone’ of sequence identity (the zone between 20% and 35% sequence identity where homology becomes challenging to predict with regular algorithms)²⁴. To illustrate the ability of our approach to annotate protein clusters even in cases of low sequence identity, we recovered the recently discovered distant Asgard archaeal homologue of Vps29 (ref. ¹⁵), a component of the eukaryotic retromer and retriever complexes, with sequence similarity searches (best structure hit amino-acid identity = 27.5%, HHsearch P = 99.8), as well as with local and global structural alignment (Foldseek E-value = 1.9 × E⁻²⁰, DaliLite Z-score = 30; Fig. 2c). Extended analyses of these annotations, including domain-specific enrichment and sequence divergence patterns, are detailed in Supplementary Information (also see Supplementary Fig. 1).

**Fig. 2: Structural information recovers significantly more eukaryotic best hits.**

Expanding the repertoire of ESPs

Next, we used structure-based similarity searches to identify novel iESPs in Asgard archaea (Fig. 3a). We define an iESP as an Asgard archaeal protein structure that exhibits either exclusively eukaryotic hits, or a statistically significant overrepresentation of eukaryotic protein structures in (1) all hits or (2) the top 95% bit-score quantile of hits (Fig. 3b; Methods). This structure-based approach refines previous ESP classifications by incorporating a quantitative enrichment threshold rather than relying solely on presence/absence criteria. Unlike earlier definitions, which varied in their strictness or permissiveness, our method applies a standardized framework for assessing the overrepresentation of eukaryotic homologues for the investigated protein. This ensures that ESP identification remains systematic, biologically relevant and statistically justified.

**Fig. 3: Structure-guided identification of functionally diverse iESP structural clusters.**

We identified 1,319 iESPs that have thus far not been identified as Asgard archaeal ESPs (Fig. 3b). Of note, we captured only 46% (611 proteins) of the 1,323 previously established Asgard archaeal ESPs, indicating that previous definitions for ESPs have been rather permissive (also see above; Fig. 3b and Supplementary Data 3). For example, 40 AsCOGs containing roadblock/LC7 domains were considered ESPs in a previous study, and Asgard archaeal proteins have been shown to form similar structures to their eukaryotic relatives³⁵. However, only four Asgard archaeal roadblock/LC7 clusters (cog.000673, cog.000921, cog.006948 and cog.008459) are enriched in eukaryotes in our study. The marked change in coverage of previous ESPs is caused by our enrichment-based approach, which, rather than simply relying on sequence-based homology, is based on the overrepresentation of eukaryotic hits in structural similarity searches. Indeed, roadblock/LC7 domain (PF03259) containing proteins are common in prokaryotes with 24,892 and 2,494 such proteins encoded by bacterial and archaeal genomes, respectively, compared with 5,724 proteins in eukaryotes (Pfam database accessed 12 June 2024). While roadblock/LC7 domain proteins have important functions in eukaryotic cells, their widespread presence in prokaryotes suggests that previous studies may have overestimated the Asgard archaeal provenance of these proteins.

To reduce redundancy, and to obtain an overview of the structural connectivity within the (i)ESP landscape, we clustered the 37,223 predicted Asgard archaeal protein structures on the basis of their similarity, which we amalgamated into 19,775 structural clusters (Methods; Fig. 3a and Extended Data Fig. 2a). In total, the 1,319 newly identified iESPs and all 1,323 previously identified ESP protein structures are contained in 908 and 425 structural clusters (Fig. 3c), respectively, indicating that our structure-based approach more than triples the potential number of Asgard archaeal proteins that entered the eukaryotic stem lineage. A high-level functional assessment revealed remarkable differences between iESP and ESP structural clusters (Fig. 3d and Supplementary Data 3), despite the largely sparse distribution across Asgard archaeal genomes (Extended Data Figs. 3 and 4). For example, 64% of previously identified ESP clusters (336 of 425) have functions in cellular processing and signalling, including a hub of 59 clusters collectively encompassing 932 Asgard archaeal small GTPase protein representative structures (Fig. 3e), which are known to have undergone extensive duplication in both eukaryotes and Asgard archaea^{12,13,23,36,37}. By contrast, only 28% of iESP clusters’ eukaryotic counterparts (258 of 908) are involved in cellular processing and signalling functions (when including clusters containing multiple functional categories). Among these, we identified a single cluster containing eight Argonaute-related Asgard archaeal iESPs (Extended Data Fig. 2). Argonautes are involved in DNA and RNA interference in prokaryotes and eukaryotes, respectively³⁸. Recent studies indicate that some Asgard archaeal Argonautes appear to exhibit similar functions to their eukaryotic counterparts^39,40. We obtained the best structural hits to eukaryotic AGO and PIWI proteins (Fig. 3e and Extended Data Fig. 2), illustrating their higher structural conservation despite their high level of sequence divergence³⁸.

We also retrieved many iESP clusters specific to metabolism (Fig. 3d, n = 137), which was thus far poorly represented among previously found ESPs in Asgard archaea (n = 24; Extended Data Figs. 3 and 4). For example, we identified diverse iESPs, including best hits to proteins of the eukaryote-type mevalonate pathway (phosphomevalonate kinase, Swissprot accession: Q2KIU2), the oxygen-dependent degradation of prenylated proteins (PCYOX1, Q5R748), and reactive oxygen species defence (SOD1, P80566). As an outstanding feature, we identified many (n = 271) iESP clusters involved in information storage and processing functions, of which 169 are related to translation, ribosomal structure and biogenesis, a function in eukaryotes that is known to have an archaeal provenance⁴¹. iESPs identified within the latter functional category included best structural hits to eukaryotic elongation factor 1A lysine methyltransferase 1 (EEF1AKMT1, Q17QF2) and the malignant T-cell-amplified sequence 1 that is involved in translation re-initiation (MCT-1, Q2KIE4) (Supplementary Data 3). Altogether, our structure-based and functionally unbiased approach identified hundreds of new ESPs, bearing relevance for efforts to reconstruct the physiology and cell biological features of both extant Asgard archaea as well as the archaeal ancestor or eukaryotes.

iESPs indicate extended Asgard archaeal cellular complexity

The emergence of intricate cellular compartments has been a hallmark process of eukaryogenesis, yet the origins of many genes responsible for the formation of these compartments remain elusive⁴². To identify Asgard archaeal proteins potentially involved in cellular compartmentalization, we investigated iESPs with robust structural assignment but limited ‘twilight zone’ sequence similarity (Fig. 3d) and examined their relationship to their evolutionary eukaryotic counterparts. By using targeted sequence-based searches with iterative refinement guided by structural similarity, we could link several iESPs at the sequence level, after which we constructed multiple sequence alignments and performed phylogenetic analyses (Methods).

One of the eukaryotic complexes with a role in cell compartment biology and lacking a clear prokaryotic ancestry is the vault, the largest reported ribonucleoprotein complex conserved in diverse eukaryotes. This complex has been suggested to be involved in transport between cellular compartments, signal transmission, cellular stress protection and immune response⁴³. Vaults are primarily composed of two symmetric cups, each consisting of 39 molecules of the major vault protein (MVP)⁴⁴. While prokaryotic homologues of MVP have so far been described in only a few Bacteria⁴⁵, we identified an Asgard archaeal protein structure with a reciprocal best hit to Xenopus laevis MVP (Q6PF69; Extended Data Fig. 5). In total, we found ten Asgard archaeal MVP homologues, half of which in our phylogenetic analysis affiliate with a clade including eukaryotic MVPs (Fig. 4a and Extended Data Fig. 5a). The representative Asgard archaeal MVP displays a predicted structure similar to the resolved rat MVP, including the cap helix, shoulder and repeat domains, even though the Asgard archaeal homologue contains only five instead of nine repeat domains present in the rat protein⁴⁶ (Fig. 4b). While estimating multimeric stoichiometries remains a computationally challenging task in the absence of experimental data, here we used structural modelling to build a first model of the Asgard archaeal vault. Multimer structure modelling suggests a closed cup with ten Asgard archaeal MVP molecules (interface predicted template modelling score (ipTM) = 0.525, average pLDDT = 71.4; Extended Data Fig. 5). While the role of MVP homologues in Asgard archaea remains unknown, our findings support a prokaryotic—possibly Asgard archaeal—origin of eukaryotic MVP.

**Fig. 4: Asgard archaeal protein complexes implicating cellular compartmentalization.**

Another eukaryotic complex with an elusive origin is Commander. This complex is required for endosomal recycling of diverse transmembrane cargos and is composed of 16 subunits arranged into the CCC and retriever subcomplexes. While some retriever components have been reported in Asgard archaea before (Vps29, Fig. 2c; Vps35)⁴⁷, the CCC (named after its components CCDC22, CCDC93 and COMMD) subunits, including the heterodecamer-forming COMMD proteins, thus far lacked prokaryotic homologues⁴⁷. Our structure-based searches retrieved an Asgard archaeal iESP that displayed the characteristic COMMD protein structure, that is, an α-helical N-terminal (HN) and a C-terminal COMMD domain⁴⁸, while displaying extremely low sequence identity (8.5%) (Fig. 4d). Subsequent sensitive HMM-based searches yielded homologues in diverse Asgard archaea (Lokiarchaeales, Helarchaeales and Heimdallarchaeia) and some other prokaryotes. In our phylogenetic analysis, eukaryotic COMMD proteins (COMMD1-10) form a near-monophyletic group (Fig. 4e), confirming that eukaryote-specific gene duplications gave rise to the COMMD heterodecamer^47,49. While our phylogenetic analyses failed to resolve the origin of eukaryotic COMMD, multimer modelling of an Asgard archaeal homologue suggests that 8, 10 or 12 molecules may form a homomultimeric complex with high confidence (homomultimeric n = 10 in Fig. 4f; ipTM = 0.889, pLDDT = 88.4; see other homomultimers in Extended Data Fig. 5d,e).

In addition to homologues of eukaryotic proteins involved in cellular compartmentalization, we newly identified some proteins uniquely shared between eukaryotes and Asgard archaea. Despite limited sequence similarity, Ubiquitin fold modifier 1 (Ufm1) exhibits structural similarities to ubiquitin⁵⁰ and is implicated in DNA damage and endoplasmic reticulum stress responses, although it has not been characterized extensively⁵¹. We identified Ufm1 homologues in nine of the major Asgard archaeal clades, but not in any other prokaryote (Fig. 4g), indicating an Asgard archaeal provenance of Ufm1 in eukaryotes. Similarly, no prokaryotic homologues have yet been reported for the cyclin-dependent kinase 2-interacting protein (CINP), a protein involved in DNA replication complex and DNA damage control^52,53 that was recently also implicated in eukaryotic ribosome biogenesis⁵⁴. Our sequence similarity searches revealed it is present in five major Asgard archaeal clades, but not in other prokaryotes. Phylogenetic analyses revealed that eukaryotic sequences are monophyletic and cluster with Hodarchaeal sequences with good support (Fig. 4h, UFBOOT: 99%), suggesting that eukaryotes inherited this protein from their Heimdallarchaeial ancestor¹⁵.

Discussion

This study leverages state-of-the-art structural prediction tools to uncover a broader spectrum of ESPs in Asgard archaea. Large-scale analyses of the protein structure universe are becoming powerful approaches to predicting the origins and functions of proteins beyond the capabilities of standard sequence-based homology searches^55,56. Here, we explored the potential of these tools to gain insight into the archaeal provenance of the eukaryotic cell. By building and analysing a structural catalogue of the Asgard archaeal pangenome, we improved the annotation of Asgard archaeal proteins lacking significant sequence similarity. Our approach revealed many Asgard archaeal protein families, iESPs, that are structurally most similar to those of eukaryotes. As in previous studies that relied on sequence similarity searches to identify ESPs^12,13,15,23, we identified iESPs involved in cellular processes and signalling, including many that participate in intracellular trafficking, secretion and vesicular transport. However, our extended analyses retrieved many iESPs involved in additional processes, such as information storage and processing. This observation is in line with the general conception that many eukaryotic proteins involved in translation, transcription, replication and DNA repair have an archaeal provenance⁵⁷. Furthermore, we found that iESPs are also relatively enriched in metabolic functions, which contrasts with previous work indicating that metabolic functions in eukaryotes predominantly are of bacterial origin^58,59. The underlying reason for this observation is unclear. Yet, in congruence with recent work showing that eukaryotic central carbon metabolic pathways are in part of Asgard archaeal origin⁶⁰, these metabolic iESPs represent ancient homologues of eukaryotic proteins that have evolved beyond the limit of reliable sequence similarity detection. Given the scale of our dataset and the inclusion of high-confidence structure predictions independent of domain annotations, we anticipate that future studies will uncover novel domain architectures or previously uncharacterized folds among these proteins. Altogether, our analyses suggest that a thus far underappreciated fraction of the eukaryotic metabolic repertoire is of Asgard archaeal provenance. We point out that iESPs do not necessarily represent eukaryotic proteins that were directly inherited from Asgard archaea. Instead, they are Asgard proteins whose closest structural matches—often highly similar—are disproportionately found in eukaryotes. This pattern of enrichment suggests functional and evolutionary relevance, but not necessarily direct ancestry. Phylogenetic analyses to investigate the exact evolutionary relationship between iESPs and eukaryotic proteins are often hampered due to limited sequence similarity.

While several studies have revealed that some previously identified ESPs, such as small GTPases, actin homologues and several subunits of the endosomal sorting complex required for transport (ESCRT complex), are nearly universally distributed across Asgard archaeal genomes, many ESPs display a rather patchy distribution^13,15,23. This patchiness is evident, for example, for Asgard archaeal homologues of adaptor proteins, Golgi-associated retrograde protein, homotypic fusion and protein sorting, and class C core vacuole/endosome tethering complexes¹⁵. A similar observation can be made for iESPs, which predominantly display patchy distribution patterns across Asgard archaeal taxa. These patchily distributed ESPs and iESPs probably represent ancient protein families that were already present in the Asgard archaeal lineage from which eukaryotes emerged, and were subject to multiple loss events or horizontal gene transfers among Asgard archaeal lineages. Overall, given their patchy distribution, combined with the evolutionary distance between present-day Asgard archaeal and eukaryotic proteins, it remains unclear to what extent Asgard archaeal iESPs are functionally equivalent to their eukaryotic counterparts. While structural conservation has been shown to be tightly linked to protein function, even at high levels of sequence divergence⁶¹, future studies are needed to corroborate the functions of Asgard archaeal iESPs and ESPs. Biochemical studies and high-resolution structural analyses will be crucial in determining whether these iESPs operate in cellular contexts analogous to their eukaryotic counterparts. Such efforts will provide deeper insights into the transitional features of eukaryotic common ancestors and refine our models of early eukaryotic evolution.

Methods

Genome dataset selection

Dataset assembly

To construct a representative initial dataset, we retrieved all publicly available Asgard genomes from the National Center for Biotechnology Information (NCBI)⁶² up to 6 October 2022. This collection also included the recently published Asgard archaeal MAGs from refs. ³¹^,32. To ensure data quality, MAGs were evaluated using CheckM v1.2.1 (ref. ⁶³). Those MAGs with estimated completeness below 50% and estimated contamination exceeding 10% were identified as low-quality and consequently excluded from the initial dataset. Taxonomic classification of the initial dataset was conducted using GTDB-Tk v2.3.2 (ref. ⁶⁴) with default parameters. The final dataset comprised 936 genomes (Supplementary Data 1) covering all known Asgard archaeal lineages. Gene prediction was performed using Prokka v1.14.6 (ref. ⁶⁵) (options ‘--metagenome --kingdom Archaea’).

Phylogenomic inference of the species tree

To obtain an adequate outgroup dataset for inferring the phylogenetic relationships among the different Asgard archaeal lineages, we downloaded genus-level representatives of other archaeal lineages from the Genome Taxonomy Database (GTDB), release 214 (ref. ⁶⁶). We based our selection on genome quality score (GQS), defined as GQS = completeness (%) − 5 × contamination (%), as described in ref. ⁶⁷. In cases where two genomes had equal GQS, a random selection was made between the two. The final outgroup dataset included 311 genus-level representatives classified as members of the Thermoproteota (excluding Korarchaeia, to avoid artefacts derived from their uncertain affiliation⁶⁸ and their strong thermophilic compositions¹⁵), Methanobacteria B and Hadarchaeota lineages.

To infer the species tree, we performed phylogenomic analysis based on 47 non-ribosomal proteins, which were selected from a set of 200 markers previously identified as core archaeal proteins⁶⁹ (Supplementary Data 1). Homologous sequences within the final genome dataset were recruited using PSI-BLAST⁷⁰ v2.10.0+ (‘-evalue 1e-10’). All recruited sequences per taxon per protein marker were selected, aligned using MAFFT L-INS-i⁷¹ v7.453, followed by trimming with trimAl⁷² v1.4.rev22 (‘-gt 0.5’) and removal of sequences with more than 60% gaps. We constructed the individual protein phylogenies using IQ-TREE⁷³ v2.1.3, incorporating model selection from ModelFinder⁷⁴. The best-fitting model was selected among the combination of the LG, Q.pfam and WAG models by adding the mixture model C20 with rate heterogeneity (+R4 or +G4) (‘-mset LG+C20,Q.pfam+C20,WAG+C20 -mrate G4,R4 -mfreq ""’). We assessed branch robustness for each marker with 1,000 ultrafast bootstraps⁷⁵ and Shimodaira–Hasegawa-like approximate likelihood ratio tests (SH-aLRT)⁷⁶. From the resulting phylogenies, we removed sequences indicative of contamination, paralogy or horizontal gene transfer events and realigned and trimmed the remaining sequences as described above. The curated alignments were then concatenated into a supermatrix containing 1,244 sequences. To mitigate effects related to compositional bias, we performed heterogeneous site removal using χ² trimming⁷⁷ where the 50% most heterogeneous sites were removed, resulting in an alignment of 8,068 amino acid positions. We inferred a species phylogeny for the χ²-trimmed alignment using ModelFinder within IQ-TREE v2.1.3 to select among the LG + C10, Q.pfam + C10 and WAG + C10 models and rate heterogeneity components (+R4 or +G4). A posterior mean site frequency (PMSF) approximation⁷⁸ of the best-fitting model (WAG + C10 + R4) using the resulting tree was then employed to reconstruct a final tree with 100 non-parametric bootstrap pseudoreplicates.

Clustering and selection of representative protein sequences

The dataset of 936 Asgard archaeal genomes comprised 2.68 million proteins. We assigned AsCOG domains²³ to 2.1 million Asgard archaeal proteins according to the best hit to an AsCOG member using MMseqs2 (ref. ⁷⁹) with ‘-e 0.001’ and ‘-s 9’ and at least 80% of the best hit had to be covered. Unassigned proteins (0.6 million) or protein fragments (0.2 million) of at least 60 amino acids were clustered de novo using MMseqs2 (ref. ⁸⁰) v14.7e284 at 20% sequence identity and a coverage of 50%. We built sequence profiles for 14,467 (2,084,964 represented proteins) AsCOGs and 22,846 (448,812 represented proteins) de novo clusters with at least 5 members. To select an evolutionary representative sequence per cluster, we searched members of the 37,313 clusters with at least five members against their respective cluster profile using MMseqs2 (‘mmseqs search’), ranked them based on their bit-score and selected the highest-ranked sequence per cluster as the representative sequence⁷⁹.

Protein structure prediction

Supplementing the ColabFold database with Asgard archaeal proteins

Protein structure prediction using AlphaFold2 (ref. ²⁷) has been shown to generally perform poorly if few sequences can be aligned to the target sequence²⁷. We therefore wanted to evaluate whether adding our Asgard archaeal protein dataset to the ‘genetic search’ workflow of ColabFold³³, an accelerated adaptation of AlphaFold2, would increase overall prediction quality. To this end, we implemented a version of the ‘genetic search’ workflow of ColabFold that queries the Asgard archaeal protein dataset (‘enriched’) in addition to the default databases (‘default’). For the enriched workflow, we added a third MMseqs2 sequence search step against the Asgard archaeal protein database as after the searches against the two default ColabFold databases with the same parameters.

Comparing performance of structure prediction algorithms for an Asgard archaeon

To evaluate performance of different structure prediction algorithms as well as the ColabFold ‘default’ versus the ‘enriched’ database, we created a test set of 100 Asgard archaeal proteins (Supplementary Data 2). We downloaded 100 randomly selected proteins of a reference Asgard archaeal proteome, P. syntrophicum, from UniProt (Supplementary Data 2; Proteome ID: UP000321408; accessed on 17 January 2023). We first predicted structural models from the primary sequences using the protein language model based ESMfold v2.0.0 (ref. ³⁴) with option ‘-r 12’. To measure the quality of predictions, we used the average pLDDT score, ranging from low to high confidence (0–100). We considered predictions with an average pLDDT ≥80 as high-quality, as a compromise between the suggested pLDDT ≥90 for ‘high accuracy’ and pLDDT ≥70 for ‘general correct backbone’ according to ref. ²⁷. Second, we generated multiple sequence alignments with the ‘genetic search’ module of ColabFold v1.3.0 (ref. ³³) with default and enriched database, respectively. We then ran the ColabFold prediction workflow on each alignment using the default ‘exhaustive’ setting and a premature stopping rule (‘early-stop’) designed to reduce computation time; specifically, the algorithm terminates if a pLDDT of at least 85 is reached or if the first prediction yields a pLDDT below 50 (‘--stop-at-score 85 --stop-at-score-below 50’). The ‘genetic search’ module was run on a computer equipped with two AMD EPYC 7H12 processors (64 cores each, 2.6 GHz, 280 W) and 1 TiB of memory, whereas the ‘prediction’ module was run on a system with four NVIDIA A100 graphics processing units (40 GiB HBM2 memory each).

Protein structure prediction workflow

Based on the highest ratio of high-quality proteins and lowest computational resource demands for our 100 test proteins, we opted for a hybrid approach of using protein language model- and multiple sequence alignment-based prediction algorithms. We first used ESMfold v2.0.0 (ref. ³⁴) with ‘-r 12’ to calculate structural models for each representative Asgard archaeal protein. Second, structures with an average pLDDT <80 in ESMfold were predicted again using ColabFold v1.3.0 (ref. ³³) with the enriched database and the ‘early-stop’ settings. Large proteins that could not be folded with ESMfold v2.0.0 and ColabFold v1.3.0 because of exceeding memory demands were attempted to be folded with ColabFold v1.5.2.

Structure similarity searches

Best structural hit annotation

We searched Asgard archaeal structures reciprocally against SwissProt predicted structures (downloaded 8 July 2022) using FoldSeek v6.29e2557 (ref. ²⁸) ‘foldseek search’ with ‘--max-seqs 10000’. To ensure robustness in structural comparisons, we use the default local structural alignment via Foldseek rather than relying on global fold similarity (for example, TM-score). This mitigates potential biases introduced by differences between ColabFold and ESMFold models, as functionally relevant local motifs remain detectable regardless of global conformational variations. We retained the highest bit-score non-overlapping hits along the query sequence to accommodate fusion proteins and checked for reciprocal best hits. We mapped the annotation of the SwissProt best hits to each query protein. As described above, but unidirectionally, we searched Asgard archaeal structures against the Protein Data Bank and UniProt50 databases (downloaded 9 February 2023).

EggNOG annotation of SwissProt best hits

Proteins representing the best SwissProt hits were mapped against EggNOG v5 (ref. ⁸¹) with the emapper user interface (http://eggnog-mapper.embl.de/) with default parameters, and we extracted root non-supervised orthologous group (NOG) and eukaryotic NOG identifiers and functional categories.

Identification eukaryotic hit enriched structures

For each Asgard archaeal predicted structure, we collected the best 10,000 hits of predicted UniProt50 structures (downloaded 9 February 2023), which contains proteins from all domains of life, ensuring that our ESP identification pipeline inherently considers homologues across bacteria, archaea and eukaryotes. Per Asgard archaeal protein representative, we performed a one-tailed Fisher’s exact test with the function ‘fisher.test’ and the ‘alternative=less’ parameter with Bonferroni correction with the function ‘p.adjust’ in R v4.2.1 (ref. ⁸²) on the domain-level taxonomy of hit UniProt proteins to test for a statistical enrichment in eukaryotic sequences. To test for eukaryotic enrichment in only the most similar proteins, we also performed the same statistical test using only the top 5% bit-score percentile of the hits. Structures with an enrichment in hits to eukaryotic proteins were classified as candidate isomorphic (i)ESPs, that is, proteins that look structurally similar to proteins that are overrepresented in eukaryotes. We clustered all Asgard archaeal structures with Foldseek ‘foldseek cluster’ into clusters of isomorphic protein structures and identified structural clusters uniquely added with iESPs.

NCBI COG and KOG annotation of gene families

We created multiple sequence alignments for each Asgard archaeal protein cluster using FAMSA v2.2.2 (ref. ⁸³) with ‘-refine_mode on’. We performed profile–profile searches with the HHsuite3 (ref. ⁸⁴) program HHsearch v3.3.0 with parameters ‘-glob -M 50’ against the profile COG–eukaryotic orthologous groups (KOG) database (ftp://ftp.tuebingen.mpg.de/pub/protevo/toolkit/databases/hhsuite_dbs/COG_KOG.tar.gz)⁸⁵.

Mapping of ESPs described by Eme et al. (2023)

To identify conserved protein domains in the proteomes of the Asgard archaeal dataset, we used InterProScan v5.57-90.0 (ref. ⁸⁶) with default parameters and using hidden Markov models (HMM) from the databases AntiFam v7.0 (ref. ⁸⁷), CDD v3.18 (ref. ⁸⁸), Coils v2.2.1 (ref. ⁸⁹), Gene3D v4.3.0 (ref. ⁹⁰), MMobiDBLite v2.0 (ref. ⁹¹), PANTHER v15.0 (ref. ⁹²), Pfam v35.0 (ref. ⁹³), PIRSF v3.10 (ref. ⁹⁴), PRINTS v42.0 (ref. ⁹⁵), SFLD v4 (ref. ⁹⁶), SMART v7.1 (ref. ⁹⁷), SUPERFAMILY v1.75 (ref. ⁹⁸) and TIGRFAM v15.0 (ref. ⁹⁹).

We then identified the AsCOG and de novo cluster protein domains containing at least 80% of the length of a Pfam or Interpro domains reported as ESPs¹⁵.

Phylogenetic inferences of iESPs

iESP selection

To illustrate how iESP confer information about the origins of eukaryotic functions and their proteins, we selected several iESPs for phylogenetic analysis, based on the following criteria: the Asgard archaeal query structure is well covered (>80% of protein length) by its alignment to its best structure hit; the best (eukaryotic) structure hit reciprocally has the Asgard archaeal query structure as its best hit; eukaryotic structures are overrepresented among the hits (Fig. 3b); the eukaryotic hit structures are consistent (are evidently homologous to one another); they comprise eukaryote-relevant functions; neither the query nor the hit appears to embody particularly complex evolutionary histories (for example, they do not contain repeat domains or highly composite multidomain architectures); and the Asgard archaeal query is unlikely to represent contamination, as it is found in more than one Asgard archaeal taxon. Finally, we required that the candidates lack a well-scoring sequence-based hit to eukaryotic sequences, as determined by HHsearch; consequently, they fall into the ‘twilight zone’ of sequence homology (Fig. 3c).

Establishing remote sequence similarity between iESP and eukaryotic structure hits

Subsequently, we found that the iESPs, although divergent, retain sequence signals that connect them to the eukaryotic proteins they match structurally. For this, we sought to gradually expand the homologue set of the iESP via manually supervised, iterative HMM searches. In each round, we checked the newly hit proteins before adding them to the multiple sequence alignment, as we ensured these are genuine homologues by inspecting both their sequences and (predicted) protein structures. We executed these profile HMM-based searches using online tools (HHpred and HMMer web server) as well as local hmmsearches onto our local databases (see description below). Note that, in addition to eukaryotic and Asgard archaeal sequences, we included bacterial and other archaeal sequences in the search database, as they may also have homologues that could help link the iESP and related Asgard archaeal sequences to their eukaryotic structural hits.

Selecting homologues for phylogenetic inference

We made use of three sequence datasets for retrieving sequences for the phylogenetic analysis. First, we subsampled our in-house Asgard archaeal set, including only a single representative protein set per species. This representative for a given species was selected based on the quality of the predicted proteomes, as reflected by their predicted completeness and contamination, measured by CheckM⁶³. Note that ‘species’ here signifies groups of genomes that can be clustered at the 95% average nucleotide identity level. Second, we used a subsampled version of an in-house eukaryotic dataset¹⁰⁰, including 25 eukaryotic taxa of all of the major eukaryotic groups, taking the taxon with the best, most complete, predicted proteome quality, as measured by BUSCO¹⁰¹. Third, we used a subsampled version of GTDB (r207)⁶⁶, of which first the Asgard archaea were removed, and then we selected the best assembly for each family, which was also based on the CheckM quality parameters. Using the final, most inclusive yet accurate profile HMM obtained, and our manually determined bit-score cut-offs (described above), we employed hmmsearch onto these three datasets and retrieved all sequences meeting the cut-off. Because COMMD and CINPL comprised virtually full-length hits, both at the structural comparisons as well as in our sequence similarity searches, we extracted the entire protein sequence of each hit protein. For Ufm1, we observed that some hits in our sequence searches were not full-length, and others contained multiple hit regions; in these cases, we extracted only the protein segment corresponding to the best-scoring hit. For the MVP, in addition to the smaller full-length phylogeny (Fig. 4a), we performed a broader phylogenetic analysis of the shoulder domain only, which is a type of Band 7 domain found in many prokaryotic and eukaryotic proteins^45,102, and which are united in the SPFH (for stomatins, prohibitins, flotillins and HflK/C) family ‘clan’ (https://www.ebi.ac.uk/interpro/set/pfam/CL0433/entry/pfam/).

Phylogenetic analysis and annotation of the phylogeny

For each family, we inferred gene trees using multiple sequences alignments generated by MAFFT (v7.505, mode L-INS-i)⁷¹ and the web server of PROMALS3D¹⁰³. For the latter, we used the default options, except for detecting and using homologues with three-dimensional structures (included DaliLite v5 (ref. ¹⁰⁴)), pairwise alignments between input three-dimensional structures (included DaliLite) and aligning sequences within groups in the first alignment stage (PROMALS instead of MAFFT). We supplemented PROMALS3D with predicted protein structures from diverse sequences in the AlphaFold Protein Structure Database, as well as with structures from our own predictions (described above), including those of the iESPs and, where available, other Asgard archaeal homologues. Before inferring the gene trees, we trimmed the multiple sequence alignment using BMGE v1.12 (settings: ‘-m BLOSUM30 --h 0.6 -g 0.7 -b 3’)¹⁰⁵, which selects good-quality aligned positions. However, in some cases (for example, COMMD MAFFT alignment), this produced very short alignments, prompting us to switch to trimAl (v1.4.1, mode ‘gappyout’)⁷². For phylogenetic inference in a maximum-likelihood framework, we used IQ-TREE v.2.0.3 (settings ‘-B 1000 -m MFP -mset LG,JTT,Q.pfam,WAG,LG+C20,LG+C40,LG+C60,LG+C20+R+F,LG+C40+R+F,LG+C60+R+F,WAG+C20,WAG+C40,WAG+C60,WAG+C20+R+F,WAG+C40+R+F,WAG+C60+R+F,JTT+C20,JTT+C40,JTT+C60,JTT+C20+R+F,JTT+C40+R+F,JTT+C60+R+F,Q.pfam+C20,Q.pfam+C40,Q.pfam+C60,Q.pfam+C20+R+F,Q.pfam+C40+R+F,Q.pfam+C60+R+F’)⁷³ to first select the best evolutionary model using ModelFinder⁷⁴ and then infer a phylogeny with 1,000 ultrafast bootstraps⁷⁵. For each iESP/family, we subsequently selected the phylogeny displaying the most informative and probably accurate tree, which entailed post-hoc selecting the alignment algorithm (MAFFT-L-INS-i versus PROMALS3D) (based on ultrafast bootstrap support values at key branches, and monophyly of expected monophyletic sequence groups). We coloured the branches in the tree according to the species group the sequences belong to: Eukaryota, Asgard archaea, Archaea (other) and Bacteria. We also annotated the eukaryotic clades with the names of their proteins, specifically labelling each clade reflecting a single gene in the last eukaryotic common ancestor. Trees were visualized using iTOL¹⁰⁶.

Visual representation of protein structures

Structural models were either visualized in ChimeraX v1.6.1 (Fig. 4b–f)¹⁰⁷ or in R with the ‘r3dmol’ package v0.1.2 (Fig. 4g,h) (https://github.com/swsoyee/r3dmol)¹⁰⁸.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. For benchmarking structure prediction methods (Extended Data Fig. 1b–d), 100 proteins were randomly sampled from the proteome of P. syntrophicum (UniProt ID: UP000321408; Supplementary Data 2). Each protein was evaluated once per prediction condition; no technical replicates were performed. This sample size was selected to provide a representative yet computationally feasible comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Asgard archaeal genome data were obtained from NCBI GenBank (https://www.ncbi.nlm.nih.gov/nucleotide), and identifiers can be found in Supplementary Data 1. Archaeal outgroup proteomes were downloaded from Genome Taxonomy Database (GTDB) release 214 (https://data.gtdb.ecogenomic.org/releases/release214/214.0/genomic_files_reps/gtdb_proteins_aa_reps_r214.tar.gz) and proteins for gene phylogenies from release 207 (https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_reps/gtdb_proteins_aa_reps_r207.tar.gz). The Asgard protein database and all predicted structures, original multiple sequence alignments and IQ-TREE outputs are available via figshare at https://doi.org/10.6084/m9.figshare.26057632 (ref. ¹⁰⁹). The uncollapsed phylogenies can be found via the iTOL¹⁰⁶ website at https://itol.embl.de/tree/62145192210399341699888333 (CINP; Fig. 4a), https://itol.embl.de/tree/13722425212199811699868285 (COMMD; Fig. 4c) and https://itol.embl.de/tree/62145192210319901699902102 (Ufm1; Fig. 4d).

Code availability

Custom code is available via GitHub at https://github.com/stephkoest/structural_genomics.

References

Stanier, R. Y., Doudoroff, M. & Adelberg, E. A. The Microbial World (Prentice-Hall, 1963).
Vosseberg, J. et al. The emerging view on the origin and early evolution of eukaryotic cells. Nature 633, 295–305 (2024).
Article CAS PubMed Google Scholar
Betts, H. C. et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2, 1556–1562 (2018).
Article PubMed PubMed Central Google Scholar
Mahendrarajah, T. A. et al. ATP synthase evolution on a cross-braced dated tree of life. Nat. Commun. 14, 7456 (2023).
Article PubMed PubMed Central Google Scholar
Cox, C. J., Foster, P. G., Hirt, R. P., Harris, S. R. & Embley, T. M. The archaebacterial origin of eukaryotes. Proc. Natl Acad. Sci. USA 105, 20356–20361 (2008).
Article CAS PubMed PubMed Central Google Scholar
Eme, L., Spang, A., Lombard, J., Stairs, C. W. & Ettema, T. J. G. Archaea and the origin of eukaryotes. Nat. Rev. Microbiol. 15, 711–723 (2017).
Article CAS PubMed Google Scholar
Roger, A. J., Muñoz-Gómez, S. A. & Kamikawa, R. The origin and diversification of mitochondria. Curr. Biol. 27, R1177–R1192 (2017).
Article CAS PubMed Google Scholar
Martijn, J. & Ettema, T. J. G. From archaeon to eukaryote: the evolutionary dark ages of the eukaryotic cell. Biochem. Soc. Trans. 41, 451–457 (2013).
Article CAS PubMed Google Scholar
Schwartz, R. M. & Dayhoff, M. O. Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts: a perspective is derived from protein and nucleic acid sequence data. Science 199, 395–403 (1978).
Article CAS PubMed Google Scholar
Yang, D., Oyaizu, Y., Oyaizu, H., Olsen, G. J. & Woese, C. R. Mitochondrial origins. Proc. Natl Acad. Sci. USA 82, 4443–4447 (1985).
Article CAS PubMed PubMed Central Google Scholar
Tamarit, D. et al. Description of Asgardarchaeum abyssi gen. nov. spec. nov., a novel species within the class Asgardarchaeia and phylum Asgardarchaeota in accordance with the SeqCode. Syst. Appl. Microbiol. 47, 126525 (2024).
Article CAS PubMed Google Scholar
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017).
Article CAS PubMed Google Scholar
Williams, T. A., Cox, C. J., Foster, P. G., Szöllősi, G. J. & Embley, T. M. Phylogenomics provides robust support for a two-domains tree of life. Nat. Ecol. Evol. 4, 138–147 (2020).
Article PubMed Google Scholar
Eme, L. et al. Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes. Nature 618, 992–999 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hartman, H. & Fedorov, A. The origin of the eukaryotic cell: a genomic investigation. Proc. Natl Acad. Sci. USA 99, 1420–1425 (2002).
Article CAS PubMed PubMed Central Google Scholar
Akıl, C. & Robinson, R. C. Genomes of Asgard archaea encode profilins that regulate actin. Nature 562, 439–443 (2018).
Article PubMed Google Scholar
Akıl, C. et al. Insights into the evolution of regulated actin dynamics via characterization of primitive gelsolin/cofilin proteins from Asgard archaea. Proc. Natl Acad. Sci. USA 117, 19904–19913 (2020).
Article PubMed PubMed Central Google Scholar
Survery, S. et al. Heimdallarchaea encodes profilin with eukaryotic-like actin regulation and polyproline binding. Commun. Biol. 4, 1024 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hatano, T. et al. Asgard archaea shed light on the evolutionary origins of the eukaryotic ubiquitin-ESCRT machinery. Nat. Commun. 13, 3398 (2022).
Article CAS PubMed PubMed Central Google Scholar
Imachi, H. et al. Isolation of an archaeon at the prokaryote–eukaryote interface. Nature 577, 519–525 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rodrigues-Oliveira, T. et al. Actin cytoskeleton and complex cell architecture in an Asgard archaeon. Nature 613, 332–339 (2023).
Article CAS PubMed Google Scholar
Liu, Y. et al. Expanded diversity of Asgard archaea and their relationships with eukaryotes. Nature 593, 553–557 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
Article CAS PubMed Google Scholar
Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins 77, 499–508 (2009).
Article PubMed Google Scholar
Vanni, C. et al. Unifying the known and unknown microbial coding sequence space. eLife 11, e67667 (2022).
Article PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Ruperti, F. et al. Cross-phyla protein annotation by structural prediction and alignment. Genome Biol. 24, 113 (2023).
Article PubMed PubMed Central Google Scholar
Seong, K. & Krasileva, K. V. Prediction of effector protein structures from fungal phytopathogens enables evolutionary analyses. Nat. Microbiol. 8, 174–187 (2023).
Article CAS PubMed PubMed Central Google Scholar
K.E. Appler et al. Oxygen metabolism in descendants of the archaeal-eukaryotic ancestor. Nature https://doi.org/10.1038/s41586-026-10128-z (2026).
Valentin-Alvarado, L. E. et al. Asgard archaea modulate potential methanogenesis substrates in wetland soil. Nat. Commun. 15, 6384 (2024).
Article CAS PubMed PubMed Central Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Tran, L. T., Akıl, C., Senju, Y. & Robinson, R. C. The eukaryotic-like characteristics of small GTPase, roadblock and TRAPPC3 proteins from Asgard archaea. Commun. Biol. 7, 273 (2024).
Article CAS PubMed PubMed Central Google Scholar
Klinger, C. M., Spang, A., Dacks, J. B. & Ettema, T. J. G. Tracing the archaeal origins of eukaryotic membrane-trafficking system building blocks. Mol. Biol. Evol. 33, 1528–1541 (2016).
Article CAS PubMed Google Scholar
Vosseberg, J. et al. Timing the origin of eukaryotic cellular complexity with ancient duplications. Nat. Ecol. Evol. 5, 92–100 (2021).
Article PubMed Google Scholar
Swarts, D. C. et al. The evolutionary journey of Argonaute proteins. Nat. Struct. Mol. Biol. 21, 743–753 (2014).
Article CAS PubMed PubMed Central Google Scholar
Bastiaanssen, C. et al. RNA-guided RNA silencing by an Asgard archaeal Argonaute. Nat. Commun. 15, 5499 (2024).
Article CAS PubMed PubMed Central Google Scholar
Leão, P. et al. Asgard archaea defense systems and their roles in the origin of eukaryotic immunity. Nat. Commun. 15, 6386 (2024).
Article PubMed PubMed Central Google Scholar
Koonin, E. V. & Yutin, N. The dispersed archaeal eukaryome and the complex archaeal ancestor of eukaryotes. Cold Spring Harb. Perspect. Biol. 6, a016188 (2014).
Article PubMed PubMed Central Google Scholar
Prokopchuk, G. et al. Lessons from the deep: mechanisms behind diversification of eukaryotic protein complexes. Biol. Rev. Camb. Philos. Soc. 98, 1910–1927 (2023).
Article PubMed PubMed Central Google Scholar
Berger, W., Steiner, E., Grusch, M., Elbling, L. & Micksche, M. Vaults and the major vault protein: novel roles in signal pathway regulation and immunity. Cell. Mol. Life Sci. 66, 43–61 (2009).
Article CAS PubMed PubMed Central Google Scholar
Frascotti, G. et al. The Vault nanoparticle: a gigantic ribonucleoprotein assembly involved in diverse physiological and pathological phenomena and an ideal nanovector for drug delivery and therapy. Cancers 13, 707 (2021).
Article CAS PubMed PubMed Central Google Scholar
Daly, T. K., Sutherland-Smith, A. J. & Penny, D. In silico resurrection of the major vault protein suggests it is ancestral in modern eukaryotes. Genome Biol. Evol. 5, 1567–1583 (2013).
Article CAS PubMed PubMed Central Google Scholar
Casañas, A. et al. New features of vault architecture and dynamics revealed by novel refinement using the deformable elastic network approach. Acta Crystallogr. D 69, 1054–1061 (2013).
Article PubMed Google Scholar
Healy, M. D. et al. Structure of the endosomal Commander complex linked to Ritscher–Schinzel syndrome. Cell 186, 2219–2237 (2023).
Article CAS PubMed PubMed Central Google Scholar
Healy, M. D. et al. Structural insights into the architecture and membrane interactions of the conserved COMMD proteins. eLife 7, e35898 (2018).
Article PubMed PubMed Central Google Scholar
Laulumaa, S., Kumpula, E.-P., Huiskonen, J. T. & Varjosalo, M. Structure and interactions of the endogenous human Commander complex. Nat. Struct. Mol. Biol. 31, 925–938 (2024).
Article CAS PubMed PubMed Central Google Scholar
Komatsu, M. et al. A novel protein-conjugating system for Ufm1, a ubiquitin-fold modifier. EMBO J. 23, 1977–1986 (2004).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X. et al. UFMylation: a ubiquitin-like modification. Trends Biochem. Sci 49, 52–67 (2024).
Article CAS PubMed Google Scholar
Lovejoy, C. A. et al. Functional genomic screens identify CINP as a genome maintenance protein. Proc. Natl Acad. Sci. USA 106, 19304–19309 (2009).
Article CAS PubMed PubMed Central Google Scholar
Grishina, I. & Lattes, B. A novel Cdk2 interactor is phosphorylated by Cdc7 and associates with components of the replication complexes. Cell Cycle 4, 1120–1126 (2005).
Article CAS PubMed Google Scholar
Ni, C. et al. Labeling of heterochronic ribosomes reveals C1ORF109 and SPATA5 control a late step in human ribosome assembly. Cell Rep. 38, 110597 (2022).
Article CAS PubMed PubMed Central Google Scholar
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Article CAS PubMed PubMed Central Google Scholar
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yutin, N., Makarova, K. S., Mekhedov, S. L., Wolf, Y. I. & Koonin, E. V. The deep archaeal roots of eukaryotes. Mol. Biol. Evol. 25, 1619–1630 (2008).
Article CAS PubMed PubMed Central Google Scholar
Rivera, M. C. & Lake, J. A. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature 431, 152–155 (2004).
Article CAS PubMed Google Scholar
Brueckner, J. & Martin, W. F. Bacterial genes outnumber archaeal genes in eukaryotic genomes. Genome Biol. Evol. 12, 282–292 (2020).
Article CAS PubMed PubMed Central Google Scholar
Molina, C. S., Williams, T. A., Snel, B. & Spang, A. Chimeric origins and dynamic evolution of central carbon metabolism in eukaryotes. Nat. Ecol. Evol. 9, 613–627 (2025).
Article Google Scholar
Friedberg, I. & Margalit, H. Persistently conserved positions in structurally similar, sequence dissimilar proteins: roles in preserving protein fold and function. Protein Sci. 11, 350–360 (2002).
Article CAS PubMed PubMed Central Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 50, D20–D26 (2022).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–5316 (2022).
Article CAS PubMed PubMed Central Google Scholar
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Article CAS PubMed Google Scholar
Rinke, C. et al. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat. Microbiol. 6, 946–959 (2021).
Article CAS PubMed Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Tahon, G. et al. Phylogenomics and ancestral reconstruction of Korarchaeota reveals genomic adaptation to habitat switching. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2023.09.28.559970v2 (2023).
Petitjean, C., Deschamps, P., López-García, P., Moreira, D. & Brochier-Armanet, C. Extending the conserved phylogenetic core of archaea disentangles the evolution of the third domain of life. Mol. Biol. Evol. 32, 1242–1254 (2015).
Article CAS PubMed Google Scholar
Schäffer, A. A. et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29, 2994–3005 (2001).
Article PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2018).
Article CAS PubMed PubMed Central Google Scholar
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).
Article CAS PubMed Google Scholar
Viklund, J., Ettema, T. J. G. & Andersson, S. G. E. Independent genome reduction and phylogenetic reclassification of the oceanic SAR11 clade. Mol. Biol. Evol. 29, 599–615 (2012).
Article CAS PubMed Google Scholar
Wang, H.-C., Minh, B. Q., Susko, E. & Roger, A. J. Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Syst. Biol. 67, 216–235 (2018).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
Article PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2021).
Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci. Rep. 6, 33964 (2016).
Article CAS PubMed PubMed Central Google Scholar
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
Article PubMed PubMed Central Google Scholar
Biegert, A., Mayer, C., Remmert, M., Söding, J. & Lupas, A. N. The MPI Bioinformatics Toolkit for protein sequence analysis. Nucleic Acids Res. 34, W335–W339 (2006).
Article CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Eberhardt, R. Y. et al. AntiFam: a tool to help identify spurious ORFs in protein annotation. Database 2012, bas003 (2012).
Article PubMed PubMed Central Google Scholar
Lu, S. et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 48, D265–D268 (2020).
Article CAS PubMed PubMed Central Google Scholar
Linding, R. et al. Protein disorder prediction: implications for structural proteomics. Structure 11, 1453–1459 (2003).
Article CAS PubMed Google Scholar
Lees, J. et al. Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis. Nucleic Acids Res. 40, D465–D471 (2012).
Article CAS PubMed Google Scholar
Necci, M., Piovesan, D., Dosztányi, Z. & Tosatto, S. C. E. MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins. Bioinformatics 33, 1402–1404 (2017).
Article CAS PubMed Google Scholar
Thomas, P. D. et al. PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci. 31, 8–22 (2022).
Article CAS PubMed Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wu, C. H. et al. PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 32, D112–114 (2004).
Article CAS PubMed PubMed Central Google Scholar
Attwood, T. K. The PRINTS database: a resource for identification of protein families. Brief. Bioinform. 3, 252–263 (2002).
Article CAS PubMed Google Scholar
Akiva, E. et al. The Structure–Function Linkage Database. Nucleic Acids Res. 42, D521–D530 (2014).
Article CAS PubMed Google Scholar
Letunic, I., Doerks, T. & Bork, P. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res. 40, D302–D305 (2012).
Article CAS PubMed Google Scholar
Wilson, D. et al. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386 (2009).
Article CAS PubMed Google Scholar
Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373 (2003).
Article CAS PubMed PubMed Central Google Scholar
de Potter, B., Raas, M. W. D., Seidl, M. F., Verrijzer, C. P. & Snel, B. Uncoupled evolution of the Polycomb system and deep origin of non-canonical PRC1. Commun. Biol. 6, 1144 (2023).
Article PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sokolskyi, T. H. Bacterial Major Vault Protein homologs shed new light on origins of the enigmatic organelle. Preprint at bioRxiv https://doi.org/10.1101/872010 (2019).
Pei, J. & Grishin, N. V. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information. Methods Mol. Biol. 1079, 263–271 (2014).
Article PubMed PubMed Central Google Scholar
Holm, L. Dali server: structural unification of protein families. Nucleic Acids Res. 50, W210–W215 (2022).
Article CAS PubMed PubMed Central Google Scholar
Criscuolo, A. & Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10, 210 (2010).
Article PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Article CAS PubMed PubMed Central Google Scholar
Meng, E. C. et al. UCSF ChimeraX: tools for structure building and analysis. Protein Sci. 32, e4792 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rego, N. & Koes, D. 3Dmol.js: molecular visualization with WebGL. Bioinformatics 31, 1322–1324 (2015).
Article PubMed Google Scholar
Dataset for: Prediction of eukaryotic cellular complexity in Asgard archaea using structural modelling. figshare https://doi.org/10.6084/m9.figshare.26057632 (2026).

Download references

Acknowledgements

We thank F. Homa and V. de Jager for technical assistance and SURF (www.surf.nl) for supporting the use of the National Supercomputer Snellius, facilitated through a grant from the Dutch Research Council (NWO-2021.059, T.J.G.E.). This work was supported by the European Research Council Consolidator and Advanced Grants 817834 and 101142180, respectively (T.J.G.E.), the Dutch Research Council VI.C.192.016 (T.J.G.E.) and VI.Veni.212.099 (J.J.E.v.H.), the Volkswagen Foundation Grant 96725 (T.J.G.E.) and the Simons Foundation as part of the Moore-Simons Project on the Origin of the Eukaryotic Cell (Grant 73592LPI; https://doi.org/10.46714/735925LPI) (T.J.G.E. and B.J.B.). Computational resources were provided by the SURF Cooperative, grant no. EINF-2953.

Author information

Stephan Köstlbacher
Present address: AITHYRA GmbH, Research Institute for Biomedical Artificial Intelligence of the Austrian Academy of Sciences, Vienna, Austria
Valerie De Anda
Present address: Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria

Authors and Affiliations

Laboratory of Microbiology, Wageningen University and Research, Wageningen, The Netherlands
Stephan Köstlbacher, Jolien J. E. van Hooff, Kassiani Panagiotou, Daniel Tamarit & Thijs J. G. Ettema
Theoretical Biology and Bioinformatics, Department of Biology, Utrecht University, Utrecht, The Netherlands
Daniel Tamarit
Department of Marine Science, Marine Science Institute, University of Texas at Austin, Port Aransas, TX, USA
Valerie De Anda, Kathryn E. Appler & Brett J. Baker
Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
Valerie De Anda & Brett J. Baker

Authors

Stephan Köstlbacher
View author publications
Search author on:PubMed Google Scholar
Jolien J. E. van Hooff
View author publications
Search author on:PubMed Google Scholar
Kassiani Panagiotou
View author publications
Search author on:PubMed Google Scholar
Daniel Tamarit
View author publications
Search author on:PubMed Google Scholar
Valerie De Anda
View author publications
Search author on:PubMed Google Scholar
Kathryn E. Appler
View author publications
Search author on:PubMed Google Scholar
Brett J. Baker
View author publications
Search author on:PubMed Google Scholar
Thijs J. G. Ettema
View author publications
Search author on:PubMed Google Scholar

Contributions

S.K. and T.J.G.E. conceptualized the study. S.K. led orthology assignment, protein modelling and sequence homology searches, with support from J.J.E.v.H. Structural genomics analyses were performed by S.K. and J.J.E.v.H. Genome data generation and curation were carried out by K.E.A., B.J.B. and V.D.A., while phylogenetic analyses were conducted by J.J.E.v.H., S.K. and K.P. Data interpretation involved S.K., J.J.E.v.H., K.P., K.E.A., D.T. and T.J.G.E. Supervision was provided by T.J.G.E. S.K., J.J.E.v.H. and T.J.G.E. wrote the original draft, and all authors (S.K., J.J.E.v.H., K.P., D.T., K.E.A., V.D.A., B.J.B. and T.J.G.E.) contributed to reviewing and editing the paper.

Corresponding authors

Correspondence to Stephan Köstlbacher or Thijs J. G. Ettema.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Damien Devos, Robert Robinson and Rui Zhao for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Asgard archaeal phylogenomic tree and comparison of structural prediction algorithms.

a, Maximum-likelihood phylogenetic tree of 935 Asgard archaeal genomes, using Euryarchaeota and Thermoproteota archaeal representatives as outgroup. The tree is based on 47 concatenated non-ribosomal proteins (8,068 sites and 1,244 taxa), using IQ-TREE under the WAG+C10+R4 model. PMSF approximated non-parametric bootstrap support ≥70 is indicated on branches. Scalebar represents the average expected substitutions per site. b-d, Evaluation of different aspects of the structure prediction workflow in the following panels were performed on a set of 100 proteins of Prometheoarchaeum syntrophicum (n = 100; see Methods). b, Number of aligned reference sequences (x-axis) and average structure model pLDDT (y-axis) with default (blue) and enriched (purple) database. Each dot represents a structure prediction for one of the 100 randomly selected proteins from Prometheoarchaeum syntrophicum. Lines show linear regression between pLDDT and the number of aligned sequences with shaded 95% confidence intervals. c, Number of high-quality structure predictions (pLDDT ≥80) based on different predictions strategies. d, Inference times of ColabFold prediction modules with different inference strategies, including the default setting and database, or the enriched database with either default settings or an early stop criterion (see Methods). Boxes represent the interquartile range (IQR), with the centre line showing the median. Whiskers extend to the most extreme data points within 1.5×IQR from the box. Outliers are shown as individual points.

Extended Data Fig. 2 Analyses of the Asgard archaeal protein structure similarity network.

a, Subgraph complementing the protein structure similarity network depicted in Fig. 3, once again highlighting Argonaute proteins. b, Distribution across Asgard archaeal groups of eight Asgard archaeal Argonaute-related iESPs contained in a single structural cluster.

Extended Data Fig. 3 Distribution of ESPs and iESPs in Asgard archaea.

Complementing Fig. 3d, this heatmap displays the presence of ESPs (green) and iESPs (purple) across Asgard archaeal genomes. Genomes (y-axis) are grouped by taxonomy, and structural clusters (x-axis) are sorted by conservation across genomes and functional categories.

Extended Data Fig. 4 Distribution of ESPs and iESPs across Asgard archaea by functional category of best structural hit.

Presence of eukaryotic signature proteins (ESPs) and isomorphic ESPs (iESPs) across Asgard archaeal genomes grouped by taxonomy (y-axis). Structural clusters are ordered by functional category (x-axis). Each column represents a distinct structural cluster, categorized based on predicted functional annotations. Functional categories (x-axis labels) follow COG annotations, reflecting major biological processes, including information storage and processing, cellular processes and signaling, and metabolism. Asgard archaeal genomes are grouped into taxonomic lineages (abbreviation on y-axis). Black lines demarcate major Asgard archaeal clades. ESPs (green) and iESPs (purple) show distinct patterns of conservation across taxonomic groups and functional categories. This extended dataset builds on the high-level summary in Fig. 3 and Extended Data Fig. 3, providing deeper resolution into functional distributions of ESPs and iESPs. Functional categories follow COG annotations and are labeled by their letter codes, including: J, Translation, ribosomal structure and biogenesis; A, RNA processing and modification; K, Transcription; L, Replication, recombination and repair; B, Chromatin structure and dynamics; D, Cell cycle control, cell division, chromosome partitioning; T, Signal transduction mechanisms; M, Cell wall/membrane/envelope biogenesis; Z, Cytoskeleton; W, Extracellular structures; U, Intracellular trafficking, secretion, and vesicular transport; O, Posttranslational modification, protein turnover, chaperones; C, Energy production and conversion; G, Carbohydrate transport and metabolism; E, Amino acid transport and metabolism; F, Nucleotide transport and metabolism; H, Coenzyme transport and metabolism; I, Lipid transport and metabolism; P, Inorganic ion transport and metabolism; Q, Secondary metabolites biosynthesis, transport and catabolism; and Multiple, for clusters assigned to more than one category.

Extended Data Fig. 5 Phylogenetic and structural analyses of iESPs.

a, Protein domain phylogeny based on Band 7, MVP and related shoulder domains. The depicted phylogenetic tree is based on 90 aligned positions and was generated under the LG+C60+R7 model (see Methods). b, ipTM score of Asgard archaeal MVP homopolymers modeled with different numbers of subunits with local optima highlighted. c, Multimer model of Lokiarchaeial MVP with different number of subunits. d, ipTM score of Asgard archaeal COMMD homopolymers modeled with different numbers of subunits with local optima highlighted. e, Homo-multimer model of Lokiarchaeial COMMD-containing protein with different number of subunits.

Supplementary information

Supplementary Information (download PDF )

Supplementary Discussion and Fig. 1.

Reporting Summary (download PDF )

Supplementary Data 1–3 (download XLSX )

Supplementary Data 1. Spread sheet with genome information of the outgroup genomes used for Extended Data Fig. 1a and the dataset of 936 Asgard archaeal draft genomes. Supplementary Data 2. Spread sheet of UniProt protein IDs and annotations of sampled Prometheoarchaeum syntrophicum. Supplementary Data 3. Spread sheet including the annotation of structures in the of ESPs and iESP structural clusters, as well as ESP and iESP proteins in Prometheoarchaeum syntrophicum.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Köstlbacher, S., van Hooff, J.J.E., Panagiotou, K. et al. Prediction of eukaryotic cellular complexity in Asgard archaea using structural modelling. Nat Microbiol 11, 747–758 (2026). https://doi.org/10.1038/s41564-026-02273-y

Download citation

Received: 22 November 2024
Accepted: 20 January 2026
Published: 05 March 2026
Version of record: 05 March 2026
Issue date: March 2026
DOI: https://doi.org/10.1038/s41564-026-02273-y