Atlas of predicted protein complex structures across kingdoms

Qi, Xianzhi; Ye, Cheng; Liang, Jianqiang; Wen, Shimin; Li, Yuanyuan; Ding, Kai; Hao, Yongfu; Fei, Junjie; Mao, Weian; Li, Liupeng; Lin, Zhiyu; Shen, Yichong; Zhu, Hongjie; Hu, Yayun; Zhang, Rui; Ji, Pengli; Lu, Yafei; Liu, Bonan; Wang, Han; Chen, Yuxuan; Ma, Zhenguo; Yang, Peiyuan; Xu, Xinyu; Wu, Junlong; Zhu, Youyuan; Zou, Qiaosha; Zhu, Wencheng; Yao, Kelu; Li, Shuya; Xin, Hongyi; Ergu, Daji; Zeng, Jianyang; Xiao, Zhi-Xiong Jim; Shen, Chunhua; Cai, Ying; Yi, Yong; Ma, Dacheng

doi:10.1038/s41467-026-70884-4

Download PDF

Article
Open access
Published: 25 March 2026

Atlas of predicted protein complex structures across kingdoms

Nature Communications volume 17, Article number: 4397 (2026) Cite this article

9151 Accesses
1 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Protein complexes are fundamental to all biological processes. Public repositories have expanded to include millions of potential protein–protein interactions (PPIs) from human and diverse model organisms. Yet, large-scale structural characterization of these complexes—especially across different biological kingdoms—has lagged far behind, leaving most potential and unidentified interactions unresolved. Here, we present a comprehensive atlas of 1.1 million predicted protein–protein interaction structures generated with the AlphaFold2-based ColabFold framework. This dataset spans proteome-wide interactions from bacteria, archaea, humans, mice, plants, and human–virus pairs. Overall, we identify 181,671 high-confidence protein complex structures, especially 37,855 in the human interactome. Structural clustering revealed numerous conserved protein complex architectures shared across kingdoms, providing insights into previously uncharacterized biological functions. Supported by co-immunoprecipitation experiments, we further identify candidate viral receptors for Human mastadenovirus A and Papiine alphaherpesvirus 2. Comparative analyses integrating our complex structures with the AlphaFold monomeric structure database uncovered widespread gene fusion and fission events during evolution. Finally, we demonstrate how our dataset can enhance protein binding–surface prediction using deep learning approaches, illustrating its broad utility beyond structural modeling alone. Altogether, this atlas to our knowledge, represents one of the most extensive cross-kingdom resources and opens avenues for future discoveries in various biomedical applications.

Towards a structurally resolved human protein interaction network

Article Open access 23 January 2023

Protein interactions in human pathogens revealed through deep learning

Article Open access 18 September 2024

Highly accurate protein structure prediction for the human proteome

Article Open access 22 July 2021

Introduction

Deep learning has transformed biological researches by enabling accurate modeling of the three-dimensional structures of proteins and their complexes. Revolutionary tools such as AlphaFold2 (AF2)¹, RoseTTAFold², and ESMFold³ have considerably enhanced our understanding of protein mechanics by predicting protein monomer structures from amino acid sequences. The extension of AlphaFold-Multimer, tailored for multimeric proteins, has significantly improved complex-structure prediction accuracy⁴. Additionally, combinatorial and hierarchical assembly algorithms or searching approaches with Monte Carlo tree, have shown high-confidence predictions for large protein complexes based on AF2^5,6. Recently, the development of a unified deep-learning framework that integrates proteins, nucleic acids, small molecules, metals, and chemical modifications, including RoseTTAFold All-Atom⁷ and AlphaFold3⁸, has achieved unprecedented accuracy across the biomolecular spectrum. These technological breakthroughs are reshaping biological research, accelerating therapeutic discovery, and expanding the reach of structural modeling into nearly every aspect of molecular biology.

Proteome-scale monomer prediction has profoundly expanded our understanding of protein function, evolution, and drug design^9,10,11. The AlphaFold Protein Structure Database (AFDB) now contains over 214 million structures predicted by AlphaFold2, while the ESM Metagenomic Atlas contains more than 617 million structures predicted by ESMFold³. Structural-alignment-based clustering algorithms, such as Foldseek, enable large-scale mining of these protein structure databases^12,13,14. These algorithms have revealed numerous previously unidentified structures and enzymes through the clustering of a broad spectrum of protein structures^14,15. Two recent studies have predicted 67,715 and more than 350,000 novel viral protein structures, revealing unprecedented architectural diversity^16,17. Furthermore, AI-guided structural mining has enabled a new base-editing tool with enhanced activities and minimal off-target effects¹⁸. Collectively, these developments illustrate the growing role of large-scale computational prediction in diverse areas of biological discovery.

The assembly of both stable and transient protein complexes forms the cornerstone of virtually all biological processes. Understanding the architectures of these complexes is essential for elucidating and modifying their functional roles. Repositories of potential protein-protein interactions (PPIs) generated from high-throughput experimental techniques such as yeast two-hybrid systems or computational predictions have burgeoned, now comprising millions of entries^19,20. However, while small-scale predictions and experimental studies have revealed numerous bacterial and eukaryotic heterodimers and homo-oligomers^8,21,22,23, comprehensive and accurate structural characterization of complexes spanning multiple kingdoms remains limited. This gap hampers our ability to analyze interaction mechanisms, evolutionary conservation, and functional regulation across species.

Here, we develop an approach to identify potential protein-protein interacting pairs in prokaryotes using co-localization analysis and operon prediction. We compile potential protein interactions in 36 pathogenic bacteria and adjacent protein pairs in 188 prokaryotic species, identifying a total of 553,814 pairs for structural modeling, of which 108,879 are classified as high-confidence protein-protein interactions. Further, we predict 559,960 protein complexes in human, plant, mouse, and human-virus interactions. Collectively, we predict 1.1 million protein complexes, among which we identify 181,671 high-confidence protein pair structures, especially, more than 37,855 high-confidence predict protein complexes in human interactome. Building upon this extensive protein-protein interaction atlas, we conduct a series of in-depth analyses that include identifying large protein complexes, clustering these complexes across kingdoms for evolutionary analysis, coupling the structural human interactome with the human-virus interactome to assess viral interference in the context of the human protein network, and analyzing gene fusion and fission events throughout evolution. Finally, we further illustrate the applicability of the high-confidence predicted protein complex dataset for identifying protein interaction surfaces. Our results provide a unified structural complex landscape of the proteome across biological domains, revealing evolutionary patterns and mechanistic insights that could not be obtained from sequence data alone. We anticipate that our large-scale atlas of predicted protein complexes across kingdoms and host-pathogen interactions will prove valuable for further research and applications.

Results

Identification of loci-linked protein complexes in bacteria and archaea

In bacteria and archaea, genes encoding proteins that participate in related biological functions are frequently co-localized on the genome, often forming operons. Such organization enables coordinated transcription and efficient regulation through a single promoter, producing polycistronic mRNA²⁴. To identify possible physical (PPIs), we first detected operons using the Operon-mapper server²⁵ for 36 species of widely distributed pathogenic bacteria (Supplementary Table 1) and identified 21,609 operons (gene counts >1), with an average of 3.3 genes per operon (Fig. 1a and Supplementary Fig. 1). To capture protein pairs that are not located in the same operon, we further developed a pipeline to explore co-localization of protein pairs to reflect their interdependence (Fig. 1a). Such an approach has been widely used in the discovery of CRISPR-associated proteins^26,27. In this study, we systematically cataloged all pairs of proteins located within 3 kilobases (kb) of each other on the genomes of pathogenic bacteria. A comprehensive protein dataset, including information about their positions in cognate contigs, was collected. This dataset consists of 18 million contigs from bacterial genomes sourced from the National Center for Biotechnology Information (NCBI) and encompasses ~0.89 billion proteins identified in coding sequences (CDS) regions. The co-localization score was defined as the proportion of contigs where homologs of two query proteins appeared within 3 kb, normalized by the total number of contigs containing either homolog. To determine the suitable threshold, we calculated the co-localization score of proteins in a batch of biosynthetic gene clusters²⁸. We observed a significant enrichment of protein pairs with co-localization scores exceeding 0.4. Consequently, we established a cutoff of 0.4 for subsequent analyses (Supplementary Fig. 2). Totally, we identified 164,037 potential protein pairs across 36 pathogenic bacteria using both operon analysis and co-localization mining (Fig. 1b).

We used ColabFold v1.5.5 to generate 5 PDBs for each protein pair with up to 20 cycles (Supplementary Fig. 3)²⁹. While preparing this manuscript and in step with the release of AlphaFold3⁸, we also tried to compare the prediction accuracy of AlphaFold3 and other AlphaFold3-inspired models, including Boltz-2³⁰, Chai-1³¹ and Protenix³², with ColabFold Multimer. We found that, on heterodimers, AlphaFold3 achieved the highest accuracy, with Boltz-2 performing comparably; for homodimers, multiple sequence alignment (MSA)–based methods showed similar accuracy (Supplementary Fig. 4). To distinguish true interactions from artifacts, we assessed each predicted complex using multiple established confidence metrics, including Predicted DockQ (pDockQ)³³, pDockQ2³⁴, Local Interaction Score (LIS), Local Interaction Area (LIA)³⁵, AlphaFold Multimer-derived interface predicted Template Modeling (ipTM) score, and linear models specifically designed for homodimer identification³⁶. Previous studies demonstrated that³⁵, when evaluating the top-ranked model among generated PDBs for each predicted PPI complex, the Best LIS metric provides superior Receiver Operating Characteristic (ROC) performance—indicating better discrimination between true positives and false positives—for heterodimer identification compared to ipTM, pDockQ, and pDockQ2. Furthermore, as this research recommended, we confirmed that the combination of Best LIS and Best LIA metrics yields the highest precision rate among evaluated scoring schemes (Supplementary Fig. 5a). Therefore, we adopted a threshold of Best LIS ≥ 0.203 and Best LIA ≥ 3432 as criteria to identify high-confidence heterodimeric PPIs for subsequent analyses. As a result, we identified a total of 22,216 high-confidence protein pairs across the 36 species of pathogenic bacteria (Fig. 1d). Also, we observed that increasing the Best LIS threshold improved precision but concurrently reduced in the recall rate. For instance, applying the Best LIS threshold of 0.6 resulted in a precision rate exceeding 90%, whereas the recall rate fell below 15% (Supplementary Fig. 5b). Therefore, we recommend selecting different thresholds tailored to the specific objectives and tolerance for false negatives in different studies.

In addition to heterodimer predictions, we also predicted the homo-oligomeric states of 76,429 proteins within the identified potential protein pairs. To evaluate the metrics for identifying true homodimers in ColabFold modeling, we constructed a benchmark dataset comprising 411 proteins from the PDB, curated in previous studies³⁶ to remove redundancy and eliminate overlap with the AlphaFold2 training set. In the beginning, proteins resolved as monomers by X-ray crystallography were used as the negative dataset. Similarly, Best LIS showed best ROC performance when we analyzed the top-ranked model of each prediction (Supplementary Fig. 6a). Following previous research³⁶, we also reclassified certain assemblies from dimers to monomers to correct for likely crystal-packing artifacts. This adjustment improved the overall classification performance, as evidenced by an increased AUC (Supplementary Fig. 6a and b). When we selected Best LIS with 0.203 and Best LIA with 3432 as the threshold (Supplementary Fig. 6c), consistent with the previous report²¹, a significant proportion of these proteins formed homodimers (Fig. 1e).

As expected, proteins encoded by adjacent CDSs exhibited the highest likelihood of forming high-confidence physical PPIs (Fig. 1f), and higher co-localization scores were positively associated with a greater prevalence of predicted high-confidence PPIs (Supplementary Fig. 7). To broaden our understanding of protein-protein interactions across the bacterial and archaeal domains, we collected representative genomes from each of the 188 prokaryotic phyla, encompassing all known archaeal or bacterial phyla, after excluding phyla related to our collected pathogens (Fig. 1g, Supplementary Fig. 8, and Supplementary Data 1).

After selecting a representative genome for each phylum, we examined the proportion of the entire proteome at the species, phylum, and domain (super-kingdom) levels that is accounted for by our selected proteins and their homologs. To this end, we performed protein clustering using MMseqs2³⁷. As expected, at the species level, whether in representative species of archaea or bacteria, the vast majority of protein clusters, apart from a few small ones, contained proteins from the target genome and collectively covered more than 90% of all proteins found in the sampled genomes of that species (Supplementary Figs. 9 and 10). When the analysis was extended to the phylum level, coverage naturally decreased but remained substantial. For example, at a cluster size threshold of ≥200 members, 84.7% of all Chlamydiota proteins were included, and more than 90% of the clusters still contained C. trachomatis proteins. Similarly, across 2,168 Cyanobacteriota genomes, applying a ≥ 1000-protein cluster threshold retained 44.4% of all proteins, and more than 90% of these clusters contained N. linckia proteins. These results suggest that numerous high-abundance, conserved proteins are shared well beyond species boundaries. At the domain level, the same trend was observed (Supplementary Fig. 11). Together, these data suggest that selecting only representative species from each phylum is sufficient to capture most evolutionarily conserved proteins, whereas additional predictions are required to resolve proteins such as strain-specific gene products that lie in the long tail of small clusters.

In total, we extracted 313,348 neighboring CDS pairs and used ColabFold to predict their corresponding protein assemblies. Notably, 47,668 predicted heterodimers were identified as high-confidence interacting pairs (Supplementary Fig. 12), underscoring the remarkable diversity of protein complexes across bacterial and archaeal kingdoms and highlighting their potential as a valuable resource for further investigation.

Beyond colocalized gene pairs, we aimed to identify interactions between proteins encoded at distant loci within prokaryotic genomes. As a proof of concept, we analyzed the Campylobacter jejuni genome and surveyed 21,676 reference genomes. Additionally, as shown in Supplementary Fig. 13, by reducing the original 200-step diffusion process to a single step, the time-efficient MSA-free Chai-1 model (MSA-free Light Chai-1 model), although lacking 3D structure generation capability, produced ipTM scores that correlated well with ColabFold Multimer predictions (r ≈ 0.68) and exhibited strong discriminatory power for PPIs (AUC = 0.783) (Supplementary Fig. 14). Therefore, all Campylobacter jejuni protein pairs that passed stringent genomic co-occurrence (score ≥ 0.5) and length filters were first evaluated using the MSA-free Light Chai-1 model, and the top-ranked candidates were subsequently modeled in 3D with ColabFold-Multimer. For example, it revealed a heterodimer between an RDD family protein and an SH3 domain–containing C40 family peptidase. Such an integrated approach could serve as a general strategy to expand the putative prokaryotic interactome (Supplementary Fig. 15).

In total, we predicted 553,814 protein pairs, including both heterodimers and homodimers, and identified 108,879 high-confidence physical protein complexes for subsequent analyses.

Reconstruction of multi-component protein complexes in prokaryotes

To de novo chart multi-component protein assemblies from our predicted binary PPIs, we built comprehensive interaction maps across 36 pathogenic bacterial species. In these networks, proteins form discrete communities corresponding to putative complexes; for example, Klebsiella pneumoniae exhibits highly cohesive clusters (Fig. 2a). A prominent and recurring feature across species is the ribosome, which forms a robust community in most organisms (Supplementary Fig. 16). Additionally, we recovered multiple large assemblies, including the ethanolamine-utilization carboxysome, type I fimbriae, the urease complex, and the recently described propanediol-utilization metabolosome³⁸. Across the 36 pathogens, we identified 3,803 communities comprising >2 proteins, with the per-species counts ranging from 16 to 242. The largest community contained 39 subunits (Fig. 2b). Collectively, these results indicate that community detection on predicted PPI maps can recover diverse multi-component protein assemblies at proteome scale.

Having delineated communities largely from heterodimeric edges, we next asked whether their constituent subunits also self-associate, thereby assembling into higher-order complexes. As a case study in a virulence-linked module, we examined copper homeostasis. Bacteria can deploy copper intoxication to restrict invading pathogens, and bacteria counter with copper-sensing and efflux systems³⁹. In Salmonella enterica, we identified a CopRS two-component system in which the sensor kinase CopS was predicted to form a homodimer, while CopS and the response regulator CopR were predicted to associate as a heterodimer. By combining these interfaces, we reconstructed a quaternary CopRS assembly using CombFold⁶ (Fig. 2c), providing a structural hypothesis that could inform the design of targeted inhibitors.

Building on complexes inferred from heterodimeric interactions, we next examined whether homodimerization could nucleate higher-order assemblies, especially in virulence modules. The virulence factor database encompasses 27,982 potential virulence factors across various bacterial species, including a spectrum of mechanisms such as exotoxin production, adherence, and immune modulation⁴⁰. We generated a comprehensive set of 26,490 homodimeric structures and identified structural symmetries using QSproteome²¹. For instance, we reconstructed the hexametric assembly of the Hcp family type VI secretion system effector, which displays a hollow, ring-shaped configuration (Fig. 2d). We hypothesized that in 36 pathogenic bacteria, part of the predicted homodimers could further assemble into homo-oligomers, whose subunits might also engage in interactions with other proteins, to form advanced assemblies. To test this hypothesis, we developed a computational pipeline for the comprehensive screening (Supplementary Fig. 17 and Fig. 2e–j). For example, we discovered that in Listeria monocytogenes, proteins EutL and TIGR02536 family ethanolamine utilization protein interact with each other and can each form a trimer separately. We thereby assembled the two-layered 6-subunit complex using our pipeline. Furthermore, we reconstructed several advanced complexes, such as phage baseplate, phage tail, and ethanolamine utilization complex, revealing the potential for the assembly of higher-order complexes through the integration of homologous and heterologous dimers (Fig. 2e–j).

Collectively, these analyses show that predicted structural interactomes recover a wide range of multi-component assemblies, underscoring the architectural and mechanistic breadth of prokaryotic protein complexes.

The atlas of human and model organism protein complex structures

To extend the atlas to eukaryotic systems, we compiled candidate PPIs from multiple large-scale resources, including the HI-Union Human Reference Interactome⁴¹, which have been predicted by FoldDock^23,33 and experimentally supported human protein–protein interaction candidates from the STRING database¹⁹. We filtered the candidates from the STRING database using the pLDDT score of each monomer obtained from the AFDB database, retaining only those with pLDDT scores above 70 and protein lengths between 150 and 800 amino acids. The overlap between two datasets is 3231, and totally we collected 278,167 potential candidates. Subsequently, we predicted the structures by ColabFold with up to 20 cycles to improve the accuracy. We identified 37,855 high-confidence structures (Best LIS ≥ 0.203 and Best LIA ≥ 3432), representing a 12-fold increase relative to previously published AlphaFold-based predictions of human PPIs²³ (Fig. 3a-c).

**Fig. 3: The Atlas of Human Protein Complex Structures.**

Since entries from the STRING database were included without applying internal confidence score (combined score) filtering, a total of 1.47 million redundant protein–protein interaction pairs were initially considered. We further compared the STRING combined scores with our predicted LIS scores. Overall, no clear linear relationship was observed between the two measures (Pearson correlation coefficient, r = 0.155), suggesting that even interactions with low STRING scores should be considered in subsequent predictions (Supplementary Fig. 18).

Compared with proteins from prokaryotic species, human proteins are generally longer, which complicates structural prediction and reduces accuracy (Supplementary Fig. 19a). Accordingly, we observed a higher proportion of low-confidence predictions among human proteins (Supplementary Fig. 19b). To further assess whether AlphaFold-based modeling can discriminate true human protein interactions, we selected the PRKCZ–NPM1 pair for detailed analysis. As shown in Fig. 3d, although both monomers contained extensive low-confidence regions and the ipTM score was low (0.366), the Best LIS and Best LIA metrics exceeded the defined thresholds, indicating a high-confidence interaction. In endogenous assays using H1299 cells, we performed co-immunoprecipitation (Co-IP) and cellular co-localization analyses, which confirmed a significant protein–protein interaction (Fig. 3d-g).

In addition, we collected 200,558 potential PPIs across multiple model organisms, including Mus musculus and Arabidopsis thaliana, leading to the identification of a total of 19,812 high-confidence protein complex structures (Supplementary Figs. 20 and 21). Comparative structural analyses across divergent kingdoms may also reveal the evolutionary trajectories that have driven the diversification of these complexes. Remarkably, several protein complexes were found to be highly conserved across the tree of life, underscoring their fundamental biological importance (Fig. 3h). For example, Human glutaredoxin 3 (Glrx3) is an essential [2Fe-2S]-binding protein that forms [2Fe-2S]-bridged complexes with human BolA2. It plays key roles in immune cell responses, embryogenesis, cancer cell proliferation, and the regulation of cardiac hypertrophy⁴². We found that this complex closely resembles the BolA/IbaG family iron-sulfur metabolism protein and the Grx4 family monothiol glutaredoxin in Brucella melitensis. Notably, human Glrx3 has evolved to contain three tandem-repeat domains (Fig. 3i and j). Altogether, we totally identified 57,667 high-confidence protein-protein interaction complexes. This expansion suggests a significant potential step forward in our understanding of the structural landscape of protein complexes in humans and other model organisms.

Atlas of structure-predicted human-virus protein interactions

Viruses exploit a sophisticated network of host-virus PPIs to hijack cellular processes, including endocytosis, nuclear transport, protein translation, and secretion. In response, host cells activate a complex transcriptional program mediated by PPIs, thereby triggering innate antiviral defenses, regulating viral replication, and stimulating the adaptive immune response⁴³. High-throughput experimental and computational approaches have collectively identified a large number of PPI candidates, greatly advancing our understanding of the human-virus interactome⁴⁴. However, the number of host-virus PPI complexes with experimentally resolved three-dimensional structures remains extremely limited. The comprehensive structural characterization of these PPIs, including the precise three-dimensional arrangements of the interacting proteins, will provide critical insights into the molecular mechanisms underlying viral pathogenesis and host defense strategies. Two curated databases of predicted human–virus protein–protein interactions, HVIDB⁴⁵ and P-HIPSTer⁴⁴, provide precise resources for predictions. Due to computational constraints (P-HIPSTer >280k interactions; HVIDB 48,643 PPIs), we retained HVIDB entries and subsampled P-HIPSTer. After removing duplicates, we totally collected 81,235 unique PPI candidates covering 3531 virus proteins and 8532 host proteins (Fig. 4a). Novelty was evaluated against the AF-Multimer training set, and 62 virus–human PPIs (at a 70% sequence identity threshold) were found to match training-set structures for both chains. Among the analyzed virus–human PPIs, 5119 (5.72%) were predicted to have Best LIS scores above 0.203 and Best LIA scores above 3432, indicating high-confidence physical interactions.

In addition, we found that some human proteins could interact with distinct viral proteins, whereas certain viral proteins were capable of binding to multiple host target proteins, suggesting the potential involvement of key viral mediators and critical human target proteins (Fig. 4a). Among these interactions, the top 15 viral or human proteins are illustrated in Fig. 4b and c, where several 14-3-3 family members display broad interactions with diverse viral proteins. We specifically focused on host membrane proteins, given their interactions may mediate viral entry into cells and thus represent important targets for therapeutic intervention. CD4, a crucial receptor for Human Immunodeficiency Virus (HIV) and an important cellular biomarker, was predicted to interact with both the glycoprotein of Zaire ebolavirus and the membrane glycoprotein E3 CR1-beta of Human mastadenovirus A (Fig. 4d). These interactions exhibited similar binding sites on CD4 and we validated the interaction between the extracellular domains of CD4 and E3 from Human mastadenovirus A using co-immunoprecipitation (Co-IP) assays (Fig. 4e). Additionally, we also confirmed the interaction of human HLA-DRA and UL44 in Papiine alphaherpesvirus 2 using the same experimental approach (Fig. 4e), suggesting that our AlphaFold-based protein complex structure modeling approach could potentially identify key viral entry receptors on host cells (Supplementary Fig. 22a).

We hypothesized that viral proteins interfere with normal host protein–protein interactions by competing for binding sites or forming alternative complex assemblies. We integrated our predicted human protein complex atlas with the human–virus interactome and identified 78,191 ternary relationships with high confidence. In these relationships, viral proteins interact with human proteins that, in turn, engage additional host partners. To distinguish whether a viral protein forms a ternary complex with two host proteins or instead competes with host proteins for binding, we calculated the binding interface violations for each pairwise combination. Our analysis revealed that 57.4% human protein-protein interactions could potentially be disrupted by viral proteins, as indicated by interface violations exceeding 50% (Supplementary Fig. 22b). For example, we found that the ORF128 ankyrin repeat protein from Orf Virus can potentially interact with human SKP1 (Fig. 4f), which also binds to human FBXL20. SKP1 and FBXL20 constitute core components of the SCF (Skp1–Cul1–F-box) ubiquitin ligase complex, which regulates cell-cycle progression, DNA-damage responses, autophagy, and apoptosis⁴⁶. Interestingly, the ORF128 ankyrin repeat protein targets an overlapping binding surface on FBXL20. Such competitive binding may displace FBXL20 from SKP1, thereby potentially facilitating Orf virus pathogenesis (Fig. 4f).

Collectively, we constructed a structural atlas of virus–human interactions, identifying key host factors and potential viral receptor proteins. Integration of predicted human interactomes revealed potential competitive interactions in which viral proteins may disrupt native human protein–protein interactions, underscoring the importance of the AlphaFold-derived structural atlas in elucidating viral pathogenic mechanisms. In total, we predicted 1.1 million protein complexes, covering proteome-wide interactions across bacteria, archaea, humans, mice, plants, and human–virus pairs, thereby substantially expanding structural coverage and providing a comprehensive reference dataset for diverse analyses (Supplementary Fig. 23).

Identifying protein fusion and fission events through structural alignments of protein monomers and complexes

The AlphaFold Protein Structure Database (AFDB), which contains structural predictions for over 200 million protein monomers, has garnered widespread enthusiasm for its transformative potential, not only in structural biology but across the entire field of life sciences. During evolution, proteins undergo fusion and fission events, in which open reading frames (ORFs) merge or split, thereby enabling the acquisition of new functions⁴⁷. However, the discovery of protein fusion or fission events remains challenging because of the scarcity of protein complex structures and the difficulty in detecting remote sequence homology¹².

We hypothesized that this evolutionary transition can be detected through comprehensive structural alignment analyses of two large datasets: the AlphaFold monomer structure database and our predicted cross-kingdom protein complex dataset, both encompassing eukaryotic and prokaryotic species. Because two subunits within a complex may share structural similarity and correspond to the same monomer position, we aligned each subunit separately to monomers in the pre-clustered AFDB50 database, which contains about 50 million protein structures. The alignment was performed using Foldseek with an overlap threshold of 30% (Fig. 5a). By comparing the predicted protein complexes from the bacterial, archaeal, human, Arabidopsis, mouse, and human–virus datasets with those in AFDB50, we identified 668,992 matches, the majority of which were found in prokaryotic species (Fig. 5b). Surprisingly, we found that thousands of AFDB50 entries also overlapped with human–viral complexes. One representative example, shown in Fig. 5c, involves thymidylate kinase from the vaccinia virus, which was predicted to bind to the human thymidylate kinase (gene: DTYMK). This complex exhibited high structural similarity to an uncharacterized protein from Bremia lactucae, a plant-pathogenic fungus, suggesting that the virus may have evolved to preserve its dimerization capability for survival. Additionally, we identified an essential protein complex in human mitochondria that resembles a hypothetical protein from Perkinsus marinus, suggesting that our comparative analysis could aid in uncovering the unknown functions of proteins (Fig. 5d). Further comparative analyses revealed that such rearrangements are widespread across kingdoms (Supplementary Fig. 24a). Conversely, in a distinct case, we observed both protein monomers and protein complexes with high sequence similarity within the same bacterial species, indicating the potential occurrence of ORF read-through interruptions caused by occasional stop codons (Supplementary Fig. 24b, c). We identified multiple instances in which protein complexes and their corresponding full-length monomers coexist within the same bacterial species, highlighting the prevalence of gene fusion and fission phenomena across prokaryotic lineages (Supplementary Fig. 24d, e). In addition to clearly identified direct protein fusions, we also discovered cases in which small protein complexes resembled partial regions of long monomers, possibly resulting from complex structural rearrangements of the proteins. Notably, a predicted protein complex from Promethearchaeota archaeon (archaea) exhibited structural similarity to a much larger protein from Macrostomum lignano (Fig. 5j).

**Fig. 5: Structure alignment of protein monomers and complexes to find protein fusion and fission events.**

Overall, through comprehensive comparative analyses of two large protein structure datasets, we identified numerous instances of protein fusion and fission events, underscoring their pivotal roles in protein evolution.

Improving protein-binding site prediction using our predicted protein complexes dataset

The AlphaFold Database (AFDB) of protein monomer structures not only provides an extensive repository of structural information but also serves as a critical source of training data for developing advanced predictive models. For example, high-quality subsets of AFDB data have been curated to train the AlphaFold3 model⁸. Over 40 million high-quality monomer structures were used to train the unified structural and protein sequence embedding model SaProt⁴⁸. AFDB is also applied during fine-tuning stages for protein domain segmentation annotation⁴⁹. Identifying functional sites on protein surfaces is crucial for drug discovery, protein engineering, and vaccine design. Among our dataset, a substantial array of protein interaction surfaces is available. MPBind, a recently developed model for protein-binding site prediction, is built upon protein language models and equivariant graph neural networks⁵⁰. We continued training MPBind, referred to as MPBind-PHC, using 9609 predicted non-redundant protein complexes that were selected according to Best LIS ≥ 0.4 and pLDDT ≥ 80.

To evaluate binding-site prediction performance, we first curated a validation dataset from predicted protein complex structures by removing redundant entries based on interface and monomer structural similarity using FoldSeek. The fine-tuned model was then tested on an independent test set, Pro_Test_315, derived from experimentally resolved structures⁵¹, as well as on a filtered version of this dataset (Filtered Pro_Test_315) constructed with the same deduplication procedure. Across all three evaluation datasets, the fine-tuned model exhibited improved AUC values, highlighting the potential of predicted protein complexes as a valuable resource for applications in biomedicine and beyond (Supplementary Fig. 25).

In sum, we demonstrated deep-learning applications built on a large-scale protein complex dataset, showing that the massive predictions generated by AlphaFold not only facilitate the discovery of individual PPIs but also provide substantial value to protein science through high-quality predicted data-driven modeling.

Discussion

In this study, we employed a large-scale, AlphaFold-based structure-prediction pipeline to build an atlas of predicted protein complexes spanning viruses, archaea, bacteria, plants, and mammals. This resource advances our understanding of key biological mechanisms, including viral pathogenesis, the evolution of protein assemblies, and complex cellular processes. The rapid progress of high-accuracy structure prediction suggests that large-scale, data-centric modeling will become a broadly applicable approach for investigating diverse biological processes. Although such modeling offers advantages when integrated with high-throughput experiments, it also presents challenges. High-throughput screening identified numerous candidate protein–protein interactions; nevertheless, only a minority received high-confidence predictions from AlphaFold-Multimer, plausibly owing to a combination of screening-derived false positives and genuine—but transient, weak, or condition-specific—interactions that elude confident modeling given incomplete MSA coverage and current algorithmic constraints. To address these issues, it will be important to improve the accuracy and sensitivity of both computational models and experimental techniques, thereby enabling more comprehensive identification and characterization of protein interactions, particularly those that are transient or underrepresented in current datasets.

Recent advances in deep learning-based structure prediction, especially all-atom models such as AlphaFold3 and RosettaFold-All-Atom^7,8, extend beyond protein structure prediction to include modeling of protein-ligand and protein-nucleic acid complexes, and to account for post-translational modifications. Such interactions underpin diverse biological functions, including binding of transcription factors to DNA⁵², regulatory mechanisms between metabolites and enzymes⁵³, and phosphorylation-dependent signaling⁵⁴. Incorporating these interaction types into predictive frameworks enables more faithful modeling of dynamic cellular processes. We anticipate that the near future will bring an influx of expansive datasets, catalyzing a surge of biological discoveries and further enhancing the refinement and iteration of deep-learning models. This integration will not only deepen our understanding of molecular machinery but also drive innovation in drug design and genetic engineering. Taken together, our proteome-wide atlas of protein complex structures opens up new opportunities for systems biology, protein engineering, structure-guided omics, and structure-informed evolutionary analyses, ultimately advancing our understanding of the fundamental principles that govern protein interactions and link sequence, structure, and function.

Methods

Pathogenic bacteria operon prediction

We obtained the translated CDS FASTA files of the reference genomes for each of the 36 pathogenic bacterial species from the NCBI genome dataset (https://www.ncbi.nlm.nih.gov/datasets/genome/). The operons were predicted by Operon Mapper (https://biocomputo.ibt.unam.mx/operon_mapper/)²⁵. Proteins encoded within in the same operon were pairwise combined as operonic protein pairs.

Representative genome analysis of archaeal and bacterial phyla

For each phylum of Archaea and Bacteria, a representative sequence was selected from the GTDB database (https://gtdb.ecogenomic.org/tree). The selection criteria prioritized the largest order encompassing the greatest number of classes, followed by the largest family within that order, and continuing this hierarchical approach to identify the GTDB species representative. Corresponding translated CDS FASTA files were retrieved from NCBI based on the GenBank Assembly accession numbers. When a CDS FASTA file was unavailable, alternative representative sequences were chosen. The final selection included 21 representative sequences from Archaea (spanning 21 phyla) and 167 representative sequences from Bacteria (covering 181 phyla). From the acquired FASTA files, all adjacent protein sequence pairs located on the same genome were extracted.

Phylogenetic tree analysis

Phylogenetic tree analysis was conducted using OrthoFinder 2.5.5 (https://github.com/davidemms/OrthoFinder) on the genomic proteins of 36 pathogenic bacteria, 21 representative species of Archaea, and 167 representative species of Bacteria. The resulting phylogenetic trees were visualized with iTOL v6 (https://itol.embl.de/).

Protein complex prediction using ColabFold

We employed ColabFold v1.5.5 (https://github.com/YoshitakaMo/localcolabfold) to predict protein complexes, employing databases including UniRef30 (uniref30_2302), ColabFoldDB (colabfold_envdb_202108), and the template database (pdb100_230517) for the structure inference. ColabFold operates in two main stages: Multiple Sequence Alignment (MSA) generation and protein structure prediction. For the MSA generation, we used the colabfold_search command line tool with --use-templates 1 option enabled to perform template-based searches, while keeping all other parameters at their default settings.

Subsequently, we utilized the colabfold_batch command line tool to predict protein structures, and performed the structure refinement on GPUs (--amber --use-gpu-relax) with the Alphafold2_Multimer_v3 model. When template matches were identified in the MSA, we incorporated template information into the prediction using --template option. For cases without template matching, predictions were carried out without using the --template parameter. ColabFold uses the num-recycle parameter to control the number of prediction recycles. Increasing the number of recycles can improve prediction quality but increases computation time. Without any special instructions, we used 20 recycles (--num-recycle 20) to ensure high prediction quality. For mice, Arabidopsis, and 188 representative bacterial and archaeal species, we used 3 recycles (--num-recycle 3) to significantly reduce inference time while maintaining sufficient prediction quality. With default parameter --num-models 5, Colabfold outputs five PDB-formatted files ranked from 1 to 5, for each protein complex prediction. We selected the ipTM highest-ranked PDB file as the final prediction. The corresponding confidence scores, including pLDDT, pTM, and ipTM, were recorded in the inference log files and scoring JSON files. The LIS and LIA metric scores were calculated as the guidance (https://github.com/flyark/AFM-LIS).

Comparison of AlphaFold3 with protenix, Chai-1, Boltz-2 and ColabFold-multimer

Heterodimeric and homodimeric complexes deposited in PDB, since 17 May 2023, were retrieved by Dockground server⁵⁵ filtered to a maximum resolution of 3.5 Å, an interface comprising of at least 12 contacting residues, and a buried surface area greater than 800 Å². Redundancy was removed with Foldseek-multimer with default parameters, yielding 54 heterodimers and 75 homodimers that served as the definitive test set. The AlphaFold3 predictions were performed on the online server (https://alphafoldserver.com/). ColabFold-Multimer was predicted as described above. Boltz-2 was predicted as the following command line: boltz predict input_path --recycling_steps 3 --sampling_steps 200 --diffusion_samples 5 --use_msa_server. Protenix predictions with ESM embeddings were executed as protenix predict --input input.json --out_dir./output_no_msa --seeds 101 --use_esm. Protenix predictions using MSA, and Chai-1 prediction were performed with scripts available on GitHub (https://github.com/wensm77/Protein-Complex-Atlas). Prediction quality was assessed using lDDT module in OpenStructure⁵⁶ to evaluate residue-level accuracy and DockQ⁵⁷ to measure overall chain–chain docking performance.

Co-localization score calculation

Initially, we used each of the 36 pathogenic bacterial genomes as a query in the search against the NCBI database, which served as the target and comprised 890 million proteins (across 18 million contigs). Searches were performed using the MMseqs2 version 15.6f452, filtering for sequences with the similarity ranging from 0.3 to 0.9. The search parameter —max-seqs was set to 10,000 to ensure sufficient confidence in the subsequent co-localization score calculations. Prior to calculating this matrix, we defined co-localization as occurring between two protein sequences when both of the following conditions are met: (1) the proteins are located on the same contig, and (2) the shortest distance between the two proteins does not exceed 3 kbp.

Considering a pathogenic bacterium with n query protein sequences, we denote ${q}_{i}$ as the i-th query protein and ${T}_{i}=[{t}_{1}^{i},{t}_{2}^{i},\ldots,{t}_{{Li}}^{i}]$ as the searched target protein list after identity filtering, where ${Li}\ge 0$ denotes the number of the corresponding i-th found sequences. For query protein pair ${q}_{i},{q}_{j}$, we iterated through every possible pair of protein sequences in ${T}_{i}$ and ${T}_{j}$ and counted whether they were co-localized. In other words, ${f}_{{count}}(i,j)$ is initialized to 0 for all $i,j\le n$. Moreover, $\forall {t}_{u}^{i}\in {T}_{i},{t}_{v}^{j}\in {T}_{j}$, ${f}_{{count}}(i,j)$=$\,{f}_{{count}}(i,j)$+1 if ${t}_{u}^{i},{t}_{v}^{j}$ are co-localized. Accordingly, calculate ${f}_{{count}}(i,j)$ for all combinations of $i,{j}\le {n}$ except $i=j$, since we ignore self-localization. Actually, it is unnecessary to calculate ${f}_{{count}}(i,j)$ for the $i < j$ situation owing to ${f}_{{count}}(i,j)$=$\,{f}_{{count}}( \, \, j,i)$.

Based on the analysis above, the co-localization score matrix $M=[{m}_{{ij}}]\in {{\mathbb{R}}}^{n\times n}$ is calculated as follows:

$${m}_{{ij}}=\left\{\begin{array}{cc}0 & {{{\rm{if}}}} \, i=j \, {{{\rm{or}}}} \, {q}_{i},{q}_{j} \, {{{\rm{is}}}}\; {{{\rm{not}}}}\; {{{\rm{co}}}}-{{{\rm{localized}}}}\; {{{\rm{or}}}} \; {contig}{{\_}}{cnt}({T}_{i})=0\\ \frac{{f}_{{count}}\left(i,j\right)}{{contig}{{\_}}{cnt}({T}_{i})} & {otherwise}\end{array}\right.$$

(1)

where ${contig\_cnt}({T}_{i})$ indicates the number of contigs associated with T_i. It should be noted that if at least one of $\{{Li},{Lj}\}$ is small, then one of $\{{m}_{{ij}},{m}_{{ji}}\}$ will be may become disproportionately high while the other is close to 0. To avoid such situation and ensure the accuracy of co-localization scores, we applied the principle of mutual interaction between co-localized protein pairs as follows: (1) ${m}_{{ij}}\ge 0.4$ or ${m}_{{ji}}\ge 0.4$; (2) ${q}_{i},{q}_{j}$ are co-localized; (3) ${Li}\ge 2000$ and ${Lj}\ge 2000$. Protein pairs satisfying these conditions were considered associated co-localized protein pairs.

Cell culture and plasmids transfection

H1299 (CRL-5803) and 293 T (CRL-3216) were obtained from ATCC (Manassas, VA, USA) and cultured in DMEM medium (Gibco, Rockville, MD, USA) supplemented with 10% fetal bovine serum (FBS; Hyclone, Logan, UT, USA), 100 units/mL penicillin (GIBCO), and 100 μg/mL streptomycin (GIBCO). Cells were maintained in a humidified 37 °C incubator under a 5% CO₂ atmosphere. All cell lines used in this study were routinely tested negative for mycoplasma contamination and were maintained at low passage numbers to maintain their identity, and were authenticated by morphology check and growth curve analysis.

For plasmids transfection, 293 T cells at 60% confluence were transfected using Lipofectamine 2000 (Invitrogen, Carlsbad, CA, USA). The plasmids expressing Flag- or HA-tagged proteins (Flag-E3A, Flag-UL44, HA-CD4, and HA-HLA) were synthesized and constructed by GenScript Biotech (Sequences showed in Supplementary Table 2).

Immunofluorescence staining

Cells grown on coverslips were fixed with 4% polyformaldehyde in PBS, permeabilized with 0.1% Triton X-100 in PBS, blocked with 4% bovine serum albumin in PBS, hybridized to an appropriate primary antibody (PKCζ: 1:100, 340815, Zenbio; NPM1:1:200, MA5-12508, ThermoFisher), followed by incubation with a second antibody (Goat anti-Mouse Alexa Fluor 488, A-11029 or Goat anti- Rabbit Alexa Fluor 514, A-31558, ThermoFisher). The cells were counterstained with ProLong® Gold Antifade Reagent with DAPI (82961, CST) prior to visualization and photographed using a Leica TCS SP5II confocal laser scanning microscope. LAS X (version3.3.0) was used to analyze fluorescent images. To evaluate the PKCζ or NPM1 co-localization with DAPI, the free software Image J. Fiji, coupled with the Coloc 2 plugin and Pearson’s correlation coefficient were used to calculate double fluorescence correlation coefficients, and co-localized fluorescence quantifications were presented by scatter plots.

Co-immunoprecipitation (Co-IP) and western blot

Cells were lysed in IP buffer (1 mM Tris pH 7.5, 5 mM NaCl, 0.25% Nonidet P-40, 0.1% deoxycholate, and protease inhibitors), and equal amounts of total protein were incubated with primary antibodies, or normal indicated IgG, overnight at 4 °C, and then 30 µl of protein A/G beads were added for an additional 2 h of incubation. For exogenous Co-IP, anti-HA beads were added to equal amounts of total protein and incubated overnight. Beads were centrifuged (500 × g for 30 s) and washed three times using wash buffer (20 mM Tris-HCl, 250 mM NaCl, 0.2 mM EGTA, and 0.1% Nonidet P-40). The beads were heated at 100 °C for 10 min before western blot analyses. Anti-HA magnetic beads (88836) were obtained from Thermo Fisher Scientific (Waltham, MA, USA).

For western blot, cell lysates were loaded, separated by SDS-PAGE, transferred to PVDF membranes (Millipore), and hybridized to an appropriate primary antibody and horseradish peroxidase (HRP)-conjugated secondary antibody for subsequent detection by enhanced chemiluminescence (Bio-Rad). Western blot images were analyzed using Image Lab Software 5.0. Antibody for GAPDH (AB0037,1:5000) was purchased from Abways (Shanghai, China). Antibodies for SP1 (SC-59,1:1000) were purchased from Santa Cruz (USA). Antibody for PKCζ (PRKCZ) (340815, 1:1000) was purchased from Zenbio (Chengdu, China). Antibody for KPNA2 (ET1705-61, 1:1000) was purchased from Huabio (Hangzhou, China). Antibody for NPM1 (MA5-125081:1000) was purchased from ThermoFisher Scientific (Wilmington, DE, USA). Antibodies for HA (3724, 1:1000), rabbit lgG (2729, 1:1000), and Flag (117935, 1:1000) were purchased from Cell Signaling Technology (Danvers, MA, USA).

Homo oligomer-contained large protein complex assembly

To construct a large homo-oligomeric protein complex, we adopted a strategy that combines QSproteome with AlphaFold2-bigbang²¹. Specifically, we first used QSproteome to truncate the flexible regions of homodimers using the default parameters, and then calculated the symmetry-related structure based on this truncation. After determining the optimal symmetry, we used AnAnaS to generate the untruncated symmetric complex structure. This untruncated symmetric structure, along with the MSA and sequence information of the homologous dimers, was provided as input to AlphaFold2-bigbang to predict the full-length symmetric complex structure. This approach allowed us to overcome the negative impact of flexible region truncation on AlphaFold2’s prediction accuracy.

In the high-quality protein structure prediction database, for a heterodimer composed of two distinct monomers, a and b, we first identified homodimers composed solely of a and b, respectively. Then, using the aforementioned method, we calculated the full-length optimal symmetric structures for these two homodimers and determined their optimal symmetries. If the two homodimers share the same optimal symmetry, we used CombFold (https://github.com/dina-lab3D/CombFold) as instructed to assemble them into a complete homo-oligomer containing a large protein complex.

Finetuning MPBind

We fine-tuned MPBind to accommodate our specific task requirements. We curated a subset comprising 71,667 high-quality complexes by filtering based on Best LIS ≥ 0.4 and pLDDT ≥ 80. To ensure diversity and non-redundancy in the training data, we implemented a deduplication strategy based on structural similarity. Utilizing FoldSeek, we eliminated complexes where the TM-score between any two protein monomers was ≥ 0.5. Following this process, the final non-redundant training set consisted of 9,609 dimeric complexes. Based on this dataset, we conducted continued training of the MPBind model (https://github.com/jianlin-cheng/MPBind) for 45 epochs on an NVIDIA RFX3090 GPU. Subsequently, we conducted deduplication on partial of remaining samples based on interface similarity and monomer similarity. In the interface deduplication phase, we employed a strategy that integrates structural alignment with interface site information to identify and eliminate proteins that, despite differing overall structures, exhibit highly similar key functional interfaces. We first utilized FoldSeek to compute structural alignment information between all protein single chains in the candidate samples (command: foldseek easy-search query.pdb targetdb out.m8 tmp --format-output “query,target,evalue,bits,qstart,qend,tstart,tend” -s 9.5). For any pair of proteins (denoted as A and B), we examined the extent of overlap between their structural alignment regions and the pre-defined interface residues. Interface residues in these complexes were defined as surface residues with a relative solvent accessibility greater than 5% that lost more than 1 Å² of absolute solvent accessibility upon protein-protein complex formation. If the overlap between the alignment region of protein A and the entire interface region of protein B exceeds 50%, or if the overlap between the alignment region of protein B and the entire interface region of protein A exceeds 50%, then the pair of proteins is deemed to exhibit interface similarity. Any complex containing protein chains determined to have interface similarity will be removed from the validation set candidate pool. Subsequently, in the monomer deduplication phase, we aimed to thoroughly exclude any single chains in the validation set complexes that display structural similarity with those in the training set. We employed FoldSeek to perform structural clustering of all monomers in the dataset, using the command: foldseek easy-cluster pdb res tmp -c 0.9. Each monomer was mapped to a unique structural cluster representative. Any validation-candidate dimer sharing a representative (from either chain) with the training set was discarded, yielding a validation set whose dimers are structurally independent of all training chains. We ultimately obtained a validation set comprising 205 monomers. Similarly, we performed deduplication on the independent dataset Pro_Test_315, resulting in the construction of a test set containing 28 monomers derived from PDB structures. The predicted residues were selected as the threshold with 0.5 as before.

Genomic co-occurrence rate calculation

Reference genomes were downloaded from NCBI. The homologs were searched by MMseqs2 with similarity between 0.3 and 0.9. Then the genomic co-occurrence was calculated as the number of genomes in which both genes are present, divided by the total number of genomes in which either gene is present.

Protein complex clustering and Gene fusion hit discovery

Protein complexes were predicted by Foldseek Multimer as the command: foldseek easy-multimercluster INPUT_PDB_DIR OUTPUT_CLUSTER_DIR TMP_WORK_DIR --multimer-tm-threshold 0.65 --chain-tm-threshold 0.5 --cov-mode 2 -e 0.01 --exhaustive-search. AFDB50 Foldseek database file was downloaded from https://foldseek.steineggerlab.workers.dev/. Each monomer extracted from the protein complexes was searched against AFDB50 with Foldseek (command: foldseek easy-search query.pdb targetdb out.m8 tmp --format-output “query,target,evalue,bits,qstart,qend,tstart,tend” -s 9.5). For each hit, we considered the aligned query and target segments as intervals, measured the length of their intersection (overlap) and union (combined coverage), and calculated an overlap/union ratio; only hits with a ratio <0.30 were accepted to avoid mapping to the same region.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The 1.1 million predicted protein structures generated in this study have been deposited in the ModelScope database (https://www.modelscope.cn/collections/protein_complex_atlas-2ae5e7d4f4a343). The processed, curated high-confidence PPI structures are available at a companion website (https://www.biopredictnavigator.cn). Accession codes for analysed genomes of representative prokaryotes are available in Supplementary Data 1. Source data are provided with this paper.

Code availability

The code for this manuscript is provided in GitHub repository: https://github.com/wensm77/Protein-Complex-Atlas, and on Zenodo: https://doi.org/10.5281/zenodo.18630539.

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article ADS MathSciNet CAS PubMed Google Scholar
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv, 2021.10.04.463034 (2021).
Bryant, P. et al. Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search. Nat. Commun. 13, 6028 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Shor, B. & Schneidman-Duhovny, D. CombFold: predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2. Nat. Methods 21, 477–487 (2024).
Article CAS PubMed PubMed Central Google Scholar
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
Article CAS PubMed Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bouatta, N. & AlQuraishi, M. Structural biology at the scale of proteomes. Nat. Struct. Mol. Biol. 30, 129–130 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hammack, A. T. & Blaby-Haas, C. E. Machine learning sheds light on microbial dark proteins. Nat. Rev. Microbiol. 22, 63–63 (2024).
Article CAS PubMed Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article ADS PubMed Google Scholar
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Durairaj, J. et al. Uncovering new families and folds in the natural protein universe. Nature 622, 646–653 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Bordin, N. et al. AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun. Biol. 6, 160 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nomburg, J. et al. Birth of protein folds and functions in the virome. Nature 633, 710–717 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, R. S., Levy Karin, E., Mirdita, M., Chikhi, R. & Steinegger, M. BFVD—a large repository of predicted viral protein structures. Nucleic Acids Res. 53, D340–D347 (2024).
Article Google Scholar
Huang, J. et al. Discovery of deaminase functions by structure-based protein clustering. Cell 187, 4426–4428 (2024).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646 (2023).
Article CAS PubMed PubMed Central Google Scholar
Oughtred, R. et al. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 30, 187–200 (2021).
Article CAS PubMed Google Scholar
Schweke, H. et al. An atlas of protein homo-oligomerization across domains of life. Cell 187, 999–1010 e15 (2024).
Article CAS PubMed Google Scholar
Humphreys, I. R. et al. Computed structures of core eukaryotic protein complexes. Science 374, eabm4805 (2021).
Article CAS PubMed PubMed Central Google Scholar
Burke, D. F. et al. Towards a structurally resolved human protein interaction network. Nat. Struct. Mol. Biol. 30, 216–225 (2023).
Article CAS PubMed PubMed Central Google Scholar
Shine, M. et al. Co-transcriptional gene regulation in eukaryotes and prokaryotes. Nat. Rev. Mol. Cell Biol. 25, 534–554 (2024).
Article CAS PubMed PubMed Central Google Scholar
Taboada, B., Estrada, K., Ciria, R. & Merino, E. Operon-mapper: a web server for precise operon identification in bacterial and archaeal genomes. Bioinformatics 34, 4118–4120 (2018).
Article CAS PubMed PubMed Central Google Scholar
Altae-Tran, H. et al. Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering. Science 382, eadi1910 (2023).
Article CAS PubMed PubMed Central Google Scholar
Makarova, K. S. et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18, 67–83 (2020).
Article CAS PubMed Google Scholar
Terlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, D603–D610 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article CAS PubMed PubMed Central Google Scholar
Passaro, S. et al. Boltz-2: towards accurate and efficient binding affinity prediction. bioRxiv, 2025.06.14.659707 (2025).
Discovery, C. et al. Chai-1: Decoding the molecular interactions of life. bioRxiv, 2024.10.10.615955 (2024).
Team, B. A. A. S. et al. Protenix—advancing structure prediction through a comprehensive AlphaFold3 reproduction. bioRxiv, 2025.01.08.631967 (2025).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhu, W., Shenoy, A., Kundrotas, P. & Elofsson, A. Evaluation of AlphaFold-multimer prediction on multi-chain protein complexes. Bioinformatics 39, btad424 (2023).
Kim, A.-R. et al. Enhanced protein-protein interaction discovery via AlphaFold-multimer. bioRxiv, 2024.02.19.580970 (2024).
Schweke, H. et al. An atlas of protein homo-oligomerization across domains of life. Cell 187, 999–1010.e15 (2024).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Yang, M. et al. Biogenesis of a bacterial metabolosome for propanediol utilization. Nat. Commun. 13, 2920 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Chandrangsu, P., Rensing, C. & Helmann, J. D. Metal homeostasis and resistance in bacteria. Nat. Rev. Microbiol. 15, 338–350 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Zheng, D., Zhou, S., Chen, L. & Yang, J. VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 50, D912–D917 (2022).
Article CAS PubMed PubMed Central Google Scholar
Luck, K. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, H., Mapolelo, D. T., Randeniya, S., Johnson, M. K. & Outten, C. E. Human glutaredoxin 3 forms [2Fe-2S]-bridged complexes with human BolA2. Biochemistry 51, 1687–1696 (2012).
Article CAS PubMed PubMed Central Google Scholar
Liang, G. & Bushman, F. D. The human virome: assembly, composition and host interactions. Nat. Rev. Microbiol. 19, 514–527 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lasso, G. et al. A structure-informed atlas of human-virus interactions. Cell 178, 1526–1541 e16 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yang, X. et al. HVIDB: a comprehensive database for human-virus protein-protein interactions. Brief. Bioinform. 22, 832–844 (2021).
Article CAS PubMed Google Scholar
Xiao, J. et al. FBXL20-mediated Vps34 ubiquitination as a p53 controlled checkpoint in regulating autophagy and receptor degradation. Genes Dev. 29, 184–196 (2015).
Article PubMed PubMed Central Google Scholar
Arnold, B. J., Huang, I. T. & Hanage, W. P. Horizontal gene transfer and adaptive evolution in bacteria. Nat. Rev. Microbiol. 20, 206–218 (2022).
Article CAS PubMed Google Scholar
Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. bioRxiv, 2023.10.01.560349 (2023).
Lau, A. M., Kandathil, S. M. & Jones, D. T. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat. Commun. 14, 8445 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, Y., Boadu, F. & Cheng, J. MPBind: multitask protein binding site prediction by protein language models and equivariant graph neural networks. bioRxiv, 2025.04.12.648527 (2025).
Fang, Y. et al. DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model. Bioinformatics 39, btad718 (2023).
Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).
Article CAS PubMed Google Scholar
Martínez-Reyes, I. & Chandel, N. S. Mitochondrial TCA cycle metabolites control physiology and disease. Nat. Commun. 11, 102 (2020).
Article ADS PubMed PubMed Central Google Scholar
Lee, J. M., Hammarén, H. M., Savitski, M. M. & Baek, S. H. Control of protein stability by post-translational modifications. Nat. Commun. 14, 201 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Douguet, D., Chen, H.-C., Tovchigrechko, A. & Vakser, I. A. Dockground resource for studying protein–protein interfaces. Bioinformatics 22, 2612–2618 (2006).
Article CAS PubMed Google Scholar
Biasini, M. et al. OpenStructure: an integrated software framework for computational structural biology. Acta Crystallogr. D. Biol. Crystallogr. 69, 701–709 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Mirabello, C. & Wallner, B. DockQ v2: improved automatic quality measure for protein multimers, nucleic acids, and small molecules. Bioinformatics 40, (2024).

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (No.2023YFF1205400 to D.M.), National Natural Science Foundation of China under grant (No. 32571689 and No. 32301230 to D.M., No. 82573859 to Y. Y. and No. 72174172 to D. E.), Zhejiang Laboratory PI start program, Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China (JYB2025XDXM502 to J.Z.), the Noncommunicable Chronic Diseases-National Science and Technology Major Project (No. 2024ZD0525100 to J.Z.) and the Scientific and Technological Innovation Team for Qinghai-Tibetan Plateau Research in Southwest Minzu University (Grant No.2024CXTD20). Authors thank Yuanzhao Pan (Beijing National Day School) for valuable support and insightful discussions. Xitong Li (Jiangnan University), Weizhen Ou (Jiangnan University), Jijun Fan (Jiangnan University), Wenbo Deng (China University of Mining and Technology), and Shuhao Niu (Jiangnan University) provided suggestions for language revisions.

Author information

These authors contributed equally: Xianzhi Qi, Cheng Ye, Jianqiang Liang, Shimin Wen, Yuanyuan Li, Kai Ding, Yongfu Hao, Junjie Fei, Weian Mao, Liupeng Li.

Authors and Affiliations

Zhejiang Lab, Hangzhou, China
Xianzhi Qi, Cheng Ye, Jianqiang Liang, Yuanyuan Li, Kai Ding, Yongfu Hao, Liupeng Li, Yichong Shen, Yayun Hu, Rui Zhang, Pengli Ji, Yafei Lu, Zhenguo Ma, Xinyu Xu, Youyuan Zhu, Qiaosha Zou, Kelu Yao & Dacheng Ma
College of Computer Science and Artificial Intelligence, Southwest Minzu University, Chengdu, China
Shimin Wen, Zhiyu Lin, Hongjie Zhu, Bonan Liu, Han Wang, Yuxuan Chen, Daji Ergu & Ying Cai
Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
Junjie Fei, Zhi-Xiong Jim Xiao & Yong Yi
Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia
Weian Mao
Zhejiang University, Hangzhou, China
Weian Mao, Peiyuan Yang & Chunhua Shen
Department of Urology, Fudan University Shanghai Cancer Center, Shanghai, China
Junlong Wu
Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
Junlong Wu
Institute of Neuroscience, CAS Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China
Wencheng Zhu
School of Engineering, Westlake University, Hangzhou, China
Shuya Li & Jianyang Zeng
Global Institute of Future Technology, Shanghai Jiao Tong University, Shanghai, China
Hongyi Xin
School of Biotechnology, Jiangnan University, Wuxi, China
Dacheng Ma

Authors

Xianzhi Qi
View author publications
Search author on:PubMed Google Scholar
Cheng Ye
View author publications
Search author on:PubMed Google Scholar
Jianqiang Liang
View author publications
Search author on:PubMed Google Scholar
Shimin Wen
View author publications
Search author on:PubMed Google Scholar
Yuanyuan Li
View author publications
Search author on:PubMed Google Scholar
Kai Ding
View author publications
Search author on:PubMed Google Scholar
Yongfu Hao
View author publications
Search author on:PubMed Google Scholar
Junjie Fei
View author publications
Search author on:PubMed Google Scholar
Weian Mao
View author publications
Search author on:PubMed Google Scholar
Liupeng Li
View author publications
Search author on:PubMed Google Scholar
Zhiyu Lin
View author publications
Search author on:PubMed Google Scholar
Yichong Shen
View author publications
Search author on:PubMed Google Scholar
Hongjie Zhu
View author publications
Search author on:PubMed Google Scholar
Yayun Hu
View author publications
Search author on:PubMed Google Scholar
Rui Zhang
View author publications
Search author on:PubMed Google Scholar
Pengli Ji
View author publications
Search author on:PubMed Google Scholar
Yafei Lu
View author publications
Search author on:PubMed Google Scholar
Bonan Liu
View author publications
Search author on:PubMed Google Scholar
Han Wang
View author publications
Search author on:PubMed Google Scholar
Yuxuan Chen
View author publications
Search author on:PubMed Google Scholar
Zhenguo Ma
View author publications
Search author on:PubMed Google Scholar
Peiyuan Yang
View author publications
Search author on:PubMed Google Scholar
Xinyu Xu
View author publications
Search author on:PubMed Google Scholar
Junlong Wu
View author publications
Search author on:PubMed Google Scholar
Youyuan Zhu
View author publications
Search author on:PubMed Google Scholar
Qiaosha Zou
View author publications
Search author on:PubMed Google Scholar
Wencheng Zhu
View author publications
Search author on:PubMed Google Scholar
Kelu Yao
View author publications
Search author on:PubMed Google Scholar
Shuya Li
View author publications
Search author on:PubMed Google Scholar
Hongyi Xin
View author publications
Search author on:PubMed Google Scholar
Daji Ergu
View author publications
Search author on:PubMed Google Scholar
Jianyang Zeng
View author publications
Search author on:PubMed Google Scholar
Zhi-Xiong Jim Xiao
View author publications
Search author on:PubMed Google Scholar
Chunhua Shen
View author publications
Search author on:PubMed Google Scholar
Ying Cai
View author publications
Search author on:PubMed Google Scholar
Yong Yi
View author publications
Search author on:PubMed Google Scholar
Dacheng Ma
View author publications
Search author on:PubMed Google Scholar

Contributions

D.M. conceived this project. X.Q., C.Y., J.L., S.W., Yuanyuan L., K.D., Yongfu H., J.F., W.M., L.L., Z.L., Y.S., H.Z., Yayun H., R.Z., P.J., Yafei L., B.L., H.W., Yuxuan C., Z.M., P.Y., X.X., J.W., Y.Z., Q.Z., W.Z., K.Y., S. L., H.X., D.E. performed the computational analysis. J.F. and Y.Y. performed wet-lab experiments. D.M., Y.Y., Ying C., and C.S. supervised the project. D.M., R.Z., Z.X., J.Z., D.E., H.X., and W.Z. wrote the manuscript.

Corresponding authors

Correspondence to Chunhua Shen, Ying Cai, Yong Yi or Dacheng Ma.

Ethics declarations

Competing interests

The authors declare no conflicts of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Transparent Peer Review file (download PDF )

Reporting Summary (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Qi, X., Ye, C., Liang, J. et al. Atlas of predicted protein complex structures across kingdoms. Nat Commun 17, 4397 (2026). https://doi.org/10.1038/s41467-026-70884-4

Download citation

Received: 25 November 2024
Accepted: 04 March 2026
Published: 25 March 2026
Version of record: 18 May 2026
DOI: https://doi.org/10.1038/s41467-026-70884-4