Abstract
The global structural properties of a protein, such as shape, fold and topology, strongly affect its function. Although recent breakthroughs in diffusion-based generative models have greatly advanced de novo protein design, particularly in generating diverse and realistic structures, it remains challenging to design proteins of specific geometries without residue-level control over the topological details. A more practical, top-down approach is needed for prescribing the overall geometric arrangements of secondary structure elements in the generated protein structures. In response, we propose TopoDiff, an unsupervised framework that learns and exploits a global-geometry-aware latent representation, enabling both unconditional and controllable diffusion-based protein generation. Trained on the Protein Data Bank and CATH datasets, the structure encoder embeds protein global geometries into a 32-dimensional latent space, from which latent codes sampled by the latent sampler serve as informative conditions for the diffusion-based backbone decoder. In benchmarks against existing baselines, TopoDiff demonstrates comparable performance on established metrics including designability, diversity and novelty, as well as markedly improves coverage over the fold types of natural proteins in the CATH dataset. Moreover, latent conditioning enables versatile manipulations at the global-geometry level to control the generated protein structures, through which we derived a number of novel folds of mainly beta proteins with comprehensive experimental validation.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The dataset used for model training, along with the trained model weights, benchmark data and protein designs selected for experimental validation, is available via Zenodo at https://zenodo.org/records/13879811 (ref. 90). The crystal structure models have been deposited in the Protein Data Bank (accession codes 9KGZ and 9KGY). Source data are provided with this paper.
Code availability
The TopoDiff model is implemented in PyTorch. Full scripts (including the training code) and guidance for utilizing the model are available via GitHub at https://github.com/meneshail/TopoDiff/tree/main (ref. 91). A reproducible code capsule of TopoDiff is available via CodeOcean at https://doi.org/10.24433/CO.8705528.v1 (ref. 92).
References
Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017).
Silva, D.-A. et al. De novo design of potent and selective mimics of IL-2 and IL-15. Nature 565, 186–191 (2019).
Roy, A. et al. De novo design of highly selective miniprotein inhibitors of integrins avβ6 and avβ8. Nat. Commun. 14, 5660 (2023).
Yeh, A. H.-W. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).
Langan, R. A. et al. De novo design of bioactive protein switches. Nature 572, 205–210 (2019).
Chen, Z. et al. De novo design of protein logic gates. Science 368, 78–84 (2020).
Pan, X. & Kortemme, T. Recent advances in de novo protein design: principles, methods, and applications. J. Biol. Chem. 296, 100558 (2021).
Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2024).
Ni, B., Kaplan, D. L. & Buehler, M. J. Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model. Chem 9, 1828–1849 (2023).
Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. Preprint at bioRxiv https://doi.org/10.1101/2023.05.24.542179 (2023).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. Preprint at https://arxiv.org/abs/2205.15019 (2022).
Luo, S. et al. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. In Proc. 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 9754–9767 (Curran Associates Inc., 2022).
Yim, J. et al. SE(3) diffusion model with application to protein backbone generation. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 40001–40039 (JMLR.org, 2023).
Lin, Y. & AlQuraishi, M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 20978–21002 (PMLR, 2023).
Watson, J. L. et al. De novo design of protein structure and function with RFDiffusion. Nature 620, 1089–1100 (2023).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
Bennett, N. R. et al. Atomically accurate de novo design of single-domain antibodies. Preprint at https://doi.org/10.1101/2024.03.14.585103 (2024).
Ingraham, J. B. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).
Sadreyev, R. I., Kim, B.-H. & Grishin, N. V. Discrete-continuous duality of protein structure space. Curr. Opin. Struct. Biol. 19, 321–328 (2009).
Pascual-García, A., Abia, D., Ortiz, A. R. & Bastolla, U. Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures. PLoS Comput. Biol. 5, e1000331 (2009).
Martin, A. C. et al. Protein folds and functions. Structure 6, 875–884 (1998).
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome 1. J. Mol. Biol. 288, 147–164 (1999).
Micheletti, C. Prediction of folding rates and transition-state placement from native-state geometry. Proteins 51, 74–84 (2003).
Wang, J. & Panagiotou, E. The protein folding rate and the geometry and topology of the native state. Sci. Rep. 12, 6384 (2022).
Luo, C. Understanding diffusion models: a unified perspective. Preprint at https://arxiv.org/abs/2208.11970 (2022).
Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Hubbard, T. J., Murzin, A. G., Brenner, S. E. & Chothia, C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 25, 236–239 (1997).
Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926 (2014).
Day, R., Beck, D. A., Armen, R. S. & Daggett, V. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 12, 2150–2160 (2003).
Csaba, G., Birzele, F. & Zimmer, R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Struct. Biol. 9, 23 (2009).
Schaeffer, R. D., Kinch, L. N., Pei, J., Medvedev, K. E. & Grishin, N. V. Completeness and consistency in structural domain classifications. ACS Omega 6, 15698–15707 (2021).
Mura, C., Veretnik, S. & Bourne, P. E. The Urfold: structural similarity just above the superfold level? Protein Sci. 28, 2119–2126 (2019).
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J. & Aila, T. Improved precision and recall metric for assessing generative models. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M. et al.) 3927–3936 (Curran Associates Inc., 2019).
Listov, D., Goverde, C. A., Correia, B. E. & Fleishman, S. J. Opportunities and challenges in design and optimization of protein function. Nat. Rev. Mol. Cell Biol. 25, 639–653 (2024).
Chu, A. E., Lu, T. & Huang, P.-S. Sparks of function by de novo protein design. Nat. Biotechnol. 42, 203–215 (2024).
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y. & Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. & Singh, A.) 7176–7185 (JMLR.org, 2020).
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
Greener, J. G. & Jamali, K. Fast protein structure searching using structure graph embeddings. Bioinform. Adv. 5, vbaf042 (2025).
Bose, A. J. et al. Proc. 12th International Conference on Learning Representations (OpenReview.net, 2024).
Lin, Y., Lee, M., Zhang, Z. & AlQuraishi, M. Out of many, one: designing and scaffolding proteins at the scale of the structural universe with Genie 2. Preprint at https://arxiv.org/abs/2405.15489 (2024).
Huguet, G. et al. Sequence-augmented SE(3)-flow matching for conditional protein generation. In Advances in Neural Information Processing Systems 37 (eds Globerson, A. et al.) 33007–33036 (Curran Associates, Inc., 2024).
Chronowska, M., Stam, M. J., Woolfson, D. N., Di Costanzo, L. F. & Wood, C. W. The Protein Design Archive (PDA): insights from 40 years of protein design. Nat. Biotechnol. 43, 669–671 (2024).
Hermosilla, A. M., Berner, C., Ovchinnikov, S. & Vorobieva, A. A. Validation of de novo designed water-soluble and transmembrane β-barrels by in silico folding and melting. Protein Sci. 33, e5033 (2024).
Liu, Y., Chen, L. & Liu, H. Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions. Preprint at bioRxiv https://doi.org/10.1101/2023.11.18.567666 (2023).
Fu, C. et al. A latent diffusion model for protein structure generation. In Proc. Second Learning on Graphs Conference (eds Villar, S. & Chamberlain, B.) 29:1–29:17 (PMLR, 2024).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://arxiv.org/abs/2204.06125 (2022).
Preechakul, K., Chatthee, N., Wizadwongsa, S. & Suwajanakorn, S. Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022).
Kim, S. W. et al. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2023).
Praetorius, F. et al. Design of stimulus-responsive two-state hinge proteins. Science 381, 754–760 (2023).
Berger, S. et al. Preclinical proof of principle for orally delivered Th17 antagonist miniproteins. Cell 187, 4305–4317.e18 (2024).
Glögl, M. et al. Target-conditioned diffusion generates potent TNFR superfamily antagonists and agonists. Science 386, 1154–1161 (2024).
Huang, B. et al. Designed endocytosis-inducing proteins degrade targets and amplify signals. Nature 638, 796–804 (2024).
Baker, D. et al. De novo designed proteins neutralize lethal snake venom toxins. Nature 639, 225–231 (2024).
An, L. et al. Binding and sensing diverse small molecules using shape-complementary pseudocycles. Science 385, 276–282 (2024).
Chu, A. E. et al. An all-atom protein generative model. Proc. Natl Acad. Sci. USA 121, e2311500121 (2024).
Campbell, A., Yim, J., Barzilay, R., Rainforth, T. & Jaakkola, T. Generative flows on discrete state-spaces: enabling multimodal flows with applications to protein co-design. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 5453–5512 (JMLR.org, 2024).
Dietmann, S. et al. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29, 55–57 (2001).
Xu, J. & Zhang, J. Impact of structure space continuity on protein fold classification. Sci. Rep. 6, 23263 (2016).
Skolnick, J., Arakaki, A. K., Lee, S. Y. & Brylinski, M. The continuity of protein structure space is an intrinsic property of proteins. Proc. Natl Acad. Sci. USA 106, 15690–15695 (2009).
Woolfson, D. N. et al. De novo protein design: how do we expand into the universe of possible protein structures? Curr. Opin. Struct. Biol. 33, 16–26 (2015).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
Guo, X., Du, Y., Tadepalli, S., Zhao, L. & Shehu, A. Generating tertiary protein structures via interpretable graph variational autoencoders. Bioinform. Adv. 1, vbab036 (2021).
Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).
Lai, B., McPartlon, M. & Xu, J. End-to-end deep structure generative model for protein design. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499440 (2022).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022).
Podell, D. et al. Proc. 12th International Conference on Learning Representations (OpenReview.net, 2024).
Esser, P. et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 12606–12633 (JMLR.org, 2024).
Poličar, P. G., Stražar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. J. Stat. Softw. 109, 1–30 (2024).
Scott, D. W. Multivariate Density Estimation: Theory, Practice, and Visualization 1st edn (Wiley, 1992).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XV. Proteins 91, 1539–1549 (2023).
Greener, J. G. & Jamali, K. Fast protein structure searching using structure graph embeddings. Bioinform. Adv. 5, vbaf042 (2022).
Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2023).
Song, J., Meng, C. & Ermon, S. Proc. 9th International Conference on Learning Representations (OpenReview.net, 2021).
Otwinowski, Z. & Minor, W. in Methods in Enzymology (ed. Carter, C. W. Jr) 307–326 (Elsevier, 1997).
Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010).
Emsley, P. & Cowtan, K. Coot: model-building tools for molecular graphics. Acta Crystallogr. D 60, 2126–2132 (2004).
The PyMOL Molecular Graphics System (Schrödinger, LLC, 2015).
Meng, E. C. et al. UCSF ChimeraX: tools for structure building and analysis. Protein Sci. 32, e4792 (2023).
Zhang, Y. et al. Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding. Preprint at bioRxiv https://doi.org/10.1101/2024.10.05.616664 (2024).
Zhang, Y. meneshail/TopoDiff: v1.1.0. GitHub https://github.com/meneshail/TopoDiff/tree/main (2025).
Zhang, Y., Liu, Y., Ma, Z., Li, M. & Chunfu, X. CodeOcean release of ‘TopoDiff: improving diffusion-based protein backbone generation with global-geometry-aware latent encoding’, version 1. CodeOcean https://doi.org/10.24433/CO.8705528.v1 (2025).
Acknowledgements
This work has been supported by the Ministry of Science and Technology of China (no. 2023YFF1204400 to H.G.), the National Natural Science Foundation of China (no. 32171243 to H.G.) and the Beijing Frontier Research Center for Biological Structure. We thank the staff of beamlines BL02U1, BL10U2, BL18U1 and BL19U1 at the Shanghai Synchrotron Radiation Facility as well as the X-ray crystallography platform, National Protein Science Facility, Tsinghua University, for assistance in the X-ray diffraction data collection and analysis. We thank J. Hu, Z. Zhu, Y. Xue and C. Song for helpful discussions.
Author information
Authors and Affiliations
Contributions
Y.Z. and H.G. conceived the study. Y.Z. and Z.M. designed and implemented the model. Y.Z. and Z.M. performed the in silico experiments and analysed the results. Y.Z. designed the candidate proteins for experimental validation. Y.L. designed, executed and analysed all the wet-laboratory experiments. H.G. supervised the development of the model and the result analysis. C.X. supervised the design of the candidate proteins and wet-laboratory experiments. M.L. contributed to the X-ray structure determination. Y.Z. drafted the initial paper. Y.Z., Y.L. and Z.M. created the final figures. All authors contributed to writing and improving the paper, and approved the submission.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Zhuoran Qiao, Limei Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Persistent underrepresentation of mainly-beta proteins in de novo design.
All de novo designed proteins deposited in PDB up to September 2024 were collected from the PDA database, and filtered to exclude small peptides (length ≤ 50) as well as designs originating from sequence mutations or redesigns of naturally occurring backbones (maximum TM-score to PDB ≥ 0.9). (a) Cumulative number of de novo protein design entries over the time. General proteins and mainly-beta proteins (with beta ratio ≥ 0.5) are colored in blue and purple, respectively. (b) Distribution of natural proteins of the CATH dataset (left) and de novo designed proteins (right) based on the proportion of beta sheets. (c) Scatter plot of all de novo designed proteins, where the horizontal axis represents novelty (maximum TM-score to PDB) and the vertical axis represents the proportion of beta sheets. Each protein is denoted as a point, colored based on protein length. Detailed discussion of these data could be found in Supplementary Results 6.3.
Supplementary information
Supplementary Information
Supplementary Figs. 1–28, Tables 1–16, Methods, Results and Algorithms (for pseudocodes).
Source data
Source Data Fig. 2
t-SNE-reduced embedding and annotations for all structures in the three databases.
Source Data Fig. 3
Statistical source data for Fig. 3b,c.
Source Data Fig. 4
Statistical source data for Fig. 4b.
Source Data Fig. 5
Source data for the SEC profiles, CD spectra and melting curve.
Source Data Fig. 5
Unprocessed western blots.
Source Data Fig. 5
Unprocessed western blots.
Source Data Extended Data Fig./Table 1
Statistics of filtered de novo designed proteins in PDB.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Liu, Y., Ma, Z. et al. Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding. Nat Mach Intell 7, 1104–1118 (2025). https://doi.org/10.1038/s42256-025-01059-x
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01059-x