Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding

A preprint version of the article is available at bioRxiv.

Abstract

The global structural properties of a protein, such as shape, fold and topology, strongly affect its function. Although recent breakthroughs in diffusion-based generative models have greatly advanced de novo protein design, particularly in generating diverse and realistic structures, it remains challenging to design proteins of specific geometries without residue-level control over the topological details. A more practical, top-down approach is needed for prescribing the overall geometric arrangements of secondary structure elements in the generated protein structures. In response, we propose TopoDiff, an unsupervised framework that learns and exploits a global-geometry-aware latent representation, enabling both unconditional and controllable diffusion-based protein generation. Trained on the Protein Data Bank and CATH datasets, the structure encoder embeds protein global geometries into a 32-dimensional latent space, from which latent codes sampled by the latent sampler serve as informative conditions for the diffusion-based backbone decoder. In benchmarks against existing baselines, TopoDiff demonstrates comparable performance on established metrics including designability, diversity and novelty, as well as markedly improves coverage over the fold types of natural proteins in the CATH dataset. Moreover, latent conditioning enables versatile manipulations at the global-geometry level to control the generated protein structures, through which we derived a number of novel folds of mainly beta proteins with comprehensive experimental validation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the work.
Fig. 2: Analysis of TopoDiff’s learned latent representations.
Fig. 3: Evaluation of TopoDiff’s generative performance for unconditional sampling.
Fig. 4: Exploring controllable protein structure generation with TopoDiff.
Fig. 5: Experimental validation of novel mainly beta protein designs.

Similar content being viewed by others

Data availability

The dataset used for model training, along with the trained model weights, benchmark data and protein designs selected for experimental validation, is available via Zenodo at https://zenodo.org/records/13879811 (ref. 90). The crystal structure models have been deposited in the Protein Data Bank (accession codes 9KGZ and 9KGY). Source data are provided with this paper.

Code availability

The TopoDiff model is implemented in PyTorch. Full scripts (including the training code) and guidance for utilizing the model are available via GitHub at https://github.com/meneshail/TopoDiff/tree/main (ref. 91). A reproducible code capsule of TopoDiff is available via CodeOcean at https://doi.org/10.24433/CO.8705528.v1 (ref. 92).

References

  1. Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017).

    Article  Google Scholar 

  2. Silva, D.-A. et al. De novo design of potent and selective mimics of IL-2 and IL-15. Nature 565, 186–191 (2019).

    Article  Google Scholar 

  3. Roy, A. et al. De novo design of highly selective miniprotein inhibitors of integrins avβ6 and avβ8. Nat. Commun. 14, 5660 (2023).

    Article  Google Scholar 

  4. Yeh, A. H.-W. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).

    Article  Google Scholar 

  5. Langan, R. A. et al. De novo design of bioactive protein switches. Nature 572, 205–210 (2019).

    Article  Google Scholar 

  6. Chen, Z. et al. De novo design of protein logic gates. Science 368, 78–84 (2020).

    Article  Google Scholar 

  7. Pan, X. & Kortemme, T. Recent advances in de novo protein design: principles, methods, and applications. J. Biol. Chem. 296, 100558 (2021).

    Article  Google Scholar 

  8. Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2024).

    Article  Google Scholar 

  9. Ni, B., Kaplan, D. L. & Buehler, M. J. Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model. Chem 9, 1828–1849 (2023).

    Article  Google Scholar 

  10. Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).

    Article  Google Scholar 

  11. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  12. Baek, M. et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. Preprint at bioRxiv https://doi.org/10.1101/2023.05.24.542179 (2023).

  13. Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. Preprint at https://arxiv.org/abs/2205.15019 (2022).

  14. Luo, S. et al. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. In Proc. 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 9754–9767 (Curran Associates Inc., 2022).

  15. Yim, J. et al. SE(3) diffusion model with application to protein backbone generation. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 40001–40039 (JMLR.org, 2023).

  16. Lin, Y. & AlQuraishi, M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 20978–21002 (PMLR, 2023).

  17. Watson, J. L. et al. De novo design of protein structure and function with RFDiffusion. Nature 620, 1089–1100 (2023).

    Article  Google Scholar 

  18. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  Google Scholar 

  19. Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).

    Article  Google Scholar 

  20. Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).

    Article  Google Scholar 

  21. Bennett, N. R. et al. Atomically accurate de novo design of single-domain antibodies. Preprint at https://doi.org/10.1101/2024.03.14.585103 (2024).

  22. Ingraham, J. B. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).

  23. Sadreyev, R. I., Kim, B.-H. & Grishin, N. V. Discrete-continuous duality of protein structure space. Curr. Opin. Struct. Biol. 19, 321–328 (2009).

    Article  Google Scholar 

  24. Pascual-García, A., Abia, D., Ortiz, A. R. & Bastolla, U. Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures. PLoS Comput. Biol. 5, e1000331 (2009).

    Article  Google Scholar 

  25. Martin, A. C. et al. Protein folds and functions. Structure 6, 875–884 (1998).

    Article  Google Scholar 

  26. Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome 1. J. Mol. Biol. 288, 147–164 (1999).

    Article  Google Scholar 

  27. Micheletti, C. Prediction of folding rates and transition-state placement from native-state geometry. Proteins 51, 74–84 (2003).

    Article  Google Scholar 

  28. Wang, J. & Panagiotou, E. The protein folding rate and the geometry and topology of the native state. Sci. Rep. 12, 6384 (2022).

    Article  Google Scholar 

  29. Luo, C. Understanding diffusion models: a unified perspective. Preprint at https://arxiv.org/abs/2208.11970 (2022).

  30. Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  31. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).

    Article  Google Scholar 

  32. Hubbard, T. J., Murzin, A. G., Brenner, S. E. & Chothia, C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 25, 236–239 (1997).

    Article  Google Scholar 

  33. Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926 (2014).

    Article  Google Scholar 

  34. Day, R., Beck, D. A., Armen, R. S. & Daggett, V. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Sci. 12, 2150–2160 (2003).

    Article  Google Scholar 

  35. Csaba, G., Birzele, F. & Zimmer, R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Struct. Biol. 9, 23 (2009).

    Article  Google Scholar 

  36. Schaeffer, R. D., Kinch, L. N., Pei, J., Medvedev, K. E. & Grishin, N. V. Completeness and consistency in structural domain classifications. ACS Omega 6, 15698–15707 (2021).

    Article  Google Scholar 

  37. Mura, C., Veretnik, S. & Bourne, P. E. The Urfold: structural similarity just above the superfold level? Protein Sci. 28, 2119–2126 (2019).

    Article  Google Scholar 

  38. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J. & Aila, T. Improved precision and recall metric for assessing generative models. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. M. et al.) 3927–3936 (Curran Associates Inc., 2019).

  39. Listov, D., Goverde, C. A., Correia, B. E. & Fleishman, S. J. Opportunities and challenges in design and optimization of protein function. Nat. Rev. Mol. Cell Biol. 25, 639–653 (2024).

  40. Chu, A. E., Lu, T. & Huang, P.-S. Sparks of function by de novo protein design. Nat. Biotechnol. 42, 203–215 (2024).

    Article  Google Scholar 

  41. Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).

    Article  Google Scholar 

  42. Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y. & Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. & Singh, A.) 7176–7185 (JMLR.org, 2020).

  43. Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).

    Article  Google Scholar 

  44. Greener, J. G. & Jamali, K. Fast protein structure searching using structure graph embeddings. Bioinform. Adv. 5, vbaf042 (2025).

  45. Bose, A. J. et al. Proc. 12th International Conference on Learning Representations (OpenReview.net, 2024).

  46. Lin, Y., Lee, M., Zhang, Z. & AlQuraishi, M. Out of many, one: designing and scaffolding proteins at the scale of the structural universe with Genie 2. Preprint at https://arxiv.org/abs/2405.15489 (2024).

  47. Huguet, G. et al. Sequence-augmented SE(3)-flow matching for conditional protein generation. In Advances in Neural Information Processing Systems 37 (eds Globerson, A. et al.) 33007–33036 (Curran Associates, Inc., 2024).

  48. Chronowska, M., Stam, M. J., Woolfson, D. N., Di Costanzo, L. F. & Wood, C. W. The Protein Design Archive (PDA): insights from 40 years of protein design. Nat. Biotechnol. 43, 669–671 (2024).

  49. Hermosilla, A. M., Berner, C., Ovchinnikov, S. & Vorobieva, A. A. Validation of de novo designed water-soluble and transmembrane β-barrels by in silico folding and melting. Protein Sci. 33, e5033 (2024).

    Article  Google Scholar 

  50. Liu, Y., Chen, L. & Liu, H. Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions. Preprint at bioRxiv https://doi.org/10.1101/2023.11.18.567666 (2023).

  51. Fu, C. et al. A latent diffusion model for protein structure generation. In Proc. Second Learning on Graphs Conference (eds Villar, S. & Chamberlain, B.) 29:1–29:17 (PMLR, 2024).

  52. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://arxiv.org/abs/2204.06125 (2022).

  53. Preechakul, K., Chatthee, N., Wizadwongsa, S. & Suwajanakorn, S. Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022).

  54. Kim, S. W. et al. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2023).

  55. Praetorius, F. et al. Design of stimulus-responsive two-state hinge proteins. Science 381, 754–760 (2023).

    Article  Google Scholar 

  56. Berger, S. et al. Preclinical proof of principle for orally delivered Th17 antagonist miniproteins. Cell 187, 4305–4317.e18 (2024).

    Article  Google Scholar 

  57. Glögl, M. et al. Target-conditioned diffusion generates potent TNFR superfamily antagonists and agonists. Science 386, 1154–1161 (2024).

    Article  Google Scholar 

  58. Huang, B. et al. Designed endocytosis-inducing proteins degrade targets and amplify signals. Nature 638, 796–804 (2024).

  59. Baker, D. et al. De novo designed proteins neutralize lethal snake venom toxins. Nature 639, 225–231 (2024).

  60. An, L. et al. Binding and sensing diverse small molecules using shape-complementary pseudocycles. Science 385, 276–282 (2024).

    Article  Google Scholar 

  61. Chu, A. E. et al. An all-atom protein generative model. Proc. Natl Acad. Sci. USA 121, e2311500121 (2024).

    Article  Google Scholar 

  62. Campbell, A., Yim, J., Barzilay, R., Rainforth, T. & Jaakkola, T. Generative flows on discrete state-spaces: enabling multimodal flows with applications to protein co-design. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 5453–5512 (JMLR.org, 2024).

  63. Dietmann, S. et al. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29, 55–57 (2001).

    Article  Google Scholar 

  64. Xu, J. & Zhang, J. Impact of structure space continuity on protein fold classification. Sci. Rep. 6, 23263 (2016).

    Article  Google Scholar 

  65. Skolnick, J., Arakaki, A. K., Lee, S. Y. & Brylinski, M. The continuity of protein structure space is an intrinsic property of proteins. Proc. Natl Acad. Sci. USA 106, 15690–15695 (2009).

    Article  Google Scholar 

  66. Woolfson, D. N. et al. De novo protein design: how do we expand into the universe of possible protein structures? Curr. Opin. Struct. Biol. 33, 16–26 (2015).

    Article  Google Scholar 

  67. Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).

    Article  Google Scholar 

  68. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).

    Article  Google Scholar 

  69. Guo, X., Du, Y., Tadepalli, S., Zhao, L. & Shehu, A. Generating tertiary protein structures via interpretable graph variational autoencoders. Bioinform. Adv. 1, vbab036 (2021).

    Article  Google Scholar 

  70. Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).

    Article  Google Scholar 

  71. Lai, B., McPartlon, M. & Xu, J. End-to-end deep structure generative model for protein design. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499440 (2022).

  72. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022).

  73. Podell, D. et al. Proc. 12th International Conference on Learning Representations (OpenReview.net, 2024).

  74. Esser, P. et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 12606–12633 (JMLR.org, 2024).

  75. Poličar, P. G., Stražar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. J. Stat. Softw. 109, 1–30 (2024).

    Article  Google Scholar 

  76. Scott, D. W. Multivariate Density Estimation: Theory, Practice, and Visualization 1st edn (Wiley, 1992).

  77. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  Google Scholar 

  78. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

    Article  Google Scholar 

  79. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XV. Proteins 91, 1539–1549 (2023).

    Article  Google Scholar 

  80. Greener, J. G. & Jamali, K. Fast protein structure searching using structure graph embeddings. Bioinform. Adv. 5, vbaf042 (2022).

  81. Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

    Article  Google Scholar 

  82. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  MathSciNet  Google Scholar 

  83. Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2023).

  84. Song, J., Meng, C. & Ermon, S. Proc. 9th International Conference on Learning Representations (OpenReview.net, 2021).

  85. Otwinowski, Z. & Minor, W. in Methods in Enzymology (ed. Carter, C. W. Jr) 307–326 (Elsevier, 1997).

  86. Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010).

    Article  Google Scholar 

  87. Emsley, P. & Cowtan, K. Coot: model-building tools for molecular graphics. Acta Crystallogr. D 60, 2126–2132 (2004).

    Article  Google Scholar 

  88. The PyMOL Molecular Graphics System (Schrödinger, LLC, 2015).

  89. Meng, E. C. et al. UCSF ChimeraX: tools for structure building and analysis. Protein Sci. 32, e4792 (2023).

    Article  Google Scholar 

  90. Zhang, Y. et al. Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding. Preprint at bioRxiv https://doi.org/10.1101/2024.10.05.616664 (2024).

  91. Zhang, Y. meneshail/TopoDiff: v1.1.0. GitHub https://github.com/meneshail/TopoDiff/tree/main (2025).

  92. Zhang, Y., Liu, Y., Ma, Z., Li, M. & Chunfu, X. CodeOcean release of ‘TopoDiff: improving diffusion-based protein backbone generation with global-geometry-aware latent encoding’, version 1. CodeOcean https://doi.org/10.24433/CO.8705528.v1 (2025).

Download references

Acknowledgements

This work has been supported by the Ministry of Science and Technology of China (no. 2023YFF1204400 to H.G.), the National Natural Science Foundation of China (no. 32171243 to H.G.) and the Beijing Frontier Research Center for Biological Structure. We thank the staff of beamlines BL02U1, BL10U2, BL18U1 and BL19U1 at the Shanghai Synchrotron Radiation Facility as well as the X-ray crystallography platform, National Protein Science Facility, Tsinghua University, for assistance in the X-ray diffraction data collection and analysis. We thank J. Hu, Z. Zhu, Y. Xue and C. Song for helpful discussions.

Author information

Authors and Affiliations

Authors

Contributions

Y.Z. and H.G. conceived the study. Y.Z. and Z.M. designed and implemented the model. Y.Z. and Z.M. performed the in silico experiments and analysed the results. Y.Z. designed the candidate proteins for experimental validation. Y.L. designed, executed and analysed all the wet-laboratory experiments. H.G. supervised the development of the model and the result analysis. C.X. supervised the design of the candidate proteins and wet-laboratory experiments. M.L. contributed to the X-ray structure determination. Y.Z. drafted the initial paper. Y.Z., Y.L. and Z.M. created the final figures. All authors contributed to writing and improving the paper, and approved the submission.

Corresponding authors

Correspondence to Chunfu Xu or Haipeng Gong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Zhuoran Qiao, Limei Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Persistent underrepresentation of mainly-beta proteins in de novo design.

All de novo designed proteins deposited in PDB up to September 2024 were collected from the PDA database, and filtered to exclude small peptides (length ≤ 50) as well as designs originating from sequence mutations or redesigns of naturally occurring backbones (maximum TM-score to PDB ≥ 0.9). (a) Cumulative number of de novo protein design entries over the time. General proteins and mainly-beta proteins (with beta ratio ≥ 0.5) are colored in blue and purple, respectively. (b) Distribution of natural proteins of the CATH dataset (left) and de novo designed proteins (right) based on the proportion of beta sheets. (c) Scatter plot of all de novo designed proteins, where the horizontal axis represents novelty (maximum TM-score to PDB) and the vertical axis represents the proportion of beta sheets. Each protein is denoted as a point, colored based on protein length. Detailed discussion of these data could be found in Supplementary Results 6.3.

Source data

Supplementary information

Supplementary Information

Supplementary Figs. 1–28, Tables 1–16, Methods, Results and Algorithms (for pseudocodes).

Reporting Summary

Source data

Source Data Fig. 2

t-SNE-reduced embedding and annotations for all structures in the three databases.

Source Data Fig. 3

Statistical source data for Fig. 3b,c.

Source Data Fig. 4

Statistical source data for Fig. 4b.

Source Data Fig. 5

Source data for the SEC profiles, CD spectra and melting curve.

Source Data Fig. 5

Unprocessed western blots.

Source Data Fig. 5

Unprocessed western blots.

Source Data Extended Data Fig./Table 1

Statistics of filtered de novo designed proteins in PDB.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Liu, Y., Ma, Z. et al. Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding. Nat Mach Intell 7, 1104–1118 (2025). https://doi.org/10.1038/s42256-025-01059-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01059-x

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing