Abstract
The recent success of RFdiffusion, a method for protein structure design with a denoising diffusion probabilistic model, has relied on fine-tuning the RoseTTAFold structure prediction network for protein backbone denoising. Here, we introduce SCUBA-diffusion (SCUBA-D), a protein backbone denoising diffusion probabilistic model freshly trained by considering co-diffusion of sequence representation to enhance model regularization and adversarial losses to minimize data-out-of-distribution errors. While matching the performance of the pretrained RoseTTAFold-based RFdiffusion in generating experimentally realizable protein structures, SCUBA-D readily generates protein structures with not-yet-observed overall folds that are different from those predictable with RoseTTAFold. The accuracy of SCUBA-D was confirmed by the X-ray structures of 16 designed proteins and a protein complex, and by experiments validating designed heme-binding proteins and Ras-binding proteins. Our work shows that deep generative models of images or texts can be fruitfully extended to complex physical objects like protein structures by addressing outstanding issues such as the data-out-of-distribution errors.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
Protein structures for training the models were downloaded from the PDB. The experimentally solved protein structures were deposited in the PDB under accession codes: 8K7Z (N1), 8K83 (N2), 8K84 (N3), 8KCJ (N7), 8KCK (N9), 8K8I (N14), 8KC4 (NA5), 8KA6 (NA7), 8KA7 (NB7), 8KC0 (NB8), 8KAC (NX1), 8KC1 (NX5), 8K7M (T01), 8KDQ (T03), 8WX8 (T09), 8KC8 (T11) and 8WWC (120–4). We referenced the structures 2ZDO and 4G0N from the PDB for the design of heme-binding proteins and Ras-binding proteins, respectively. The amino acid sequences and encoding DNA sequences of the experimentally examined proteins are available in Supplementary Tables 6–10 and Supplementary Data 1–3. The complete lists of proteins for training and testing the models, the data of experimental results (SEC, multi-angle light scattering, 15N-1H HSQC NMR, ITC, CD, validation reports for experimentally solved protein structures) and all in silico experimental results are available from Zenodo via https://doi.org/10.5281/zenodo.10911626 (ref. 45). Source data are provided with this paper.
Code availability
Executable computer programs and source codes of SCUBA-D (version 1.0) and SCUBA-sketch (version 1.0) are publicly available from Zenodo via https://doi.org/10.5281/zenodo.10947360 (ref. 46) and can be freely used for noncommercial purposes. The source codes for SCUBA-D are also available from GitHub at https://github.com/liuyf020419/SCUBA-D.git/.
References
Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of protein design. Nature 537, 320–327 (2016).
Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003).
Polizzi, N. F. & DeGrado, W. F. A defined structural unit enables de novo design of small-molecule–binding proteins. Science 369, 1227–1233 (2020).
Gainza, P. et al. De novo design of protein interactions with learned surface fingerprints. Nature 617, 176–184 (2023).
Yeh, A. H.-W. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).
Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).
Kuhlman, B. & Baker, D. Native protein sequences are close to optimal for their structures. Proc. Natl Acad. Sci. USA 97, 10383–10388 (2000).
Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011).
Huang, B. et al. A backbone-centred energy function of neural networks for protein design. Nature 602, 523–528 (2022).
Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature 600, 547–552 (2021).
Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).
Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Chen, N. et al. Wavegrad: estimating gradients for waveform generation. In Proc. International Conference on Learning Representations (ICLR, 2021).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with clip latents. Preprint at https://arXiv.org/abs/2204.06125 (2022).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. Preprint at https://arXiv.org/abs/2205.15019 (2022).
Wu, K. E. et al. Protein structure generation via folding diffusion. Nat. Commun. 15, 1059 (2022).
Trippe, B. L. et al. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. In Proc. International Conference on Learning Representations (ICLR, 2023).
Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).
Yim, J. et al. SE(3) diffusion model with application to protein backbone generation. In Proc. International Conference on Machine Learning (ICML, 2023).
Zhao, H., Gallo, O., Frosio, I. & Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3, 47–57 (2016).
Blau, Y., Mechrez, R., Timofte, R., Michaeli, T. & Zelnik-Manor, L. The 2018 PIRM challenge on perceptual image super-resolution. In Proc. the European Conference on Computer Vision (ECCV) Workshops (eds Leal-Taixé, L. & Roth, S.) 334–355 (2019).
Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).
Liu, Y. et al. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat. Comput. Sci. 2, 451–462 (2022).
Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M. Grad-tts: a diffusion probabilistic model for text-to-speech. in International Conference on Machine Learning, 8599–8608 (PMLR, 2021)
Lee, S.-g. et al. PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior. In International Conference on Learning Representations (ICLR, 2022).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
Xu, J. & Zhang, Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26, 889–895 (2010).
Lin, Y. & AlQuraishi, M. Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds. In Proc. International Conference on Machine Learning (ICML, 2023).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn Res. 9, 2579–2605 (2008).
Wang, J. et al. Scaffolding protein functional sites using deep learning. Science 377, 387–394 (2022).
Lee, W. C., Reniere, M. L., Skaar, E. P. & Murphy, M. E. Ruffling of metalloporphyrins bound to IsdG and IsdI, two heme-degrading enzymes in Staphylococcus aureus. J. Biol. Chem. 283, 30957–30963 (2008).
Skaar, E. P., Gaspar, A. H. & Schneewind, O. IsdG and IsdI, heme-degrading enzymes in the cytoplasm of Staphylococcus aureus. J. Biol. Chem. 279, 436–443 (2004).
Fetics, S. K. et al. Allosteric effects of the oncogenic RasQ61L mutant on Raf-RBD. Structure 23, 505–516 (2015).
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Remy, I., Campbell-Valois, F. & Michnick, S. W. Detection of protein–protein interactions using a simple survival protein–fragment complementation assay based on the enzyme dihydrofolate reductase. Nat. Protoc. 2, 2120–2125 (2007).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J., Dror, R. Learning from protein structure with geometric vector perceptrons. In Proc. International Conference on Learning Representations (ICLR, 2021).
Wang, G. & Dunbrack, R. L. Jr PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003).
Wang, S. Source data for manuscript: de novo protein design with a denoising diffusion network independent of pre-trained structure prediction models. Zenodo https://doi.org/10.5281/zenodo.10911626 (2024).
Wang, S. De novo protein design with a denoising diffusion network independent of pre-trained structure prediction models. Zenodo https://doi.org/10.5281/zenodo.10947360 (2024).
Acknowledgements
We thank the staff from the BL18U1 and BL19U1 beamlines of the National Facility for Protein Science in Shanghai for their assistance during crystallographic data collection. We also thank X. Hu, R. Wu and L. Zhang for their help with experimental techniques, as well as M. Lv and H. Yu for their help with crystal collection. This work was supported by the National Key R&D Program of China (2022YFA1303700 to H.L. and 2022YFF1203100 to Q.C.), National Natural Science Foundation of China (T2221005, 92253302 and 22177107 to H.L.; 32371487 and 32171411 to Q.C.), CAS Strategic Priority Research Program (XDB0500201 to H.L.), CAS Project for Young Scientists in Basic Research (YSBR-072 to Q.C.), Anhui Provincial Natural Science Foundation (2308085J01 to Q.C.) and Research Funds of Center for Advanced Interdisciplinary Science and Biomedicine of IHM (QYPY20230035 to Q.C.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Y.L. developed computational models and codes with the assistance of L.C. S.W. carried out the experimental work with the help of J.D., X.W. and Y.W. L.W., F.L. and C.W. helped with analysis of crystal structural data. J.Z. collected the NMR data. S.W. participated in the discussion. H.L. and Q.C. supervised the project. H.L., Y.L., S.W. and Q.C. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Arne Elofsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Evaluation of the variant models ‘no ESM’, ‘compressed ESM’, ‘full ESM’ and ‘full ESM with GAN’.
(a) Distributions of TM-scores and RMSDs between the initial natural backbones and the denoised backbones generated by the variant modes. For each model, 75 protein backbones were generated by considering 3 independent ‘denoising’ runs from each of 25 initial natural backbones. (b) Distributions of scTM-scores and scRMSDs between the denoised backbones and the AlphaFold2-predicted structures for amino acid sequences designed (with ABACUS-R) on corresponding denoised backbones. (c) Distributions of per-residue ABACUS-R logits scores of amino acid sequences designed by ABACUS-R for the various denoised backbones and for the initial natural backbones. Larger logits scores indicate better compatibility between the designed sequences and the corresponding backbone structures. (d) Left: backbones ‘denoised’ with the ‘full ESM’ model (orange) and the ‘full ESM with GAN’ model (green) from the same initial natural backbone 1e1qA01 (CATH domain ID); the RMSD between the two denoised backbones is indicated. Middle: the backbone denoised with the ‘full ESM’ model (orange) superimposed with the AlphaFold2-predicted structure (gray) for the amino acid sequence designed on this backbone by ABACUS-R; the corresponding scRMSD is indicated. Left: the backbone denoised with the ‘full ESM with GAN’ model (green) superimposed with the AlphaFold2-predicted structure (gray) for the amino acid sequence designed on this backbone by ABACUS-R; the corresponding scRMSD is indicated. (e) The distributions of the ABACUS-R logits of different amino acid sequences for 25 natural backbones. The pESM sequences were obtained by projecting the single representation parts from the SCUBA-D output using a residue type classifier network of ESM. The boxplots in A to C and E show median, interquartile range, and minimum and maximum values excluding outliers (>1.5 times the interquartile range beyond the box) with the sample sizes being 75 (for the denoised backbones) or 25 (for the natural backbones).
Extended Data Fig. 2 Comparasions between SCUBA-D and other DDPM models for unconditional backbone generation.
(a) Averaged metrics of various models. For each method, the averages over two groups (one group comprised 100 backbones of 100 residues in chain length and the other group comprised 300 backbones of 200 to 400 residues in chain length) are reported, with data in the parentheses reporting the total number of backbones with scRMSD below 2.5 Å or the total number of backbones of high overall structural novelty (the highest TM-score to PDB below 0.5). (b) Two example backbones of 100 residues (100-9 and 100-7) generated by SCUBA-D without condition and with their highest TM-scores to both PDB and AlphaFold2 database below or equal to 0.5. The generated backbones (in blue) and their superimpositions with the corresponding structures from PDB or AlphaFold2 database (in salmon) are shown. The respective TM-scores and PDB IDs (with chain IDs) are indicated. Here the scRMSD of a generated backbone was determined as the RMSD of the backbone from the AF2 predicted structure for the amino acid sequence designed (here with ProteinMPNN) for that backbone. (c) The structures of ten example backbones with RosettaFold2-based scRMSDs above 6.0 Å. Both the ESM prediction-based scRMSDs and the RosettaFold2 prediction-based scRMSDs are indicated.
Extended Data Fig. 3 Example results of size-exclusion chromatography (SEC) experiments.
Proteins designed in the five different tasks as indicated were analyzed. For each task, three example results are shown in the same row. The protein IDs and the types of the SEC columns are indicated.
Extended Data Fig. 4 The deviations between the loops in designed structures and in solved crystal structures.
(a) The RMSDs between the loops. The analysis included the 6 crystal structures obtained for proteins of backbones generated by SCUBA-D without condition. Each point corresponds to a loop, with the loops grouped according to their lengths and those of the same length displayed in the same column. The RMSDs were calculated by superimposing the flanking secondary structure segments for a pair of compared loops. An example showing the superimposed structures with the indicated RMSD between designed loop (blue) and corresponding crystal structure (orange). (b) The same as A, but for the 6 experimentally determined structures of proteins generated for particular architectures.
Extended Data Fig. 5 Protein backbone generation without condition or with biased secondary structure (SS) distributions.
(a) The distribution of the mutual TM-scores between the set of backbones unconditionally generated by SCUBA-D. (b) Histograms of the proportions of residues in the α helix state (upper panel) and of residues in the β strand state (lower panel) for the set of unconditionally generated backbones with SCUBA-D (blue) and for a set of natural protein structures (salmon), which comprised PDB structures of resolutions higher than 2.0 Å, of mutual sequence identities below 40%, and of 100 to 500 residues in length. The proportions were calculated on individual backbones. The histograms represent the normalized frequencies of backbones with proportions in specific bins. (c) Scattering plot of the recovery rates of the input secondary structure (SS) states versus the scRMSDs for the set of 225 backbones generated using the 25 input structures. Each input structure was composed according to the SS distribution of a natural backbone. The gray box indicates the region with scRMSD < 2.0 Å and SS recovery rate > 70%. (d) The scRMSDs of the backbones generated with biased SS distributions and the SS recovery rate. For each SS distribution, 9 backbones were generated and evaluated, one data point in the plots corresponding to one designed backbone. The results for three different classes of SS distributions (all-α, all-β, and mixed αβ) are displayed in different plots. Within each plot, results biased towards the same SS distribution are numbered the same and displayed in the same column. Results for different SS distributions were arranged from left to right in an ascending order of the corresponding chain lengths.
Extended Data Fig. 6 Backbone generation with skected input structures.
(a) Example scattering plots of scRMSD versus TM-score to initial structure for the backbones generated from initial structures ‘sketched’ according to three architectures of different natural proteins. The examples were of different fold classes (all-α, all-β, and mixed αβ). For each architecture, backbones were generated by applying SCUBA-D to 60 independently ‘sketched’ input structures. The dashed boxes indicate regions with scRMSDs < 2.0 Å and the TM-scores to initial backbones > 0.5. (b) An example for which no generated backbone for the particular architecture meet the criteria of scRMSD < 2.0 Å and TM-score > 0.5. Left: the scattering plot of scRMSD versus the TM-score to initial and backbone for the architecture. Middle: an example of the initial backbone. Right: an example generated backbone.
Extended Data Fig. 7 Designing proteins of the (αβ)n-barrel and the (β4)n-propeller architectures.
(a) Left: an example initial structure ‘sketched’ according to the (αβ)15-barrel architecture. Right: example backbones (blue) generated for the (αβ)n-barrel architectures superimposed with structures predicted by AlphaFold2 (gray) for amino acid sequences designed for these backbones with ProteinMPNN. The scRMSDs of the superimpositions are indicated. For each value of the repeat number n from 9 to 15, one example is shown. (b) The same as A, but for the (β4)n-propeller architectures with n ranging from 7 to 11. (c) Left: the crystal structure (gold and salmon) and the designed backbone (blue) of the designed (αβ)9-barrel protein T01. The crystal structure presents a domain-swapped dimer, with the monomers colored differently. The designed backbone is superimposed with one of the monomers. Right: the results of SEC (black curve) and static light scattering (red curve) experiments on T01, which indicate that the protein exists in the monomeric state in solution. (d) Left: the crystal structure (gold, yellow, and salmon) and the designed backbone (blue) of the designed (αβ)9-barrel protein T11. The crystal structure presents a domain-swapped trimer, with the monomers colored differently. The designed backbone is superimposed with one of the monomers. Right: the results of SEC (black curve) and static light scattering (red curve) experiments on T11, which indicate that the protein exists in the monomeric state in solution.
Extended Data Fig. 8 Designed heme-binding proteins.
(a) Scattering plot and histograms of the scRMSD and pLDDT scores of the designed heme-binding backbones. Structure predictions with AlphaFold2 were performed for amino acid sequences designed with the ABACUS-R program. (b) UV-Visible absorbance spectra of 9 designed heme-binding proteins are shown with the topology diagrams of the corresponding proteins. ‘NC’ represents negative control. ‘IsdG’ represents the natural iron-regulated surface determinant G protein which served as a positive control. Heme binding is indicated by the presence of the peak around 412 nm. (c) Experimental characterizations of the designed heme-binding protein H6. Left: SEC result. Middle: NMR 15N-1H HSQC spectrum. Right: ITC measurements on heme binding. (d) The same as C, but for H8. (e) The result of ITC experiments measuring the KD values of heme binding by the natural protein IsdG. (f) UV-Visible absorbance spectra showing the impacts of mutating the iron-coordinating histidine residues in the natural protein and the designed heme-binding proteins. Each panel shows the spectrum of a mutated protein together with the spectra of the corresponding original protein and of a non-heme-binding negative control protein (labeled as ‘NC’).
Extended Data Fig. 9 The designed structures and experimental characterizations of the Ras-binding proteins 90-4, 90-2 and 120-4.
(a) The designed proteins (90-4, 90-2 and 120-4, colored in blue) are correspondingly superimposed with the predicted structures (gray) in complex with Ras (green). For each designed protein, the residues to be mutated is shown with its surrounding residues in the predicted structure next to the overall superimposition. The scRMSD and ligand pLDDT are indicated. (b) NMR 15N-1H HSQC spectrum of Ras-binding proteins 90-2. (c) The results of ITC measurements on the Ras binding of 90-4, 90-2 and 120-4 and their mutated variants. (d) The results of ITC measurements on the Ras binding of Raf-RBD and the mutated variant of Raf-RBD (R89L).
Extended Data Fig. 10 Assessing the designed Ras-binding proteins with the dihydrofolate reductase (DHFR)-based protein complementarity analysis assay.
(a) The protein complementarity analysis results on 14 designed Ras-binding proteins. In these experiments, the peptide chain of DHFR is split into two parts. Ras and the protein to be assessed were separately fused with each part. Bacterium cells expressing the two fused peptides were diluted to different levels of concentrations and tittered on media containing different levels of trimethoprim (TMP), which can inhibit the endogenous DHFR activity of the cells. Possible binding between Ras and the protein to be assessed was detected through the resistance of the bacterium cells to the growth inhibition by TMP. The label ‘Raf-RBD’ represents the Ras-binding domain of Raf, which served as a positive control. The label ‘Raf-RBD R89L’ represents a mutant with abolished Ras binding activity, which served as a negative control. The stronger TMP resistance (relative to the negative control) exhibited by the cells expressing fusion peptides of the designed proteins indicated that the designed proteins examined here can bind Ras. (b) Results of competitive DHFR-PCA analysis of 4 designed proteins. In the experiments examining a designed protein, cells co-expressing isolated Raf-RBD and the DHFR-PCA system for the designed protein were analyzed. If the designed protein and Raf-RBD share binding sites on Ras, the expression of Raf-RBD, which is induced by L-arabinose, will lead to the competitive inhibition of the Ras binding of the designed protein, detected as reduced resistance to TMP.
Supplementary information
Supplementary Information
Supplementary Methods, Fig. 1, Tables 1–11 and References.
Supplementary Data 1
Raw data for Supplementary Fig. 1.
Supplementary Data 2
Partial computational data and crystallographic data for Supplementary Tables 1–4.
Supplementary Data 3
DNA sequences of experimentally characterized proteins.
Source data
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data and unprocessed NMR data.
Source Data Fig. 5
Statistical source data.
Source Data Extended Data Fig./Table 1
Statistical source data.
Source Data Extended Data Fig./Table 2
Statistical source data.
Source Data Extended Data Fig./Table 3
Statistical source data.
Source Data Extended Data Fig./Table 4
Statistical source data.
Source Data Extended Data Fig./Table 5
Statistical source data.
Source Data Extended Data Fig./Table 6
Statistical source data.
Source Data Extended Data Fig./Table 7
Statistical source data.
Source Data Extended Data Fig./Table 8
Statistical source data and unprocessed NMR data.
Source Data Extended Data Fig./Table 9
Statistical source data and unprocessed NMR data.
Source Data Extended Data Fig./Table 10
Unprocessed images.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Wang, S., Dong, J. et al. De novo protein design with a denoising diffusion network independent of pretrained structure prediction models. Nat Methods 21, 2107–2116 (2024). https://doi.org/10.1038/s41592-024-02437-w
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41592-024-02437-w
This article is cited by
-
Efficient protein structure generation with sparse denoising models
Nature Machine Intelligence (2025)
-
Modification and applications of glucose oxidase: optimization strategies and high-throughput screening technologies
World Journal of Microbiology and Biotechnology (2025)