Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

ECloudGen: leveraging electron clouds as a latent variable to scale up structure-based molecular design

A preprint version of the article is available at bioRxiv.

Abstract

Structure-based molecule generation represents a notable advancement in artificial intelligence-driven drug design. However, progress in this field is constrained by the scarcity of structural data on protein–ligand complexes. Here we propose a latent variable approach that bridges the gap between ligand-only data and protein–ligand complexes, enabling target-aware generative models to explore a broader chemical space, thereby enhancing the quality of molecular generation. Inspired by quantum molecular simulations, we introduce ECloudGen, a generative model that leverages electron clouds as meaningful latent variables. ECloudGen incorporates techniques such as latent diffusion models, Llama architectures and a contrastive learning task, which organizes the chemical space into a structured and highly interpretable latent representation. Benchmark studies demonstrate that ECloudGen outperforms state-of-the-art methods by generating more potent binders with superior physiochemical properties and by covering a broader chemical space. The incorporation of electron clouds as latent variables not only improves generative performance but also introduces model-level interpretability, as illustrated in our case studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the sparse chemical generation space paradox and the ECloudGen framework.
Fig. 2: Quantitative metrics and latent space mapping in ECloudGen.
Fig. 3: Workflow of the ECloud on V2R design case.
Fig. 4: Workflow of ECloudGen on the BRD4 design case.

Data availability

The data are available via Zenodo at https://zenodo.org/records/16846544 (ref. 56). Source data are provided with this paper.

Code availability

The source code is freely available via GitHub at https://github.com/HaotianZhangAI4Science/ECloudGen and via Zenodo https://zenodo.org/records/16876908 (ref. 57) to allow replication of the results.

References

  1. Meganck, R. M. & Baric, R. S. Developing therapeutic approaches for twenty-first-century emerging infectious viral diseases. Nat. Med. 27, 401–410 (2021).

    Article  Google Scholar 

  2. Xu, Y. et al. Deep learning for molecular generation. Fut. Med. Chem. 11, 567–597 (2019).

    Article  Google Scholar 

  3. Li, Z. MolGAN without mode collapse. GitHub https://github.com/ZiyaoLi/molgan-without-mode-collapse (2020).

  4. Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. Preprint at https://arxiv.org/abs/1905.11600 (2019).

  5. Zhang, O. et al. ResGen is a pocket-aware 3D molecular generation model based on parallel multiscale modelling. Nat. Mach. Intell. 5, 1020–1030 (2023).

    Article  Google Scholar 

  6. Song, Y. et al. Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representations (ICLR, 2020).

  7. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  8. Lin, H., et al. Diffbp: generative diffusion of 3d molecules for target protein binding. Chem. Sci. 16, 1417–1431 (2025).

  9. Schneuing, A. et al. Structure-based drug design with equivariant diffusion models. Preprint at https://arxiv.org/abs/2210.13695 (2022).

  10. Guan, J. et al. 3D equivariant diffusion for target-aware molecule generation and affinity prediction. Preprint at https://arxiv.org/abs/2303.03543 (2023).

  11. Peng, X., et al. Pocket2mol: efficient molecular sampling based on 3D protein pockets. In International Conference on Machine Learning 17644–17655 (PMLR, 2022).

  12. Zhang, O. et al. Learning on topological surface and geometric structure for 3D molecular generation. Nat. Comput. Sci. 3, 849–859 (2023).

    Article  Google Scholar 

  13. Feng, W. et al. Generation of 3D molecules in pockets via a language model. Nat. Mach. Intell. 6, 62–73 (2024).

  14. Gao, Z., Hu, Y., Tan, C. & Li, S. Z. Prefixmol: target-and chemistry-aware molecule design via prefix embedding. Preprint at https://arxiv.org/abs/2302.07120 (2023).

  15. Jiang, Y. et al. Pocketflow is a data-and-knowledge-driven structure-based molecular generative model. Nat. Mach. Intell. 6, 326–337 (2024).

    Article  Google Scholar 

  16. Chen, S. et al. Deep lead optimization enveloped in protein pocket and its application in designing potent and selective ligands targeting LTK protein. Nat. Mach. Intell. 7, 448–458 (2025).

  17. Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).

    Article  Google Scholar 

  18. Wang, R., Fang, X., Lu, Y., Yang, C.-Y. & Wang, S. The PDBbind database: methodologies and updates. J. Med. Chem. 48, 4111–4119 (2005).

    Article  Google Scholar 

  19. Polishchuk, P. G., Madzhidov, T. I. & Varnek, A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput. Aided Mol. Des. 27, 675–679 (2013).

    Article  Google Scholar 

  20. Leckband, D. & Israelachvili, J. Intermolecular forces in biology. Q. Rev. Biophys. 34, 105–267 (2001).

    Article  Google Scholar 

  21. Francoeur, P. G. et al. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 60, 4200–4215 (2020).

    Article  Google Scholar 

  22. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (Curran Assoc., 2022).

  23. Zhang, Z., Min, Y., Zheng, S. & Liu, Q. Molecule Generation for Target Protein Binding with Structural Motifs. In International Conference on Learning Representations (ICLR, 2023).

  24. Zhang, O. et al. FragGen: towards 3D geometry reliable fragment-based molecular generation. Chem. Sci. 15, 19452–19465 (2024).

  25. Liu, M., et al. Generating 3D molecules for target protein binding. In International Conference on Machine Learning 13912–13924 (PMLR, 2022).

  26. Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).

    Article  Google Scholar 

  27. Abad-Zapatero, C. & Metz, J. T. Ligand efficiency indices as guideposts for drug discovery. Drug Discov. Today 10, 464–469 (2005).

    Article  Google Scholar 

  28. Hopkins, A. L., Keserü, G. M., Leeson, P. D., Rees, D. C. & Reynolds, C. H. The role of ligand efficiency metrics in drug discovery. Nat. Rev. Drug Discov. 13, 105–121 (2014).

    Article  Google Scholar 

  29. Reynolds, C. H., Tounge, B. A. & Bembenek, S. D. Ligand binding efficiency: trends, physical basis, and implications. J. Med. Chem. 51, 2432–2438 (2008).

    Article  Google Scholar 

  30. Clark, D. E. & Pickett, S. D. Computational methods for the prediction of ‘drug-likeness’. Drug Discov. today 5, 49–58 (2000).

    Article  Google Scholar 

  31. Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics 1, 1–11 (2009).

    Article  Google Scholar 

  32. Ganesan, A. The impact of natural products upon modern drug discovery. Curr. Opin. Chem. Biol. 12, 306–317 (2008).

    Article  Google Scholar 

  33. Sangster, J. Octanol–water partition coefficients of simple organic compounds. J. Phys. Chem. Ref. Data 18, 1111–1229 (1989).

    Article  Google Scholar 

  34. Xie, Y., Xu, Z., Ma, J. & Mei, Q. How Much of the Chemical Space Has Been Explored? Selecting the Right Exploration Measure for Drug Discovery. In 39th International Conference on Macing Learning (ICML, 2022).

  35. Edelsbrunner, H. & Harer, J. L. Computational Topology: An Introduction (American Mathematical Society, 2022).

  36. Zhang, O. et al. Deep lead optimization: leveraging generative AI for structural modification. J. Am. Chem. Soc. 146, 31357–31370 (2024).

  37. Liu, Z., Ma, Y., Schubert, M., Ouyang, Y. & Xiong, Z. Multi-modal contrastive pre-training for recommendation. In International Conference on Multimedia Retrieval 99–108 (ACM/PMLR, 2022).

  38. Barducci, A., Bonomi, M. & Parrinello, M. Metadynamics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 1, 826–843 (2011).

    Article  Google Scholar 

  39. Fernández Martínez, J. L. & García Gonzalo, E. The PSO family: deduction, stochastic analysis and comparison. Swarm Intell. 3, 245–273 (2009).

    Article  Google Scholar 

  40. Chen, Z., Min, M. R., Parthasarathy, S. & Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).

    Article  Google Scholar 

  41. Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning 2323–2332 (PMLR, 2018).

  42. Jin, W., Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecular optimization. In International Conference on Machine Learning 4839–4848 (PMLR, 2019).

  43. Wong, L. L. & Verbalis, J. G. Vasopressin V2 receptor antagonists. Cardiovasc. Res. 51, 391–402 (2001).

    Article  Google Scholar 

  44. Donati, B., Lorenzini, E. & Ciarrocchi, A. BRD4 and cancer: going beyond transcriptional regulation. Mol. Cancer 17, 164 (2018).

    Article  Google Scholar 

  45. Huang, W., Zheng, X., Yang, Y., Wang, X. & Shen, Z. An overview on small molecule inhibitors of BRD4. Mini Rev. Med. Chem. 16, 1403–1414 (2016).

    Article  Google Scholar 

  46. Sun, M. et al. MolSearch: search-based multi-objective molecular generation and property optimization. In Proc. 28th KDD Conference on Knowledge Discovery and Data Mining 4724–4732 (ACM, 2022).

  47. Filippakopoulos, P. et al. Histone recognition and large-scale structural analysis of the human bromodomain family. Cell 149, 214–231 (2012).

    Article  Google Scholar 

  48. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

  49. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention 424–432 (Springer, 2016).

  50. Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).

    Google Scholar 

  51. Su, J. et al. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024).

    Article  Google Scholar 

  52. Oord, A. V. D., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

  53. Schrödinger, E. An undulatory theory of the mechanics of atoms and molecules. Phys. Rev. 28, 1049 (1926).

    Article  Google Scholar 

  54. Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652–1671 (2019).

    Article  Google Scholar 

  55. Liu, C. et al. Revisit ligand-receptor interaction at the human vasopressin V2 receptor: a kinetic perspective. Eur. J. Pharmacol. 880, 173157 (2020).

    Article  Google Scholar 

  56. Zhang, O, et al. ECloudGen dataset. Zenodo https://doi.org/10.5281/zenodo.16846544 (2025).

  57. Zhang, O., et al. ECloudGen v1 code. Zenodo https://doi.org/10.5281/zenodo.16876908 (2025).

Download references

Acknowledgements

This study was supported by the National Key Research and Development Program of China (grant no. 2024YFA1300051), the National Natural Science Foundation of China (grant nos. 22220102001, 82204279 and 92370130), the Scientific Research Foundation for Talented Scholars in Xuzhou Medical University (grant no. D2024005) and the Natural Science Foundation of Jiangsu Province (grant no. BK.20241043).

Author information

Authors and Affiliations

Authors

Contributions

O.Z. led the overall conceptualization, formal analysis, investigation, project administration and validation and completed both the original draft and subsequent revisions. J.J. was responsible for the overall model training. Z.W. and Y.H. were involved in the overall model training. J.Z. was responsible for the quantum chemistry component. P.Y. was responsible for molecular synthesis. Y.Y. and H.L. were involved in the ablation study. H. Zhong was responsible for the BRD4 target experiments. X.Z. and C.H. were involved in the computational design for two targets. W.Z. was responsible for unifying the figure style. Z.Z. was responsible for the IC50 determination experiments. K.Y. and H. Zhao were involved in the conceptualization of the ECloudDecipher model. Y.K., P.P. and J.W. were involved in the design and discussion of two experiments. D.G. and S.Z. provided experimental platform support. C.-Y.H. reviewed the paper comprehensively and participated in the overall discussion. T.H. supervised the entire work, guided the scientific direction and contributed to critical paper revisions.

Corresponding authors

Correspondence to Dong Guo, Shuangjia Zheng, Chang-Yu Hsieh or Tingjun Hou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Jiahui Chen, Zhiwei Feng and Duc Duy Nguyen for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–3, Tables 1 and 2 and discussion.

Reporting Summary

Source data

Source Data Fig. 3

Unprocessed IC50 data.

Source Data Fig. 4

Unprocessed IC50 data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, O., Jin, J., Wu, Z. et al. ECloudGen: leveraging electron clouds as a latent variable to scale up structure-based molecular design. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00886-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s43588-025-00886-7

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research