Abstract
Structure-based molecule generation represents a notable advancement in artificial intelligence-driven drug design. However, progress in this field is constrained by the scarcity of structural data on protein–ligand complexes. Here we propose a latent variable approach that bridges the gap between ligand-only data and protein–ligand complexes, enabling target-aware generative models to explore a broader chemical space, thereby enhancing the quality of molecular generation. Inspired by quantum molecular simulations, we introduce ECloudGen, a generative model that leverages electron clouds as meaningful latent variables. ECloudGen incorporates techniques such as latent diffusion models, Llama architectures and a contrastive learning task, which organizes the chemical space into a structured and highly interpretable latent representation. Benchmark studies demonstrate that ECloudGen outperforms state-of-the-art methods by generating more potent binders with superior physiochemical properties and by covering a broader chemical space. The incorporation of electron clouds as latent variables not only improves generative performance but also introduces model-level interpretability, as illustrated in our case studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Data availability
The data are available via Zenodo at https://zenodo.org/records/16846544 (ref. 56). Source data are provided with this paper.
Code availability
The source code is freely available via GitHub at https://github.com/HaotianZhangAI4Science/ECloudGen and via Zenodo https://zenodo.org/records/16876908 (ref. 57) to allow replication of the results.
References
Meganck, R. M. & Baric, R. S. Developing therapeutic approaches for twenty-first-century emerging infectious viral diseases. Nat. Med. 27, 401–410 (2021).
Xu, Y. et al. Deep learning for molecular generation. Fut. Med. Chem. 11, 567–597 (2019).
Li, Z. MolGAN without mode collapse. GitHub https://github.com/ZiyaoLi/molgan-without-mode-collapse (2020).
Madhawa, K., Ishiguro, K., Nakago, K. & Abe, M. GraphNVP: an invertible flow model for generating molecular graphs. Preprint at https://arxiv.org/abs/1905.11600 (2019).
Zhang, O. et al. ResGen is a pocket-aware 3D molecular generation model based on parallel multiscale modelling. Nat. Mach. Intell. 5, 1020–1030 (2023).
Song, Y. et al. Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representations (ICLR, 2020).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
Lin, H., et al. Diffbp: generative diffusion of 3d molecules for target protein binding. Chem. Sci. 16, 1417–1431 (2025).
Schneuing, A. et al. Structure-based drug design with equivariant diffusion models. Preprint at https://arxiv.org/abs/2210.13695 (2022).
Guan, J. et al. 3D equivariant diffusion for target-aware molecule generation and affinity prediction. Preprint at https://arxiv.org/abs/2303.03543 (2023).
Peng, X., et al. Pocket2mol: efficient molecular sampling based on 3D protein pockets. In International Conference on Machine Learning 17644–17655 (PMLR, 2022).
Zhang, O. et al. Learning on topological surface and geometric structure for 3D molecular generation. Nat. Comput. Sci. 3, 849–859 (2023).
Feng, W. et al. Generation of 3D molecules in pockets via a language model. Nat. Mach. Intell. 6, 62–73 (2024).
Gao, Z., Hu, Y., Tan, C. & Li, S. Z. Prefixmol: target-and chemistry-aware molecule design via prefix embedding. Preprint at https://arxiv.org/abs/2302.07120 (2023).
Jiang, Y. et al. Pocketflow is a data-and-knowledge-driven structure-based molecular generative model. Nat. Mach. Intell. 6, 326–337 (2024).
Chen, S. et al. Deep lead optimization enveloped in protein pocket and its application in designing potent and selective ligands targeting LTK protein. Nat. Mach. Intell. 7, 448–458 (2025).
Irwin, J. J. & Shoichet, B. K. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182 (2005).
Wang, R., Fang, X., Lu, Y., Yang, C.-Y. & Wang, S. The PDBbind database: methodologies and updates. J. Med. Chem. 48, 4111–4119 (2005).
Polishchuk, P. G., Madzhidov, T. I. & Varnek, A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput. Aided Mol. Des. 27, 675–679 (2013).
Leckband, D. & Israelachvili, J. Intermolecular forces in biology. Q. Rev. Biophys. 34, 105–267 (2001).
Francoeur, P. G. et al. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. J. Chem. Inf. Model. 60, 4200–4215 (2020).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (Curran Assoc., 2022).
Zhang, Z., Min, Y., Zheng, S. & Liu, Q. Molecule Generation for Target Protein Binding with Structural Motifs. In International Conference on Learning Representations (ICLR, 2023).
Zhang, O. et al. FragGen: towards 3D geometry reliable fragment-based molecular generation. Chem. Sci. 15, 19452–19465 (2024).
Liu, M., et al. Generating 3D molecules for target protein binding. In International Conference on Machine Learning 13912–13924 (PMLR, 2022).
Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).
Abad-Zapatero, C. & Metz, J. T. Ligand efficiency indices as guideposts for drug discovery. Drug Discov. Today 10, 464–469 (2005).
Hopkins, A. L., Keserü, G. M., Leeson, P. D., Rees, D. C. & Reynolds, C. H. The role of ligand efficiency metrics in drug discovery. Nat. Rev. Drug Discov. 13, 105–121 (2014).
Reynolds, C. H., Tounge, B. A. & Bembenek, S. D. Ligand binding efficiency: trends, physical basis, and implications. J. Med. Chem. 51, 2432–2438 (2008).
Clark, D. E. & Pickett, S. D. Computational methods for the prediction of ‘drug-likeness’. Drug Discov. today 5, 49–58 (2000).
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics 1, 1–11 (2009).
Ganesan, A. The impact of natural products upon modern drug discovery. Curr. Opin. Chem. Biol. 12, 306–317 (2008).
Sangster, J. Octanol–water partition coefficients of simple organic compounds. J. Phys. Chem. Ref. Data 18, 1111–1229 (1989).
Xie, Y., Xu, Z., Ma, J. & Mei, Q. How Much of the Chemical Space Has Been Explored? Selecting the Right Exploration Measure for Drug Discovery. In 39th International Conference on Macing Learning (ICML, 2022).
Edelsbrunner, H. & Harer, J. L. Computational Topology: An Introduction (American Mathematical Society, 2022).
Zhang, O. et al. Deep lead optimization: leveraging generative AI for structural modification. J. Am. Chem. Soc. 146, 31357–31370 (2024).
Liu, Z., Ma, Y., Schubert, M., Ouyang, Y. & Xiong, Z. Multi-modal contrastive pre-training for recommendation. In International Conference on Multimedia Retrieval 99–108 (ACM/PMLR, 2022).
Barducci, A., Bonomi, M. & Parrinello, M. Metadynamics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 1, 826–843 (2011).
Fernández Martínez, J. L. & García Gonzalo, E. The PSO family: deduction, stochastic analysis and comparison. Swarm Intell. 3, 245–273 (2009).
Chen, Z., Min, M. R., Parthasarathy, S. & Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning 2323–2332 (PMLR, 2018).
Jin, W., Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecular optimization. In International Conference on Machine Learning 4839–4848 (PMLR, 2019).
Wong, L. L. & Verbalis, J. G. Vasopressin V2 receptor antagonists. Cardiovasc. Res. 51, 391–402 (2001).
Donati, B., Lorenzini, E. & Ciarrocchi, A. BRD4 and cancer: going beyond transcriptional regulation. Mol. Cancer 17, 164 (2018).
Huang, W., Zheng, X., Yang, Y., Wang, X. & Shen, Z. An overview on small molecule inhibitors of BRD4. Mini Rev. Med. Chem. 16, 1403–1414 (2016).
Sun, M. et al. MolSearch: search-based multi-objective molecular generation and property optimization. In Proc. 28th KDD Conference on Knowledge Discovery and Data Mining 4724–4732 (ACM, 2022).
Filippakopoulos, P. et al. Histone recognition and large-scale structural analysis of the human bromodomain family. Cell 149, 214–231 (2012).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention 424–432 (Springer, 2016).
Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).
Su, J. et al. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing 568, 127063 (2024).
Oord, A. V. D., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).
Schrödinger, E. An undulatory theory of the mechanics of atoms and molecules. Phys. Rev. 28, 1049 (1926).
Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652–1671 (2019).
Liu, C. et al. Revisit ligand-receptor interaction at the human vasopressin V2 receptor: a kinetic perspective. Eur. J. Pharmacol. 880, 173157 (2020).
Zhang, O, et al. ECloudGen dataset. Zenodo https://doi.org/10.5281/zenodo.16846544 (2025).
Zhang, O., et al. ECloudGen v1 code. Zenodo https://doi.org/10.5281/zenodo.16876908 (2025).
Acknowledgements
This study was supported by the National Key Research and Development Program of China (grant no. 2024YFA1300051), the National Natural Science Foundation of China (grant nos. 22220102001, 82204279 and 92370130), the Scientific Research Foundation for Talented Scholars in Xuzhou Medical University (grant no. D2024005) and the Natural Science Foundation of Jiangsu Province (grant no. BK.20241043).
Author information
Authors and Affiliations
Contributions
O.Z. led the overall conceptualization, formal analysis, investigation, project administration and validation and completed both the original draft and subsequent revisions. J.J. was responsible for the overall model training. Z.W. and Y.H. were involved in the overall model training. J.Z. was responsible for the quantum chemistry component. P.Y. was responsible for molecular synthesis. Y.Y. and H.L. were involved in the ablation study. H. Zhong was responsible for the BRD4 target experiments. X.Z. and C.H. were involved in the computational design for two targets. W.Z. was responsible for unifying the figure style. Z.Z. was responsible for the IC50 determination experiments. K.Y. and H. Zhao were involved in the conceptualization of the ECloudDecipher model. Y.K., P.P. and J.W. were involved in the design and discussion of two experiments. D.G. and S.Z. provided experimental platform support. C.-Y.H. reviewed the paper comprehensively and participated in the overall discussion. T.H. supervised the entire work, guided the scientific direction and contributed to critical paper revisions.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Jiahui Chen, Zhiwei Feng and Duc Duy Nguyen for their contribution to the peer review of this work. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–3, Tables 1 and 2 and discussion.
Source data
Source Data Fig. 3
Unprocessed IC50 data.
Source Data Fig. 4
Unprocessed IC50 data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, O., Jin, J., Wu, Z. et al. ECloudGen: leveraging electron clouds as a latent variable to scale up structure-based molecular design. Nat Comput Sci (2025). https://doi.org/10.1038/s43588-025-00886-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43588-025-00886-7