Abstract
Optimizing molecular resource utilization for molecular discovery requires collaborative efforts across research institutions and organizations to accelerate progress. However, given the high research value of both successful and unsuccessful molecules produced by each institution (or organization), these findings are typically kept highly private and confidential until formal publication or commercialization, with even failed molecules rarely disclosed. This confidentiality requirement presents a great challenge for most existing methods when collaboratively handling molecular data with heterogeneous distributions under stringent privacy constraints. Here we propose FedLG (federated learning Lanczos graph), a federated graph learning method that leverages the Lanczos algorithm to facilitate collaborative model training across multiple parties, achieving reliable prediction performance under strict privacy protection conditions. Compared with various existing federate learning methods, FedLG exhibits excellent model performance on 18 benchmark datasets in a simulated federated learning environment. Under different privacy-preserving mechanism settings, FedLG demonstrates robust performance and resistance to noise. Leave-one-client-out experiments and comparison tests across each simulated institution show that FedLG achieves improved heterogeneous data aggregation capabilities and more promising outcomes than localized training. In addition, we incorporate Bayesian optimization into FedLG to show its scalability and further stabilize model performance. Overall, FedLG can be considered an effective method to realize multi-party collaboration while ensuring that sensitive molecular information is protected from potential leakage.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Data availability
All datasets used in this study for method construction are freely available. The MoleculeNet (BBBP, BACE, SIDER, Tox21, ToxCast, ESOL, Lipo and FreeSolv) datasets are available at http://moleculenet.ai/datasets-1. The LIT-PCBA (ALDH1, FEN1, GBA, KAT2A, MAPK1, PKM2 and VDR) datasets are available at https://github.com/idrugLab/FP-GNN/blob/main/Data.rar. The DrugBank and BIOSNAP datasets are available at https://github.com/kexinhuang12345/CASTER/tree/master/DDE/data. The CoCrystal dataset is available at https://github.com/Saoge123/ccgnet/tree/main/data.
Code availability
All source codes for this study are available via GitHub at https://github.com/Turningl/FedLG and Zenodo at https://doi.org/10.5281/zenodo.16872722 (ref. 67).
References
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702.e13 (2020).
Hartono, N. T. P. et al. How machine learning can help select capping layers to suppress perovskite degradation. Nat. Commun. 11, 4172 (2020).
Jiang, Y. et al. Coupling complementary strategy to flexible graph neural network for quick discovery of coformer in diverse co-crystal materials. Nat. Commun. 12, 5950 (2021).
Cao, Y. et al. Perovskite light-emitting diodes based on spontaneously formed submicrometre-scale structures. Nature 562, 249–253 (2018).
Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016).
Müller, S. Small-molecule-mediated G-quadruplex isolation from human cells. Nat. Chem. 2, 1095–1098 (2010).
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Tran-Nguyen, V.-K., Jacquemard, C. & Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263–4273 (2020).
Wishart, D. S. et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906 (2008).
Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov. 19, 353–364 (2020).
Tan, L. et al. Tackling assay interference associated with small molecules. Nat. Rev. Chem. 8, 319–339 (2024).
Durant, G. et al. The future of machine learning for small-molecule drug discovery will be driven by data. Nat. Comput. Sci. 4, 735–743 (2024).
Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10, 1–19 (2019).
Zhu, W. et al. Federated learning of molecular properties with graph neural networks in a heterogeneous setting. Patterns 3, 100521 (2022).
Xiong, Z. et al. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches. Sci. China Life Sci. 65, 529–539 (2022).
Heyndrickx, W. et al. MELLODDY: cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information. J. Chem. Inf. Model. 64, 2331–2344 (2024).
Gilmer, J. et al. Neural message passing for quantum chemistry. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 1263–1272 (PMLR, 2017).
Veličković, P. et al. Graph attention networks. Preprint at https://arxiv.org/abs/1710.10903 (2018).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations https://openreview.net/pdf?id=SJU4ayYgl (ICLR, 2017).
Ning, Y. et al. GFedKRL: graph federated knowledge re-learning for effective molecular property prediction via privacy protection. In International Conference on Artificial Neural Networks. 426–438 (Springer, 2023).
Cao, X., Jia, J., Zhang, Z. & Gong, N. Z. FedRecover: recovering from poisoning attacks in federated learning using historical information. In Proc. 2023 IEEE Symposium on Security and Privacy (SP) 1366–1383 (IEEE, 2023).
Gupta, S. et al. Recovering private text in federated learning of language models. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://papers.neurips.cc/paper_files/paper/2022/file/35b5c175e139bff5f22a5361270fce87-Paper-Conference.pdf (2022).
Zhang, K. et al. Flip: a provable defense framework for backdoor mitigation in federated learning. In International Conference on Learning Representations https://openreview.net/pdf?id=Xo2E217_M4n (ICLR, 2022).
Chen, J. et al. FederEI: federated library matching framework for electron ionization mass spectrum based compound identification. Anal. Chem. 96, 15840–15845 (2024).
Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl Bur. Stand. 45, 255 (1950).
Liao, R., Zhao, Z., Urtasun, R. & Zemel, R. S. Lanczosnet: multi-scale deep graph convolutional networks. In International Conference on Learning Representations (ICLR, 2019).
Olkin, I. & Rubin, H. Multivariate beta distributions and independence properties of the Wishart distribution. Ann. Math. Stat. 35, 261–269 (1964).
Alaggan, M., Gambs, S. & Kermarrec, A.-M. Heterogeneous differential privacy. Journal of Privacy and Confidentiality, 7(2) (2016).
McInnes, L. et al. UMAP: Uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Pelikan, M. Bayesian Optimization Algorithm. In Hierarchical Bayesian Optimization Algorithm Vol. 170, 31–48 (Springer, Berlin, Heidelberg, 2005).
Wu, C. et al. A federated graph neural network framework for privacy-preserving personalization. Nat. Commun. 13, 3091 (2022).
Liu, J., Lou, J., Xiong, L., Liu, J. & Meng, X. Projected federated averaging with heterogeneous differential privacy. Proc. VLDB Endow. 15, 828–840 (2021).
Wang, L. et al. Enhancing federated learning with in-cloud unlabeled data. In Proc. IEEE 38th International Conference on Data Engineering (ICDE) 136–149 (IEEE, 2022).
Lin, T. et al. Ensemble distillation for robust model fusion in federated learning. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020) https://proceedings.neurips.cc/paper/2020/file/18df51b97ccd68128e994804f3eccc87-Paper.pdf (2020).
Li, Q. et al. Practical one-shot federated learning for cross-silo setting. Preprint at https://arxiv.org/abs/2010.01017 (2020).
Shao, J., Wu, F. & Zhang, J. Selective knowledge sharing for privacy-preserving federated distillation without a good teacher. Nat. Commun. 15, 349 (2024).
Park, J. et al. Sageflow: robust federated learning against both stragglers and adversaries. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) https://proceedings.neurips.cc/paper/2021/file/076a8133735eb5d7552dc195b125a454-Paper.pdf (2021).
Xie, C. et al. Zeno++: Robust fully asynchronous SGD. In Proceedings of the 37th International Conference on Machine Learning (eds III, H. D. & Singh, A.) 10495–10503 (PMLR, 2020).
Huang, K., Xiao, C., Hoang, T. N., Glass, L. M. & Sun, J. CASTER: predicting drug interactions with chemical substructure representation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) 702–709 (2020).
Li, Y. et al. An adaptive graph learning method for automated molecular interactions and properties predictions. Nat. Mach. Intell. 4, 645–651 (2022).
Zeng, X. et al. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat. Mach. Intell. 4, 1004–1016 (2022).
Zhang, X., Kang, Y., Chen, K., Fan, L. & Yang, Q. Trading off privacy, utility and efficiency in federated learning. In ACM Trans. Intell. Syst. Technol. 14, 1–32 (2023).
Cai, H., Zhang, H., Zhao, D., Wu, J. & Wang, L. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief. Bioinform. 23, bbac408 (2022).
Hanser, T. Federated learning for molecular discovery. Curr. Opin. Struct. Biol. 79, 102545 (2023).
Boiko, D. A. et al. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Mirza, A. et al. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat. Chem. 17, 1027–1034 (2025).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Farayola, O. A. et al. Data privacy and security in it: a review of techniques and challenges. Comput. Sci. IT Res. J. 5, 606–615 (2024).
Weber, R. H. Internet of things—new security and privacy challenges. Comput. Law Secur. Rev. 26, 23–30 (2010).
Smith, V., Chiang, C.-K., Sanjabi, M. & Talwalkar, A. S. Federated multi-task learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017) https://papers.nips.cc/paper_files/paper/2017/file/6211080fa89981f66b1a0c9d55c61d0f-Paper.pdf (2017).
Liu, L. et al. GEM-2: next generation molecular property prediction network by modeling full-range many-body interactions. Preprint at http://arxiv.org/abs/2208.05863 (2022).
Hussain, M. S., Zaki, M. J. & Subramanian, D. Triplet interaction improves graph transformers: accurate molecular graph learning with triplet graph transformers. Preprint at http://arxiv.org/abs/2402.04538 (2024).
Wallach, I. & Heifets, A. Most ligand-based classification benchmarks reward memorization rather than generalization. J. Chem. Inf. Model. 58, 916–932 (2018).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065–1076 (1962).
Li, P. et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Brief. Bioinform. 22, bbaa266 (2021).
Gao, W., Tang, Z., Zhao, J. & Chelikowsky, J. R. Efficient full-frequency GW calculations using a Lanczos method. Phys. Rev. Lett. 132, 126402 (2024).
Ma, W., Lou, Q., Kazemi, A., Faraone, J. & Afzal, T. Super efficient neural network for compression artifacts reduction and super resolution. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 460–468 (2024).
Wang, S., Zhang, Z. & Zhang, T. Improved analyses of the randomized power method and block Lanczos method. Preprint at https://arxiv.org/pdf/1508.06429 (2015).
Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. 30th International Conference on Machine Learning (eds Dasgupta, S. & McAllester, D.) 115–123 (PMLR, 2013).
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf (2019).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. Preprint at https://arxiv.org/abs/1903.02428 (2019).
Zhang, L. et al. A federated graph learning method to realize multi-party collaboration for molecular discovery. Zenodo https://doi.org/10.5281/zenodo.16872722 (2025).
Acknowledgements
This work was supported by the China Ministry of Science and Technology (2020YFA0710203), the Joint Funds of the National Natural Science Foundation of China (U23A2081), the National Natural Science Foundation of China (22201271, 92261105, 22025304, 22033007 and 22221003), the Anhui Provincial Natural Science Foundation (2108085UD06 and 2208085UD04), the Anhui Provincial Key Research and Development Project (2023z04020010 and 2022a05020053), USTC Research Funds of the Double First-Class Initiative (YD2060002029 and YD2060006005), the Joint Funds from Hefei National Synchrotron Radiation Laboratory (KY2060000180 and KY2060000195) and the Fundamental Research Funds for the Central Universities (WK2060000088). The AI-driven experiments, simulations and model training were performed on the robotic AI-Scientist platform of the Chinese Academy of Science.
Author information
Authors and Affiliations
Contributions
Y. Wu, J.J., K.C. and Y.Z. conceived the study and supervised the research. L.Z. designed and performed the computational framework, carrying out benchmarks and case studies with assistance of J.Z., R.H., Y. Wang and L.L. L.Z. wrote the initial draft of the paper with assistance from Y.Z. Y. Wu, Y.Z., K.C and L.Z. contributed major revisions. All authors discussed the results and provided feedback on the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Thierry Hanser, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Data flow of training/validation/testing splitting for simulated private institutional and open-access molecular databases.
The whole data partitioning process is consistent across 18 benchmark datasets.
Extended Data Fig. 2 Data distribution of multiple local models in BBBP dataset.
a, Data distribution of random seed 1234. b, Data distribution of random seed 4567. c, Data distribution of random seed 7890. PLM1: private local model 1. PLM2: private local model 2. PLM3: private local model 3. OLM: open-access local model.
Extended Data Fig. 3 Data distribution of multiple local models in FEN1 dataset.
a, Data distribution of random seed 1234. b, Data distribution of random seed 4567. c, Data distribution of random seed 7890. PLM1: private local model 1. PLM2: private local model 2. PLM3: private local model 3. OLM: open-access local model.
Extended Data Fig. 4 Data distribution of multiple local models in CoCrystal dataset.
a, Data distribution of random seed 1234. b, Data distribution of random seed 4567. c, Data distribution of random seed 7890. PLM1: private local model 1. PLM2: private local model 2. PLM3: private local model 3. OLM: open-access local model. Data distribution is partitioned by the first molecule in each pair of molecules (See Methods).
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–7, Notes 1–14, Tables 1–32 and References.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, L., Zhang, J., Huang, R. et al. A federated graph learning method to realize multi-party collaboration for molecular discovery. Nat Mach Intell 8, 246–256 (2026). https://doi.org/10.1038/s42256-026-01184-1
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-026-01184-1


