Abstract
A main challenge for artificial intelligence in scientific research is ensuring access to sufficient, high-quality data for the development of impactful models. Despite the abundance of public data, the most valuable knowledge often remains embedded within confidential corporate data silos. Although industries are increasingly open to sharing non-competitive insights, such collaboration is often constrained by the confidentiality of the underlying data. Federated learning makes it possible to share knowledge without compromising data privacy, but it has notable limitations. Here, we introduce FLuID (federated learning using information distillation), a data-centric application of federated distillation tailored to drug discovery aiming to preserve data privacy. We validate FLuID in two experiments, first involving public data simulating a virtual consortium and second in a real-world research collaboration between eight pharmaceutical companies. Although the alignment of the models with the partner specific domain remains challenging, the data-driven nature of FLuID offers several avenues to mitigate domain shift. FLuID fosters knowledge sharing among pharmaceutical organizations, paving the way for a new generation of models with enhanced performance and an expanded applicability domain in biological activity predictions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
All data for the public simulation experiment are held within the GitHub repository (https://github.com/LhasaLimited/FLuID_POC) in the data folder. A release of the repository is also available from Zenodo68 (https://zenodo.org/records/14531198).
Code availability
All code necessary to run the public portion of the experiment can be found in the github repository (https://github.com/LhasaLimited/FLuID_POC) or it is also available from Zenodo68 (https://zenodo.org/records/14531198). Simply open and run the Jupyter notebook and it will run the entire experiment using the included data. The code is licenced using the GPLv3 licence.
References
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Zhou, W. et al. Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images. Nat. Commun. 12, 1259 (2021).
Topaloglu, M. Y., Morrell, E. M., Rajendran, S. & Topaloglu, U. In the pursuit of privacy: the promises and predicaments of federated learning in healthcare. Front. Artif. Intell. 4, 746497 (2021).
Brauneck, A. et al. Federated machine learning in data-protection-compliant research. Nat. Mach. Intell. 5, 2–4 (2023).
Bak, M. et al. Federated learning is not a cure-all for data ethics. Nat. Mach. Intell. 6, 370–372 (2024).
Zhu, H., Xu, J., Liu, S. & Jin, Y. Federated learning on non-IID data: a survey. Neurocomputing 465, 371–390 (2021).
McMahan, B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. y. Communication-efficient learning of deep networks from decentralized data. In Proc. 20th International Conference on Artificial Intelligence and Statistics PMLR 54, 1273–1282 (2017).
Zhou, J. et al. A survey on federated learning and its applications for accelerating industrial internet of things. Preprint at https://doi.org/10.48550/arXiv.2104.10501 (2021).
Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Comput. Ind. Eng. 149, 106854 (2020).
Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 50–60 (2020).
Yin, X., Zhu, Y. & Hu, J. A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput. Surv. 54, 131:1–131:36 (2021).
Kairouz, P. et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14, 1–210, (2021).
Liu, J. et al. From distributed machine learning to federated learning: a survey. Knowl. Inf. Syst. 64, 885–917 (2022).
Konečný, J., McMahan, H. B., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. Preprint at https://doi.org/10.48550/arXiv.1610.02527 (2016).
Abadi, M. et al. Deep learning with differential privacy. In Proc. 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318 (ACM, 2016).
Dwork, C. Differential privacy: a survey of results. In Proc. International Conference on Theory and Applications of Models of Computation (eds Agrawal, M. et al.) 1–19 (Springer, 2008).
Long, G., Tan, Y., Jiang, J. & Zhang, C. in Federated Learning: Privacy and Incentive (eds Yang, Q. et al.) 240–254 (Springer, 2020).
Rieke, N. et al. The future of digital health with federated learning. Npj Digit. Med. 3, 119 (2020).
Choudhury, O. et al. Predicting adverse drug reactions on distributed health data using federated learning. AMIA. Annu. Symp. Proc. 2019, 313–322 (2020).
Nguyen, D. C. et al. Federated learning for smart healthcare: a survey. ACM Computing Surveys (Csur) 55, 1–37 (2022).
Xiong, Z. et al. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches. Sci. China Life Sci. 65, 529–539 (2022).
Manu, D. et al. FL-DISCO: federated generative adversarial network for graph-based molecule drug discovery: special session paper. In Proc. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) 1–7 (IEEE, 2021).
Naz, S., Phan, K. T. & Chen, Y.-P. P. A comprehensive review of federated learning for COVID-19 detection. Int. J. Intell. Syst. 37, 2371–2392 (2022).
Goldsmith, M. R. et al. in Crop Protection Products for Sustainable Agriculture (eds Rauzan, B. M. & Lorsbach, B. A.) Vol. 1390, 181–200 (American Chemical Society, 2021).
Heyndrickx, W. et al. MELLODDY: cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information. J. Chem. Inf. Model. 64, 2331–2344 (2024).
Hanser, T. Federated learning for molecular discovery. Curr. Opin. Struct. Biol. 79, 102545 (2023).
Konečný, J. et al. Federated learning: strategies for improving communication efficiency. Preprint at https://doi.org/10.48550/arXiv.1610.05492 (2017).
Wu, C., Wu, F., Lyu, L., Huang, Y. & Xie, X. Communication-efficient federated learning via knowledge distillation. Nat. Commun. 13, 2032 (2022).
Zhu, X. Semi-Supervised Learning Literature Survey (Univ. Wisconsin, 2005); https://minds.wisconsin.edu/handle/1793/60444
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at http://arxiv.org/abs/1503.02531 (2015).
Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I. & Talwar, K. Semi-supervised knowledge transfer for deep learning from private training data. Preprint at http://arxiv.org/abs/1610.05755 (2016).
Papernot, N. et al. Scalable private learning with PATE. Preprint at http://arxiv.org/abs/1802.08908 (2018).
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Dietterich, T. G. Ensemble methods in machine learning. In Proc. International Workshop on Multiple Classifier Systems (eds Kittler, J. & Roli, F.) 1–15 (Springer, 2000).
Li, L., Gou, J., Yu, B., Du, L. & Tao, Z. Y. D. Federated distillation: a survey. Preprint at https://doi.org/10.48550/arXiv.2404.08564 (2024).
Eldar, Y. C. et al. in Machine Learning and Wireless Communications (eds Goldsmith, A. et al.) 457–485 (Cambridge Univ. Press, 2022).
Li, D. & Wang, J. FedMD: heterogenous federated learning via model distillation. Preprint at https://doi.org/10.48550/arXiv.1910.03581 (2019).
Itahara, S., Nishio, T., Koda, Y., Morikura, M. & Yamamoto, K. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-IID private data. IEEE Trans. Mob. Comput. 22, 191–205 (2023).
Sattler, F., Marban, A., Rischke, R. & Samek, W. Communication-efficient federated distillation. Preprint at http://arxiv.org/abs/2012.00632 (2020).
Sui, D. et al. FedED: federated learning via ensemble distillation for medical relation extraction. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 2118–2128 (Association for Computational Linguistics, 2020).
Han, S. et al. FedX: unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision. (eds Avidan, S. et al.) 691–707 (Springer Nature Switzerland, 2022).
Jeong, E. et al. Communication-efficient on-device machine learning: federated distillation and augmentation under non-IID private data. Preprint at http://arxiv.org/abs/1811.11479 (2023).
Choquette-Choo, C. A. et al. CaPC learning: confidential and private collaborative learning. Preprint at https://doi.org/10.48550/arXiv.2102.05188 (2021).
PyGrid: a peer-to-peer platform for private data science and federated learning OpenMined Blog https://blog.openmined.org/what-is-pygrid-demo/ (2020).
FLuID POC platform. GitHub https://github.com/LhasaLimited/FLuID_POC (2023).
Hancox, J. C., McPate, M. J., El Harchi, A. & Zhang, Y. H. The hERG potassium channel and hERG screening for drug-induced torsades de pointes. Pharmacol. Ther. 119, 118–132 (2008).
Wolford, B. What is GDPR, the EU’s new data protection law? GDPR.eu https://gdpr.eu/what-is-gdpr/ (2018).
Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning model. In IEEE Symposium on Security and Privacy (SP) 3–18 (IEEE, 2017).
Raipuria, G., Bonthu, S. & Singhal, N. Noise robust training of segmentation model using knowledge distillation. In Proc. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021 (eds Del Bimbo, A. et al.) 97–104 (Springer, 2021).
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).
Bassani, D., Brigo, A. & Andrews-Morger, A. Federated learning in computational toxicology: an industrial perspective on the Effiris Hackathon. Chem. Res. Toxicol. 36, 1503–1517 (2023).
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2021).
Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminformatics 7, 20 (2015).
Glen, R. et al. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs Investig. Drugs J. 9, 199–204 (2006).
Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J. & Osman, J. Parameter based methods for compound selection from chemical databases. Quant. Struct. Act. Relatsh. 15, 285–289 (1996).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry. J. Med. Chem. 57, 3186–3204 (2014).
Maggiora, G. M. Concepts and Applications of Molecular Similarity (eds Johnson, M. A. & Maggiora, G. M.) (John Wiley & Sons, 1990).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. Math. Stat. Probab. (eds Marie Le Cam, L. & Neyman, J.) Vol. 1, 281–298 (1967).
Siramshetty, V. B., Chen, Q., Devarakonda, P. & Preissner, R. The catch-22 of predicting hERG blockade using publicly accessible bioactivity data. J. Chem. Inf. Model. 58, 1224–1233 (2018).
Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition Vol. 1, 278–282 (1995).
Hanser, T., Barber, C., Marchaland, J. F. & Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. https://doi.org/10.1080/1062936X.2016.1250229 (2016) .
Hanser, T. et al. Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge. J. Cheminformatics 6, 21 (2014).
Carhart, R., Smith, D. H. & Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. https://doi.org/10.1021/ci00046a002 (1985).
Hanser, T., Steinmetz, F. P., Plante, J., Rippmann, F. & Krier, M. Avoiding hERG-liability in drug design via synergetic combinations of different (Q)SAR methodologies and data sources: a case study in an industrial setting. J. Cheminformatics 11, 9 (2019).
Hanser, T., Werner, S. & Plante, J. FLuID POC a simulation platform for federated distillation. Zenodo https://doi.org/10.5281/zenodo.14531198 (2024).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Acknowledgements
The Lhasa Limited authors acknowledge the invaluable collaborative effort, scientific discussions and knowledge sharing made possible by all the industrial partners in this project.
Author information
Authors and Affiliations
Contributions
E.A., A.A., L.T.A., R.J.B., A.B., A.D., S.G., N.G., D.K., L.K., W.M., F.R., Y.S., F.S., A.W., J.W. and T.Y.: data access and preparation, domain expertise and continuous input and knowledge sharing. T.H., J.-F.M., J.P., R.v.D. and S.W.: methodology design and implementation, experiment orchestration, virtual simulation platform development, data analytics and paper preparation. C.B. and L.J.: managing the partner–Lhasa relationships.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests, except for R.J.B. who is an employee and shareholder of Sanofi, a pharmaceutical R&D company that may benefit from the outcome of this research.
Peer review
Peer review information
Nature Machine Intelligence thanks Alissa Brauneck, Gabriele Buchholtz, Stuart McLennan and Umit Topaloglu for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hanser, T., Ahlberg, E., Amberg, A. et al. Data-driven federated learning in drug discovery with knowledge distillation. Nat Mach Intell 7, 423–436 (2025). https://doi.org/10.1038/s42256-025-00991-2
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-00991-2
This article is cited by
-
Breaking data silos in drug discovery with federated learning
Nature Chemical Engineering (2025)
-
Recent advances in molecular representation methods and their applications in scaffold hopping
npj Drug Discovery (2025)
-
Knowledge-driven federated learning: A systematic literature review on approaches, challenges, and prospects
The Journal of Supercomputing (2025)