Data-driven federated learning in drug discovery with knowledge distillation

Hanser, Thierry; Ahlberg, Ernst; Amberg, Alexander; Anger, Lennart T.; Barber, Chris; Brennan, Richard J.; Brigo, Alessandro; Delaunois, Annie; Glowienke, Susanne; Greene, Nigel; Johnston, Laura; Kuhn, Daniel; Kuhnke, Lara; Marchaland, Jean-François; Muster, Wolfgang; Plante, Jeffrey; Rippmann, Friedrich; Sabnis, Yogesh; Schmidt, Friedemann; van Deursen, Ruud; Werner, Stéphane; White, Angela; Wichard, Joerg; Yukawa, Tomoya

doi:10.1038/s42256-025-00991-2

Article
Published: 05 March 2025

Data-driven federated learning in drug discovery with knowledge distillation

Nature Machine Intelligence volume 7, pages 423–436 (2025)Cite this article

3756 Accesses
7 Citations
4 Altmetric
Metrics details

Subjects

Abstract

A main challenge for artificial intelligence in scientific research is ensuring access to sufficient, high-quality data for the development of impactful models. Despite the abundance of public data, the most valuable knowledge often remains embedded within confidential corporate data silos. Although industries are increasingly open to sharing non-competitive insights, such collaboration is often constrained by the confidentiality of the underlying data. Federated learning makes it possible to share knowledge without compromising data privacy, but it has notable limitations. Here, we introduce FLuID (federated learning using information distillation), a data-centric application of federated distillation tailored to drug discovery aiming to preserve data privacy. We validate FLuID in two experiments, first involving public data simulating a virtual consortium and second in a real-world research collaboration between eight pharmaceutical companies. Although the alignment of the models with the partner specific domain remains challenging, the data-driven nature of FLuID offers several avenues to mitigate domain shift. FLuID fosters knowledge sharing among pharmaceutical organizations, paving the way for a new generation of models with enhanced performance and an expanded applicability domain in biological activity predictions.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Concept validation and teacher–student comparison for a simulated consortium.**

**Fig. 4: Teacher and hybrid model comparison.**

**Fig. 5: Concept validation in an industrial setup.**

**Fig. 6: Public proof-of-concept datasets.**

Knowledge-guided diffusion model for 3D ligand-pharmacophore mapping

Article Open access 06 March 2025

Comprehensive evaluation of pure and hybrid collaborative filtering in drug repurposing

Article Open access 21 January 2025

First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa

Article Open access 15 September 2023

Data availability

All data for the public simulation experiment are held within the GitHub repository (https://github.com/LhasaLimited/FLuID_POC) in the data folder. A release of the repository is also available from Zenodo⁶⁸ (https://zenodo.org/records/14531198).

Code availability

All code necessary to run the public portion of the experiment can be found in the github repository (https://github.com/LhasaLimited/FLuID_POC) or it is also available from Zenodo⁶⁸ (https://zenodo.org/records/14531198). Simply open and run the Jupyter notebook and it will run the entire experiment using the included data. The code is licenced using the GPLv3 licence.

References

Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1, e271–e297 (2019).
Article Google Scholar
Zhou, W. et al. Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images. Nat. Commun. 12, 1259 (2021).
Article Google Scholar
Topaloglu, M. Y., Morrell, E. M., Rajendran, S. & Topaloglu, U. In the pursuit of privacy: the promises and predicaments of federated learning in healthcare. Front. Artif. Intell. 4, 746497 (2021).
Article Google Scholar
Brauneck, A. et al. Federated machine learning in data-protection-compliant research. Nat. Mach. Intell. 5, 2–4 (2023).
Article Google Scholar
Bak, M. et al. Federated learning is not a cure-all for data ethics. Nat. Mach. Intell. 6, 370–372 (2024).
Article Google Scholar
Zhu, H., Xu, J., Liu, S. & Jin, Y. Federated learning on non-IID data: a survey. Neurocomputing 465, 371–390 (2021).
Article Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. y. Communication-efficient learning of deep networks from decentralized data. In Proc. 20th International Conference on Artificial Intelligence and Statistics PMLR 54, 1273–1282 (2017).
Zhou, J. et al. A survey on federated learning and its applications for accelerating industrial internet of things. Preprint at https://doi.org/10.48550/arXiv.2104.10501 (2021).
Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Comput. Ind. Eng. 149, 106854 (2020).
Article Google Scholar
Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 50–60 (2020).
Google Scholar
Yin, X., Zhu, Y. & Hu, J. A comprehensive survey of privacy-preserving federated learning: a taxonomy, review, and future directions. ACM Comput. Surv. 54, 131:1–131:36 (2021).
Google Scholar
Kairouz, P. et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning 14, 1–210, (2021).
Liu, J. et al. From distributed machine learning to federated learning: a survey. Knowl. Inf. Syst. 64, 885–917 (2022).
Article Google Scholar
Konečný, J., McMahan, H. B., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. Preprint at https://doi.org/10.48550/arXiv.1610.02527 (2016).
Abadi, M. et al. Deep learning with differential privacy. In Proc. 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318 (ACM, 2016).
Dwork, C. Differential privacy: a survey of results. In Proc. International Conference on Theory and Applications of Models of Computation (eds Agrawal, M. et al.) 1–19 (Springer, 2008).
Long, G., Tan, Y., Jiang, J. & Zhang, C. in Federated Learning: Privacy and Incentive (eds Yang, Q. et al.) 240–254 (Springer, 2020).
Rieke, N. et al. The future of digital health with federated learning. Npj Digit. Med. 3, 119 (2020).
Article Google Scholar
Choudhury, O. et al. Predicting adverse drug reactions on distributed health data using federated learning. AMIA. Annu. Symp. Proc. 2019, 313–322 (2020).
Google Scholar
Nguyen, D. C. et al. Federated learning for smart healthcare: a survey. ACM Computing Surveys (Csur) 55, 1–37 (2022).
Xiong, Z. et al. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches. Sci. China Life Sci. 65, 529–539 (2022).
Article Google Scholar
Manu, D. et al. FL-DISCO: federated generative adversarial network for graph-based molecule drug discovery: special session paper. In Proc. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) 1–7 (IEEE, 2021).
Naz, S., Phan, K. T. & Chen, Y.-P. P. A comprehensive review of federated learning for COVID-19 detection. Int. J. Intell. Syst. 37, 2371–2392 (2022).
Article Google Scholar
Goldsmith, M. R. et al. in Crop Protection Products for Sustainable Agriculture (eds Rauzan, B. M. & Lorsbach, B. A.) Vol. 1390, 181–200 (American Chemical Society, 2021).
Heyndrickx, W. et al. MELLODDY: cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information. J. Chem. Inf. Model. 64, 2331–2344 (2024).
Article Google Scholar
Hanser, T. Federated learning for molecular discovery. Curr. Opin. Struct. Biol. 79, 102545 (2023).
Article Google Scholar
Konečný, J. et al. Federated learning: strategies for improving communication efficiency. Preprint at https://doi.org/10.48550/arXiv.1610.05492 (2017).
Wu, C., Wu, F., Lyu, L., Huang, Y. & Xie, X. Communication-efficient federated learning via knowledge distillation. Nat. Commun. 13, 2032 (2022).
Article Google Scholar
Zhu, X. Semi-Supervised Learning Literature Survey (Univ. Wisconsin, 2005); https://minds.wisconsin.edu/handle/1793/60444
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at http://arxiv.org/abs/1503.02531 (2015).
Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I. & Talwar, K. Semi-supervised knowledge transfer for deep learning from private training data. Preprint at http://arxiv.org/abs/1610.05755 (2016).
Papernot, N. et al. Scalable private learning with PATE. Preprint at http://arxiv.org/abs/1802.08908 (2018).
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Article Google Scholar
Dietterich, T. G. Ensemble methods in machine learning. In Proc. International Workshop on Multiple Classifier Systems (eds Kittler, J. & Roli, F.) 1–15 (Springer, 2000).
Li, L., Gou, J., Yu, B., Du, L. & Tao, Z. Y. D. Federated distillation: a survey. Preprint at https://doi.org/10.48550/arXiv.2404.08564 (2024).
Eldar, Y. C. et al. in Machine Learning and Wireless Communications (eds Goldsmith, A. et al.) 457–485 (Cambridge Univ. Press, 2022).
Li, D. & Wang, J. FedMD: heterogenous federated learning via model distillation. Preprint at https://doi.org/10.48550/arXiv.1910.03581 (2019).
Itahara, S., Nishio, T., Koda, Y., Morikura, M. & Yamamoto, K. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-IID private data. IEEE Trans. Mob. Comput. 22, 191–205 (2023).
Article Google Scholar
Sattler, F., Marban, A., Rischke, R. & Samek, W. Communication-efficient federated distillation. Preprint at http://arxiv.org/abs/2012.00632 (2020).
Sui, D. et al. FedED: federated learning via ensemble distillation for medical relation extraction. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 2118–2128 (Association for Computational Linguistics, 2020).
Han, S. et al. FedX: unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision. (eds Avidan, S. et al.) 691–707 (Springer Nature Switzerland, 2022).
Jeong, E. et al. Communication-efficient on-device machine learning: federated distillation and augmentation under non-IID private data. Preprint at http://arxiv.org/abs/1811.11479 (2023).
Choquette-Choo, C. A. et al. CaPC learning: confidential and private collaborative learning. Preprint at https://doi.org/10.48550/arXiv.2102.05188 (2021).
PyGrid: a peer-to-peer platform for private data science and federated learning OpenMined Blog https://blog.openmined.org/what-is-pygrid-demo/ (2020).
FLuID POC platform. GitHub https://github.com/LhasaLimited/FLuID_POC (2023).
Hancox, J. C., McPate, M. J., El Harchi, A. & Zhang, Y. H. The hERG potassium channel and hERG screening for drug-induced torsades de pointes. Pharmacol. Ther. 119, 118–132 (2008).
Article Google Scholar
Wolford, B. What is GDPR, the EU’s new data protection law? GDPR.eu https://gdpr.eu/what-is-gdpr/ (2018).
Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning model. In IEEE Symposium on Security and Privacy (SP) 3–18 (IEEE, 2017).
Raipuria, G., Bonthu, S. & Singhal, N. Noise robust training of segmentation model using knowledge distillation. In Proc. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021 (eds Del Bimbo, A. et al.) 97–104 (Springer, 2021).
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).
Article Google Scholar
Bassani, D., Brigo, A. & Andrews-Morger, A. Federated learning in computational toxicology: an industrial perspective on the Effiris Hackathon. Chem. Res. Toxicol. 36, 1503–1517 (2023).
Article Google Scholar
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395 (2021).
Article Google Scholar
Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminformatics 7, 20 (2015).
Article Google Scholar
Glen, R. et al. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs Investig. Drugs J. 9, 199–204 (2006).
Google Scholar
Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J. & Osman, J. Parameter based methods for compound selection from chemical databases. Quant. Struct. Act. Relatsh. 15, 285–289 (1996).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article Google Scholar
Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular similarity in medicinal chemistry. J. Med. Chem. 57, 3186–3204 (2014).
Article Google Scholar
Maggiora, G. M. Concepts and Applications of Molecular Similarity (eds Johnson, M. A. & Maggiora, G. M.) (John Wiley & Sons, 1990).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Article Google Scholar
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Article Google Scholar
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. Math. Stat. Probab. (eds Marie Le Cam, L. & Neyman, J.) Vol. 1, 281–298 (1967).
Siramshetty, V. B., Chen, Q., Devarakonda, P. & Preissner, R. The catch-22 of predicting hERG blockade using publicly accessible bioactivity data. J. Chem. Inf. Model. 58, 1224–1233 (2018).
Article Google Scholar
Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition Vol. 1, 278–282 (1995).
Hanser, T., Barber, C., Marchaland, J. F. & Werner, S. Applicability domain: towards a more formal definition. SAR QSAR Environ. Res. https://doi.org/10.1080/1062936X.2016.1250229 (2016) .
Hanser, T. et al. Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge. J. Cheminformatics 6, 21 (2014).
Article Google Scholar
Carhart, R., Smith, D. H. & Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. https://doi.org/10.1021/ci00046a002 (1985).
Hanser, T., Steinmetz, F. P., Plante, J., Rippmann, F. & Krier, M. Avoiding hERG-liability in drug design via synergetic combinations of different (Q)SAR methodologies and data sources: a case study in an industrial setting. J. Cheminformatics 11, 9 (2019).
Article Google Scholar
Hanser, T., Werner, S. & Plante, J. FLuID POC a simulation platform for federated distillation. Zenodo https://doi.org/10.5281/zenodo.14531198 (2024).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar

Download references

Acknowledgements

The Lhasa Limited authors acknowledge the invaluable collaborative effort, scientific discussions and knowledge sharing made possible by all the industrial partners in this project.

Author information

Authors and Affiliations

Lhasa Limited, Leeds, UK
Thierry Hanser, Chris Barber, Laura Johnston, Jean-François Marchaland, Jeffrey Plante, Ruud van Deursen & Stéphane Werner
Predictive Compound ADME and Safety, Drug Safety and Metabolism, AstraZeneca IMED Biotech Unit, Mölndal, Sweden
Ernst Ahlberg
Mölnlycke Health Care, Gothenburg, Sweden
Ernst Ahlberg
Preclinical Safety, Sanofi, Frankfurt, Germany
Alexander Amberg, Lennart T. Anger & Friedemann Schmidt
Computational Toxicologist, Genentech, Inc., South San Francisco, CA, USA
Lennart T. Anger
Preclinical Safety, Sanofi, Cambridge, MA, USA
Richard J. Brennan
Pharma Research and Early Development, Roche, Basel, Switzerland
Alessandro Brigo & Wolfgang Muster
Safety and Secondary Pharmacology, UCB Biopharma SRL, Braine L’Alleud, Belgium
Annie Delaunois & Yogesh Sabnis
Novartis Institutes for Biological Research (NIBR), Novartis Pharma AG, Basel, Switzerland
Susanne Glowienke
Imaging & Data Analytics, Clinical Pharmacology & Safety Sciences, AstraZeneca, Waltham, MA, USA
Nigel Greene
Toxicology, Recursion Pharmaceuticals, Salt Lake City, UT, USA
Nigel Greene
Global Computational Chemistry and Biology, Merck Healthcare KGaA, Darmstadt, Germany
Daniel Kuhn & Friedrich Rippmann
Pharmaceuticals, R&D, Computational Molecular Design, Bayer AG, Berlin, Germany
Lara Kuhnke & Joerg Wichard
AI & Modeling, DSM-Firmenich AG, Geneva, Switzerland
Ruud van Deursen
Computational Toxicology, GlaxoSmithKline (GSK), Stevenage, UK
Angela White
Computational Chemistry, Selvita S.A, Cracow, Poland
Joerg Wichard
Chemical Toxicology, Takeda Pharmaceutical Company, Ltd, Fujisawa, Japan
Tomoya Yukawa

Authors

Thierry Hanser
View author publications
Search author on:PubMed Google Scholar
Ernst Ahlberg
View author publications
Search author on:PubMed Google Scholar
Alexander Amberg
View author publications
Search author on:PubMed Google Scholar
Lennart T. Anger
View author publications
Search author on:PubMed Google Scholar
Chris Barber
View author publications
Search author on:PubMed Google Scholar
Richard J. Brennan
View author publications
Search author on:PubMed Google Scholar
Alessandro Brigo
View author publications
Search author on:PubMed Google Scholar
Annie Delaunois
View author publications
Search author on:PubMed Google Scholar
Susanne Glowienke
View author publications
Search author on:PubMed Google Scholar
Nigel Greene
View author publications
Search author on:PubMed Google Scholar
Laura Johnston
View author publications
Search author on:PubMed Google Scholar
Daniel Kuhn
View author publications
Search author on:PubMed Google Scholar
Lara Kuhnke
View author publications
Search author on:PubMed Google Scholar
Jean-François Marchaland
View author publications
Search author on:PubMed Google Scholar
Wolfgang Muster
View author publications
Search author on:PubMed Google Scholar
Jeffrey Plante
View author publications
Search author on:PubMed Google Scholar
Friedrich Rippmann
View author publications
Search author on:PubMed Google Scholar
Yogesh Sabnis
View author publications
Search author on:PubMed Google Scholar
Friedemann Schmidt
View author publications
Search author on:PubMed Google Scholar
Ruud van Deursen
View author publications
Search author on:PubMed Google Scholar
Stéphane Werner
View author publications
Search author on:PubMed Google Scholar
Angela White
View author publications
Search author on:PubMed Google Scholar
Joerg Wichard
View author publications
Search author on:PubMed Google Scholar
Tomoya Yukawa
View author publications
Search author on:PubMed Google Scholar

Contributions

E.A., A.A., L.T.A., R.J.B., A.B., A.D., S.G., N.G., D.K., L.K., W.M., F.R., Y.S., F.S., A.W., J.W. and T.Y.: data access and preparation, domain expertise and continuous input and knowledge sharing. T.H., J.-F.M., J.P., R.v.D. and S.W.: methodology design and implementation, experiment orchestration, virtual simulation platform development, data analytics and paper preparation. C.B. and L.J.: managing the partner–Lhasa relationships.

Corresponding author

Correspondence to Thierry Hanser.

Ethics declarations

Competing interests

The authors declare no competing interests, except for R.J.B. who is an employee and shareholder of Sanofi, a pharmaceutical R&D company that may benefit from the outcome of this research.

Peer review

Peer review information

Nature Machine Intelligence thanks Alissa Brauneck, Gabriele Buchholtz, Stuart McLennan and Umit Topaloglu for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hanser, T., Ahlberg, E., Amberg, A. et al. Data-driven federated learning in drug discovery with knowledge distillation. Nat Mach Intell 7, 423–436 (2025). https://doi.org/10.1038/s42256-025-00991-2

Download citation

Received: 29 February 2024
Accepted: 14 January 2025
Published: 05 March 2025
Issue date: March 2025
DOI: https://doi.org/10.1038/s42256-025-00991-2

This article is cited by

Breaking data silos in drug discovery with federated learning
- Can Li
Nature Chemical Engineering (2025)
Recent advances in molecular representation methods and their applications in scaffold hopping
- Shihang Wang
- Ran Zhang
- Fang Bai
npj Drug Discovery (2025)
Knowledge-driven federated learning: A systematic literature review on approaches, challenges, and prospects
- Xiaogang Lin
- Xiaoli Zhao
- Yincan Shu
The Journal of Supercomputing (2025)